Handling Unexpected Outages With SQL Server

A catastrophic failure occurs. There is a backup but you have some lag to pull it from your remote storage or cloud provider. You also need an available and experienced DBA no matter what time it is to handle this as seamlessly as possible. How can your DB system help you overcome this and put your business back on track ASAP?

Where do you start from?

Your business might have many distinct levels of criticality and the first step is to ask yourself some questions to identify where a failure can put it at risk. The two most important ones are: What are your limits for acceptable data loss and downtime? Which point-in-time is enough for your business to get back running?

There are two concepts you can use to answer those questions: Recovery Point Objective (RPO) and Recovery Time Objective (RTO).

RPO is at which point in time you want your data available – at least 2 minutes back of healthy records before the failure, for example.

RTO is the time SLA for your business to get back running within the RPO you defined. For example, you want the service reestablishment in a maximum of 3 minutes. Take a look at which point are your databases currently. Based on these answers, you can start planning your Business Continuity Plan.

Outage types

Understanding what types of outages your SQL Server can suffer from can help you choose the right infrastructure and features to overcome them.

When something fails and only affects one server, like an OS crash, patch installation failure, failing RAM/CPU, disk failure, you need a High Availability solution to move your resources to.

Sometimes your problems are widespread, like network switches or SAN failures, natural disasters like fire, flood, power outage, and so many others. Then, you need a Disaster Recovery solution, such as moving your resources to another datacenter room or location, for example.

Database corruption might occur within SAN failures, SQL bugs that also need processes, and plans to fix that corruption. Human errors include wrong table or database drops, updates without a where clause, and so on.

Features

How can SQL Server help you get rid of outages? Here are the features that will leverage your business’s high availability in a secure and robust fashion.

Backup and Restore – Yes! The most basic piece of data recovery must be taken into account! Make sure your backups are meeting your RPO and RTO and test them periodically. There are different types of backups, full, differentials, transactional log, and also filegroups, that can especially bring your database online faster by pieces.

Replication – Copy and distribute data between databases on local or remote instances at scheduled intervals, with many directions and types available. Replicate only data, ideal for specific subsets and to send/receive data to/from specific locations, like store branches, distribution centers, etc.

Log Shipping – Automated backup and restore of the transactional log to another local or remote database, with read-only replicas available (Standby).

Always-ON FCI (Failover Clustered Instances) – A cluster of 2 or more servers with shared storage. When a failure occurs in one node, all the resources are moved to an available node. These nodes can be on the same datacenter or on a remote one.

Always-ON Availability Groups – A cluster of standalone SQL instances with independent storage volumes where you can move groups of databases instead of the whole instance in a failure. You can take advantage of a secondary read-only replica, backup offload, and can be both async or sync. Ideal for mission-critical environments

Storage Replication – Replication of volumes between servers or clusters, replicating the disk blocks so data can be moved between sites and SAN clusters

Virtual Machine Replication – Replication of virtual instances with the whole server as a snapshot, becoming fully functional VM’s to bring your infrastructure back in a timely fashion.

Investment

I began talking that you must be aware of the criticality of your databases. The less RTO and RPO you need, the more you have to invest in SQL features and infrastructure to provide almost zero data loss. Well-defined criticalities will provide you a scenario where you can invest more in what is more valuable.

Here you have 3 main options:

Enterprise Edition: The full SQL feature bundle, provides high-end solutions and high scaling for computing and availability. All features are available. I’d like to mention here the Distributed Availability Groups (Cross-domain and network availability), online page and file restore, and high I/O scalability.

Standard Edition: A basic SQL feature bundle, provides most of the features of Enterprise with some limitations, like Basic Availability Groups for 2-server clusters, accelerated database recovered, failover cluster instances.

Web and Express: Free editions with very limited resources and features.

Make it happen

Designing and implementing these solutions is an effort that will prove its value when things go really bad with your databases. At BairesDev we have the Top 1% of tech Talent to design and implement the process from budget to technical design, passing through implementation and then support and maintaining this new structure that will save your business in case of small outages or big disasters.

You can start from a very simple implementation like fixing your backup schedules and types, optimizing its parameters like I/O, number of files, and go through more complex design like availability groups between different rooms, domains, and datacenters making sure your business won’t suffer major outages and you will be peaceful and restful to focus on business strategies that will leverage your profits.