Docs Menu
Docs Home
/ /
Atlas Architecture Center
/

Guidance for Atlas Disaster Recovery

It is critical for enterprises to plan for disaster recovery. We strongly recommend that you prepare a comprehensive disaster recovery (DR) plan that includes elements such as:

  • Your designated recovery point objective (RPO)

  • Your designated recovery time objective (RTO)

  • Automated processes that facilitate alignment with these objectives

Use the recommendations on this page to prepare for and respond to disasters.

To learn more about proactive high availability configurations that can help with disaster recovery, see Recommendations for Atlas High Availability.

To learn about Atlas features that support disaster recovery, see the following pages in the Atlas Architecture Center:

Use the following disaster recovery recommendations to create a DR plan for your organization. These recommendations provide information on steps to take in the event of a disaster.

It is imperative that you test the plans in this section regularly (ideally quarterly, but at least semi-annually). Testing often helps prepare the Enterprise Database Management (EDM) team to respond to disasters while also helping to keep the instructions up to date.

Some disaster recovery testing might require actions that cannot be performed by EDM users. In these cases, open a support case for the purpose of performing artificial outages at least a week in advance of when you plan on running a test exercise.

The following diagram provides an overview of the available disaster recovery configuration options, including the relative gains in RTO and RPO and the relative cost and complexity of deploying each option and remediating various disaster scenarios.

An image showing relative complexity and RTO/RPO trade-offs.
click to enlarge

Recommendations that apply only to deployments in a single region

Each cloud provider on which you can deploy Atlas clusters provides default data redundancy that helps to mitigate any outages:

  • AWS stores objects on multiple devices across a minimum of three Availability Zones in an AWS Region.

  • Microsoft Azure uses locally redundant storage (LRS) that replicates your data three times within a single data center in the selected region.

  • Google Cloud spreads your data across multiple zones in the backup region.

To enhance your disaster recovery, you can configure Atlas to automatically create copies of your snapshots and oplogs in other regions. This ensures that, even if the primary region experiences an outage, you can restore your cluster using snapshot copies stored in other regions. Atlas optimizes restore speeds by selecting the most efficient option based on region availability, using copied snapshots if restoring to a region where those copies exist. Additionally, if the original snapshot is inaccessible due to a regional outage, Atlas will restore using the nearest available snapshot copy, minimizing downtime and improving recovery resilience. To learn more, see Configure Atlas to Automatically Copy Snapshots to Other Regions.

Recommendations that apply only to deployments across multiple regions or multiple cloud providers

If you run a multi-region or multi-cloud MongoDB cluster, ensure you configure your backup strategy to account for your specific needs for mitigating data corruption risks. Address how quickly you can identify a potential data corruption issue within your system. Once you establish this detection timeframe, configure your backup retention accordingly, and ensure that snapshots are available to restore from before the corruption occurred. To account for any delays or uncertainties in detecting the issue, we advise you to include an additional buffer, typically around 10-15%, in your backup retention schedule. This padding helps ensure you can reliably recover clean data without losing critical information, which enhances the overall resilience and reliability of your deployment.

The following recommendations apply to all deployment paradigms.

This section covers the following disaster recovery procedures:

If a single node in your replica set fails due to a partial regional outage, your deployment should still be available, assuming you have followed best practices. If you are reading from secondaries, you might experience degraded performance or potential outages in the event that a secondary node should fail, because of the increased load on the then underprovisioned cluster.

You can test a primary node outage in Atlas using the Atlas UI's Test Primary Failover feature or the Test Failover Atlas Administration API endpoint.

Multi-region clusters, in the event of a regional outage, will automatically hold an election and identify a new primary node if necessary. This topology change will be automatically communicated to the application allowing it to take any necessary corrective action. In order to maintain application uptime in the event of a regional outage, your application itself must also be deployed with a multi-region topology. This requirement extends to include any third-party service your application may be integrated with. To learn more, see Multi-Region Deployment Paradigm.

If a single region outage or multi-region outage degrades the state of your cluster, follow these steps:

1
2

You can find information about cluster health in the cluster's Overview tab of the Atlas UI.

3

Based on how many nodes are left online, determine how many new nodes you require to restore the replica set to a normal state.

A normal state is a state in which the majority of nodes are available.

4

Depending on the cause of the outage, there may be additional regions in the near future that will also experience unscheduled outages. For example, if the outages were caused by a natural disaster on the east coast of the United States, you should avoid regions on the east coast of the United States in case there are additional issues.

5

Add the required number of nodes for a normal state across regions that are unlikely to be affected by the cause of the outage.

To reconfigure a replica set during an outage by adding regions or nodes, see Reconfigure a Replica Set During a Regional Outage.

6

In addition to adding nodes to restore your replica set to a normal state, you can add additional nodes to match the topology of your replica set before the disaster.

You can test a region outage in Atlas using the Atlas UI's Simulate Outage feature or the Start an Outage Simulation Atlas Administration API endpoint.

With multi-cloud clusters, you can select electable nodes across cloud providers to maintain high availability. Should the provider in which your primary node is deployed become unavailable, electable nodes can be converted to primary nodes to ensure continuous operation. For example, you can create electable nodes on AWS, Google Cloud, and Microsoft Azure to ensure that if one cloud provider experiences an outage, an electable node on a separate provider can take over as your cluster's primary node. To learn more, see Multi-Cloud Deployment Paradigm.

Most multi-region Atlas clusters will recover automatically from a single region outage. To learn more, see the High Availability section and Multi-region Deployment page. In the case that regional outages have knocked out a majority of nodes, you must determine how many more nodes you need to add in order for a majority of nodes to be healthy.

In the highly unlikely event that an entire cloud provider is unavailable, follow these steps to bring your deployment back online:

1

You need this information later in this procedure to restore your deployment.

2

For a list of cloud providers and information, see Cloud Providers.

3

To learn how to view your backup snapshots, see View M10+ Backup Snapshots.

4

Your new cluster must have an identical topology of the original cluster.

Alternatively, instead of creating a full new cluster, you can also add new nodes hosted by an alternative cloud provider to the existing cluster.

5

To learn how to restore your snapshot, see Restore Your Cluster.

6

To find the new connection string, see Connect via Drivers. Review your application stack as you likely need to redeploy it onto the new cloud provider.

In the highly unlikely event that the Atlas Control Plane and the Atlas UI are unavailable, your cluster is still available and accessible. To learn more, see Platform Reliability. Open a high-priority support ticket to investigate this further.

Computational resource (such as disk space, RAM, or CPU) capacity issues can result from poor planning or unexpected database traffic. This behavior might not be a result of a disaster.

If a computational resource reaches the maximum allocated amount and causes a disaster, follow these steps:

1

To view your resource utilization in the Atlas UI, see Monitor Real-Time Performance.

To view metrics with the Atlas Administration API, see Monitoring and Logs.

2
3

Note that Atlas will perform this change in a rolling manner, so it should not have any major impact on your applications.

To learn how to allocate more resources, see Edit a Cluster.

4

Important

This is a temporary solution intended to shorten overall system downtime. Once the underlying issue resolves, merge the data from the newly-created cluster into the original cluster and point all applications back to the original cluster.

If a computational resource fails and causes your cluster to become unavailable, follow these steps:

1
2
3

To learn how to restore your snapshot, see Restore Your Cluster.

4

Production data might be accidentally deleted due to human error or a bug in the application built on top of the database. If the cluster itself was accidentally deleted, Atlas might retain the volume temporarily.

If the contents of a collection or database have been deleted, follow these steps to restore your data:

1
2

You can use mongoexport to create a copy.

3

If the deletion occurred within the last 72 hours, and you configured continuous backup, use Point in Time (PIT) restore to restore from the point in time right before the deletion occurred.

If the deletion did not occur in the past 72 hours, restore the most recent backup from before the deletion occurred into the cluster.

To learn more, see Restore Your Cluster.

4

You can use mongoimport with upsert mode to import your data and ensure that any data that was modified or added is reflected properly in the collection or database.

If a driver fails, follow these steps:

1

You can work with the technical support team during this step.

If you determine that reverting to an earlier driver version will fix the issue, continue to the next step.

2
3

This might include application code or query changes. For example, there might be a breaking change if you are moving between major and minor versions.

4
5

Ensure that any other changes from the previous step are reflected in the production environment.

Important

This is a temporary solution intended to shorten overall system downtime. Once the underlying issue resolves, merge the data from the newly-created cluster into the original cluster and point all applications back to the original cluster.

If your underlying data becomes corrupted, follow these steps:

1
2
3

To learn how to restore your snapshot, see Restore Your Cluster.

4
5

Back

Backups

On this page