Docs Menu
Docs Home
/ /
Atlas Architecture Center
/

Guidance for Atlas Monitoring and Alerts

On this page

  • Features for Atlas Monitoring and Alerts
  • Recommendations for Atlas Monitoring and Alerts
  • Monitor by Using Metrics
  • Monitor by Configuring Alerts
  • Monitor by Using Atlas Built-In Tools
  • Monitor by Using Logs
  • Automation Examples: Atlas Monitoring and Logging

MongoDB Atlas has a robust set of built-in metrics, telemetry, and logs that you can leverage from Atlas or integrate into your third-party observability and alerting stack. This enables you to monitor and manage your Atlas deployments and respond to incidents proactively and in real time.

Monitoring your deployment allows you to:

  • Understand the health and status of your cluster

  • Learn how operations running on the cluster are impacting the database

  • Learn if your hardware is resource constrained

  • Perform workload and query optimization

  • Detect and react to real-time issues to improve your application stack

Atlas provides several metrics for monitoring and alerting. You can track the health, availability, consumption, and performance of your deployments in visual dashboards and by API. You can also view various cluster metrics, monitor your database performance, configure alerts and alert notification, and download activity logs.

Metrics

Deployment metrics provide insight into hardware performance and database operation efficiency. Atlas collects metrics for your servers, databases, and MongoDB processes and stores metrics data at various granularity levels. For each granularity level, Atlas computes metrics as averages of the reported metrics at the next finer granularity level. Many metrics have a burst reporting equivalent.

You can monitor metrics in the Atlas UI using the Metrics view, Real-Time Performance Panel, Query Profiler, Performance Advisor and Namespace Insights page. You can also use the Atlas CLI or Atlas Administration API to funnel metrics into a tool of your choice.

The following Atlas UI Metrics view shows the variety of metrics available to monitor for a sample three-node replica set:

An image showing an example Metrics view for a 3 node replica set.
click to enlarge

Alerts

Atlas provides alerts for over 200 event types, allowing you to tailor alerts for precise monitoring. Atlas issues alerts for the database and server conditions that you configure in your alert settings. When a condition triggers an alert, Atlas displays a warning symbol on the cluster and sends alert notifications.

You can use the Atlas UI, Atlas Administration API, Atlas CLI, and integrated Terraform resource to configure alerts and alert notification.

Monitoring

Atlas monitoring visualizations provide insights into various key metrics, including hardware performance and database operation efficiency. Tools such as real-time performance panels for network and operation visibility, query profilers for tracking efficiency trends, and automated index suggestions allow you to monitor and troubleshoot issues more effectively, driving greater operational efficiency. For example, these charts can help you understand the impact of server restarts and elections on database performance.

Logs

Atlas provides logs for each process in the cluster. Each process keeps an account of its activity in its own log file.

You can download logs by using the Atlas UI, Atlas CLI, and Atlas Administration API. To learn more, see Guidance for Atlas Auditing and Logging.

To monitor your cluster or database performance, you can view the cluster metrics such as historical throughput, performance, and utilization metrics. The following table lists some important categories of metrics to monitor:

Atlas Cluster Operations and Connection Metrics

  • Opcounters

  • Operation Execution Times

  • Query Executors and Query Targeting

  • Connections

  • Queues

  • Scan and Order

Hardware Metrics

  • Normalized System CPU

  • Normalized Process CPU

  • Disk Latency

  • Disk IOPS

  • Disk Space Free

  • System Memory

  • Swap Usage

Replication Metrics

  • Replication Lag

  • Replication Oplog Window

  • Replication Headroom

  • Oplog GB/Hour

  • Opcounters -repl

You can use the Atlas UI, Atlas Administration API, and Atlas CLI to view Atlas cluster metrics. Additionally, Atlas provides built-in integrations with third-party tools such as Datadog and Prometheus, and you can also leverage the Atlas Administration API to integrate with other custom metrics tools.

To learn more, see Review Cluster Metrics.

Atlas extends into your existing observability stack so that you can get alerts and make data-driven decisions without replacing your current tools or changing your workflows. Atlas can send alert notifications with third-party tools like Microsoft Teams, PagerDuty, DataDog, Prometheus, Opsgenie, Splunk On-Call, and other tools to give you visibility into both database and full-stack performance in the same place.

Configure alerts and notifications for security events, such as failed login attempts, unusual access patterns, and data breaches. In dev and test environments, we recommend configuring alerts on clusters after seven days of inactivity to save costs.

When you view alerts in the Atlas UI, we recommend that you use available filters to limit results by host, replica set, cluster, shard, and more to help focus on the most critical alerts.

At the minimum, we recommend configuring the following alerts. These alerts recommendations provide a baseline, but you should adjust them based on your workload characteristics. Where "high priority" conditions are specified in the following table, we recommend that you configure multiple alerts for the same condition, one for a low priority level of severity, and one for a high priority level of severity. This allows you to configure alert notification settings for each separately.

Condition
Recommended Alert Threshold: Low Priority
Recommended Alert Threshold: High Priority
Key Insights

Oplog Window

< 24h for 5 minutes

< 1h for 10 minutes

Monitor the replication oplog window, together with replication headroom, to determine whether the secondary may soon require a full resync. The replication oplog window often helps to determine in advance the resilience of secondaries to planned and unplanned outages.

Election events

> 3 for 5 minutes

> 30 for 5 minutes

Monitor election events, which occur when a primary node steps down and a secondary node is elected as the new primary. Frequent election events can disrupt operations and impact availability, causing temporary write unavailability and possible rollback of data. Keeping election events to a minimum ensures consistent write operations and stable cluster performance.

Read IOPS

> 4000 for 2 minutes

> 9000 for 5 minutes

Monitor whether disk IOPS approaches the maximum provisioned IOPS. Determine whether the cluster can handle future workloads.

Write IOPS

> 4000 for 2 minutes

> 9000 for 5 minutes

Monitor whether disk IOPS approaches the maximum provisioned IOPS. Determine whether the cluster can handle future workloads.

Read Latency

> 20ms for 5 minutes

> 50 s for 5 minutes

Monitor disk latency to track the efficiency of reading from and writing to disk.

Write Latency

> 20ms for 5 minutes

> 50ms for more than 5 minutes

Monitor disk latency to track the efficiency of reading from and writing to disk.

Swap use

> 2GB for 15 minutes

> 2GB for 15 minutes

Monitor memory to determine whether to upgrade to a higher cluster tier. This metric represents the average value over the time period specified by the metric granularity.

Host down

15 minutes

24 hours

Monitor your hosts to detect downtime promptly. A host down for more than 15 minutes can impact availability, while downtime exceeding 24 hours is critical, risking data accessibility and application performance.

No primary

5 minutes

5 minutes

Monitor the status of your replica sets to identify instances where there is no primary node. A lack of a primary for more than 5 minutes can halt write operations and impact application functionality.

Missing active mongos

15 minutes

15 minutes

Monitor the status of active mongos processes to ensure effective query routing in sharded clusters. A missing mongos can disrupt query routing.

Page faults

> 50/second for 5 minutes

> 100/second for 5 minutes

Monitor page faults to determine whether to increase your memory. This metric displays the average rate of page faults on this process per second over the selected sample period. In non-Windows environments this applies to hard page faults only.

Replication lag

> 240 second for 5 minutes

> 1 hour for 5 minutes

Monitor replication lag to determine whether the secondary might fall off the oplog.

Failed backup

Any occurrence

None

Track backup operations to ensure data integrity. A failed backup can compromise data availability.

Restored backup

Any occurrence

None

Verify restored backups to ensure data integrity and system functionality.

Fallback snapshot failed

Any occurrence

None

Monitor fallback snapshot operations to ensure data redundancy and recovery capability.

Backup schedule behind

> 12 hours

> 12 hours

Check backup schedules to ensure they are on track. Falling behind can risk data loss and compromise recovery plans.

Queued Reads

> 0-10

> 10+

Monitor queued reads to ensure efficient data retrieval. High levels of queued reads may indicate resource constraints or performance bottlenecks, requiring optimization to maintain system responsiveness.

Queued Writes

> 0-10

> 10+

Monitor queued writes to maintain efficient data processing. High levels of queued writes may signal resource constraints or performance bottlenecks, requiring optimization to maintain system responsiveness.

Restarts last hour

> 2

> 2

Track the number of restarts in the last hour to detect instability or configuration issues. Frequent restarts can indicate underlying problems that require immediate investigation to maintain system reliability and uptime.

Any occurrence

None

Monitor primary elections to ensure stable cluster operations. Frequent elections can indicate network issues or resource constraints, potentially impacting the availability and performance of the database.

Maintenance no longer needed

Any occurrence

None

Review unnecessary maintenance tasks to optimize resources and minimize disruptions.

Maintenance started

Any occurrence

None

Track the start of maintenance tasks to ensure planned activities proceed smoothly. Proper oversight helps maintain system performance and minimize downtime during maintenance.

Maintenance scheduled

Any occurrence

None

Monitor scheduled maintenance to prepare for potential system impacts.

> 5% for 5 minutes

> 20% for 5 minutes

Monitor CPU steal on AWS EC2 clusters with Burstable Performance to identify when CPU usage exceeds the guaranteed baseline due to shared cores. High steal percentages indicate the CPU credit balance is depleted, affecting performance.

CPU

> 75% for 5 minutes

> 75% for 5 minutes

Monitor CPU usage to determine whether data is retrieved from disk instead of memory.

Disk partition usage

> 90%

> 95% for 5 minutes

Monitor disk partition usage to ensure sufficient storage availability. High usage levels can lead to performance degradation and potential system outages.

To learn more, see Configure and Resolve Alerts.

Atlas provides several tools that allow you to proactively monitor and improve the performance of your database.

The Real-Time Performance Panel (RTPP) in the Atlas UI provides insights into current network traffic, database operations, and hardware statistics about the hosts at a one second granularity in the Atlas UI. We recommend that you use RTPP to:

  • Visually identify relevant database operations

  • Evaluate query execution times

  • Evaluate the ratio of documents scanned to documents returned

  • Monitor network load and throughput

  • Discover potential replication lag on secondary members of replica sets

  • Kill operations before they have completed to free up valuable resources

The RTPP is not available to monitor from the Atlas Administration API, but you can enable and disable the RTPP from the Atlas Administration API with Update One Project Settings.

To learn more, see Monitor Real-Time Performance.

The Query Profiler identifies slow queries and bottlenecks, and suggests index refinement and query restructuring to improve the performance of your database. It provides visibility into the slowest operations over a 24-hour window in the Atlas UI, making it easier to identify trends and outliers in query efficiency. We recommend that you use this data to pinpoint and troubleshoot poorly performing queries, reducing performance overhead.

You can return log lines for slow queries that the Query Profiler identifies from the Atlas Administration API with Return Slow Queries.

To learn more, see Monitor Query Performance with Query Profiler.

The Performance Advisor automatically analyzes logs for slow-running queries and recommends indexes to create and drop. It analyzes slow queries and provides index suggestions for individual collections, ranked by a calculated impact score, and tailored to your workload. This gives you an easy, instantaneous way to make high-impact performance improvements. We recommend that you monitor regularly, focus on slow queries, and enable the profiler selectively to minimize overhead.

You can use the Atlas UI, Atlas CLI, and the Atlas Administration API to view slow queries and suggestions for improving the performance of your queries from the Performance Advisor.

You can return log lines for slow queries that the Performance Advisor identifies from the Atlas Administration API with Return Slow Queries. To return suggested indexes and more with the Atlas Administration API, see Performance Advisor.

To learn more, see Monitor and Improve Slow Queries.

The Namespace Insights page in the Atlas UI allows you to monitor collection-level performance and usage metrics. It displays metrics (such as the number of CRUD operations on the collection) and statistics (like average query execution time) for certain hosts and operation types for the collections that you pin for monitoring. This gives you more granular visibility into collection-level performance, which you can use to optimize database performance, resolve issues, and make decisions about scaling, indexing, and query tuning.

To learn more, see Monitor Collection-Level Query Latency.

Atlas retains the last 30 days of log messages and system event audit messages. You can download Atlas logs at any point until the end of their retention periods by using the Atlas UI, Atlas Administration API, and Atlas CLI.

To learn more, see View and Download MongoDB Logs.

You can also push logs to an AWS S3 bucket. When you configure this feature, Atlas continually pushes logs from mongod, mongos, and audit logs to an AWS S3 bucket. Atlas exports logs every five minutes.

See Terraform examples to enforce our Staging/Prod recommendations across all pillars in one place in Github.

The following examples demonstrate how to enable monitoring using Atlas tools for automation.

Run the following command to retrieve the amount of used and free space on the specified disk. This metric can be used to determine if the system is running out of free space.

atlas metrics disks describe atlas-lnmtkm-shard-00-00.ajlj3.mongodb.net:27017 data \
--granularity P1D \
--period P1D \
--type DISK_PARTITION_SPACE_FREE,DISK_PARTITION_SPACE_USED \
--projectId 6698000acf48197e089e4085 \

Run the following command to create an alert notification to an email address when your deployment doesn't have a primary.

atlas alerts settings create \
--enabled \
--event "NO_PRIMARY" \
--matcherFieldName CLUSTER_NAME \
--matcherOperator EQUALS \
--matcherValue ftsTest \
--notificationType EMAIL \
--notificationEmailEnabled \
--notificationEmailAddress "myName@example.com" \
--notificationIntervalMin 5 \
--projectId 6698000acf48197e089e4085

Run the following command to enable Atlas-managed slow operation threshold for your project.

atlas performanceAdvisor slowOperationThreshold enable --projectId 56fd11f25f23b33ef4c2a331

Run the following command to download a compressed file that contains the MongoDB logs for the specified host in your project.

atlas logs download atlas-lnmtkm-shard-00-00.ajlj3.mongodb.net mongodb.gz --projectId 56fd11f25f23b33ef4c2a331

Before you can create resources with Terraform, you must:

  • Create your paying organization and create an API key for the paying organization. Store your API key as environment variables by running the following command in the terminal:

    export MONGODB_ATLAS_PUBLIC_KEY="<insert your public key here>"
    export MONGODB_ATLAS_PRIVATE_KEY="<insert your private key here>"
  • Install Terraform

We also suggest creating a workspace for your environment.

The following examples demonstrate how to configure alerts and alert notifications. You must create the following files for each example. Place the files for each example in their own directory. Change the IDs and names to use your values:

variable "atlas_org_id" {
type = string
description = "MongoDB Atlas Organization ID"
}
variable "atlas_project_name" {
type = string
description = "The MongoDB Atlas Project Name"
}
variable "atlas_project_id" {
description = "MongoDB Atlas project id"
type = string
}
variable "atlas_cluster_name" {
description = "MongoDB Atlas Cluster Name"
default = "datadog-test-cluster"
type = string
}
atlas_org_id = "32b6e34b3d91647abb20e7b8"
atlas_project_name = "Customer Portal - Prod"
atlas_project_id = "67212db237c5766221eb6ad9"
atlas_cluster_name = "myCluster"

Example: Use the following to send an alert notification by email to users with the GROUP_CLUSTER_MANAGER role when there is a replication lag, which could result in data inconsistencies.

resource "mongodbatlas_alert_configuration" "test" {
project_id = var.atlas_project_id
event_type = "REPLICATION_OPLOG_WINDOW_RUNNING_OUT"
enabled = true
notification {
type_name = "GROUP"
interval_min = 10
delay_min = 0
sms_enabled = false
email_enabled = true
roles = ["GROUP_CLUSTER_MANAGER"]
}
matcher {
field_name = "CLUSTER_NAME"
operator = "EQUALS"
value = "myCluster"
}
threshold_config {
operator = "LESS_THAN"
threshold = 1
units = "HOURS"
}
}

Run the sample command to retrieve the following metrics:

  • OPCOUNTERS - to monitor the amount of queries, updates, inserts, and deletes occurring at peak load and ensure that load doesn't increase unexpectedly.

  • TICKETS - to ensure that the number of allowed concurrent reads and writes doesn't lower much, or frequently.

  • CONNECTIONS - to ensure that the number of sockets used for heartbeats and replication between members isn't above the set limit.

  • QUERY TARGETING - to ensure that number of keys and documents scanned to the number of documents returned, averaged by second, are't too high.

  • SYSTEM CPU - to ensure that the CPU usage is steady.

atlas metrics processes atlas-lnmtkm-shard-00-00.ajlj3.mongodb.net:27017 \
--projectId 56fd11f25f23b33ef4c2a331 \
--granularity PT1H \
--period P7D \
--type
CONNECTIONS,OPCOUNTER_DELETE,OPCOUNTER_INSERT,OPCOUNTER_QUERY,OPCOUNTER_UPDATE,TICKETS_AVAILABLE_READS,TICKETS_AVAILABLE_WRITE,CONNECTIONS,QUERY_TARGETING_SCANNED_OBJECTS_PER_RETURNED,QUERY_TARGETING_SCANNED_PER_RETURNED,SYSTEM_CPU_GUEST,SYSTEM_CPU_IOWAIT,SYSTEM_CPU_IRQ,SYSTEM_CPU_KERNEL,SYSTEM_CPU_NICE,SYSTEM_CPU_SOFTIRQ,SYSTEM_CPU_STEAL,SYSTEM_CPU_USER \
--output json

Run the following command to send alerts to a group by email when there are possible connection storms based on the number of connections in your project.

atlas alerts settings create \
--enabled \
--event "OUTSIDE_METRIC_THRESHOLD" \
--metricName CONNECTIONS \
--metricOperator LESS_THAN \
--metricThreshold 1 \
--metricUnits RAW \
--notificationType GROUP \
--notificationRole "GROUP_DATA_ACCESS_READ_ONLY","GROUP_CLUSTER_MANAGER","GROUP_DATA_ACCESS_ADMIN" \
--notificationEmailEnabled \
--notificationEmailAddress "user@example.com" \
--notificationIntervalMin 5 \
--projectId 6698000acf48197e089e4085

Run the following command to retrieve the suggested indexes for collections experiencing slow queries.

atlas performanceAdvisor suggestedIndexes list \
--projectId 56fd11f25f23b33ef4c2a331 \
--processName atlas-zqva9t-shard-00-02.2rnul.mongodb.net:27017

Run the following command to download a compressed file that contains the MongoDB logs for the specified host in your project.

atlas logs download atlas-lnmtkm-shard-00-00.ajlj3.mongodb.net mongodb.gz --projectId 56fd11f25f23b33ef4c2a331

Before you can create resources with Terraform, you must:

  • Create your paying organization and create an API key for the paying organization. Store your API key as environment variables by running the following command in the terminal:

    export MONGODB_ATLAS_PUBLIC_KEY="<insert your public key here>"
    export MONGODB_ATLAS_PRIVATE_KEY="<insert your private key here>"
  • Install Terraform

The following examples demonstrate how to configure alerts and alert notifications. You must create the following files for each example. Place these files for each example in their own directory and replace only the main.tf file. Change the IDs and names to use your values:

variable "atlas_org_id" {
type = string
description = "MongoDB Atlas Organization ID"
}
variable "atlas_project_name" {
type = string
description = "The MongoDB Atlas Project Name"
}
variable "atlas_project_id" {
description = "MongoDB Atlas project id"
type = string
}
variable "atlas_cluster_name" {
description = "MongoDB Atlas Cluster Name"
default = "datadog-test-cluster"
type = string
}
variable "datadog_api_key" {
description = "Datadog api key"
type = string
}
variable "datadog_region" {
description = "Datadog region"
default = "US5"
type = string
}
variable "prometheus_user_name" {
type = string
description = "The Prometheus User Name"
default = "puser"
}
variable "prometheus_password" {
type = string
description = "The Prometheus Password"
default = "ppassword"
}
atlas_org_id = "32b6e34b3d91647abb20e7b8"
atlas_project_name = "Customer Portal - Prod"
atlas_project_id = "67212db237c5766221eb6ad9"
atlas_cluster_name = "myCluster"
datadog_api_key = "1234567890abcdef1234567890abcdef"
datadog_region = "US5"
prometheus_user_name = "prometheus_user"
prometheus_password = "secure_prometheus_password"

Example 1: Use the following to integrate with third-party services like Datadog and Prometheus for alert notification.

resource "mongodbatlas_third_party_integration" "test_datadog" {
project_id = var.atlas_project_id
type = "DATADOG"
api_key = var.datadog_api_key
region = var.datadog_region
}
resource "mongodbatlas_third_party_integration" "test_prometheus" {
project_id = var.atlas_project_id
type = "PROMETHEUS"
user_name = var.prometheus_user_name
password = var.prometheus_password
service_discovery = "http"
enabled = true
}
output "datadog.id" { value = mongodbatlas_third_party_integration.test_datadog.id }
output "prometheus.id" { value = mongodbatlas_third_party_integration.test_prometheus.id }

Example 2: Use the following to send alert notification to third-party services like Datadog and Prometheus when there is no primary on the replica set for more than 5 minutes.

resource "mongodbatlas_alert_configuration" "test_alert_notification" {
project_id = var.atlas_project_id
event_type = "NO_PRIMARY"
enabled = true
notification {
type_name = "PROMETHEUS"
integration_id = mongodbatlas_third_party_integration.test_datadog.id # ID of the Atlas Prometheus integration
}
notification {
type_name = "DATADOG"
integration_id = mongodbatlas_third_party_integration.test_prometheus.id # ID of the Atlas Datadog integration
}
matcher {
field_name = "REPLICA_SET_NAME"
operator = "EQUALS"
value = "myReplSet"
}
threshold_config {
operator = "GREATER_THAN"
threshold = 5
units = "MINUTES"
}
}

Example 3: Use the following to send an alert notification by email to users with the GROUP_CLUSTER_MANAGER role when there is a replication lag, which could result in data inconsistencies.

resource "mongodbatlas_alert_configuration" "test_replication_lag_alert" {
project_id = var.atlas_project_id
event_type = "OUTSIDE_METRIC_THRESHOLD"
enabled = true
notification {
type_name = "GROUP"
interval_min = 10
delay_min = 0
sms_enabled = false
email_enabled = true
roles = ["GROUP_CLUSTER_MANAGER"]
}
matcher {
field_name = "CLUSTER_NAME"
operator = "EQUALS"
value = "myCluster"
}
metric_threshold_config {
metric_name = "OPLOG_SLAVE_LAG_MASTER_TIME"
operator = "GREATER_THAN"
threshold = 1
units = "HOURS"
}
}

Back

Automation