Designing a metrics monitoring and alerting system
By: Simardeep Singh, Statistics Canada
Introduction
Designing a metrics monitoring and alerting system is a crucial step in ensuring the health and performance of any system or application. A well-designed system can help identify potential issues before they become critical, allowing for quick resolution and minimizing downtime.
The first step in designing a metrics monitoring and alerting system is to identify the key metrics that need to be monitored. These metrics should be chosen based on the specific goals and objectives of the system or application, as well as its unique characteristics and requirements. For example, a website may need to monitor metrics such as page load time, user engagement, and server response time, while a mobile app may need to monitor metrics such as battery usage and network performance.
Once the key metrics have been identified, the next step is to determine how they will be collected and stored. This may involve setting up specialized monitoring tools or using existing tools and services. It is important to ensure that the data collected is accurate, reliable, and easily accessible.
The next step is to set up alerts and notifications based on the metrics being monitored. This can be done using a variety of tools and methods, such as email, SMS, or push notifications. The alert thresholds should be carefully chosen to ensure that they are sensitive enough to detect potential issues, but not so sensitive that they generate erroneous alerts.
Finally, it is important to regularly review and assess the performance of the metrics monitoring and alerting system. This can involve analyzing the data collected, identifying areas for improvement, and making any necessary adjustments to the system. By continuously improving the system, it can remain effective and reliable over time.
What are the major components of the system?
A metrics monitoring and alerting system consists of five components:
- Data collection: collects metric data from different resources.
- Data transmission: transfers data from sources to the metrics monitoring system.
- Data storage: organizes and stores incoming data.
- Alerting: analyzes the incoming data, detects anomalies and generates alerts. The system must be able to send alerts to different communication channels configured by the organization.
- Visualization: presents data in graphics, charts, etc. It's easier to identify the patterns, trends or problems when data is presented visually.
How to design the metrics for monitoring and alerting system
In this section, we discuss some fundamentals of building the system, the data model, and the high-level design.
Data modelling: Metrics data is generally recorded as a time series that contains the set of values with their associated timestamps. The series itself can be identified by its name, and operationally by a set of labels. Every time series consists of the following:
Table 1: Time series
Name | Type |
---|---|
A metric name | String |
A set of tags/labels | List of <key: value> pairs |
An array of values and their timestamps | An array of <value, timestamp> pairs |
Data access pattern: Consider a real-world scenario where the alerting system must calculate the average CPU load across all the webservers in a specific region. The data must be averaged every 10 minutes, which accounts for about 10 million operational metrics written per day, and many metrics are collected at high frequency. For these systems, the write load is heavy, and the read load is simultaneously spiky. Both visualization and alerting services will send the queries to the database, and depending on the access patterns and alerts, the read volume can either increase or decrease. The system is under constant heavy write load, while the read load is spiky.
Data storage system: A general-purpose database, in theory, could support time-series data, but it will require extensive tuning to make it work on a large scale. A relational database is not optimized for operations commonly performed against time-series data.
There are many storage systems optimized for time-series data. Optimization consumes fewer servers to handle huge volumes of data. Many of these databases have custom query interfaces designed for the analysis of time-series data that are much easier to use than structured query language (SQL).
Two very popular time-series databases are Influx DB (database) and Prometheus, which are designed to store large volumes of time-series data and perform real time analysis. Another feature of the strong time-series database is efficient aggregation. Influx DB builds indexes on the labels to facilitate the fast lookup of time-series by labels.
High level design
Figure 1: High level design for a metrics monitoring and designing system
- Metrics source:This can be application servers, SQL databases, message queues, etc.
- Metrics collector: Gathers metrics data and writes data into the time-series database.
- Time-series database: This stores metrics data as time series. It usually provides a custom-query interface for analyzing and summarizing a large amount of time-series data. It maintains indexes on labels to facilitate the fast lookup of data using the labels.
- Query service: The query service makes it easy to query and retrieve data from the time-series databases.
- Alerting system: This sends alert notifications to various alerting destinations.
- Visualization system: This shows metrics in the form of various graphs/charts.
Design deep dive
Let's investigate the designs in detail:
- Metrics collection
- Scaling the metrics transmission pipeline
- Query service
- Alerting system
- Visualization system
Metrics collection
There are two ways metrics data can be collected – pull or push.
Figure 2: Metrics collection flow
Pull model
In a pull model, the metrics collector pulls the metrics from the sources. Consequently the metrics collector needs to know the complete list of service ends to pull the data. We can use a reliable, scalable and maintainable service like service discovery, provided by ETCD and Zookeeper. A service discovery contains configuration rules about when and where to collect the metrics.
- The metrics collector fetches the configuration metadata of service endpoint from service discovery. Metadata includes pulling interval, IP addresses, timeout and retry parameters.
- The metrics collector pulls the metric data using the HTTP endpoint (for example, web servers) or TCP (transmission control protocol) endpoint (for DB clusters).
- The metrics collector registers a change event notification with the service directory to get an update whenever the service endpoints change.
Figure 3: Pull model in detail
Push model
In a push model, a collection agent is installed on every server that is being monitored. A collection agent is long-running software that collects the metrics from the service running on the server and pushes those metrics to the collector.
To prevent the metrics collector from falling behind a push model, the collector should always be in an autoscaling position with a load balancer in the front of it (Figure 4). The cluster should scale up or down based on the CPU (central processing unit) load of the metrics collector.
Figure 4: Push model in detail
Pull or push?
So, what's best for a large organization? Knowing the advantages and disadvantages of each approach is important. A large organization needs to support both, especially serverless architecture.
Push Monitoring System:
Advantages:
- Real-time notifications of issues and alerts
- Can alert multiple recipients at once.
- Can be customized to specific needs and requirements
- Can be integrated with other systems and applications
Disadvantages:
- Requires a constant and reliable internet connection to function properly
- Can be overwhelming with too many notifications and alerts
- Can be vulnerable to cyber-attacks and security breaches
Pull Monitoring System:
Advantages:
- Can be accessed remotely and for multiple devices
- Can be set up to check specific metrics and parameters at regular intervals
- Can be easily configured and customized
- Can provide detailed and historical data for analysis and reporting
Disadvantages:
- Requires manual intervention to check and review the data.
- May not provide the real-time alerts and notifications
- Can be less efficient in identifying and responding to issues and anomalies.
Scaling the metrics transmission pipeline
Whether we use the push or pull model, the metrics collector of servers and the cluster receive enormous amounts of data. There's a risk of data loss if the time-series database is unavailable. To navigate through the risk of losing data, we can use a queueing component as shown in Figure 5.
Figure 5: Add queues
In this design, the metrics collector sends metric data to a queuing system like Kafka. Then consumers or streaming processing services such as Apache Spark process and push the data to the time-series database. This approach has several advantages:
- Kafka is used as a highly reliable and scalable distributed messaging platform.
- It decouples the data collection and processing services from one another.
- It can easily prevent data loss when the database is unavailable by retaining the data in Kafka.
Query service
The query service comprises a cluster of query servers which access the time-series database and handle the requests from the visualization or alerting systems. Once you have a dedicated set of query servers, you can decouple time-series database from the visualization and alerting systems. This provides us with the flexibility to change the time-series database or the visualization and alerting systems, whenever needed.
To reduce the load of the time-series database and make the query service more performant, cache servers can be added to store query results, as shown in Figure 6.
Figure 6: Cache layer
Storage layer
Space optimization – In order to optimize the storage, following strategies can be used to tackle this problem:
Data encoding and compression: Data encoding is the process of translating data from one format into another, typically for the purposes of efficient transmission or storage. Data compression is a related process that involves reducing the amount of data required to represent a given piece of information. Together data encoding and compression can significantly reduce the size of the data. It is the process of encoding, restructuring, or otherwise modifying data to reduce its size. Essentially, it involves re-encoding information with fewer bits than the original representation.
Downsampling: Downsampling is the process of reducing the number of samples in a dataset by removing some data points. This is often done to reduce the amount of data that needs to be processed and to simply the analysis. Downsampling can be done in a variety of ways, including randomly selecting a subset of the data points, using a specific algorithm to select the data points, or using a specific sampling frequency to reduce the data. If the data retention policy is set to one year, we can sample the data using the following example.
- Retention: seven days, no sampling
- Retention: 30 days, down sample to one minute resolution
- Retention: one-year, down sample to one hour resolution
Alerting system
A monitoring system is very useful for proactive interpretation and investigation, but one of the main advantages of a full monitoring system is that administrators can be disconnected from the system. Alerts allow you to define situations to be actively managed while relying on passive monitoring of software to watch for changing conditions.
The alert flow works as follows:
- Load the config files to the cache servers. Rules are defined as config files on the disk, shown in Figure 7.
Figure 7: Alerting system
- The alert manager fetches alert configs from the cache.
- Based on the config rules, the alert manager calls the query service at a predefined interval. If the value violates the threshold, an alert event is created. The alert manager is responsible for the following:
- Filter, merge, and dedupe alerts. Here's an example of merging alerts that are triggered within one instance in a short amount of time.
Figure 8: Merge alerts
- Access control—to avoid human error and keep the system secure, it is essential to restrict access to certain alert management operations to authorized individuals only.
- Retry—the alert manager checks alert states and ensures a notification is sent at least once.
- The alert store is a key-value database such as Cassandra, that keeps the state (in-active, pending, firing, resolved) of all alerts. It ensures a notification is sent at least once.
- Eligible alerts are inserted into a messaging and queuing system such as Kafka.
- Alert consumers pull alert events from the messaging and queuing system.
- Alert consumers process alert events from the messaging and queuing system and sends notifications to different channels such as email, text message, PageDuty, or HTTP endpoints.
Visualization system
Visualization is built on top of the data layer. Metrics can be shown on the metrics dashboard over various time scales and alerts can be shown on the dashboard. A high-quality visualization system is hard to build. There's a strong argument for using an off-the-shelf system. For example, Grafana can be a very good system for this purpose.
Wrap up
In this article we discussed the design for a metrics monitoring and alerting system. At a high level, we talked about the data collection, time-series database, alerts and visualization. We also dove into some of the important techniques and components, such as:
- Push versus pull model for collecting metrics data.
- Using Kafka to scale the system.
- Choosing the right time-series database.
- Using down sampling to reduce data size.
- Build versus buy options for alerting and visualization systems.
We went through a few iterations to refine the diagram, and our final design looks like this:
Figure 9: Final design
In conclusion, designing a metrics monitoring and alerting system is a crucial step in ensuring the health and performance of any system or application. By carefully selecting the key metrics to monitor, collect and store data accurately, setting up effective alerts and notifications, and regularly reviewing and improving the system, it is possible to create a robust and reliable system that can help identify and resolve potential issues before they become critical.
Meet the Data Scientist
If you have any questions about my article or would like to discuss this further, I invite you to Meet the Data Scientist, an event where authors meet the readers, present their topic and discuss their findings.
Thursday, February 16
2:00 to 3:00 p.m. ET
MS Teams – link will be provided to the registrants by email
Register for the Data Science Network's Meet the Data Scientist Presentation. We hope to see you there!
Subscribe to the Data Science Network for the Federal Public Service newsletter to keep up with the latest data science news.
Additional resources
- Datadog
- Splunk
- PagerDuty
- Elastic stack
- Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
- Distributed Systems Tracing with Zipkin
- Prometheus
- OpenTSDB-A Distributed, Scalable Monitoring System
- Data model
- MySQL
- Schema design for time-series data | Cloud Bigtable Documentation
- MetricsDB
- Amazon Timestream
- DB-Engines Ranking of time-series DBMS
- InfluxDB
- etcd
- Service Discovery with Zookeeper
- Amazon CloudWatch
- Graphite
- Push vs. Pull
- Pull doesn't scale or does it?
- Monitoring Architecture
- Push vs. Pull in Monitoring Systems
- Date modified: