Monitoring in the Cloud

Comprehensive monitoring is a must now that development moves faster than ever

The cloud has transformed the economics of infrastructure, essentially crumbling the barrier to entry for building applications on world-class technology. It has brought about a fundamental change on the operations side as well: the effortless scaling made possible by the cloud means that the typical organization’s infrastructure is always in flux.

In the following chapters, we will outline a practical monitoring framework for dynamic infrastructure. This framework comes out of our experience monitoring large-scale infrastructure for thousands of customers, as well as for our own rapidly scaling application in the cloud.

Collecting the Right Data

Most infrastructure monitoring data falls into one of two categories: metrics and events.

Metrics

Metrics capture a value pertaining to your systems at a specific point in time. There are two important categories of metrics:

Works Metrics

Indicate the top-level health of your system by measuring its useful output and are invaluable for surfacing real, often user-facing issues.

Metric Type	Description	Example (Web server)
Throughput	The amount of work completed per unit time	Requested per second
Success	The portion of work executed successfully	2XX Responses / Total responses
Error	The number, rate, or percentage of erroneous results	5XX Responses / Total responses
Performance	Measurement of how efficiently a component is doing its work	95TH Percentile response time

Metric Type: Throughput

Description: The amount of work completed per unit time

Example (Web server): Requested per second

Metric Type: Success

Description: The portion of work executed successfully

Example (Web server): 2XX Responses / Total responses

Metric Type: Error

Description: The number, rate, or percentage of
erroneous results

Example (Web server): 5XX Responses / Total responses

Metric Type: Performance

Description: Measurement of how efficiently a
component is doing its work

Example (Web server): 95TH Percentile response time

Resources Metrics

Most components of your infrastructure serve as a resource to other systems and are especially valuable for investigating problems.

Metric Type	Description	Example (Web server)
Utilization	The percentage of time that the resource is busy or how much of the resource’s capacity is in use	Open database connections
Saturation	The amount of requested work that the resources cannot yet service	Disk queue depth
Error	Internal errors that may not be observable in the work the resources produces	Failed connection attempts
Availability	The percentage of time that the resource responded to requests	N/A

Metric Type: Utilization

Description: The percentage of time that the resource is busy or how muchof the resource’s capacity is in use

Example (Web server): Open database connections

Metric Type: Saturation

Description: The amount of requested work that the resources cannot yet service

Example (Web server): Disk queue depth

Metric Type: Error

Description: Internal errors that may not be observable in the work the resources produces

Example (Web server): Failed connection attempts

Metric Type: Availability

Description: The percentage of time that the resource responded to requests

Example (Web server): N/A

Events

In contrast to metrics, which are collected more or less continuously, events are discrete, infrequent occurrences. Events capture what happened, at a point in time, with optional additional information. These provide crucial context for understanding changes in your system’s behavior.

Alerting on what matters

Automated alerts allow you to spot problems anywhere in your infrastructure, so that you can rapidly identify their causes and minimize service degradation and disruption. Know the levels of alerting urgency:

Alerts as records
(Low severity)

Many alerts will not be associated with a service problem, so a human may never even need to be aware of them.

Alerts as notifications
(Moderate severity)

The next tier of alerting urgency is for issues that do require intervention, but not right away.

Alerts as pages
(High severity)

The most urgent alerts should receive special treatment and be escalated to a page (as in "pager") to urgently request human attention.

Investigating Performance Issues

Investigating is often the least structured aspect of monitoring, driven largely by hunches and guess-and-check. This chapter describes a more directed approach for finding and correcting root causes.

1. Start at the top with work metrics

First examine the work metrics for the highest-level system that is exhibiting problems. These metrics will usually set the direction for your investigation

2. Dig into resources

Next examine the system's resources-physical resources as well as services that support the system. Well-designed dashboards enable you to quickly scan relevant resource metrics for each system.

3. Did something change?

Next consider events that may be correlated with your metrics. Look for code releases, internal alerts, or other events that were recorded just before the problem developed.

4. Fix it (and don't forget it)

Once you have determined what caused the issue, correct it. Your investigation is complete when symptoms disappear.

Build dashboards before you need them:

To keep your investigations focused, set up dashboards in advance. You may
want to set up one dashboard for your high-level application metrics, and
one dashboard for each subsystem.

Contact us

Clicking " Send", you agree that bit2bit Americas will store and process the personal information provided above in order to give you the requested content.

Monitoring in the Cloud