Monitoring in the Cloud

Comprehensive monitoring is a must now that development moves faster than ever

The cloud has transformed the economics of infrastructure, essentially crumbling the barrier to entry for building applications on world-class technology. It has brought about a fundamental change on the operations side as well: the effortless scaling made possible by the cloud means that the typical organization’s infrastructure is always in flux. 

In the following chapters, we will outline a practical monitoring framework for dynamic infrastructure. This framework comes out of our experience monitoring large-scale infrastructure for thousands of customers, as well as for our own rapidly scaling application in the cloud.

Collecting the Right Data

Most infrastructure monitoring data falls into one of two categories: metrics and events.

  • Metrics

Metrics capture a value pertaining to your systems at a specific point in time. There are two important categories of metrics:

  • Events

In contrast to metrics, which are collected more or less continuously, events are discrete, infrequent occurrences. Events capture what happened, at a point in time, with optional additional information. These provide crucial context for understanding changes in your system’s behavior. 

Alerting on what matters

Automated alerts allow you to spot problems anywhere in your infrastructure, so that you can rapidly identify their causes and minimize service degradation and disruption. Know the levels of alerting urgency:

Alerts as records
(Low severity)

Many alerts will not be associated with a service problem, so a human may never even need to be aware of them.

Alerts as notifications
(Moderate severity)

The next tier of alerting urgency is for issues that do require intervention, but not right away.

Alerts as pages
(High severity)

The most urgent alerts should receive special treatment and be escalated to a page (as in "pager") to urgently request human attention.

Investigating Performance Issues

Investigating is often the least structured aspect of monitoring, driven largely by hunches and guess-and-check. This chapter describes a more directed approach for finding and correcting root causes.

1. Start at the top with work metrics

First examine the work metrics for the highest-level system that is exhibiting problems. These metrics will usually set the direction for your investigation

2. Dig into resources

Next examine the system's resources-physical resources as well as services that support the system. Well-designed dashboards enable you to quickly scan relevant resource metrics for each system.

3. Did something change?

Next consider events that may be correlated with your metrics. Look for code releases, internal alerts, or other events that were recorded just before the problem developed.

4. Fix it (and don't forget it)

Once you have determined what caused the issue, correct it. Your investigation is complete when symptoms disappear.

Build dashboards before you need them:

To keep your investigations focused, set up dashboards in advance. You may
want to set up one dashboard for your high-level application metrics, and
one dashboard for each subsystem.

Contact us

Clicking " Send", you agree that bit2bit Americas will store and process the personal information provided above in order to give you the requested content.

What are ITSM processes? ITIL version 4 recently went from recommending ITSM “processes” to introducing 34 ITSM “practices”. Their reasoning for this updated terminology is that “elements such as culture, technology, information and data management can be considered to get a holistic view of ways of working”. This more comprehensive approach better reflects the realities of modern organizations.

 

Here, we will not concern ourselves with nuanced differences in the use of practice or process terminology. What’s important and true, no matter what framework your team follows, is that modern IT service teams use organizational resources and follow repeatable procedures to deliver consistent and efficient service. In fact, leveraging practice or process is what distinguishes ITSM from IT.

Change management ensures standard procedures are used for efficient and prompt handling of all changes to IT infrastructure, whether it’s rolling out new services, managing existing ones, or resolving problems in the code. Effective change management provides context and transparency to avoid bottlenecks, while minimizing risk. Don’t feel overwhelmed by these and the even longer list of ITIL practices.

Problem management is the process of identifying and managing the causes of incidents on an IT service. Problem management isn’t just about finding and fixing incidents, but identifying and understanding the underlying causes of an incident as well as identifying the best method to eliminate the root causes.

Incident management is the process to respond to an unplanned event or service interruption and restore the service to its operational state. Considering all the software services organizations rely on today, there are more potential failure points than ever, so this process must be ready to quickly respond to and resolve issues.

IT asset management (also known as ITAM) is the process of ensuring an organization’s assets are accounted for, deployed, maintained, upgraded, and disposed of when the time comes. Put simply, it’s making sure that the valuable items, tangible and intangible, in your organization are tracked and being used.

Is the process of creating, sharing, using, and managing the knowledge and information of an organization. It refers to a multidisciplinary approach to achieving organizational objectives by making the best use of knowledge.

Is a repeatable procedure for handling the wide variety of customer service requests, like requests for access to applications, software enhancements, and hardware updates. The service request workstream often involves recurring requests, and benefits greatly from enabling customers with knowledge and automating certain tasks.

It’s simply not enough to have an ITSM solution – you need one that actually accelerates how your teams work.

Atlassian’s ITSM solution unlocks IT at high- velocity by streamlining workflows across development and operations at scale. Meaning what was once many siloed teams with different ways of working, are now integrated and much more collaborative than ever before.

ITSM benefits your IT team, and service management principles can improve your entire organization. ITSM leads to efficiency and productivity gains. A structured approach to service management also brings IT into alignment with business goals, standardizing the delivery of services based on budgets, resources, and results. It reduces costs and risks, and ultimately improves the customer experience.