All the data in the world means nothing if it’s not the right data. But when it comes to delivering reliable software and troubleshooting issues, what is the right data?
To answer this question, we created a framework that helps organizations pinpoint critical gaps in data and metrics that are holding them back on their reliability journeys. At the foundation of this framework is the concept of Continuous Reliability (CR), or the notion of balancing speed, complexity, and quality by taking a continuous, proactive approach to reliability across the SDLC. When it comes to CR, it’s not just about what data you can capture, but how you analyze and leverage it.
With increasingly complex systems and ever-growing expectations for digital customer experiences, traditional tooling and the shallow data they provide is insufficient. To fully understand what’s going on inside your application and maintain stability, this data must be collected at the code level.
One of the things that makes Harness service Reliability Management (SRM) a powerful reliability tool is the way that we capture, analyze and present code-level data across the software delivery lifecycle. In this post, we’ll break down the four key types of data SRM captures and why they’re critical to advancing your journey toward Continuous Reliability.
Capturing all the information about events occurring in your code is critical to deciphering which issues need to be addressed. Before you can effectively prioritize and fix critical code-level issues, you first need visibility into exactly which issues are occurring.
At the most basic level, SRM automatically captures 100% of events happening within your application in both test and production – even those missed by your logging framework or APM tools. This includes:
With SRM, you no longer need to rely on logs and foresight into which events to capture, what to include in a log statement, or how to analyze it.
On top of detecting every event, SRM applies a layer of intelligence to automatically prioritize all events based on severity. That way, your team can focus on the issues that matter most.
Taking into account things like if an error is new, when it was first and last seen, how many times it occurred and if there has been a sudden increase, SRM is able to mark errors as severe based on criteria such as if a new or increasing error is uncaught, or if its volume and rate exceeds a certain threshold. It considers established baselines and averages to pinpoint anomalies and immediately notify DevOps and SRE teams of events that require immediate resolution.
Many APM vendors will tell you that they provide the root cause of an issue, including “code-level” insights. What they actually mean is that they provide you with a stack trace. Stack traces, while useful, only help identify the layer of code where an issue occurred. From there, you’re left to your own devices, including spending time manually digging through shallow log files to find context that can help you reproduce the issue.
Service Reliability Management helps you go beyond the stack trace, capturing deep data, down to the lowest level of detail – without dependency on developer or operational foresight. This includes:
In the context of software development and reliability, a transaction is a sequence of calls that are treated as a unit, often based on a user-facing function. When a transaction fails, customer experience is often impacted, so it’s important to be able to identify and prioritize these failures in the context of the transactions that they impact.
SRM captures data about every transaction failure, ranging from how many times it happened, to how many transactions failed, to the response time of the transaction. Using insights from the code events we mentioned above, we can determine the success of a transaction by correlating errors, exceptions and slowdowns within a given timeframe and surface this data to our users.
These performance metrics include things like throughput, or the number of transactions that occur during a given period of time, and response time baselines. The ability to capture data about application performance is critical to understanding what your end users are experiencing, as well as correlating related events that may help with identifying the root cause.
SRM focuses on data at the code level of your application, but we recognize the importance of correlating code-level failures with other aspects of your system. For example, what impact did your latest deployment have on CPU/memory utilization? Are there any blocked threads related to this failure? Was this CPU spike caused by the application?
Through the SRM reliability dashboards, you can correlate events, transactions and performance metrics to things like Garbage Collection, Threads, CPU, Class Loading and Memory Consumption, giving you a more comprehensive view into dependencies indirectly related to your application.
What allows SRM to capture this depth and breadth of data that other monitoring tools simply can’t? The not-so-secret secret to our unique capabilities is a combination of a few key elements:
To learn more about how SRM can help you capture deeper data, schedule a call with one of our engineers.
The powerful combination of data and analysis is the key to enterprise scale observability and reliability. SRM helps your team not only capture a complete picture of how your code is executing and the errors and slowdowns that occur, but analyzes and adds meaning to that data so you know exactly which issues to prioritize.