Organizations moving to the cloud are having to navigate increasingly complex tech stacks, Kubernetes, containers, microservices and software as a service, not to mention the countless application programming interfaces holding it all together and linking it with on-premise, “monolithic” systems.
Security company Splunk’s third annual The State of Observability 2023 report, covering 10 countries and 16 industries, examines how companies that have instituted comprehensive observability technology and practices — and those that, maybe not so much — are using telemetry to see and understand, not just monitor, what is happening in systems and dynamically get answers to specific questions about vulnerabilities and performance.
SEE: Use this hiring kit from TechRepublic Premium to help you find the right cloud engineer to navigate your business’s complex tech stacks.
Jump to:
What is observability, and why is it important?
Observability, which has been around since the dawn of mechanical systems analysis, combines systems monitoring with diagnostic analysis. Splunk said its research shows how observability applied to organizations’ digital surfaces is a powerful way to reduce outages, improve app reliability and optimize customer experience, and its research is table stakes for resilience.
Spiros Xanthos, SVP and general manager for the observability practice at Splunk, explained that observability is really a superset of exercises that were once separate: application performance monitoring, infrastructure monitoring and digital experience monitoring, as well as AIOps and log management categories. He said these categories have converged because of:
- Changing software delivery practices, such as the advent of DevOps, infrastructure-as-code and continuous delivery.
- Changing software architectures, such as cloud, microservices and Kubernetes.
“Observability helps IT operations and engineering teams improve digital resilience by lowering the cost of unplanned downtime across all of their infrastructure and applications,” Xanthos said, adding that there are some similarities between observability practice and visibility tools deployed for security.
“Companies … have to be able to detect security issues very quickly before they become breaches and react to them,” Xanthos said. “Similarly, when it comes to observability, we want to be able to detect a failure scenario early on before it becomes something that can cause an outage and essentially result in bad customer experience.
“So whether it’s a security breach that essentially creates, let’s say a trust issue with the users or customer experience issues, both the goals and the tools are very similar,” he added.
Follow the leader: Strong observability means fewer outages
Splunk’s report, based on a survey of 1,750 observability practitioners, managers and experts from organizations with 500 or more employees, picked leaders out of the pack and compared them to relative observability newbies.
Who are the observability leaders?
Splunk defined observability leaders as organizations with at least 24 months of experience with observability that are also ahead of the pack in:
- The ability to correlate data across all observability tools.
- The adoption of artificial intelligence and machine learning for observability.
- Skills specialization in observability, the ability to cover both cloud-native and traditional application architectures.
Respondents to the study who were leaders were nearly eight times as likely as beginners to say that their ROI on observability tools far exceeded expectations. Approximately 90% of leaders said they were “completely confident” in their ability to meet availability and performance requirements for their applications and were four times as likely to have resolved instances of unplanned downtime or serious service issues in just minutes versus hours or days.
“Advanced observability leads to much more resilient kind of digital systems,” said Xanthos. Leaders in observability polled by the study reported excellent visibility into:
- Containers (71% of leaders versus 32% of beginners)
- Public cloud IaaS (71% versus 38%)
- Security posture (70% versus 37%)
- On-premise infrastructure (66% versus 34%)
- Applications at the code level (66% versus 31%)
Leaders experience 33% fewer outages per year than beginners, and 80% of organizations that are leaders in observability reported they could find and fix problems faster.
But most organizations haven’t reached leader status. The study said 74% of respondents were beginners.
Holistic approach key to resilience: The ability to see the forest and the trees
More organizations are moving to unified security monitoring and observability that provides better context around things that go bump in the night: interface issues, outages, problems, and bugs, the issues that caused them and how they impact both cloud and on-premise systems that touch them, according to Splunk, which reported that the ability to see not just the granular problem but larger context accelerates resolution.
Respondents to the survey said the reasons they chose to unify observability include:
- More granular and precise threat detection, with 59% of all respondents saying they uncovered security issues more effectively with intelligence and correlation capabilities.
- Better ability to find security vulnerabilities, with 55% saying they uncovered and assessed more vulnerabilities.
- Speed, with 51% of respondents saying they took action on security issues faster, thanks to the remediation capabilities of observability solutions.
On average, respondents reported having 165 business applications, with about half in the public cloud and half on-premise, while 73% of respondents reported they’ve been using observability tools for over a year, with 14% having used them for more than three years. Forty percent of respondents said they have a formal approach to resilience instituted.
The most cited observation tools included:
- Network performance monitoring (79%)
- Security monitoring (78%)
- Application performance monitoring (78%)
- Digital experience monitoring (72%)
- Infrastructure monitoring (70%)
Eighty-one percent of respondents said the number of observability tools and capabilities they use has been increasing recently, with 32% saying the increase is significant.
Xanthos said there are diminishing returns around observability when it comes to the proliferation of tools.
“Observability began with issues around different tools, for things like monitoring infrastructure, networks, applications and tools for doing things like log analysis,” Xanthos said. “The problem with that is that modern systems tend to be obviously very interconnected, so something like a failure in infrastructure can connect to an application problem or a customer experience issue. So observability involves the idea of fully connected tools.
“So, that’s kind of the starting point. In scenarios where customers are using different tools that are not connected humans have to be able to jump between these tools and troubleshoot, which is much less effective.”
Many are hedging moves to cloud-native
Based on the study’s results, many organizations are taking a hybrid approach to cloud-native: keeping applications on monoliths while also keeping the flow of cloud-native apps moving.
- Fifty-eight percent of respondents said cloud-native apps will be a bigger proportion of their internally developed apps a year from now, versus 67% last year.
- Forty percent said they will balance cloud-native with on-premise apps.
- Only 2% said they will reduce their cloud-native footprint.