In brief, DevOps is an operations-focused approach to solving development pipeline issues. Site Reliability Engineering is a development-oriented discipline that addresses operational, scale, and reliability issues.
Software is consuming the world, and as our reliance on technology grows, disciplines have emerged to ensure efficient implementation of changes and that our systems are available when we need them. Over the last few years, DevOps and Site Reliability Engineering cultures and practices have become mainstream.
Being a DevOps Engineer or SRE is a highly sought-after job title and skillset. Understanding the overlap and differences between the two skillsets and organizations is critical to avoid falling into the “CI/CD” while saying “DevOps/SRE.” Both solve different problems with unique and innovative approaches, ushering in new technological paradigms.
More efficient and reliable systems are not new technological goals. Like many other aspects of technology, the learnings and science behind both goals have grown where specialized organizations and engineers are required.
Breaking down silos is at the heart of the DevOps movement. DevOps Engineers and leaders are trying to resolve the classic disconnect between development and operations teams. This division makes sense because programmers aren’t always programmers, and programmers aren’t always programmers. It takes time for organizations to untangle years of these practices, and shifting mindsets is challenging to predict.
Having a scalable, reliable system is a critical problem to solve. As systems become more distributed, it becomes more difficult to ensure reliability or the appearance of reliability across a complex topology of moving parts. Internal and external customers alike expect systems to be available at all times, and even a brief outage (downtime) or decrease in uptime can harm a company’s reputation and bottom line.
DevOps Engineers, in a nutshell, are ops-focused engineers who solve development pipeline issues. Whereas SRE Engineers solve operational, scale, and reliability problems. Both sets of engineers bridge the gap by bringing their expertise and perspectives to the other side of the equation.
Engineering efficiency and reliability are two distinct domains that overlap to some extent. There is a link between system dexterity and system robustness. A counter-argument is that agility leads to a high rate of change, and change is inherently unreliable. Today’s challenges are large-scale, and as we continue to push the boundaries, it’s critical to adjust on the fly. The problems that both teams solve reveal the culture and skills to succeed.
Engineering efficiency can have painted with a broad brush. When you look at some DevOps job postings, it may appear as the employer is looking for a single person to manage an entire IT department. As the Software Development Life Cycle (SDLC) can be a tricky path to navigate, DevOps teams strive to eliminate bottlenecks throughout the SDLC by removing production and automation barriers. As a result of Agile’s adoption, production changes are being created and deployed faster, as incremental changes are now the norm.
DevOps teams are purveyors of development tools, from guiding the start of the SDLC with source code management (SCM) recommendations to enabling Continuous Integration and Continuous Delivery in an organization. DevOps teams can have ownership and oversight over various tools and platforms due to their scope of responsibilities. SREs, on the other hand, are concerned with the overall health of the system. Finally, recognizing the DevOps pillars should be noted because they can provide more insight into what DevOps teams solve.
Site Continuity Safety, health, uptime, and the ability to solve unforeseen problems are all priorities for engineering teams. An idealized version of SREs is that they are only called into action during an incident, assisting with problem-solving until the engineering teams complete proper remediation. Combating incidents is an undoubtedly crucial aspect of the job, and SREs spend a significant amount of time using their extensive knowledge to ensure that a firefight does not occur.
SRE practices free development teams to focus on feature development rather than the nuances of achieving and maintaining service level commitments by removing some of the complex burdens of scaling and maintaining uptime in distributed systems.
Metrics are essential to both DevOps and SRE teams because you can’t improve what you can’t measure. One of the Service Level (SLx) commitments can represent indicators and measurements of how well a system performs. SLAs, SLOs, and SLIs are a trio of metrics that depict the agreement reached versus the objectives and actuals to meet that agreement. You can learn about a system’s health by using SLOs and SLIs.
The commitment/agreement you make with your customers is a Service Level Agreement (SLA). Internal, external, or another system could be your customers. Typically SLAs are designed to meet the needs of customers or systems. SLAs have been around for a while, and most engineers would define an SLA as “we need to respond in 2000ms or less,” which is an SLO in today’s terminology. “We require 99% uptime,” would be an SLA.
Goals that must be met in order to meet SLAs are known as Service Level Objectives. Using Tom Wilkie’s RED method to come up with good metrics for SLOs: requests, errors, and duration can help. “We need to reply in 2000ms or less 99 percent of the time,” for example, would fall under duration, or the time it takes for your system to complete a request. SLOs based on Google’s Four Golden Signals are also useful, but they include saturation. The purpose of SLIs is to measure SLOs.
Service Level Indicators (SLIs) track how well a company adheres to a set of objectives. The SLI is the actual measurement, not the SLO from above that says, “we need to respond in 2000ms or less 99 percent of the time.” Only about 98 percent of requests receive a response in less than 2000 milliseconds, which falls short of the SLO’s goal. In case of broken SLOs/SLIs, need to spend time on resolving/fixing the problems causing the slowdowns.
System health and availability are not synonymous with engineering efficiency for DevOps teams. The Accelerate metrics are another set of metrics to consider for DevOps teams.
Even if you have the most resilient and robust system in the world, adoption and success will be difficult to achieve if your customers do not complete their journeys.
We delve into the organizational science of high-performing technology teams in Nicole Forsgren, Jez Humble, and Gene Kim’s book, “Accelerate”. They suggested four key metrics below to evaluate software delivery performance.
The lead time in lean manufacturing is the time it takes from a customer request to its fulfillment. It can be the time between the code from its checked-in to deployment into production in the technology domain.
It is the number of times deploying a project to production in a given period. The software delivery process will be more efficient if your internal customers deploy more frequently.
MTTR is an incident metric that calculates the average time to restore a system, taking a page from lean manufacturing. In software terms, restoring refers to reverting to the most recent version of an application. When the repairing initiates, such as when the rollback begins, this is known as the Mean Time to Repair. The “restore” portion of Mean Time to Restore accomplishes when the system restores its previous functionality.
It is the percentage of unsuccessful production changes. With the number of unknowns in production, a change will fail after navigating all the confidence-building exercises leading up to the output. With a descending change failure rate, the result can be more confident. In modern delivery methods, failing early (in a lower environment) is more important than failing later (in a higher environment).
Both are leveraged resources; clearly, there is no 1:1 ratio of Software Engineers to DevOps Engineers or Site Reliability Engineers (though it may appear that way as organizations try to scale). Compared to the first edition of Google’s SRE Book, O’Reilly’s Building Secure and Reliable Systems discuss team structure poisoning SREs as advisors/experts.
Building software at scale necessitates the use of technical engineers to assist in the resolution of issues and the development of new capabilities. Specialized advisors include DevOps Engineers, SREs, and other engineers such as Application Security Engineers.
Evolutyz is here to help you start or continue your DevOps or Site Reliability Engineering journey, no matter where you are in the process. From a DevOps standpoint, the Evolutyz Software Platform’s core capability is to build a robust pipeline consistently. Evolutyz Software Delivery Platform facilitates Continuous Delivery, central to the engineering efficiency mission. With Evolutyz, constructing a safe and reliable pipeline is simple.
In terms of SRE, providing baseline comparison coverage can be critical. SLA/SLO/SLI management is one of the first steps in establishing an SRE organization, and proper baselines are required to determine what is usual. When used with your preferred tools, Service Guard monitors your deployed applications for deviations from baselines. Evolutyz is excited to work with you and your company to help you achieve your DevOps and SRE goals, as well as the overall goals of software development.