Two hot job titles that were not around or mainstream several years ago are DevOps and site reliability engineers. What can feel like DevOps engineers are a catch all around engineering efficiency, system administration, and release management tend to have oddly broad job descriptions. Site reliability engineers, on the other hand, have a more defined focus but a broad scope in the organization with the teams they support.
Not to fall into the eponym of “CI/CD” while saying “DevOps/SRE”, understanding the overlap and differences between the two skillsets and organizations is important. Both solve very distinct challenges with unique and innovative approaches ushering in new paradigms in technology. Though, we are human and fallible and subject to Conway’s Law.
Conway’s Law at Play
A popular organizational item to talk about in DevOps culture is Conway’s Law. If you are unfamiliar with Conway’s Law, per Conway’s Law system design mimics organizational communication structure. We are people and we as people design systems. Looking at DevOps team structures, DevOps teams are focused on breaking down silos that were created by Conway’s Law. These silos that are institutional are barriers in engineering efficiency; two or more sets of people to get features across the line with separate goals.
The development organization wants to get features out quicker and the operational organization wants rigor and process in infrastructure to limit risk. Fast forward to the modern DevOps movement, we are well underway of learning how these two teams work better together. That leaves us with where does site reliability engineering sit?
Reading one of the quintessential DevOps pieces, Gene Kim’s The Phoenix Project illustrates how problems can be tossed across the proverbial wall. Fliping over to one of the quintessential site reliability engineering pieces, Google’s SRE Book there is much more focus on approach and shining light on the complexity and skills required for operating systems at scale. Because of the specialization needed, site reliability engineering organizations might need to be separate taking on advisor/expert roles to support multiple teams.
Because we are all still fighting Conway’s Law, having a solid DevOps team does not mean you have site reliability down. The vice versa is true just because you have site reliability down that you have engineering efficiency that a DevOps investment brings; they are solving for two different domains.
Two different problem sets
Engineering efficiency and reliability are two separate domains but have overlap. There is a correlation between agility and more robust systems. A counter-argument might be made that agility brings about a fast velocity of change and change is a detriment to reliability. Today’s challenges are faced at scale and as we continue to push the boundaries, adjusting on the fly is important. Both teams value metrics, you can’t improve on what you can’t measure.
Site reliability engineering organizations will focus on safety, uptime and the ability to remedy unforeseen problems. A romanticized idea is that SREs are only sprung into action during an incident helping devise remedy for problems until the engineering teams can make proper remediation. Certainly, an important pillar of the job is combating an incident, SREs spend a good deal of time making sure the firefight doesn’t occur with expertise. Both DevOps and SRE teams benefit from continuous learning and improvement which is an investment by organizations.
One of our guests on Deliver Better, Martyn Coupland, spells out the business value of DevOps. Having a process that allows for changes without a lot of learning curve overhead is important especially in today’s climate with organizations triggering business continuity plans [BCPs]. On the site reliability engineering front, the focus is really on the science of resiliency and reliability. From a business standpoint, there might be easier measures on impact to the business if there was an outage vs the agility to get new features out. Measuring success in both teams can be subjective though organizational and industry tolerances are different. Comparing several concerns that a DevOps and SRE teams have shines a light on similarities and differences of the teams.
DevOps vs SRE Concern Table
A great way to take a look at comparing the two roles is to compare a DevOps response vs an SRE response to certain concerns.
|What do you say you do around here?||Development pipelines.||Resilience, scaling, uptime, robustness.|
|TL;DR||System engineers focusing on development problems.||Software engineers focusing on operational problems.|
|Does the Application Cluster?||Yes, the application does. We need three nodes.||We use a RAFT based leader elected clustering mechanism focused on Apache Zookeeper. We front the application with Apache Mesos to work through Dominant Resource Fairness constraints.|
|Can we have Monitoring?||Yes, we use Prometheus, ELK, and FluentD and can provide hooks into each.||Concerned about the science around how the monitoring tool works. Black box vs white box monitoring and specific metrics about each. Advising teams on pros/cons.|
|Our deployment failed.||The pipelines we created allows you to re-run. If additional debugging is needed can connect the dots with log/trace data.||Unless it caused an incident wouldn’t get involved to help with remedy. If the deployment regularly fails we can work to help find decipher why.|
|Typical Metrics||Deployment frequency, deployment failure rate.||Error budgets, SLOs, SLIs.|
|War Chant||“People, process, technology!”||“There is no root cause!”|
The level of detail that SREs take a look at are more tool-agnostic and approach centric vs DevOps teams providing tools and pipelines for engineering organizations to further the mission. Both skillsets and teams are certainly important in any modern organization.
DevOps and SRE, Better Together
Both can be viewed as leveraged resources; clearly there is not a 1:1 ratio of software engineer to DevOps engineers [though can feel like it as organizations try to scale] or site reliability engineers. O’Reilly’s Building Secure and Reliable Systems when compared to the first rendition of Google’s SRE Book, discusses team structure poising SREs as advisors/experts.
Building software at scale requires specialized engineers to help tackle problems and further capabilities. DevOps engineers, SREs, and other engineers such as application security engineers fall into the categories of specialized advisors. Google in its SRE Book describes all the expertise across multiple domains to launch and maintain a product like Gmail which even surprised me. Harness is here to enable and partner with engineering organizations. The Harness Platform is well-positioned to help further the mission of your DevOps and SRE teams.
Harness Here to Partner
From a DevOps perspective core capability of the Harness Platform is to create a robust pipeline with ease and convention. The Harness Platform as an enabler for Continous Delivery which is core to engineering efficiency’s mission.
Building a safe and robust pipeline is easy with Harness with Continuous Verification.
SRE wise, providing baseline comparison coverage can be difficult; the first steps of establishing an SRE organization are SLA/SLO management and proper baselines are needed.
Service Guard running with your favorite tools looking for regression from baselines in your deployed applications.
Harness is excited to partner with you and your organization helping further DevOps and SRE goals. Can always sign up for a Harness Trial to take a look for yourself. If you are getting started on your DevOps or SRE journey, we recently launched Deliver Better with thought leadership pieces from folks in industry. If you want to contribute to Deliver Better, give us a shout!