Google SREs pay close attention to TOIL so they stay focused on engineering projects that deliver value vs. operational work that doesn't.
I spend at least 1 hour a week cutting up Amazon delivery boxes. Pretty much every night I return home from work, there are 2 or 3 boxes waiting for me to slice up so they can neatly fit into our tiny blue recycling bin.
I absolutely hate it. It’s endless. I’ve also cut myself too many times not paying attention because the task is so dull. Ordering boxes are more fun than recycling them.
For those of us who have built software, it’s a relatively fun, rewarding job. You get to see what you built, and watch your product/service evolve over time.
However, operating and keeping software running is generally not fun. Hacking deployment scripts, staring at log files, and debugging incidents are about as fun as cutting up cardboard boxes.
At Google, SREs go out of their way to focus on engineering projects versus operational work, or as they call it, Toil.
What is Toil?
Toil, as defined by Google, is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.
So then, a pretty accurate description of what scripting deployment pipelines is, as your services grow and evolve over time.
Let’s walk through the various attributes of Toil, and how it relates to your deployment pipelines.
Toil: This includes work such as manually running a script that automates some task.
A common misconception about deployment pipelines (and Continuous Delivery) is that scripting is automation. If your deployment pipeline is made up of 20 different scripts or Jenkins Jobs that need to be individually executed with human guidance, input or approval, it very much remains a manual process despite some of your tasks being automated.
Deployment Pipelines require end-to-end automation so that deployments are “hands off.” If not, you’ll never achieve Continuous Delivery at scale or hourly deployment cycles.
Toil is work you do over and over.
Do you find yourself constantly maintaining, editing or tinkering with deployment scripts? One thing remains constant with Continuous Delivery is change. One common challenge we hear from customers is that their deployment pipelines are “brittle” because they’re hard-coded to a specific service or environment configuration.
Deployment Pipelines should be dynamic, consistent and repeatable processes. They shouldn’t break or have dependencies to any service or environment configuration. Pipelines should ideally be “templatized” or “codified” so dev teams can inject parameters into them for their specific services and environments.
If a machine could accomplish the task just as well as a human, or the need for the task could be designed away, that task is toil.
Deployment pipelines that rely on humans is a major anti-pattern for Continuous Delivery. If you look at your deployment pipelines today, which of these tasks are automated by machines?
- QA & Test
- Environment Provisioning
- Deployment Strategies (Rolling, Blue/Green, Canary)
- Deployment Verification
Automated build and testing (Continuous Integration) are fairly mature today; the same is not true for deployment and Continuous Delivery.
One big win Harness customers have seen is the application of machine learning in deployment pipelines to automate test, verification, and rollback tasks.
For example, instead of engineers manually looking at events in log files or metrics in New Relic to detect performance or quality regressions post-deployment, those tasks can be automated with ML algorithms to detect and alert on regressions in seconds. Better still, you can use this insight/intelligence to trigger automated rollbacks for failed deployments.
Why do we need humans to manually identify spikes on monitoring charts or dashboards when a machine can find those spikes in a fraction of the time? It’s time to rage with the machines, not against them.
Toil is interrupt-driven and reactive, rather than strategy-driven and proactive.
Are your deployment pipelines ad-hoc? Do they result in a big event? Do they fail and cause carnage or interruptions to you, the teams and business around you? OR are your deployment pipelines a non-event and happen all the time –and just business as usual?
I’ve probably spoken to 100+ DevOps teams in the past 18 months and many fully admit to babysitting and firefighting dev team deployments instead of empowering dev teams with the appropriate pipelines so they can deploy/troubleshoot/rollback themselves. Why? Because each deployment pipeline is specific to each service/dev team and was written by a different DevOps engineer with a slightly different pattern. When pipelines and deployments fail, therefore, it’s a reactive rollercoaster for DevOps engineers to manage given they’re responsible for the deployment logic.
Deployment pipelines done properly are a pro-active means to kill a release candidate rather than your DevOps engineers. You can also be more strategic around prescriptive failure and rollback strategies, so worst case its a small blip versus a sonic boom.
Devoid of Enduring Value
If your service remains in the same state after you have finished a task, the task was probably toil.
Are your scripted deployment pipelines increasing or restricting the value in your business? Is developer velocity increasing? Or has the value of your scripted pipelines hit a brick wall? Do your teams spend more time handholding/troubleshooting deployment pipelines than actually deploying new innovation for your business?
In a cloud-native era, deployment pipelines have to empower dev teams to unlock their potential for the business, and more importantly, reduce the amount of TOIL with deploying and operating services.
Scales linearly as a Service Grows
If the work involved in a task scales up linearly with service size, traffic volume, or user count, that task is probably toil
Simply put, as you add more microservices to your application or business over-time does the amount of scripting deployment pipelines increase or remain flat? What works for a handful of microservices and dev teams perhaps might not work or scale for a few hundred microservices or dev teams.
If your deployment scripting is linear to the number of services or dev teams in your org then you’re in a world of TOIL.
Avoid Toil &
Script Scale Deployment Pipelines
Scripting deployment pipelines are what most customers have done the past 10+ years. Why? Because monolithic apps and services were relatively static and easy to manage/model using BASH or Jenkins.
Today, cloud-native apps are complex to operate because they’re often multi-cloud, elastic and require complex orchestration of microservices, in addition to supporting Continuous Delivery from many independent dev teams.
Fortunately, many open-source and commercial solutions like Harness exist to free you up from manually scripting deployment pipelines. What would take you days or weeks to script can take you minutes or hours to build.
Don’t believe me? Try Harness Continuous Delivery as-a-Service for free.