In this episode of ShipTalk, we talk to Chris Jowett who is a DevOps Architect at ABC Fitness. Chris has significant system engineering experience and has worked on several operations teams in various organizations.  DevOps is the operational focus on development scaling e.g engineering efficiency. 

There is no one path for a DevOps engineer, learn from Chris as the evolution in operations teams occur how this translates to DevOps. For example number of services supported vs infrastructure support. Chris talks about modern approaches such as not ever patching a system. This and more on this episode! 

Transcript

Ravi Lachhman  00:07
Hi, everybody! Welcome back to another episode of [the] ShipTalk podcast. Very excited to be talking to one of our buddies Chris Jowett from ABC Fitness, who’s a DevOps engineer. Chris, why don’t you introduce yourself for listeners who don’t know you?  

Chris Jowett  00:20
Yeah, so I’ve been doing this DevOps thing since before it was technically called DevOps. I’d been doing computer automation; but officially, I’ve had a DevOps title for probably somewhere between three to five years. I’ve been a DevOps Architect officially for about six months.

Ravi Lachhman  00:40
Awesome. So a little bit of background— I was able to cheat and look at Chris’s LinkedIn— and [this is] what really excites me about Chris: there’s usually two camps of people who kind of end up in a DevOps career. Like the name DevOps says, there’s folks coming from a development side who [are] now focusing on operationalizing the development pipeline. And there’s folks [like Chris]— Chris’s background [comes from] a very lengthy, robust career as a systems engineer, coming in and focusing on the development pipelines from a system engineering standpoint.

What’s really funny is most people I know coming from a development background— [well let’s say if] Chris and I were working at the same firm, a decade ago, Chris and I would be buds outside of work, but we’d be pitted against each other in the office because as an engineer, Chris would be viewed as the person not letting me deploy. 

But Chris, why don’t we talk about some of the system engineering principles? How is that important in the DevOps world?

Chris Jowett  01:35
Yeah, so since we’re talking about the way the way developers and engineers interact, developers are very big on “just get the code on the box and make it go.” They’re more code-focused, obviously— they write the code. Engineers are more focused on the actual system and the health of the system. We tend to not care as much about the code as long as it runs and it doesn’t break our system. 

You’ll see this a lot, so we do some design practices around all of that. Namely, not letting people log into boxes is a big way of preventing people from breaking boxes, to the point where even we don’t log into our own boxes anymore.

Ravi Lachhman  02:19

Awesome. Actually, this would have been really [a] true interaction between me and you like 10 years ago, and how the evolution is actually changing. 

So I used to be notorious for dropping policies on our Linux boxes. From my standpoint [on] most of the projects I’ve worked on, I wasn’t there from the inception. There [were] a lot of nuances to the applications I’ve worked on, but as we started going down microservices architecture and decomposing things, I would be like, “I don’t know what port these two services talk to each other.” So I would drop iptables or SE Linux, and then I would get a stern speaking to by somebody else on the operations team saying, “Why in your deployment script did you drop all this stuff? Why are there so many pseudo calls here?  We’re going to get audited”. But that was it; I would get spoken to, and then [it was like] okay, we have to go figure stuff out. 

But now, we work together. How do you start to enforce guardrails and development practices that kind of harmonize both those camps?

Chris Jowett  03:25
Yeah, so one of the one of the nice things that we have now is all the containerized infrastructure. Namely Kubernetes is kind of the big dog. [Now], developers don’t have the ability to do things like drop iptables anymore because you can only affect things inside of your own container. And typically now, we’ve even shifted that install of the application off of the server in the first place, so a CI/CD system builds a container not on the deployment system. So by the time that it gets deployed, it’s already technically installed. It should already be validated and tested; you’re just running it now. You’re just executing it, so all of that installation madness that that used to happen just doesn’t happen on the target system anymore.

Even if you wanted to, you don’t even have the access to do it because typically, we don’t give people access to the host systems. You only can affect what’s inside of your container, and in a properly set up infrastructure. You have security policies, limiting what the containers themselves can do.

Ravi Lachhman  04:29
Yeah, I got it; makes perfect sense. Let’s pretend I’m a developer at your firm. Where does my kubectl access end? This is always a very heated discussion when, let’s say Kubernetes is enabling self-service. But I want to hear from a system engineer expert: where do you stop my kubectl access?

Chris Jowett  04:47
So the answer to that [question] is changing very heavily within ABC right now. Prior to 2021, our main focus as a DevOps team was getting the application built and deployed and running. It was a Greenfield application under heavy development. And we didn’t let developers into the platform— not because we didn’t necessarily want them in, [and] not necessarily because we didn’t trust them. But just because we didn’t have the bandwidth to work on that functionality.

 That’s actually a core tenet of what we’re working towards in 2021: enabling more self-service and making it to where the DevOps team can get out of the way. Because with the way we were operating, there [were] a lot of places where we were interjected in that pipeline, where in order to go past this next step you [had to] contact DevOps. We want to take ourselves out of that loop as much as possible and allow developers, where it makes sense, to do things for themselves. 

 We’re actually figuring out, somewhat, the answer to that question right now. What we’re leaning towards is— talking about Kubernetes specifically— separating applications, get[ting] their own namespace. And then a developer who is responsible for that application gets some access to that namespace. And especially the lower environments, like in the dev and QA environments (our bottom two environments), we try and be fairly hands-off. We allow them to delete pods or do roll out restarts, to pull logs, to view the service and ingress configurations. They can view basically anything within the namespace. 

 They even have some limited capabilities to change things. But we don’t let them do certain things like viewing secrets, because you shouldn’t get comfortable with being able to view secrets. You’re going to lose that access— like, the very next environment, it’s gone so why give it to you in this environment? And then, obviously, as you go up to the deployment pipeline, you give access to fewer and fewer people all the way until eventually on the production clusters, it’s basically only DevOps [that] has access to that cluster.

 Ravi Lachhman  06:54
That makes sense. That is a very pragmatic answer. How I relieved I feel that some really good engineering principles are bleeding in!

I’ll give you a little bit of background. I was kind of voluntold to start looking at Kubernetes in 2015. My background has been in Java— like J2EE, Java development. And so when I first heard of Docker, I I thought it was like a pants. Like, “Oh, I go to Kohls and buy docker pants,” so I was like, what is this whale? I literally said that to my manager, and he was like, “You’re out of your mind.” 

But the development team was actually tasked with looking at Kubernetes ahead of our system engineering team because we (system engineering) were running on RHEL. We [had] CloudFront and CloudFormation, and we had all these VMs spun up in managed by Satellite. But Kubernetes was a new world and all the stringency that they had was (gone), as a developer, like “we’re just gonna do it.” So there was none of that rigor. 

And so, it’s very interesting to hear how as technology continues to mature, the practices continue to follow it. What I really like about your solution there is that there’s two radio dials. I want to hear how you manage this— so there’s a radio dial [of] allowing for innovation, and there’s a radio dial of business controls. It sounds like you have a really good model: lower environments, let the developers work and figure things, and then you start increasing the rigor. But how do you balance that? How do you balance innovation and control?

Chris Jowett  08:23
Yeah, so everyone would love to be able to say, “Hey, let’s push for, for maximum innovation— we’ll give everyone access to everything.” It’s kind of a parallel to the zero trust networking models. [In a] hypothetical perfect zero trust networking system, everything is just on the public Internet snd everything is assumed to be untrusted. and everything can talk to each other, but only if they’re properly authenticated and authorized. So you can draw parallels between that and how you deploy your infrastructure. 

That would be really great to be able to do in theory, but in practice, there are problems with that. And that’s why, with very few exceptions, no businesses actually do zero trust networking. They do a hybrid of zero trust and more of an onion model of network security. And so that’s what you end up having more commonly with your infrastructure as well. You try and open things up as much as you can. You try and give as much trust as you can to your developers to enable innovation, enable them to work quickly, and [to] get out of their way, basically. 

But at some point, you need to say, “Okay, developer, you probably shouldn’t be able to log into the production environment and delete pods,” although your Kubernetes cluster should be built to automatically recover from that.  From a pragmatic standpoint, they don’t have a need to do that so why should you allow them to do it? But then further down the stack in Dev and QA, remove as many of the guardrails as you can. Just make sure that there’s a good separation so that while they’re trying to move fast and break things (the mantra that you always hear). While they’re trying to move fast and break things, you should design it in a way that it’s impossible for them to negatively impact the customer-facing environment. 

If they negatively impact their own environment, that’s fine. If they negatively impact customers, now you got a bunch of questions to answer.

Ravi Lachhman  10:25
[Chuckles] An incident; why did this happen?

Chris Jowett  10:28
That’s how you end up working at like 2am, from your mother in law’s living room on a VPN connection— it’s letting too many people into prod.

Ravi Lachhman  10:38
And for, let’s say, traditional infrastructure, that was the norm, right? [In] pretty much most of my roles, I never had access to UAT or above. It was like a wall. And then all of a suddenly this newfangled Kubernetes came, [and it was like], “Oh yeah, there you go.” So it takes a while. All of those principles that Chris just said are extremely good. It just took time to matriculate into new infrastructure. That’s my old argument: technology is cyclical.

Chris Jowett  11:16
Part of the reason why it’s evolved is because on more legacy infrastructures, one server isn’t running just one thing; that’s hugely inefficient. So even in a legacy infrastructure, you’ll have one server that’s running maybe eight or 10 different web applications. It’s not a server for app “x”; It is a web server that runs multiple applications. And you have different teams responsible for different applications, so if you allow a developer into that server [and] to have that kind of access to their application— it becomes very difficult to manage the permission. 

It’s not impossible to manage permissions in such a way that they can’t mess with the other applications; you’re just adding more complexity onto that permissions and security solution. Whereas in Kubernetes, it’s kind of designed from the ground up to have that level of isolation built in it with namespaces. 

Every now and then I hear about people using Kubernetes and not using namespaces. Like, they just deploy everything in the default namespace; like maybe they use kube system, and then everything else goes in default. It boggles my mind. Why would you not use namespaces— they’re amazing. And especially if you’re looking to allow developers access, that is step one— is start deploying a namespace. We start with a namespace per environment, and we’re actually moving towards a namespace per environment application combination. So like app “x”, and we’ll have a namespace in app “y” and dev will have a different namespace. Just because it makes it easier to allow different development teams access to what they need access to.

Ravi Lachhman  13:00
That’s an excellent answer. You said that very eloquently. It jogged my mind; when I first started deploying— well, I didn’t do the deploy directly—but when we first had our application topology for the Java application that I wrote, we used to have that application server model. So you’re exactly right, like one application server would be serving five or six different applications. 

As time progressed, right before the Kubernetes revolution, we were deploying one to one. It was a lightweight application server [serving] one application. That’s it. And so when we started reflecting on it, it opened up my mind, like, “Oh, yeah, that’s what we did.” When we first started using Kubernetes, it was default. Default had a good ring to it, so we started doing the same thing, like “this Kubernetes cluster, or the small cluster will handle all 10 of our services.” And guess where they all went? Default.  

Chris Jowett  13:55
It’s easy to deploy stuff in there. Just don’t give it a namespace, and that’s where it goes.

Ravi Lachhman  14:01
Yeah, it mimicked how we were deploying in 2008-2009. In 2015, the practices we had in 2014 didn’t catch up to us until 2018. There was like a lag of what we used to do in our old systems and how we worked in the new system. So that was that was very good point you had there. 

Kind of changing gears a little bit– Chris has so much experience. For a young system engineer out there, as someone getting into the system engineering game, how did you make the evolution or revolution (depending how you look at it) into DevOps? How did that occur for you?

Chris Jowett  14:41
Yeah, so I had it a little bit interesting. I worked for a company that had a cloud servers offering, and they had a managed tier that was built on, like, “fire your IT team, you let us be your IT team.” We would install your web servers and everything. It was a platform as a service, but we would help you with getting Jenkins up and running, we would install Apache, we would help you get web servers config, help you get databases config and everything. You’re doing this for a lot of customers, and it gets really tedious setting up like 50 LAMP stacks in one day.  

So you start scripting things because you want to be able to do your job less [and so] you can do things other than your job more. I like to call it being the right kind of lazy. That’s kind of where it started; it was just like “I need to set up like 50 LAMP servers for random WordPress blogs.” And so you just start automating the process. The normal evolution is you write a bash script, and then from there, you move on to something like masterless Puppet or Ansible— something a little bit more advanced. And then over time, the magic day in 2014 happened and Kubernetes showed up. It just revolutionized the whole thing. That’s probably right around the time that people started really talking about DevOps; it’s because now we have a really good platform to do all this automated deployment on. 

But there was definitely a lot of work towards DevOps before then, just with people writing scripts— and that’s how I find most system engineers get into it. It’s just because they want to be able to do their job consistently and quickly. Low effort is obviously another motivating factor, and being able to run a script and have and get a consistent result out on the other side. It’s what drives engineers to this world. 

Ravi Lachhman  16:45
Perfect. Yeah, the level of automation always seems to be increasing. We could take another jog down memory lane, and it’ll actually show the need for this level of automation. One of the metrics that I would say is common or used to be common, is the engineering burden ratio. So like, how many boxes does Chris maintain? This might be a 20-year old argument at this point [or] it’s asinine to think about that, but walk us through the evolution of, let’s say, when you were coming out of university or your first gig. How many boxes did you maintain versus what is the footprint of an average system engineer look like today [with] what they have to maintain? 

Chris Jowett  17:26
Yeah, the problem is that the problem has actually changed over time. It used to be how many boxes did you maintain. And I would say [in] my first real gig in the Linux world, I was maintaining thousands, but not all at once. it was like customer could call in, [and] I’d have to hop on their box and do whatever they needed. But now the conversation has changed away from how many boxes do you manage; that’s almost a meaningless metric. Now, it’s how many types of systems are you managing, since you’ve automated everything. In our case, our infrastructure is built with Spring Boot and NodeJS, so Spring on Java and then NodeJS with a React based front end. All I care about is: is it a Spring microservice, or is it a NodeJS microservice? Because all the Spring microservices all build and deploy the exact same way. They’re all Gradle builds, they all use the same test runner, they all go through SonarQube, they all get built with the same standardized Docker file. So I don’t care if it’s five Spring microservices that I’m deploying or 500, All I care about is that they all work the same way, and [that] my interaction with them gets to be the exact same. 

So now it’s how many categories of things are you maintaining? And in order to scale, how much work a single engineer can do— you want to keep the number of categories low. So I always tell people that with a DevOps infrastructure, historically, the requirements flowed from the application up to the system engineers: I built this application, here’s how it works, build me a box that satisfies my application requirements. Now, it flows both ways. The infrastructure has requirements as well; if you want your application to go into this system, it must meet this standardized way of operating. It’s almost like meeting an API contract in a way. The requirements flow both ways [to] where it doesn’t matter now how many boxes I maintain right now. Like, I probably maintain like 30 or 40, Kubernetes hosts, and each one probably has like 20 or 30 pods on it. So you have a medium sized Kubernetes infrastructure— but it could easily be 500, and it really wouldn’t change anything. Or it could be 10 and it wouldn’t change anything, because it’s still the same number of categories of systems.

Ravi Lachhman  19:56
Now, that’s actually the most modern way I’ve heard it put. It’s all about services rendered now. The level of effort for users to scale, like you just mentioned, from 10 to 500 or even 15,000— it’s nominal between the amounts, because all the background work of getting the automation up to support that and to meet the SLA you set forth is about the same. The infrastructure is a lot cheaper, more elastic, and so you’re not worried about how you configure certain things and giving your internal customers the tools that they need. 

Another question, because you have a long career and like system engineering— are there things that you do today that would have blew your mind 10 years ago or 15 years ago? So we had a little bit of a pre call (for the listeners); Chris and I did talk about one of the things that they do today that would be asinine to someone 10 or 15 years ago.

Chris Jowett  20:51
I mean, there’s several things that I do today that would [have been] crazy to me. Like, just the fact that we allow so much access onto the cluster to developers— that was always a no-no, like “don’t let developers onto the boxes because they just break things.” And now we’re encouraging self-service. 

There’s a lot of things that I do now that that kind of blow my mind. I remember at one point, configuration management systems like Puppet and Chef— that was the pinnacle of how you maintained systems at scale. And now we’ve moved beyond that to where we don’t even use Configuration Management Systems anymore, because we don’t maintain systems anymore. I haven’t done any system patching in three years because we have disposable infrastructure. When a pod is no longer fit for purpose, we build a new container image and we replace it. When a host system is no longer fit for purpose, we build a new AMI and we change out all our host systems. So we don’t ever like log into boxes and install updates— we just replace them every now and then with boxes that already have all the updates installed. 

The nice thing about that is it gives you a really good reason to not log into your boxes, because the number one way to prevent people from logging into boxes and breaking things is to stop them from logging into the box in the first place. If they don’t log into the box, then they can’t drop iptables on you. 

Ravi Lachhman  22:20
[Jokingly] I would drop it, still figure out a way.

Chris Jowett  22:24
[Laughs] That’s why we don’t run your containers with root privileges!

Ravi Lachhman  22:27
My favorite part about that is you’re rebuilding infrastructure at any point. So actually, that takes a lot of cues from the development world. It used to be—before we containerized applications—that we could log into the application server. [We’d be] logging into production, or we were looking at various specific things. We might be changing certain properties on the fly in our running application, or doing debugging, versus when we switched over to containerized technology— if you make a change, you’re building a new image, regardless. You’re not logging into the container, you’re not logging into the pod; you are going to re-build. But what I really like about that is that those principles as a software engineer— it’s articulated into your whole infrastructure. And so that’s why I was really excited earlier. 

Chris Jowett  23:22
It’s that mythical disposable infrastructure that everyone wants to have. Some people in leadership positions think that disposable infrastructure and zero downtime deployments. They’re like, “they’re just myths.” I actually had a CTO one time [tell] me that zero downtime deployments are mythological and don’t exist, and that I was lying to him until I proved to him that we could do zero downtime deployments. Then he walked back off it.

Ravi Lachhman  23:48
That’s an application architecture. The only language I speak is English, and it’s like very rare when I hear new words. I remember I heard the word ephemeral a few years ago, and I couldn’t even spell it. I was like a kid in the spelling bee who looks all dumb, like, “Can I get it using the sentence please?” So going back to ephemeral infrastructure, you embody it, right? 

For those who don’t know what ephemeral means, like me a few years ago, it means that something is transitive or short lived. There’s also another concept. There’s two words I learned around the same time. One is idempotent, and also ephemeral. Idempotence it a mathematical concept that no matter how many times you hit an equation, you get the same result. So building infrastructure to support that is actually very challenging.

Chris Jowett  24:44
It requires the right mindset. Like, it’s weird. For some people it’s really, really hard to do. But then for other people that have the right mindset, this is the way you know. So it just really depends on your mindset. It definitely has benefits, though, to be able to say, “I just want to have this pod deployed, and you can just rerun that deploy as many times as you want. It’s already deployed, it’s not like I’m going to deploy two pods, and then I’m going to get two more and have four, and then two more and have six” You just say, “I want two pods.” 

Some people refer to this as declarative, where you say what you want the end state to be and then you don’t say how to get there. But you can’t ignore the fact that something has to give instructions on how to get there. 

Ravi Lachhman  25:40
Yeah, and I think it’s the big word of the word of the year: declarative infrastructure. Like, “we give you the state that we want, and some sort of system will go and reconcile what the difference is.” But it’s a challenging architecture [too]. [Jokingly] Trying to use idempotent and ephemeral in the same sentence like the SATs here. For idempotent systems, the results have to be consistent— but then it goes against the grain of being ephemeral, because ephemeral means things go away. And so having to be consistent and building for things that go away— I think it’s the pinnacle of the struggle with all this new cloud native architecture. It’s completely the opposite of what traditional engineering would teach you how to solve for. 

So one more question. I always like to wrap the podcast up with this particular question. Let’s say you were walking down the mean streets, and you ran into a young Chris fresh out of school or fresh out of your first job. With current Chris (i.e. you both meet each other)– what would be some advice you would tell your older self? It could be any set of advice. [Jokingly] Don’t go to jail, anything like that.

Chris Jowett  26:52
[Joking back] Yeah, when that guy makes that offer on the street, just walk away. 

I’ve been pretty lucky in my life that I’ve made a lot of pretty good career moves. But there’s been a couple of times where if I had told myself “trust your gut, and trust your instincts a little more”, especially earlier on in my career, that probably would have worked out a little bit better. And I would also probably encourage myself to get into development a little bit more. That’s actually a challenge that we have on our team. 

We were talking before the podcast about how there’s two ways people get into DevOps; one comes from the development side, and one comes from the system engineering side. Most people in DevOps probably come from the development side. They get into DevOps, because they have a requirement to deploy their code and maybe their company doesn’t have people that deploy code. So they figure out how would a developer deploy code, and bam, you have DevOps. My whole team is nothing but systems engineers that have come into DevOps— from what I’ve been told the rare ones— although to me everyone I know are engineers with DevOps. We have a need for someone that’s more developer-focused. 

And so I know how to program in multiple languages, but I wouldn’t say I’m stellar at it. I probably would have told myself, hey, learn a little bit more about [or] get a little bit more into programming. Get to where you’re not just okay, [but to where] you’re actually kind of good at it. That would be the biggest recommendation I would give myself: don’t be so scared of writing some code.

Ravi Lachhman  28:37
That can be scary sometimes. it’s not how you write about what you write— [in] my years of a software engineer, it comes down to that intersection. I could write pretty much anything, but what you need to write is always the hardest part of the equation.

Chris Jowett  28:52
Yeah. And also, time management. Take some classes on time management, because that is a problem for everyone. Everyone has problems with time management; I’m not alone in that struggle, I’m sure. Classes on time management would really help out a lot if you’re going into this kind of career, especially when you start going into either a manager role or, like me, into like an architect role— especially once you start getting up to like a senior engineer level or in management or architecture, time management becomes just crucial. 

Ravi Lachhman  29:30
Perfect. Well, Chris, thank you so much for your time on the podcast. You’ve been an excellent guest, and hope to catch everybody on the next ShipTalk podcast.

 Chris Jowett  29:39
Thanks for having me.