In this episode of Ship Talk, we are joined by Bob Chen who is a Director of DevOps at ADP. Bob has been practicing and now leading DevOps organizations since the inception of the term DevOps. Bob talks about the art and tact required to scale your team, technology, and to further the craft. The tips and advice can be used across any set of new technology paradigms.
Ravi Lachhman 00:06
Hey everybody, welcome back to another episode of the Ship Talk podcast. Very excited today to be talking to one of our friends, Bob Chen, who’s the Director of DevOps at ADP. Bob, why don’t you introduce yourself for the listeners today?
Bob Chen 00:19
Hey, everybody. Good to be on your podcast, Ravi, [and] thanks for inviting me; I’m always appreciative of that. I’m Bob Chen, currently a Director of DevOps within ADP. My background in the last six to seven years— I’ve been sort of in the DevOps realm of things. But prior to that— I’ll kind of take you far back in my career; not too far back, to where [you’re] just listening to me babble along with baby words. My college degree was not anything near computer science or engineering. I was in engineering, but I then changed over to a business degree.
So after [graduating college], I got a business degree; I went to work for some PC companies—system integrators— as IT. I originally started my career in the IT field, going through the normal certifications for Cisco and for Microsoft. But it was really in early 2008; I was fortunate enough to be running IT for a company called Gehry Technologies. They were an architecture software firm, at that time, and they had a really interesting idea of bringing architecture software to the cloud. That’s really where I had my first taste of the cloud, [and] it kind of started my own journey in terms of grooming me to where I am right now in the DevOps field.
I went from doing IT to really doing system-engineering work; not really IT system engineering, [but] the actual server system-engineering work, per se. At that time, we were kind of dabbling in between AWS and Data Center— actually I was trying out both architecture designs in both Data Center and AWS. This was early 2009, and AWS wasn’t really a whole lot.
Ravi Lachhman 02:37
Just EC2 and like two other things!
Bob Chen 02:39
Yeah, EC2. I remember, it was EC2, S3, and SQS. [Those were] the first three services we were leveraging and [were] available. VPC was not even there at the time— everything was all public [and] external-facing. So that was pretty interesting, right? At that point, we had—whom I consider really a great friend and a mentor— Jim Chung [come] from Yahoo and show me the light, right? He’s like, hey, I want you to help me focus on the system, since you’ve got a good IT background and a good server background. And I managed a lot of corporate data centers and server rooms in the past before, prior to this position. So at that time, I’d already had six, seven years in IT. [He] really showed me and had me try it out and really positioned me towards the cloud.
And it was interesting. I think what was really interesting was [that] he brought a lot of good ideas that were really DevOps centric— a lot of automation ideas, a lot of scripting. He was sort of an engineering manager. I mean, focused in engineering— he was VP that time. And he brought a lot of engineering concepts and ideas to my brain at that time. So that’s really was how my career really started. [It] was kind of dipping into sort of the DevOps model, was that I had a really good mentor that gave me this great opportunity to focus on in the cloud and telling me ideas of how they did things in engineering world, and how that can be applied into the server world of things.
Then obviously, from there, I kind of went on. I went to other SaaS companies; AWS kind of grew bigger as well and supported others. I think it was around 2010— probably around 2011, actually— [when] I actually went into a position as DevOps. And that was a DevOps Manager at that time, where I formerly had the DevOps title. At that time, I was working for a company called Cake Marketing. We were definitely doing a lot of high throughput traffic with ad engines and ad marketing. I think that was very unique, and what was also eye opening for me was [that] the skill I was working in before exponentially exploded when I was working at Cake.
And exponentially exploded, in the sense, where before I normally would be working in a single region— and at that time, the primary region was US East One. US West Two did come online around that time so we did also deploy there, [and] there were also all the other regions globally— Europe, as well as in Asia. But what kind of blew my mind and was another evolution for my DevOps career at that time— and I already had a small team back then of three engineers— was [that]we went from a single region, to two regions, to three regions, to five regions, to six regions in AWS during my tenure there. We went from deploying servers of five, 10, 15, 20, 30, 40, 50. At the high point, I think we even went up to 120 servers, scaling up and down.
At that time, a lot of the server configurations were scripted and executed. The scaling, unfortunately, was manual. So, you know how right now AWS has auto scaling everything? Well, back then, we had auto scaling. Auto scaling was essentially a bunch of DevOps monkeys sitting behind a screen, scaling up and scaling down, scaling up and scaling down, looking at the resource metrics in CloudWatch and scaling up and scaling down. That was very interesting. The multi-region database constraints proved to be quite a big challenge there.
I’ll kind of pause there really quick, and just state that if you look at now compared to back then [at] how we’re doing auto scaling and all the options of functionality available now… That was probably around late 2011, early 2012, when were kind of doing the scaling at that level… It really blows my mind, just [in] this seven-year timespan how the cloud [has] exploded, and how we manage it.
At that time, System Configuration Management was— I’m sure somebody would have thought about it, but at least for me, it was it was unheard of. We were using a source code repository using Git; we did have as much source as config stored there as much as we can. But you know, looking back right now, my mind kind of flips. Like, why did I come up with these ideas that currently exist now? Why didn’t these tools exist back then? My life would have been a lot easier. Because if you think about it, a deployment back then, on average, [took] about an hour to two hours to deploy our software. Every single deployment, we had to manually spin up the new servers, run our script, deploy the software, and then scale up and down and so forth.
[It was] kind of do exactly what we do now with blue-green deployment or with traffic-shifting deployments as well. We were kind of doing that manually, which, right now it’s what, like, a single click? And it’s completed in like, two to three minutes. Whereas back then, me and my team, it [was] a team of three, a team of three doing actual deployments where we all focused on a single region because we had like six regions back then. So we each [would] take a single region and do deployments and validate and so forth, and [we’d] do testing validation across all the regions. [We also would] play a lot with the RDS load balancing, Geo DNS load balancing at that point.
It’s also kind of around that time [when] I started also attending the AWS re:Invent conferences. And that also alone shaped quite a bit of my career. Going to the conferences like that, talking to like-minded people, talking to a lot of people with great ideas and really understanding and just really learning… I think one thing that I’ve learned throughout my career, including now,, is that in our field—technology— you can never stop learning and picking up new things. I think that’s something I’ve tried to ensure that I do personally myself.
Even right now, as I grow more into a management-level position— these days I’m doing about 80% management and 20% technical work or discussions with the team— I try to keep myself grounded and situated. I subscribed to DZone a lot. I have subscribed to a lot of blogs [and] read as much as I can. YouTube— a lot of videos, look at all the software announcements, look at all the AWS blog announcements as well. I ensure I attend all the re:Invents to keep myself up to date. But you know, if you look at it back then, my career from that point forward has been growing (amongst other with organizations I’ve been with) bigger teams, expanding from bigger teams.
I went from Cake to my next position; I went from managing a team of three, to managing a team of five, and then grew that team of five to eight, and so forth. We then went from a team of eight to a team of 12, to a team of 15, and then the team of 18, and so forth. One of the challenges I did struggle [with] the most in my career was juggling multiple region teams. I [had] never been up so much in my life when I first tackled my first region team.
Ravi Lachhman 11:59
Email comes in night, email comes in in the morning. You wake up in your inbox, your iPhone has several notifications, and you’re like “why”? [chuckles]
Bob Chen 12:08
Well, it was not only that. At that time, I was already a director, and all lot of the escalations come to me directly as well. I remember, my wife used to have this pillow barrier between me and her in bed, and it’s kind of to use as a cushion to muffle this phone ringing across the room, to get some sanity for her. At that time, we had PagerDuty as well, so escalations [would] come to me, depending on the severity, would also page out to me as well. So there was naturally a lot of calls in the middle of the night. And at some points, during times when we had major deployments and issues would occur, my wife would prep herself like, “Okay, this is going to be the week where he’s going to get called every single day.” So my wife would intentionally watch TV really late at night.
Ravi Lachhman 13:10
And she’s really tired.
Bob Chen 13:11
Yeah, really tired. So none of these PagerDuty calls are going to wake her up.
Ravi Lachhman 13:16
Oh, that’s so funny.
So in terms of experience, I think Bob has been in the DevOps field the longest of anybody I’ve talked to, like even before the industry blew up. Let me ask you [about] an interesting metric. If you take yourself way back to when you started, there’s a metric called engineering burden. It’s like the engineer-to-server ratio or engineer-to-container ratio today. When you started, how many boxes did you manage versus today in 2020? What is a typical person in your team responsible for?
Bob Chen 13:51
I think without any scripting or automation… I think one person can— this will vary also [depending on] if you are managing Windows or if you are managing Linux. If you are managing Windows, I lower that number quite dramatically.
Ravi Lachhman 14:12
[Chuckles] The day 1 out of school. What were you managing? One box? That was me.
Bob Chen 14:17
[Laughs] I think back then in the late 2000s, before 2010, a single person probably could easily manage 30 servers themselves with scripting and some automation. They could probably push themselves to manage probably 40-50 on top of that. I know at one point, it got to like 60 or 70 [that] I managed myself with another developer at that time. It was already crazy. I mean, we had some really slick automation to push out our code and then to recycle. But for anything that got stuck, we still had to go in and check it. We deployed that many services, and at a given time, you can only do so much.
So I would probably peg that number at the 30-40 mark per person. Obviously, with my team of three at Cake, we got over 100. And at that time, we were also doing probably close to 40 to 50 servers, a person. But on top of that, we’re also updating all the load balancers, updating all that Route 53 entries and so forth. So I would say, it’s kind of DevOps growing and work in the cloud, the number of tasks that’s somewhat server related, but actually application platform-related, also increased on top of that… so load balancers and Route 53, and sometimes there’s database configurations as well. So we’re also working with the database team to tackle that.
Ravi Lachhman 16:21
Yeah, that makes perfect sense. Scale brings scale! You know, I never thought of that; you said it really eloquently there. I would just harp on engineer-to-server ratio, but spot on. It’s not just the server. You have other auxiliary things that need to interact— you need storage and computing. You need networking and application infrastructure. Sometimes people like me forget.
Bob Chen 16:43
Correct. Yeah, and if you really look at it right now in the current days and current scheme of things, it’s exponentially worse. Worse in terms of the number of items you actually have to watch and maintain. Because right now you may be able to [have a] single person deploy 100 containers via a single command, but now you’re looking at— and I’m kind of strictly saying AWS, because that’s where most of my experience is—you’ve got the containers, configs, config management, secrets management, load balancers, auto-scaling groups, database, and then if you’re serverless, you also have to manage serverless at the same time, depending on how many servers you deployed.
Before, a service application was a single server. This was all monolithic back then. These days, you’re deploying onto containers. You’re in a big environment, 100 containers per service. So at that, times X amount of services, all the overhead, on top of managing Kubernetes as well as configs and the databases and the routes…. It’s quite dramatic. We have evolved a lot.
And automation really is coming into play where, back then if I looked at it as a system engineer-turning-DevOps, you were kind of a script kiddie, essentially. Now, you’re really becoming a full-blown software engineer. You’re required to actually develop code, not write a script.
Ravi Lachhman 18:26
Yeah, that’s something I’ve kind of seen in industry too. Everybody’s becoming a software engineer because of Infrastructure as Code. I always used to get in trouble with the system engineers because — so I’m coming from a very strong software engineering background— and I think one of the things that DevOps really allowed, or this modern way of thinking which Bob certainly ushered in his organizations, is the concept of iteration.
And so for me, as a software engineer, it’s given that I’m not going to get it right the first time or the 30th time. But as a system engineer, if you go way back to when you were starting, you’re always in production, right? Like you have to get it right. So versus today, we can kind of reconstruct…. Maybe you could touch on how that iteration has creeped into the infrastructure world.
Bob Chen 19:15
[Chuckles] Yeah, that is very true. Back then it was “when you do it, you do it during production.” That was definitely seriously in place.
Yeah, we do do a lot of iteration. Definitely now we do a lot of iterations. I do push my team to continuously improve and continuously iterate. Nothing we have ever done is always set in place. If you look at it, I myself continuously iterate myself as well— although not as fast [or] as sharp as a lot of the more talented engineers as compared to myself.
But we do do that. Right now a lot of our infrastructure deployments do go through phases. Usually our first iteration is sitting around “if you’re going to do a deployment of this specific software or infrastructure, how do you do it in its most simplistic form?” That’s iteration number one. It’s ripping out all the fancy stuff, ripping out all the complexity, a lot of the configs. Just to deploy the bare minimum requirements for this specific configuration, for this specific platform— what is really the bare minimum?
And then you want to also look at breaking that up, too. We look at also limiting how much code and script the first initial iteration has, to ensure that we’re simplifying. Because I’ll be honest, I’ve seen deployment scripts from platforms that I manage, where the deployment script was longer than the application we were deploying. The number of lines of code was more than the application, and I’m looking at it like, “Am I deploying out my application, or am I deploying out my own deployment script? It’s an application.” It was a lot of dealing with those where the iteration mentality and concept really comes into key.
And that’s why I also start out in the most implicit model and work our way up there. Because at that point, if you want to modularize, it’s easier to look at it holistically and modularize, rather than looking at, you know, a 150- line deployment script code. And [that’s] assuming that you can also fix and troubleshoot easily when that occurs. Not only that— as you iterate and pick it up, you can also tune a lot [more easily] because it’s just easier to go through, easier to see. So if you look at it, we’ll build the most basic requirements; the next step we’ll do is then the next iteration.
And we also want to use the concept of micro-iterations. It’s like, “Okay, well, let’s bring in really small changes, similar to in the microservices world. You want to make small changes to figure out; otherwise, you make a lot of changes and you have to troubleshoot an issue, like ‘Okay, well, which change caused the issue? Did all these changes cause the issue?”
I guess a lot of pain and experience has gone into this concept of iterations. And continuous improvement is…. well, a given case example is [that] in the past, I wrote a CloudFormation script to automate the deployment of our database. My initial script was, I think, like 15 lines of code in a YAML file, just to do the bare minimum. The next iteration was adding firewall rules. Second, just initial API calls because that’s the next thing we would add. Then on top of that, [we’d add] custom parameter configs. And then on top of that, we would enable add encryption, and so forth.
At some point, we did identify “this script is becoming a little too big to be a single file”, so we start breaking up and putting into nested CloudFormation scripts. We broke up security groups, we broke up parameter groups, and everything right. All that is evolution. So I think when we look at iterations— iteration is good, but also let’s definitely look at iterations in simplicity, right? It’s where you can simplify as well.
Ravi Lachhman 24:30
It’s happening earlier these days; they’re teaching Comp Sci super early to the kids these days.
I have another question for you. That was a really excellent answer. It sounds like you’re running a sprint, like an agile dev team. Every single one of those concepts you gave to me was like, “this is exactly how a software engineering team runs”, every single time. So there’s a lot of harmony there.
So this is going be very greedy question, because I have to answer these questions on Friday, so I will bestow them upon Bob here. There’s no right or wrong answer. You were scaling DevOps teams before DevOps was even a term. And so we flip this on its head; it can be any technology, right? It can be, you’re build a big data center, the PaaS engineering team, or it could be whatever is in the future— the cryptocurrency mining team of 2025. [chuckles] In such a bleeding edge practice at the time, even if it’s mainstream now— but it’s still futuristic— how did you start scaling that? You went from one person to three to five to, you know… How do you justify, [needing] more people for covering up their space and technology, and hoping to hang their hat on it?
Bob Chen 25:47
Yeah, part of discussion earlier was like, how much a single engineer can handle in support really comes into play. Because I do push thresholds in terms of what a single engineer can and cannot do. This derives from a lot my experience doing them myself in the past, right? I definitely encourage folks or leaders to be able to do the work that your engineers do. And I do that myself right now— to understand what is actually involved, because understanding what’s involved and what’s required is going to be key to how you scale your team. Because with that, you have the information armed to determine when your engineer is going to be (in a rough estimate) on a brink of where they can and cannot do or support before the quality of their work starts degrading.
And it varies platform by platform, team by team. Rightly so, it’s because of how they’re structured, what automation they have in place, and so forth. A lot of my team actually is also centered and structured around how IT operations and system operations used to operate. It’s following the sum model. So technically also ensuring every region has the same amount of resources across the board. That way, their time zone, they have the proper resources to properly support before they have to wake anybody up.
I guess my rule of thumb at this moment is usually [to] place it on quality of the code or of their work, as well as quality of life on the engineer. Because I think one thing we also need to understand is how does this impact an engineer’s quality of life; back then, everybody usually used to think that, you know, you do whatever you can for them to ensure that the system stays up and running, and to keep your company happy. The same model, same mentality applies now, but there’s a lot of companies. A lot of engineers have options, so you also need to ensure that they’re happy as well. You need to keep that in mind, as well.
I have my own grading scale in terms of determining when now to add. To share some insight into that, it’s definitely based off the number of incidents or errors that the engineer can make. That’s one of my metrics that I use to gauge if I need to add scale, because having [done] my own budget for a number of years, the first two to three years of manageable budget was quite painful. There was a lot of learning experiences.
Ravi Lachhman 29:02
We’re going through that right now. [laughs]
Bob Chen 29:05
So I tried to play it safe and slowly increase from the scale perspective for people. Because what I do not like and have done in the past before was scaling up too fast, and then having to let people go. As a manager, that impacted me quite a bit. In [all] honesty, having to let somebody go and impacting their own family livelihood. So I try to also ensure that I scale out slowly, and not too fast. To prevent that is— not everybody can. I guess it all depends on your ability to negotiate with product managers and engineering teams. You push back and make do with whatever you can.
But the metric, again, is number of errors. I do look also look at number of deployments and so forth, how much infrastructure, how long an environment build takes. If we’re doing a lot of planning work, I might look at it to determine if we want to bring in a contractor to help on the site; usually I do set budget aside for contractors.
I think if I continue, I’m going to put a lot of people to sleep with that, because a lot is not pertaining to what actually is required from a DevOps perspective in operations. But there’s a lot. You want to look at everything. mM biggest thing is usually looking at the quality, the uptime, and then the people.
Ravi Lachhman 30:53
Yeah, those are great, stellar points. It’s always hard the more senior you get. You have to start setting KPIs and stuff like that. It’s like, huh? Where do I even start with the KPIs? Or to your point, those are very salient points. You can take a look at the burden the person’s having, if they’re starting to make errors— those are telltale sign. I’ve never had that discussion with anybody before, so the listeners are like “that’s the secret sauce of management.”
Bob Chen 31:24
The secret sauce or the boring sauce? [Laughs] Yeah, I mean, I’ve got my engineers and I’ll ask them, “Hey, you want to step into my shoes for a little bit, and then kind of see our days?” And they’re always like, “No, you do what you do. We’ll do what we do. Stay out of my shoe, and I’ll stay out of your shoe.”
Ravi Lachhman 31:41
[Chuckles] I have two more questions for you.
Bob Chen 31:43
Ravi Lachhman 31:44
I always like to ask these two questions to DevOps leaders. They’re both kind of separate, but one question might have a longer answer. So [if] you’re hiring for another engineer on your team— what are some qualities you look for when you hire?
Bob Chen 32:00
Yeah, that is tough to do. I’ve made bad hires, I’ve made good hires. I like to say I’ve made a lot of good hires in the past, but I hate to say that I’ve also made a lot of bad hires in the past. Just being realistic here. But a lot of the qualities I’ve come to learn from my experience.
Skillset can be taught. The number one trait and quality that I look for is actually how well the impending hire, [and their] personality will jive with the team. I’ve had too many instances in the past, where I’ve had engineers leave because they either don’t like another engineer, or that’s causing them grief… So that’s the first thing I look at. When you build a team, when you are leading a team— having one bad engineer is one of the most frustrating things to deal with, because when that bad engineer acts, other people react to it. Then that level of stress can be threefold, fourfold, or even fivefold on top of your regular things you have to deal with.
So that’s my number one trait because again, as I mentioned, skill sets [and] technical skills— those can be groomed, as long as the engineer shows a willingness and ability to learn. That is not an issue. I better hire somebody that I can trust and rely on [who] will jive with the team. That’s really what I look for these days, and that has been working pretty well, ensuring that they work together. Because one thing that I also found out [was] that when I purely look for that trait and everybody works together, everybody also helps each other out when the time comes. When there’s somebody who needs help or somebody who’s a little bit down, everybody steps up. And to be honest with you, as a team lead, I cannot ask for anything more and better than that.
So that is usually the only trait and the quality that I look for when I interview. A lot of the technical items, I’ll let the team hammer through it. What I do do is I do screen them for technical chops before I let the team dig in them and see what they really can and cannot do. And also, the way I interview is also to ensure that I don’t overburden the team or others with a bunch of interviews. I usually do a phone screening; for [the] phone screening, I’m usually initially screening for primarily personality, a little bit light on the technical skill set.
After that, depending on how I think they did, I might do a second phone screening, but I might focus a little bit more in depth on actual personality [and] dig into the background a little bit and also digging a little bit into a technical discussion and giving them some scenarios and discussions. I will definitely at least do that before I bring them in in-person or give them a skillset test remotely, and then bring them in for in-person. Because I’ll be honest, a lot of feedback I get from my teams before when we’re hiring is also the amount of time it takes for them to interview somebody on top of their existing workload. It’s tremendous. They could be lying to me as well—I don’t know, right? But I do care. And then that’s one of the things I actually do—I’ll do as much due diligence as I can to limit what they have to do and the amount of effort they have to put in for interviewing future team members. But I’ll do my best to ensure that the person that they speak with is the person they actually want to speak with.
Ravi Lachhman 37:06
That’s pretty awesome. I’m literally dealing with that right now. I even brought that to Steve, my boss, like, hey, when do we introduce the interview candidates to the rest of the team? That was yesterday. So all of these things are top of mind for me. [Jokingly] Again instead of [Ship Talk], it’s Ravi’s greedy podcast.
These are all great things. I wish there was a podcast like this years ago to shield the team from the interview process, because like you I get inundated with candidates. It’s like, “yeah, talk to everybody.” It’ll be all day. Good tips!
Bob Chen 37:41
In the past, you know, I usually probably screened about 15 to 20 people. Probably about at the most three of that number [would] get past me and into the actual team. And in some cases, I might even have the lead engineer on the team do additional phone screening, and that probably [would] weed out and down to one or two. The ideal goal for me is to have them only speak to two or three— at the most— in person as well. So that is kind of what I deduce it down to.
Ravi Lachhman 38:21
That’s perfect. And so one last question— and this is how I end every one of the Ship Talk podcasts—is the intrinsic question. Let’s say it’s with young Bob and current Bob. So imagine you just graduated from university or college, and you were walking down the street with current Bob and you ran into to your young self. What would be any advice— could be life, technology, anything— that you would tell young Bob with what you know today?
Bob Chen 38:50
Yeah, interesting. Never been asked that question before, at least personally. Um, you know, I probably would say… [jokingly] pick a different career.
Ravi Lachhman 39:06
[Laughs] Boom! Don’t get paged out. Yeah, turn your phone upside down.
Bob Chen 39:10
No, but I think I definitely would tell myself it’s a good career. But I probably would definitely say be a little bit more understanding [of how] the software engineering principles and practices apply to DevOps and how DevOps totally leveraged it. Would definitely would have told myself back then, “Hey, focus a lot on software development and coding skills and abilities.” Because I’ll be honest, I think if I did do more of that back then and actually maybe got a computer science degree versus a business degree, that would have helped me out quite a bit. Or I would have gotten a [quicker], bigger jump and leap in terms of getting DevOps, to where DevOps is right now. A lot of the practices and everything would have been ingrained into my head versus picking them along the way as I work with engineering groups on that.
And really, the last thing is stay true to yourself. You know, a lot of the things that I’ve done in my past really got me to where I am. So stay true to yourself, and believe in yourself, and there’s nothing you can and cannot do. I mean, I’ll be honest, in the past, I’ve told myself, there’s no way in hell you’d be able to script and automate code. But hey, I was able to do it, get to where I am right now. And [I have been] able to fortunate [enough] to work with a lot of amazing engineers, meet a lot of great talents, and make a lot of great friends.
Ravi Lachhman 40:53
That’s awesome. Well, thanks, Bob, for being on our podcast today. I really appreciate it. Really salient and sage advice from a DevOps leader in the field. So Bob, thanks so much for being on Ship Talk.
Bob Chen 41:05
Yeah, no, thank you. It’s been a great pleasure, and thanks for inviting me to your podcast.