It’s the most spooktacular time of the year, and we’re thrilled at the terrifying tales you submitted through our Developer Horror Stories Contest. Make no bones about it, we had trouble picking witch one was the most unboolievably scary. Migration nightmares, bloody bugs, ghastly failures - so much to sink our teeth into.
So this is a pre-sales engagement where signing on the contract was contingent on a working demo. I assemble a team of developers and we build a proof of concept. It was a plain old simple web app hosted on a desktop. Why desktop? Because this is still 2012 and cloud is not a thing. We basically repurposed desktop computers for simple use cases such as these. After the proof of concept was fully developed, our team travelled to the customer's office to demo the application. We had an overall 4 scenarios to show and tell. It all seemed to work until we reached past the 50% mark in the demo. Nothing seems to work. Customer was understanding and was willing to look at the rest of the demo the following day, which went well.
So, why did it fail? Imagine a new hire walking to a desktop computer and switching it off to save electricity because he saw the desk unoccupied for 2 days. Thanks bro!
We were building our green product at the time - I would guess we would classify as a renewable energy tech project - and we were a big AWS shop. As you can imagine, all those microservices that made up that service were backed by an integration with an AWS service, like S3, RDS, SQS, Lambda - you name it. Those were all Amazon's services that we were relying on to back our applications for this renewable product. And we made a decision at the time that we wanted to move over to Azure.
We didn't know much about it, we didn't have much expertise around it. A lot of our teammates came from AWS shops previously, so this really was uncharted territory for us. We had one of our teams lead the charge and figure out, “Okay, what will it take to migrate over from AWS to Azure?”
At the time, Azure was still on the come-up. It was still significantly behind AWS in terms of feature parity and the services and capabilities that were needed to satisfy our regulatory needs, because this is renewable energy. We had to make sure that the data was secure, transactions were happening in real time, the time series metrics were always being streamed properly. Those kinds of things came into account when we were looking at Azure versus AWS.
Eventually, we settled on price. Azure was cheaper, they gave us a lot more things like Microsoft Teams - the whole office package - and this major Azure credit. So we start the migration without any full understanding of what it means to deploy an application and have it integrate with the services. We did the work, and many teams were able to adopt Azure and piece together some solutions to integrate with their service. However, it wasn't one-to-one. As a business, we're thinking, “Well, we need to cut over from AWS very soon and move to Azure.” Our customers are waiting on the service so that they can consume it for their needs. And we're sitting here, slowly moving this along. We're ‘trial by fire’ on every single microservice. We're trying to figure out what's not working. The application is down because we're using this Azure service that's not returning data the way we're expecting it to, or because of networking, we're unable to talk to the resource.
It was a disaster. It got to a point where we did all this work, we had a leadership change, and we were 25% of the way there on migrating over. This is now over a year, and we had a leadership change. And what happened was, they were like, “Oh my God, why is this taking so long? Are we really trying to shave a few bucks just to move to a different cloud provider? I don't think so.”
They moved all those migrated applications back to AWS. For those of us who already started discontinuing some of our services, we had to re-integrate with AWS. So that added another 6 months of work. People were furious because they were putting so much time into figuring out the migration strategy. Teammates were already giving us frameworks to use to migrate the rest of our services. It just came down to - we thought about this based on price, but the reality is we never needed to undergo this kind of pain. And we ended up going back to AWS.
I still remember that many teams were super upset. After some of them quit and went on to startups in the Bay Area, they were like, “We're done with this big company mentality. Why do we have to migrate if we needed certain services to be in Azure to sign a deal”? We should have approached it with a multicloud strategy, not a migration.
The lesson learned here for me was that when you're making a big decision like this, you should design your applications in a way that is agnostic of your cloud providers. You should have a generic contract in there so that you can communicate with any of these cloud providers’ services, because you never know. Tomorrow, you might need to be in a different cloud provider to go after a different set of customers. You need to make sure your application is resilient and flexible that way.
But that was a terrifying couple of months after that, all that work. It was like you were running a marathon, you were coming in first, and you got injured in the last mile. That was the feeling that we had. It was terrible. It was terrifying. But you know, a lot of learning from that, for sure. Going forward, make sure you have a solid path, make sure your application is not tied to a specific cloud provider, architect your applications to have a platform layer where it allows you to integrate with other things. Those are the learnings that we definitely picked up from our times.
We had a massive contract with Pivotal, a company under the Dell umbrella at the time. They were pushing pair programming, the Twelve-Factor app, being able to build an application and quickly deploy it to a managed pass layer so you don't have to worry about AWS, you don't have to worry about regions. Cloud Foundry was a nice layer on top of that, to take away that complexity.
We heavily used Cloud Foundry to deploy a lot of our business-facing applications and consumer-facing applications. Cloud Foundry had its perks, but when we needed to do more complex things like Canary deployments, blue-green deployments, and then being able to run these applications in an on-prem environment or in a disconnected on-prem environment, it wasn't really easy.
We needed to containerize these applications, because Cloud Foundry’s on-premise installation was not approved by a lot of our customers. They're very regulated. They approved Docker, and they approved Kubernetes on-premise. So we had to undergo this migration where we would port over all of our Cloud Foundry applications, which were not Dockerized applications at the time. They were jars, wars, and zip files that were deploying into a prod space essentially for people to consume. We used all these plugins, like the autoscaler plugin. We'd always have to manage the build packs and figure out which ones we needed to use, and which we weren't compatible with.
With this Docker effort, it required us to make sure that we had all the dependencies, everything baked in and ready to go so that when this container spun up, it actually ran, and on top of that, it was able to communicate with other services. And on top of that, we had another complexity, which was learning Kubernetes. We had an internal tool, which did CI/CD, but it deployed onto a Mesosphere containerized cluster. That was not Kubernetes. It had its own formula for deployment. We were stuck in this place where we had to learn Kubernetes and then come up with a migration strategy to move all these business-facing applications onto Kubernetes and containerize them.
The challenge was that we had a 6 month deadline. We came up with a migration tool to move these applications over. What happened was, while we were trying to do so, we also went full hardcore developer route, which was, “Oh, we have to set up Kubernetes on our own on-premise cluster, we don't want to consume EKS or any other service provider for it.” And then on top of that, we had to manage the RBAC, and come up with the migration tool to see, “Okay, these PCF manifests need to be converted into Kubernetes-style manifests, and we don't even know what all the mappings are.” So we had to come up with the hybrid mapping model.
We tried to mask Kubernetes away from the developer so that they would not go through the heavy lifting of understanding what those yamls were made of, it's just they give us an app, and we magically convert it from PCF to Kubernetes.
We tried the first three applications and things just didn't work. Our mapping was wrong. The containers failed to get healthy. People didn't understand what a service was. People didn't know the Pod concept. They weren't familiar with how Kubernetes orchestrates these containers and schedules them and they didn't have this concept of a stateful application, or that if there's a service that requires a DB, you want it to have a persistent volume. That was not there either.
There were a lot of things that we missed when we were migrating this over. And as a result, the 6 month timeline we had, we just couldn't hit it. We actually failed. Coming off another failed migration, this one was the final straw for a lot of people. We thought we had the learnings from that previous failed migration. We had a migration strategy, we had the mapping, it was automated so users wouldn't have to struggle through it or do any heavy lifting for it. But we just fell short because a) our own knowledge of Kubernetes wasn't correct and b) when we did the modeling, we never took into account all these corner cases. Not all the apps are made the same way and not all apps are deployed the same way, we found out. These PCF applications were getting force-fitted into this hack of a template that we came up with.
We ended up having all this work that was scrapped. It got so bad, to the point where people were like, “Forget about it, we'll ride it out with Cloud Foundry. In fact, we can go open source Cloud Foundry and manage all our own nodes and the whole Cloud Foundry installation if we really want to, but we are not going to invest much more effort into Kubernetes. That's out of the question.”
That resulted in people getting very upset. A lot of teammates left after that project, because that was the final straw for them. That is another horror story: you have to make sure you have these plans in place. Otherwise you're going to be in this mess where people are unhappy, they've done all this work, try to bring things over, and it just doesn't work. You waste time for your team, you waste time for your business, and your customers who are using that application are gonna be like, “What the heck, you guys took this down so you can publish a new version and it's not there. You have to turn back on the old application.” That stinks. It costs a business a lot of money, and very much can cost people their jobs. It's a dangerous game if you don't do it right.
Some years ago I was hired to customize a financial system. First test I ran exploded in my face. Turns out my predecessor would test systems with no data, and if it ran, moved it to production. The programmer's last name was “Bugg.”
--
There you have it, ghouls and gals, scary stories to fill your tomb. See you next year - and until then, creep it real.