Eighteen months ago, we took the initiative to evolve our cloud infrastructure. In this article, we share what we have learned and highlight important aspects of our approach.
The objectives for this initiative were:
Ours is a cloud-native stack with dozens of microservices deployed in a Kubernetes cluster. We needed to create new production clusters for scale and provide our SaaS service in more geographies(outside the US). We opted for OpenTofu and Terragrunt as our IaC tools and standardized on Helm Charts as service artifacts. Our CI process produces docker images and corresponding helm charts(as versioned and immutable artifacts; we bake image tags in the chart).
We divided our stack into four tiers from the bottom (Tier-1) to the top (Tier-4), as illustrated in the diagram below:
The separation of concerns from the security and operations point of view determined these tiers. The following are the functions of each tier:
Tier-1 needs the highest privileged access (it needs to be IAM admin). Our Security Operations team operates this tier from their workstations. Tier-1 setup also deploys a Harness Delegate with an IAM role with required permissions(scoped to the Project) to manage other Tiers. We operate Tiers 2-4 through Harness pipelines. Harness’ RBAC system provides granular controls for managing access at the environment levels. For production environments, we restrict access to Tier-2 and Tier-3 to Cloud Engineers (who manage our production infrastructure), and Tier-4 is available for individual application teams for their independent service deployments.
We use External Secret Manager to pass secrets from the lower to the upper tier. Cloud and Application engineering teams never see the secrets. We use keyless workload identities wherever applicable in our application and infrastructure tiers.
This tiered approach to the infrastructure stack ensures best-in-class security controls for our cloud infrastructure and provides flexibility and agility for application teams' development flows.
We have more than a dozen independent development teams. Devspaces are on-demand production-like environments where the teams can do their feature testing. We use the same infrastructure stack to build Devspaces. Each devspace is implemented as an isolated namespace (Tier-3 and -4) in a shared cluster (Tier-1 and -2). Developers can deploy feature builds to their devspaces while the rest of the stack runs a production-like configuration managed by the central team.
Devspaces have proven to be very versatile in our development process. They have effectively removed the bottlenecks of the integration environment, enabling each development team to do end-to-end feature testing in their environments. These environments are instrumental in various use cases, such as feature testing, performance testing, demo environments for early feedback, and documentation. In a typical week, we see over a thousand feature build deployments across hundred-odd devspaces, a testament to their versatility and efficiency. We have built features like TTL and team-wise cost visibility for devspaces to bring cost efficiency.
We have all aspects of our infrastructure version controlled in Git. It includes infrastructure and pipeline definition and environment-specific configurations. Git provides us with an audit trail through commit history. We use Pull Request flows to govern changes. Git-based versioning provides us with complete repeatability of environment setup.
Standardized IaC driven by Harness pipelines has provided a very flexible mechanism for creating environments for various use cases at Harness. We call this approach Environment-as-a-Service. The diagram below depicts the different use cases in which we employ this.
This initiative has had a significant impact at Harness. In the last few months, we have created three new production clusters, one integration and two QA environments, and over a hundred devspaces. The time to create a new production cluster has been reduced to a few hours from many weeks, a testament to the power and efficiency of this approach.
We are working with some of our large enterprise customers interested in adopting our approach. In the future, we plan to improve documentation, system usability, and open source our infrastructure repository so that others can benefit from this work.