

For decades, SCM has meant one thing: Source Code Management. Git commits, branches, pull requests, and version history. The plumbing of software delivery. But as AI agents show up in every phase of the software development lifecycle, from writing a spec to shipping code to reviewing a PR, the acronym is quietly undergoing its most important transformation yet.
And this isn't a rebrand. It's a rethinking of what a source repository is, what it stores, and what it serves, not just to developers, but to the agents working alongside them.
AI agents in software development are powerful but contextually blind by default. Ask a coding agent to implement a feature and it will reach out and read files, one by one, directory by directory, until it has assembled enough context to act. Ask a code review agent to assess a PR and it will crawl through the codebase to understand what changed and why it matters.
Anthropic's 2026 Agentic Coding Trends Report documents this shift in detail: the SDLC is changing dramatically as single agents evolve into coordinated multi-agent teams operating across planning, coding, review, and deployment. The report projects the AI agents market to grow from $7.84 billion in 2025 to $52.62 billion by 2030. But as agents multiply across the lifecycle, so does their hunger for codebase context, and so does the cost of getting that context wrong.
This approach has two brutal failure modes:
The result? Agents that hallucinate implementations because they missed a key abstraction three directories away. Code reviewers that flag style issues but miss architectural regressions. PRD generators that know the syntax of your codebase but not its soul.
The bottleneck is not the model. It is the absence of a pre-computed, semantically rich, always-available representation of the entire codebase: a context engine.
Consider a simple task: "Add rate limiting to the /checkout endpoint."
Without a context engine, a coding agent opens checkout.go, reads the handler function, and writes a token-bucket rate limiter inline at the top of the handler. The code compiles. The tests pass. The PR looks clean.
The agent missed three things:
The code works. The team that maintains it finds it wrong in every way that matters. A senior engineer catches these issues in review, requests changes, and the cycle restarts. Multiply this by every agent-generated PR across every team, every day.
With a context engine, the same agent queries before writing code: "How is rate limiting implemented in this service?" The context engine returns:
The agent writes a new rate limiter that follows the established pattern, implements the shared interface, emits metrics through the standard pipeline, and includes tests that match the existing style. The PR wins approval on the first pass.
The difference is context quality, not model quality.
The Language Server Protocol (LSP) transformed developer tooling in the past decade. By standardizing the interface between editors and language-aware backends, LSP gave every IDE, from VS Code to Neovim, access to autocomplete, go-to-definition, hover documentation, and real-time diagnostics. LSP was designed to serve a specific consumer: a human developer, working interactively, in a single file at a time. That design made the right trade-offs for its era:
For interactive development, these are strengths. LSP excels at what it was built to do.
Agents are a different class of consumer. They don't sit in a file waiting for cursor events. They operate across entire repositories, across SDLC phases, often in parallel. They need the full semantic picture before they start, not incrementally as they navigate.
Agents need not a replacement for LSP, but a complement: something pre-built, always available, queryable at repo scale, and semantically complete, ready before anyone opens a file.
Lossless Semantic Trees (LST), pioneered by the OpenRewrite project (born at Netflix, commercialized by Moderne), take a different approach to code representation.
Unlike the traditional Abstract Syntax Tree (AST), an LST:
This is the first layer of a Source Context Management system. Not raw files. Not a running language server. A pre-indexed semantic tree of the entire codebase, queryable by agents at any time.
A proper Source Context Management system is not a single component. It is a three-layer stack that turns a repository from a file store into something agents can actually reason over.
Every file in the repository is parsed into an LST and simultaneously embedded into a vector representation. This creates two complementary indices:
The LST and semantic indices are projected into a code knowledge graph, a property graph where nodes are functions, classes, modules, interfaces, and comments, and edges are relationships: calls, imports, inherits, implements, modifies, tests.
This graph enables queries like:
The context engine exposes itself through a Model Context Protocol (MCP) server or REST API, so any agent (whether a coding agent, a review agent, a risk assessment agent, or a documentation agent) can query the context engine directly, retrieving precisely the subgraph or semantic chunk it needs, without ever touching the raw file system.
The key insight: agents never read files. They query the context engine.
A single context engine can serve every phase of the software development lifecycle.
A PRD agent queries the context engine to understand existing capabilities, technical constraints, and module boundaries before generating a requirements document. It produces specs grounded in what the system actually is, not what someone thinks it is.
A spec agent traverses the code graph to identify affected components, surface similar prior implementations, flag integration points, and propose an architecture, all without reading a single file directly.
A coding agent retrieves the precise subgraph surrounding the feature area: the types it needs to implement, the interfaces it must satisfy, the patterns used in adjacent modules, the test conventions for this package. It writes code that fits the codebase, not just code that compiles.
A review agent queries the context engine to understand the semantic diff, not just what lines changed, but what that change means for the rest of the system. It can immediately surface:
A risk agent scores every PR against the code graph, identifying high-centrality nodes (code that many things depend on), historically buggy modules, and changes that cross team ownership boundaries. No DORA metrics spreadsheet required.
A documentation agent can traverse the code graph to generate living documentation (architecture diagrams, module dependency maps, API contracts) that updates automatically as the codebase evolves. Design principles can be encoded as graph constraints and validated on every merge.
When a production incident occurs, an on-call agent queries the context engine with the failing component and gets an immediate blast-radius map, the last 10 changes to that subgraph, the owners, and the test coverage status. Time-to-understanding drops from hours to seconds.
The business case is simple:
This is not a theoretical architecture. Tools exist today:
The missing piece is not any individual component. It is the platform that assembles them into a unified, repo-attached context engine that every agent in the SDLC can query through a single interface.
Source Context Management faces real engineering challenges:
This is the shift:
A repository is not a collection of files. A repository is a knowledge graph with a version history attached.
Git's job is to version that knowledge. The context engine's job is to make it queryable. The agent's job is to act on it.
Follow this model and the consequences are concrete. Every CI/CD pipeline should include a context engine update step, as natural as running tests. Every developer platform should expose a context engine API alongside its code hosting API. Every AI coding tool should be evaluated not just on model quality but on context engine quality.
Source code repositories that don't invest in their context layer will produce agents that are fast but wrong. Repositories with rich, well-maintained context engines will produce agents that feel like senior engineers, because they have the same depth of understanding of the codebase that a senior engineer carries in their head.
The LSP gave us IDE intelligence. Git gave us version control. Docker gave us portable environments. Kubernetes gave us cluster orchestration. Each of these was an infrastructure primitive that unlocked a new generation of developer tooling.
It is the prerequisite for every agentic SDLC capability worth building. And like every infrastructure primitive before it, the teams and platforms that build it first will be hard to catch.
SCM is no longer just about managing source code. It's about managing the context that makes the source code understandable.


In today's always-on digital economy, a single slow page or unexpected crash during peak traffic can cost businesses thousands or even millions of dollars in lost revenue, damaged reputation, and frustrated customers. Imagine Black Friday shoppers abandoning carts because your e-commerce site buckles under load, or a SaaS platform going down during a major product launch. This is where load testing becomes non-negotiable.
Load testing simulates real-world user traffic to ensure your applications, websites, and APIs stay fast, stable, and scalable. It's a cornerstone of performance testing that helps teams catch bottlenecks early, validate SLAs, and build resilient systems.
If you're searching for a complete load testing guide, what is load testing, or how to perform load testing, you're in the right place. This beginner-friendly introduction covers everything from the basics to best practices, with practical steps anyone can follow.
Load testing is a type of performance testing that evaluates how your system behaves under expected (and sometimes peak) user loads. It simulates concurrent users, requests, or transactions to measure key metrics such as Response times (average, p95, p99), Throughput (requests per second), Error rates, Resource utilization (CPU, memory, database connections), Latency and scalability.
Unlike unit or functional tests that check "does it work?", load testing answers: "How does it perform when 1,000 (or 100,000) people use it at once?"
Done early and often, load testing reduces risk across the lifecycle. It confirms capacity assumptions, reveals infrastructure limits, and proves that recent changes haven’t slowed critical paths. The result is fewer production incidents and fewer late-night fire drills.
Key terminology to anchor your approach:
Effective load testing quantifies capacity, validates autoscaling, and uncovers issues like thread pool starvation, database contention, cache thrash, and third-party limits. With data in hand, you can tune connection pools, garbage collection, caching tiers, and CDN strategies so the app stays fast when it counts.
Skipping load testing is like launching a rocket without wind-tunnel tests, risky and expensive. Here's why it's essential:
Investing in load testing upfront keeps teams focused on building, not firefighting. Many major outages (think major retailers or banking apps) trace back to untested load scenarios. Load testing helps you ship with confidence.
Not all traffic patterns are the same, and your system shouldn’t be tested with a one-size-fits-all approach. Different load testing scenarios help you understand how your application behaves under various real-world conditions, from everyday usage to extreme, unpredictable events.
Load testing isn’t just about throwing traffic at your system, it’s about understanding how your application behaves under real-world conditions and uncovering hidden bottlenecks before your users do.
Here's a step-by-step guide to do load testing:
Load testing is an iterative process, not a one-time activity. The more consistently you test and refine, the more resilient and reliable your system becomes over time.
Moving into 2026 and beyond, AI is shifting load testing from a manual, scheduled chore into an intelligent, autonomous process. Instead of relying on static scripts, AI agents now ingest vast streams of real-world data including recent incident reports, deployment logs, and even design changes documented in wikis to generate context-sensitive testing scenarios. This ensures that performance suites are no longer generic; they are hyper-targeted to the specific risks introduced by the latest code commits or environmental shifts, allowing teams to catch bottlenecks before they ever reach production.
The relationship between testing and infrastructure has also become a two-way street. Beyond just identifying breaking points, AI-driven analysis of load test results now provides proactive recommendations for deployment configurations. By correlating performance metrics with resource allocation, these systems can suggest the "golden path" for auto-scaling thresholds, memory limits, and container orchestration. This creates a continuous feedback loop where the load test doesn't just pass or fail it actively optimizes the production environment for peak efficiency.
In the new landscape of AI agents proliferation, load testing is no longer just about hitting a server with traffic it's about managing the explosion of agentic orchestration. With organizations deploying hundreds of specialized AI agents, a single user request can trigger a "storm" of inter-agent communication, where one agent's output becomes another's prompt. Traditional load tests fail here because they can't predict these emergent behaviors or the cascading latency that occurs when multiple agents reason, call external APIs, and update shared memory simultaneously. Testing must now account for "prompt bloat" and context contamination, where excessive or conflicting data fed into these agent chains causes performance to degrade or costs to spike unexpectedly.
To survive this complexity, performance engineering in 2026 has shifted toward dynamic environment testing and automated "prompt volume" estimation. Load testers are now using tools like AI Gateways to monitor and rate-limit the massive volume of prompts moving between agents, ensuring that "reasoning loops" don't turn into infinite, resource-draining cycles. By simulating thousands of parallel agent trajectories in virtual sandboxes, teams can identify the specific point where a flurry of prompts causes an LLM's context window to "clash," leading to the 30–40% drops in accuracy often seen under heavy organizational load.
When selecting a load testing tool, teams often start with open-source options for flexibility and cost, then move to enterprise or cloud-managed solutions for scale, collaboration, and integrations.
Here are some of the most popular and widely used load testing tools in 2026:
Choose based on scripting language, scale needs, and integration. For teams already invested in Locust or seeking to combine load testing with chaos engineering in CI/CD pipelines, platforms like Harness Resilience Testing provide seamless native support to elevate your testing strategy.
As systems grow more distributed and user expectations continue to rise, load testing in 2026 is no longer optional, it’s a continuous discipline. Following the right best practices ensures that your application is not just fast, but also resilient and reliable under real-world conditions.
Adopting these best practices helps you move beyond basic performance testing toward building truly resilient systems. In 2026, it’s not just about handling traffic, it’s about thriving under pressure.
Load testing turns unknowns into knowns and panic into process. It isn't a "nice-to-have", it's essential for delivering fast, reliable digital experiences that customers (and your bottom line) demand.
By following this guide, you'll identify issues early, optimize performance, and build systems that scale confidently.
Ship faster, break less, and stay resilient.




You're tagging Docker images with build numbers.
-Build #47 is your latest production release on main. A developer pushes a hotfix to release-v2.1, that run becomes build #48.
-Another merges to develop, build #49. A week later someone asks: "What build number are we on for production?" You check the registry.
-You see #47, #52, #58, #61 on main. The numbers in between? Scattered across feature branches that may never ship. Your build numbers have stopped telling a useful story.
That's the reality when your CI platform uses a single global counter. Every run, on every branch, increments the same number. For teams using GitFlow, trunk-based development, or any branching strategy, that means gaps, confusion, and versioning that doesn't match how you actually ship.
TL;DR: Harness CI now supports branch-scoped build sequence IDs via <+pipeline.branchSeqId>.
Each branch gets its own counter. No gaps. No confusion.
Most CI platforms give you one incrementing counter per pipeline. Push to main, push to develop, push to a feature branch, same counter. So you get:

This is now built directly into Harness CI as a first-class capability.
Add <+pipeline.branchSeqId> where you need the number—for example, in a Docker build-and-push step:
tags:
- <+pipeline.branchSeqId>
- <+codebase.branch>-<+pipeline.branchSeqId>
- latest
Trigger runs on main, then on develop, then on a feature branch. Each branch gets its own sequence: main might be 1, 2, 3… develop 1, 2, 3… feature/x 1, 2. Your tags become meaningful: main-42, develop-15, feature-auth-3. No more guessing which number belongs to which branch.
<+pipeline.branchSeqId>. Check out Harness variables documentation.Webhook triggers (push, PR, branch, release) and manual runs (with branch from codebase config) are supported. For tag-only or other runs without branch context, the expression returns null so you can handle that in your pipeline if needed.

Branch and repo are taken from the trigger payload when possible (webhooks) or from the pipeline's codebase configuration (for example, manual runs). We normalize them so that the same repo and branch always map to the same logical key: branch names get refs/heads/ (or similar) stripped, and repo URLs are reduced to a canonical form (for example, github.com/org/repo). That way, whether you use https://..., git@..., or different casing, you get one counter per branch.
The counter is stored and updated with an atomic increment. Parallel runs on the same branch still get distinct, sequential numbers. The value is attached to the run's metadata and exposed through the pipeline execution context so <+pipeline.branchSeqId> resolves correctly at runtime.
<+pipeline.branchSeqId> and optionally <+codebase.branch>-<+pipeline.branchSeqId> for clear, branch-specific tags.<+pipeline.branchSeqId> --app-version <+codebase.commitSha> so the chart version tracks the build number and the app version tracks the commit.<+pipeline.branchSeqId>" so production and staging each have a clear, branch-local build number.For teams that need control or migration support, branch sequences are also manageable via API:
# List all branch sequences for a pipeline
GET /pipelines/{pipelineIdentifier}/branch-sequences
# Reset counter for a specific branch
DELETE /pipelines/{pipelineIdentifier}/branch-sequences/branch?branch=main&repoUrl=github.com/org/repo
# Set counter to a specific value (e.g., after major release)
PUT /pipelines/{pipelineIdentifier}/branch-sequences/set?branch=main&repoUrl=github.com/org/repo&sequenceId=100All of this is gated by the same feature flag so only accounts that have adopted the feature use the APIs.
CI_ENABLE_BRANCH_SEQUENCE_ID (Account Settings → Feature Flags, or Reach out to the Harness team).<+pipeline.branchSeqId> in steps, tags, or env vars.If branch context isn't available, the expression returns null. Design your pipeline to handle that (for example, skip tagging or use a fallback) for tag builds or edge cases.
Feature availability may vary by plan. Check with your Harness account or Harness Developer Hub for your setup.
This isn't just a Harness problem we solved—it's an industry gap. Here's how major CI platforms compare:
Most platforms treat build numbers as an afterthought. Harness CI treats them as a first-class versioning primitive. For teams migrating from Jenkins or Azure DevOps, the model will feel familiar. For teams on GitHub Actions, GitLab, or CircleCI, this fills a gap that previously required external services or custom scripts
This is the first release of branch-scoped sequence IDs. The foundations are in place: per-branch counters, expression support, and APIs. We're not done.
We're listening. If you use this feature and hit rough edges—or have ideas for tag-scoped sequences, dashboard visibility, or trigger conditions—we want to hear about it. Share feedback .


--
Key Takeaways:
The Harness MCP server is an MCP-compatible interface that lets AI agents discover, query, and act on Harness resources across CI/CD, GitOps, Feature Flags, Cloud Cost Management, Security Testing, Resilience Testing, Internal Developer Portal, and more.
--
The first wave of MCP servers followed a natural pattern: take every API endpoint, wrap it in a tool definition, and expose it to the LLM. It was fast to build, easy to reason about, and it was exactly how we built the first Harness MCP server. That server taught us a lot: solid Go codebase, well-crafted tools, broad platform coverage across 30 toolsets. It also taught us where the one-tool-per-endpoint model hits a wall.
For platforms the size of Harness, spanning the entire SDLC, the pattern doesn't scale. When you expose one tool per API endpoint, you're asking the LLM to be a routing layer, forcing it to do something a switch statement does better. Every tool definition consumes context that could be spent on reasoning. At ~175 tools, that's ~26% of the LLM's context window before the developer even types a prompt.
So we iterated. The Harness MCP v2 redesign does the same work with 11 tools at ~1.6% context consumption. The answer isn't fewer features, it's a different architecture: a registry-based dispatch model where the LLM reasons about what to do, and the server handles how to do it.
When an MCP client connects to a server, it loads every tool definition into the LLM's context window. Every name, description, parameter schema, and annotation. For the first Harness server at ~130+ active tools, here's what that costs:

That's the core insight: the first server uses ~26% of context on tool definitions before any work begins. The v2 uses ~1.6%.
This isn't a theoretical concern. Research on LLM behavior in large context windows, including Liu et al.'s "Lost in the Middle" findings, shows that models struggle to use information placed deep within long contexts. As Ryan Spletzer recently wrote, dead context doesn't sit inertly: "It dilutes the signal. The model's attention is spread across everything in the window, so the more irrelevant context you pack in, the less weight the relevant context carries."
Anthropic's own engineering team has documented this trade-off: direct tool calls consume context for each definition and result, and agents scale better when the tool surface area is deliberately constrained.
The problem compounds in real-world developer environments. If you're running Cursor or Claude Code with a Playwright MCP, a GitHub MCP, and the Harness MCP, those tool definitions stack. EclipseSource's analysis shows that a standard set of MCP servers can eat 20% of the context window before you even type a prompt. The recommendation: stay below 40% total context utilization. Any MCP server with 100+ tools, ours included, would consume more than half that budget on its own.
The context window tax isn't unique to Harness: it's an industry-wide problem. Here's how the v2 server compares to popular MCP servers in the wild:

Lunar.dev research: "5 MCP servers, 30 tools each → 150 total tools injected. Average tool description: 200–500 tokens. Total overhead: 30,000–60,000 tokens. Just in tool metadata." MCP server v2 at ~3,150 tokens would represent just 5–10% of a typical multi-server setup's overhead.
Real-world Claude Code user: A developer on Reddit r/ClaudeCode with Playwright, Context7, Azure, Postgres, Zen, and Firecrawl MCPs reported 83.3K tokens (41.6% of 200K) consumed by MCP tools immediately after /clear. That's before a single prompt.
Anthropic's code execution findings: Anthropic's engineering team reported that a workflow consuming 150,000 tokens was reduced to ~2,000 tokens (a 98.7% reduction) by switching from direct tool calls to code-based tool invocation. The principle is clear: fewer, smarter tools beat more, narrower ones.
MCPAgentBench: An academic benchmark found that "nearly all evaluated models exhibit a decline of over 10 points in task efficiency when tool selection complexity increases." Models overwhelmed with tools prioritize task resolution over execution efficiency. They get the job done, but waste tokens doing it.
Cursor enforces an 80-tool cap, OpenAI limits to 128 tools, and Claude supports up to ~120. The v2 server's 11 tools leave massive headroom to run Harness alongside other MCP servers without hitting these limits.
Consider a concrete example: a developer running Cursor with Playwright (21 tools), GitHub MCP (~40 tools), and the old Harness MCP (~175 tools) would hit ~236 tools, well past Cursor's 80-tool cap. With v2 Harness (11 tools), the same stack is 72 tools, comfortably under the limit.
With Claude Code, the same old stack would burn ~76,400 tokens (~38%) on tool definitions alone. With v2, it drops to ~27,550 tokens (~14%), freeing ~48,850 tokens for actual reasoning and conversation.
The MCP ecosystem is in the middle of a reckoning. Scalekit ran 75 benchmark runs comparing CLI and MCP for identical GitHub tasks on Claude Sonnet 4, and CLI won on every efficiency metric: 10–32x cheaper, 100% reliable vs MCP’s 72%. For a simple “what language is this repo?” query, CLI used 1,365 tokens. MCP used 44,026 — almost entirely from schema injection of 43 tool definitions the agent never touched.
The Playwright team shipped the same verdict in hardware. Their new CLI tool saves browser state to disk instead of flooding context. In BetterStack’s benchmarks, CLI used ~150 tokens per interaction vs MCP’s ~7,400+ of accumulated page state. CircleCI found CLI completed browser tasks with 33% better token efficiency and a 77 vs 60 task completion score.
The CLI camp’s argument is real: schema bloat kills performance. But their diagnosis points at the wrong layer. The problem isn’t MCP. It’s naive MCP server design.
CLI wins when the agent already knows the tool. gh, kubectl, terraform: these have extensive training data. The agent composes commands from memory, pays zero schema overhead, and gets terse, predictable output. Scalekit found that adding an 800-token “skills document” to CLI reduced tool calls and latency by a third.
CLI also wins on composition. Piping grep into jq into xargs chains operations in a single tool call. An MCP agent doing the same work makes N round-trips through the LLM, each one burning context.
But CLI’s advantages dissolve the moment you cross three boundaries:
CLI works when the agent knows the command. For a platform like Harness, with 122+ resource types across CI/CD, GitOps, FinOps, security, chaos, and IDP, the agent can’t know the API surface from training data alone. MCP’s harness_describe tool lets the agent discover capabilities at runtime. CLI would require the agent to guess curl commands against undocumented APIs.
As Scalekit themselves concluded: “The question isn’t CLI or MCP. It’s who is your agent acting for?” CLI auth gives the agent ambient credentials: your token. For multi-tenant, multi-user environments (which is where Harness operates), MCP provides per-user OAuth, explicit tool boundaries, and structured audit trails.
CLI agents can run arbitrary shell commands. An MCP server constrains the agent to declared tools with typed inputs. The v2 server’s elicitation-based confirmation flows, fail-closed deletes, and read-only mode are protocol-level safety guarantees that CLI can’t replicate.
The CLI vs MCP debate is really about schema bloat and naive tool design. The v2 Harness MCP server eliminates the arguments against MCP without losing the arguments for it:
Schema bloat? 11 tools at ~3,150 tokens. That’s less than a single CLI help output for a complex tool. Cursor’s 80-tool cap? We use 11. The 44,026-token GitHub MCP problem? We’re 14x leaner.
Round-trip overhead? The registry-based dispatch means the agent makes one tool call to harness_diagnose and gets back a complete execution analysis — pipeline structure, stage/step breakdown, timing, logs, and root cause. A CLI agent would need to chain 4–5 API calls to assemble the same picture.
Discovery? harness_describe is a zero-API-call local schema lookup. The agent discovers 125+ resource types without a single network request. CLI would require a man page the agent has never seen.
Composition? Skills + prompt templates encode multi-step workflows (build-deploy-app, debug-pipeline-failure) as server-side orchestration. The agent reasons about what to do; the server handles how to chain it. Same efficiency as a CLI pipe, with protocol-level safety.
The real lesson from the benchmarks: MCP servers with 43+ tools and no architecture for context efficiency will lose to CLI on cost metrics. But a well-designed MCP server with 11 tools, a registry, and a skills layer outperforms both naive MCP and naive CLI — and provides authorization, safety, and discoverability that CLI architecturally cannot.
We stopped designing for API parity and started designing for agent usability.
The v2 server is built around a registry-based dispatch model. Instead of one tool per endpoint, we expose 11 intentionally generic verbs. The intelligence lives in the registry: a declarative data structure that maps resource types to API operations.

When an agent calls harness_list(resource_type="pipeline"), the server looks up pipeline in the registry, resolves the API path, injects scope parameters (account, org, project), makes the HTTP call, extracts the relevant response data, and appends a deep link to the Harness UI. The agent never needs to know the underlying API structure.
Each registry entry is a declarative ResourceDefinition:
{
resourceType: "pipeline",
displayName: "Pipeline",
toolset: "pipelines",
scope: "project",
identifierFields: ["pipeline_id"],
operations: {
list: {
method: "GET",
path: "/pipeline/api/pipelines/list",
queryParams: { search_term, page, size },
responseExtractor: (raw) => raw.content
},
get: {
method: "GET",
path: "/pipeline/api/pipelines/{pipeline_id}",
responseExtractor: (raw) => raw.data
}
}
}
Adding support for a new Harness module requires adding one declarative object to the registry. No new tool definitions. No changes to MCP tool schemas. The LLM's tool vocabulary stays constant as the platform grows.
Today, the registry covers 125+ resource types across 30 toolsets, spanning the full Harness platform:
The architecture wasn't designed in a vacuum. We built it specifically for the environments developers actually use.
Cursor and Windsurf connect via stdio transport — the server runs as a local process alongside the IDE. With 11 tools instead of 130+, the Cursor agent has a minimal, clear menu. It doesn't waste reasoning cycles on tool selection or get confused by 40 CCM-specific tools when the developer is debugging a pipeline failure.
For teams that only use specific Harness modules, HARNESS_TOOLSETS lets you filter at startup:
{
"mcpServers": {
"harness": {
"command": "npx",
"args": ["-y", "harness-mcp-v2@latest"],
"env": {
"HARNESS_API_KEY": "pat.xxx.yyy.zzz",
"HARNESS_TOOLSETS": "pipelines,services,connectors"
}
}
}
}
The agent only sees resource types from the enabled toolsets. The rest don't exist as far as the LLM is concerned.
Claude Code excels at multi-step workflows. We leaned into that with 26 prompt templates across four categories:
Each prompt template encodes a multi-step workflow the agent can execute. debug-pipeline-failure doesn't just fetch an execution — it calls harness_diagnose, follows chained failures, and produces a root cause analysis with actionable fixes.
The v2 server also supports multi-project workflows without hardcoded environment variables. An agent can dynamically discover the account structure, then scope subsequent calls with org_id and project_id parameters. No configuration changes needed.
Every tool accepts an optional url parameter. Paste a Harness UI URL, a pipeline page, an execution log, a dashboard, and the server automatically extracts the account, org, project, and resource identifiers. The agent gets context without the developer having to specify it manually.
Reducing tool count solves the context efficiency problem. But developers don't just need fewer tools — they need tools that know how to chain together into real workflows. That's where Harness Skills come in.
The v2 server ships with a companion skills layer (github.com/thisrohangupta/harness-skills) that turns raw MCP tool access into guided, multi-step workflows. Skills are IDE-native agent instructions that teach the AI how to use the MCP server effectively — without the developer having to explain Harness concepts or orchestration patterns.
Skills operate at three levels:
Every IDE gets a base instruction file, loaded automatically when the agent starts:
These files teach the agent: what the 11 tools do, how Harness scoping works (account → org → project), dependency ordering (always verify referenced resources exist before creating dependents), and how to extract context from Harness UI URLs.
The 26 MCP prompt templates registered directly in the server. Any MCP client can invoke them. They encode multi-step workflows with phase gates, e.g., build-deploy-app structures a 4-phase workflow (clone → scan → CI pipeline → deploy) with explicit "do not proceed until this step is done" checkpoints.
Specialized SKILL.md files that function as slash commands in the IDE. Each skill includes YAML frontmatter (trigger phrases, metadata), phased instructions, worked examples, performance notes, and troubleshooting steps.
Without skills, a developer says "deploy my Node.js app" and the agent has to figure out the right Harness concepts, the correct ordering, and the proper API calls from scratch. With skills, the flow is:
harness_list / harness_create / harness_execute callsThe skills layer delivers three measurable improvements:
Without skills, the agent typically needs 3–5 exploratory tool calls to understand Harness's resource model before starting real work. Skills encode this knowledge upfront — the agent knows to check for existing connectors before creating a pipeline, to verify environments exist before deploying, and to use harness_describe for schema discovery instead of trial-and-error.
Harness resources have strict dependency chains (connector → secret → service → environment → infrastructure → pipeline → trigger). Skills encode the 7-step "Deploy New Service" and 8-step "New Project Onboarding" workflows as ordered sequences. The agent doesn't discover dependencies through failures, it follows the prescribed order.
Each failed API call and retry burns tokens. Skills eliminate the most common failure modes (wrong scope, missing dependencies, incorrect parameter formats) by teaching the agent the patterns before execution. The combination of 11 tools (minimal context overhead) plus skills (minimal wasted calls) means more of the context window is available for the developer's actual task.
The first Harness MCP server (harness/mcp-server) pioneered the IDE-native pattern with a review-mcp-tool command that works across Cursor, Claude Code, and Windsurf via symlinked definitions:
One canonical definition in .harness/commands/, symlinked to all three. Update once, propagate everywhere.
The v2 skills layer extends this pattern from developer-tool commands to full DevOps workflows, the same "define once, deploy to every IDE" architecture, applied to pipeline creation, deployment debugging, cost analysis, and security review.
MCP servers that can create, update, and delete resources need safety guardrails. We built them in from the start.
Human-in-the-loop confirmation: All write operations use MCP elicitation to request explicit user confirmation before executing. The agent presents what it intends to do; the developer approves or rejects.
Fail-closed destructive operations: harness_delete is blocked entirely if the MCP client doesn't support elicitation. No silent deletions.
Read-only mode: Set HARNESS_READ_ONLY=true for shared environments, demos, or when you want agents to observe but not act.
Secrets safety: The secret resource type exposes metadata (name, type, org, project) but never the secret value itself.
Rate limiting and retries: Configurable rate limits (default: 10 req/s), automatic retries with backoff for transient failures, and bounded pagination to prevent runaway list operations.
The v2 server supports two transports:
For team deployments, the HTTP transport is compatible with MCP gateways like Portkey, LiteLLM, and Envoy-based proxies, enabling shared control planes with centralized auth, observability, and policy enforcement.
# Local (Cursor, Claude Code)
npx harness-mcp-v2@latest
# Remote (team deployment)
npx harness-mcp-v2@latest http --port 3000
# Docker
docker run -e HARNESS_API_KEY=pat.xxx.yyy.zzz harness-mcp-v2
The shift from 130+ tools to 11 isn't about simplification for its own sake. It's about recognizing that the best MCP servers are capability-oriented agent interfaces, not API mirrors.
Building the first Harness MCP server taught us the same lesson the broader ecosystem is learning: when you expose one tool per API endpoint, you're asking the LLM to be a routing layer. You're consuming context on definitions that could be used for reasoning. And you're fighting against the LLM's actual strengths, reasoning, planning, and multi-step problem solving, by forcing it to do something a switch statement does better. That first server made the cost concrete. The v2 is our answer.
The registry pattern inverts this. The tool vocabulary is stable: 11 verbs today, 11 verbs when Harness ships 50 more resource types. The registry is extensible. The skills layer is composable. The LLM reasons about what to do, and the server handles how to do it. That's not just an efficiency win — it's the correct division of labor between an LLM and a server.
This is the pattern we think more MCP servers should adopt, especially platforms with broad API surfaces. The MCP specification itself is built on the idea that servers expose capabilities, not endpoints. We took that literally.
The efficiency gains from the v2 architecture translate directly into concrete, time-saving use cases for developers operating within their IDEs. The combination of a minimal tool surface (11 tools), deep resource knowledge (125+ resource types), and pre-encoded workflows (Harness Skills) allows the agent to handle complex DevOps tasks with minimal guidance.
See it in action:
Some other use cases:
Debug a Failed CI Pipeline: Get root cause and logs for a pipeline run.
Onboard New Service: Create a Service, Environment, Infrastructure, and initial Connector.
Review Cloud Cost Anomaly: Investigate a sudden spike in cloud spend.
Check Compliance Status: Verify a service's SBOM compliance against OPA policies.
Deploy App to Prod: Execute a canary deployment pipeline.
npx harness-mcp-v2@latest
Configure with your Harness PAT (account ID is auto-extracted):
HARNESS_API_KEY=pat.<accountId>.<tokenId>.<secret>
Full source: github.com/thisrohangupta/harness-mcp-v2
Official Harness MCP Server: github.com/harness/mcp-server
---
The Harness MCP server is an MCP-compatible server that lets AI agents interact with Harness resources using a small set of generic tools.
Each exposed tool adds metadata to the model context. A smaller tool surface leaves more room for reasoning and task execution.
Instead of exposing one tool per API endpoint, it uses 11 generic tools plus a registry that maps resource types to the correct API operations.
The post mentions Cursor, Claude Code, Claude Desktop, Windsurf, Gemini CLI, and other MCP-compatible clients.
The design includes write confirmations, fail-closed delete behavior, read-only mode, and controls for retries, rate limiting, and deployment transport.


This is part 1 of a five-part series on building production-grade AI engineering systems.
Across this series, we will cover:
Most teams experimenting with AI coding agents focus on prompts.
That is the wrong starting point.
Before you optimize how an agent thinks, you must standardize what it sees.
AI agents do not primarily fail because of reasoning limits. They fail because of environmental ambiguity. They are dropped into repositories designed exclusively for humans and expected to infer structure, conventions, workflows, and constraints from scattered documentation.
If AI agents are contributors, then the repository itself must become agent-native.
The foundational step is introducing a standardized instruction layer that every agent can read.
That layer is AGENTS.md.
The Real Problem: Context Silos
Every coding agent needs instructions. Where those instructions live depends on the tool.
One IDE reads from a hidden rules directory.
Another expects a specific markdown file.
Another uses proprietary configuration.
This fragmentation creates three systemic problems.
1. Tool-dependent prompt locations
Instructions are locked into IDE-specific paths. Change tools and you lose institutional knowledge.
2. Tribal knowledge never gets committed
When a developer discovers the right way to guide an agent through a complex module, that guidance often lives in chat history. It never reaches version control. It never becomes part of the repository’s operational contract.
3. Inconsistent agent behavior
Two engineers working on the same codebase but using different agents receive different outputs because the instruction surfaces are different.
The repository stops being the single source of truth.
For human collaboration, we solved this decades ago with READMEs, contribution guides, and ownership files. For AI collaboration, we are only beginning to standardize.
What AGENTS.md Is
AGENTS.md is a simple, open, tool-agnostic format for providing coding agents with project-specific instructions. It is now part of the broader open agentic ecosystem under the Agentic AI Foundation, with broad industry adoption.
It is not a replacement for README.md. It is a complement.
Design principle:
Humans need quick starts, architecture summaries, and contribution policies.
Agents need deterministic build commands, exact test execution steps, linter requirements, directory boundaries, prohibited patterns, and explicit assumptions.
Separating these concerns provides:
Several major open source repositories have already adopted AGENTS.md. The pattern is spreading because it addresses a real structural gap.
Recent evaluations have also shown that explicit repository-level agent instructions outperform loosely defined “skills” systems in practical coding scenarios. The implication is clear. Context must be explicit, not implied.
A Real Example: OpenAI’s Agents SDK
A practical example of this pattern can be seen in the OpenAI Agents Python SDK repository.
The project contains a root-level AGENTS.md file that defines operational instructions for contributors and AI agents working on the codebase. You can view the full file here: Github.
Instead of leaving workflows implicit, the repository encodes them directly into agent-readable instructions. For example, the file requires contributors to run verification checks before completing changes:
Run `$code-change-verification` before marking work complete.It also explicitly scopes where those rules apply, such as changes to core source code, tests, examples, or documentation within the repository.
Rather than expecting an agent to infer these processes from scattered documentation, the project defines them as explicit instructions inside the repository itself.
This is the core idea behind AGENTS.md.
Operational guidance that would normally live in prompts, chat history, or internal knowledge becomes version-controlled infrastructure.
Designing an Effective Root AGENTS.md
A root AGENTS.md should be concise. Under 300 lines is a good constraint. It should be structured, imperative, and operational.
A practical structure includes four required sections.
This section establishes the mental model.
Include:
Agents are pattern matchers. The clearer the structural map, the fewer incorrect assumptions they make.
This section must be precise.
Include:
Avoid vague language. Replace “run tests” with explicit commands.
Agents execute what they are told. Precision reduces drift.
This section defines conventions.
Rather than bloating AGENTS.md, reference a separate coding standards document for:
The root file should stay focused while linking to deeper guidance.
This is where most teams underinvest.
Document:
Agents tend to repeat statistically common patterns. Your codebase may intentionally diverge from those patterns. This section is where you enforce that divergence.
Think of this as defensive programming for AI collaboration.
Hierarchical AGENTS.md: Scaling Context Correctly
Large repositories require scoped context.
A single root file cannot encode all module-specific constraints without becoming noisy. The solution is hierarchical AGENTS.md files.
Structure example:
root/
AGENTS.md
module-a/
AGENTS.md
module-b/
AGENTS.md
sub-feature/
AGENTS.mdAgents automatically read nested AGENTS.md files when operating inside those directories. Context scales from general to specific.
Root defines global conventions.
Module-level files define local invariants.
Feature-level files encode edge-case constraints.
This reduces irrelevant context and increases precision.
It also mirrors how humans reason about codebases.
Compatibility Across Tools
A standard file location matters.
Some agents natively read AGENTS.md. Others require simple compatibility mechanisms such as symlinks that mirror AGENTS.md into tool-specific filenames.
The key idea is a single source of truth.
Do not maintain multiple divergent instruction files. Normalize on AGENTS.md and bridge outward if needed.
The goal is repository-level portability. Change tools without losing institutional knowledge.
Best Practices for Agent Instructions
To make AGENTS.md effective, follow these constraints.
Write imperatively.
Use direct commands. Avoid narrative descriptions.
Avoid redundancy.
Do not duplicate README content. Reference it.
Keep it operational.
Focus on what the agent must do, not why the project exists.
Update it as the code evolves.
If the build process changes, AGENTS.md must change.
Treat violations as signal.
If agents consistently ignore documented rules, either the instruction is unclear or the file is too long and context is being truncated. Reset sessions and re-anchor.
AGENTS.md is not static documentation. It is part of the execution surface.
Ownership and Governance
If agents are contributors, then their instruction layer requires ownership.
Each module-level AGENTS.md should be maintained by the same engineers responsible for that module. Changes to these files should follow the same review rigor as code changes.
Instruction drift is as dangerous as code drift.
Version-controlled agent guidance becomes part of your engineering contract.
Why Teams Are Adopting AGENTS.md
Repositories across the industry have begun implementing AGENTS.md as a first-class artifact. Large infrastructure projects, developer tools, SDKs, and platform teams are standardizing on this pattern.
The motivation is consistent:
AGENTS.md transforms prompt engineering from a personal habit into a shared, reviewable, versioned discipline.
Vercel published evaluation results showing that repository-level AGENTS.md context outperformed tool-specific skills in agent benchmarks.
Why This Matters Now
AI agents are rapidly becoming embedded in daily development workflows.
Without a standardized instruction layer:
The repository must become the stable contract between humans and machines.
AGENTS.md is the first structural step toward that contract.
It shifts agent collaboration from ad hoc prompting to engineered context.
Foundation Before Optimization
In the next post, we will examine a different failure mode.
Even with a perfectly structured AGENTS.md, long AI sessions degrade. Context accumulates. Signal dilutes. Hallucinations increase. Performance drops as token counts rise.
This phenomenon is often invisible until it causes subtle architectural damage.
Part 2 will focus on defeating context rot and enforcing session discipline using structured planning, checkpoints, and meta-prompting.
Before you scale orchestration.
Before you add subagents.
Before you optimize cost across multiple model providers.
You must first stabilize the environment.
An agent-native repository is the foundation.
Everything else builds on top of it.


An API failure is any response that doesn’t conform to the system’s expected behavior being invoked by the client. One example is when a client makes a request to an API that is supposed to return a list of users but returns an empty list (i.e., {}). A successful response must have a status code in the 200 series. An unsuccessful response must have either an HTTP error code or a 0 return value.
An API will raise an exception if it can’t process a client request correctly. The following are the common error codes and their meanings:
An API failure can happen because of issues with the endpoints like network connectivity, latency, and load balancing issues. The examples below may give you a good understanding of what causes an API failure.
Some APIs are better left locked down to those who need access and are only available to those using an approved key. However, when you don’t set up the correct permissions for users, you can impede the application’s basic functionality. If you’re using an external API, like Facebook, Twitter, or even Google Analytics, make sure you’re adding the permissions for your users to access the data they need. Also, keep on top of any newly added features that can increase security risks.
If you’re leveraging external APIs requiring extra configuration, get the correct API key so the app has the proper permissions. Also, provide your clients with API keys relevant to their authorization levels. Thus, your users will have the correct permissions and will seamlessly access your application.
We’ve all seen it happen a million times: someone discovers an API that’s exposed to everyone after gaining user consent. Until now, this was usually reasonably benign, but when credentials are leaked, things can get ugly fast, and companies lose brand trust. The biggest problem here is keeping admins from having unsecured access to sensitive data.
Using a secure key management system that includes the “View Keys” permission for the account will help mitigate this risk. For example, you could use AWS Key Management Service (AWS KMS) to help you manage and create your encryption keys. If you can’t protect your keys, then at the very least, include a strong master password that all users can access, and only give out these keys when needed.
Untrusted tokens and session variables can cause problems for how a website functions, causing timing issues with page loads and login calls or even creating a denial of service, which can harm the end-user experience and your brand.
The best way to secure sensitive data is by using token authentication, which will encode user data into the token itself based on time/date stamps. You can then enforce this to ensure that whenever you reissue tokens, they expire after a set amount of time or use them for API requests only. As for session variables, these are usually created based on your authentication keys and should be handled the same way as your privileged keys—with some encryption. And keep the source of your keys out of the hands of anyone who can access them.
If you’re using an API to power a website, you must upload new data in real time or save it to a cache for later use. When you set an expiry time for an API and fail to update, you make it unavailable. When a user or application tries to access it after the expiry, they get a 404 or 500 error.
You should use a middle ground option—a proxy API. This will allow you to cache your data before you make it available and only allow access to the correct bits of the APIs as needed. You should also schedule tasks that run daily to import updated data and bring it into your system.
This one isn’t necessarily a mistake, but it happens from time to time when developers aren’t careful about how they name things or if they’re using an improper URL structure for their API endpoints. When the URL structure is too complex or has invalid characters, you will get errors and failures. Look at some examples of bad URL structure: “http://example.com/api/v1?mode=get” The above structure is bad because the "?" character filters a single parameter, not the type of request. The default request type is GET; thus, a better URL would look like this: “http://example.com/api/v1”
Remove any unsafe characters in your URL, like angle brackets (<>). You use angle brackets as delimiters in your URL. Also, design the API to make it more friendly for users. For example, this URL "https://example.com/users/name" tells users they’re querying the names of users, unlike this URL "https://example.com/usr/nm" It’s also good practice to use a space after the “?” in your API URL because otherwise, people can mistakenly think that the space is part of a query string.
This happens when trying to build multiple ways of accessing multiple applications. You do this by relying on generic endpoints instead of target audiences and specific applications. Creating a lot of different paths for the same data results in non-intuitive routes.
There are several ways to go about this, but for most, you want to use a network proxy system that can handle the different data access methods and bring it all into one spot. This will help minimize potential issues with your APIs routes and help with user confusion and brand damage.
This can happen when organizations are not properly securing their public IP addresses, or there is no solid monitoring process. This exposes your assets by providing easy access to anyone. Exposed IPs make your application vulnerable to DDoS attacks and other forms of abuse or phishing.
Make sure you properly manage your IP addresses and have a solid monitoring system. You must block all IPv6 traffic and enforce strict firewall rules on your network. You should only allow service access through secure transport methods like TLS.
API errors are a plague on the internet. Sometimes they come as very poor performance that can produce long response times and bring down APIs, or they can be network-related and cause unavailable services. They’re often caused by problems such as inconsistent resource access errors, neglect in proper authentication checks, faulty authentication data validation on endpoints, failure to read return codes from an endpoint, etc. Once organizations recognize what causes API failures and how to mitigate them, they seek web application and API protection (WAAP) platforms to address the security gaps. Harness WAAP by Traceable helps you analyze and protect your application from risk and thus prevent failures.
Harness WAAP is the industry’s leading API security platform that identifies APIs, evaluates API risk posture, stops API attacks, and provides deep analytics for threat hunting and forensic research. With visual depictions of API paths at the core of its technology, its platform applies the power of distributed tracing and machine learning models for API security across the entire software development lifecycle. Book a demo today.


Argo CD is a Kubernetes-native continuous delivery controller that follows GitOps principles: Git is the source of truth, and Argo CD continuously reconciles what’s running in your cluster with what’s declared in Git.
That pull-based reconciliation loop is the real shift. Instead of pipelines pushing manifests into clusters, Argo CD runs inside the cluster and pulls the desired state from Git (or Helm registries) and syncs it to the cluster. The result is an auditable deployment model where drift is visible and rollbacks are often as simple as reverting a Git commit.
For enterprise teams, Argo CD becomes a shared platform infrastructure. And that changes what “install” means. Once Argo CD is a shared control plane, availability, access control, and upgrade safety matter as much as basic deployment correctness because failures impact every team relying on GitOps.
A basic install is “pods are running.” An enterprise install is:
Argo CD can be installed in two ways: as a “core” (headless) install for cluster admins who don’t need the UI/API server, or as a multi-tenant install, which is common for platform teams. Multi-tenant is the default for most enterprise DevOps teams that use GitOps with a lot of teams.
Before you start your Argo CD install, make sure the basics are in place. You can brute-force a proof of concept with broad permissions and port-forwarding. But if you’re building a shared service, doing a bit of prep up front saves weeks of rework.
If your team is in a regulated environment, align on these early:
Argo CD install choices aren’t about “works vs doesn’t work.” They’re about how you want to operate Argo CD a year from now.
Helm (recommended for enterprise):
Upstream manifests:
If your Argo CD instance is shared across teams, Helm usually wins because version pinning, values-driven configuration, and repeatable upgrades are easier to audit, roll back, and operate safely over time.
Enterprises often land in one of these models:
As a rule: start with one shared instance and use guardrails (RBAC + AppProjects) to keep teams apart. Add instances only when you really need to (for example, because of regulatory separation, disconnected environments, or blast-radius requirements).
When Argo CD is a shared dependency, high availability (HA) is important. If teams depend on Argo CD to deploy, having just one replica Argo CD server can slow things down and cause problems with pagers.
There are three common access patterns:
For most enterprise teams, the sweet spot is Ingress + TLS + SSO, with internal-only access unless your operating model demands external access.
If you’re building Argo CD as a shared service, Helm gives you the cleanest path to versioned, repeatable installs.
helm repo add argo https://argoproj.github.io/argo-helm
helm repo update
# Optional: list available versions so you can pin one
helm search repo argo/argo-cd --versions | head -n 10
In enterprise environments, “latest” isn’t a strategy. Pin a chart version so you can reproduce your install and upgrade intentionally.
kubectl create namespace argocd
Keeping Argo CD isolated in its own namespace simplifies RBAC, backup scope, and day-2 operations.
Start by pulling the chart’s defaults:
helm show values argo/argo-cd > values.yaml
Then make the minimum changes needed to match your access model. Many tutorials demonstrate NodePort because it’s easy, but most enterprises should standardize on Ingress + TLS.
Here’s a practical starting point (adjust hostnames, ingress class, and TLS secret to match your environment):
# values.yaml (example starter)
global:
domain: argocd.example.internal
configs:
params:
# Common when TLS is terminated at an ingress or load balancer.
server.insecure: "true"
server:
ingress:
enabled: true
ingressClassName: nginx
hosts:
- argocd.example.internal
tls:
- secretName: argocd-tls
hosts:
- argocd.example.internal
# Baseline resource requests to reduce noisy-neighbor issues.
controller:
resources:
requests:
cpu: 200m
memory: 512Mi
repoServer:
resources:
requests:
cpu: 200m
memory: 512Mi
This example focuses on access configuration and baseline resource isolation. In most enterprise environments, teams also explicitly manage RBAC policies, NetworkPolicies, and Redis high-availability decisions as part of the Argo CD platform configuration.
If your clusters can’t pull from public registries, you’ll need to mirror Argo CD and dependency images (Argo CD, Dex, Redis) into an internal registry and override chart values accordingly.
Use helm upgrade --install so your install and upgrade command is consistent.
helm upgrade --install argocd argo/argo-cd \
--namespace argocd \
--values values.yaml
Validate that core components are healthy:
kubectl get pods -n argocd
kubectl get svc -n argocd
kubectl get ingress -n argocd
If something is stuck, look at events:
kubectl get events -n argocd --sort-by=.lastTimestamp | tail -n 30
Most installs include these core components:
Knowing what each component does helps you troubleshoot quickly when teams start scaling usage.
Your goal is to get a clean first login and then move toward enterprise access (Ingress + TLS + SSO).
kubectl port-forward -n argocd svc/argocd-server 8080:443
Then open https://localhost:8080.
It’s common to see an SSL warning because Argo CD ships with a self-signed cert by default. For a quick validation, proceed. For enterprise usage, use real TLS via your ingress/load balancer.
Once DNS and TLS are wired:
If your ingress terminates TLS at the edge, running the Argo CD API server with TLS disabled behind it (for example, server.insecure: “true”) is a common pattern.
Default username is typically admin. Retrieve the password from the initial secret:
kubectl -n argocd get secret argocd-initial-admin-secret \
-o jsonpath="{.data.password}" | base64 --decode; echo
After you’ve logged in and set a real admin strategy using SSO and RBAC, the initial admin account should be treated as a break-glass mechanism only. Disable or tightly control its use, rotate credentials, and document when and how it is allowed.
If you want a quick Argo CD install for learning or validation, upstream manifests get you there fast.
Important context: the standard install.yaml manifest is designed for same-cluster deployments and includes cluster-level privileges. It’s also the non-HA install type that’s typically used for evaluation, not production. If you need a more locked-down footprint, Argo CD also provides namespace-scoped and HA manifest options in the upstream manifests.
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
Validate:
kubectl get pods -n argocd
kubectl get svc -n argocd
Then port-forward to access the UI:
kubectl port-forward -n argocd svc/argocd-server 8080:443
Use admin plus the password from argocd-initial-admin-secret as shown in the prior section.
For enterprise rollouts, treat manifest installs as a starting point. If you’re standardizing Argo CD across environments, Helm is easier to control and upgrade.
A real install isn’t “pods are running.” A real install is “we can deploy from Git safely.” This quick validation proves:
Keep it boring and repeatable. For example:
apps/
guestbook/
base/
overlays/
dev/
prod/
Or, if you deploy with Helm:
apps/
my-service/
chart/
values/
dev.yaml
prod.yaml
Even for a test app, start with the guardrail. AppProjects define what a team is allowed to deploy, and where.
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
name: team-sandbox
namespace: argocd
spec:
description: "Sandbox boundary for initial validation"
sourceRepos:
- "https://github.com/argoproj/argocd-example-apps.git"
destinations:
- namespace: sandbox
server: https://kubernetes.default.svc
namespaceResourceWhitelist:
- group: "apps"
kind: Deployment
- group: ""
kind: Service
- group: "networking.k8s.io"
kind: Ingress
Apply it:
kubectl apply -f appproject-sandbox.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: guestbook
namespace: argocd
spec:
project: team-sandbox
source:
repoURL: https://github.com/argoproj/argocd-example-apps.git
targetRevision: HEAD
path: guestbook
destination:
server: https://kubernetes.default.svc
namespace: sandbox
syncPolicy:
automated:
selfHeal: true
prune: false
syncOptions:
- CreateNamespace=true
Note: In many enterprise environments, namespace creation is restricted to platform workflows or Infrastructure as Code pipelines. If that applies to your organization, remove CreateNamespace=true and require namespaces to be provisioned separately.
Apply it:
kubectl apply -f application-guestbook.yaml
Now confirm:
By default, Argo CD polls repos periodically. Many teams configure webhooks (GitHub/GitLab) so Argo CD can refresh and sync quickly when changes land. It’s not required for day one, but it improves feedback loops in active repos.
This is where most enterprise rollouts either earn trust or lose it. If teams don’t trust the platform, they won’t onboard their workloads.
Focus on these enterprise minimums:
Practical rollout order:
Break-glass access should exist, but it should be documented, auditable, and rare.
Enterprise teams don’t struggle because they can’t install Argo CD. They struggle because Argo CD becomes a shared dependency—and shared dependencies need operational maturity.
At scale, pressure points are predictable:
Plan a path to HA before you onboard many teams. If HA Redis is part of your design, validate node capacity so workloads can spread across failure domains.
Keep monitoring simple and useful:
Also, decide alert ownership and escalation paths early. Platform teams typically own Argo CD availability and control-plane health, while application teams own application-level sync and runtime issues within their defined boundaries.
Git is the source of truth for desired state, but you still need to recover platform configuration quickly.
Backup:
Then run restore tests on a schedule. The goal isn’t perfection—it’s proving you can regain GitOps control safely.
A safe enterprise approach:
Avoid “random upgrades.” Treat Argo CD as platform infrastructure with controlled change management.
Argo CD works well on EKS, but enterprise teams often have extra constraints: private clusters, restricted egress, and standard AWS ingress patterns.
Common installation approaches on EKS:
For access, most EKS enterprise teams standardize on an ingress backed by AWS Load Balancer Controller (ALB) or NGINX, with TLS termination at the edge.
An enterprise-grade Argo CD install is less about getting a UI running and more about putting the right foundations in place: a repeatable deployment method (typically Helm), a stable endpoint for access and SSO, and clear boundaries so teams can move fast without stepping on each other. If you take away one thing, make it this: treat Argo CD like shared platform infrastructure, not a one-off tool.
Start with a pinned, values-driven Helm install. Then lock in the enterprise minimums: SSO, RBAC, and AppProjects, before you onboard your second team. Finally, operationalize it with monitoring, backups, and a staged upgrade process so Argo CD stays reliable as your cluster and application footprint grows.
When you need orchestration, approvals, and progressive delivery across complex releases, pair GitOps with Harness CD. Request a demo.
These are quick answers to the most common questions that business teams have when they install Argo CD.
Most enterprise teams should use Helm to install Argo CD because it lets you pin versions, keep configuration in Git, and upgrade in a predictable way. Upstream manifests are a great way to get started quickly if you’re thinking about Argo CD.
Use an internal hostname, end TLS at your ingress/load balancer, and make sure that SSO is required for interactive access. Do not make Argo CD public unless your business model really needs it.
Pin your chart/app versions, test upgrades in a non-production environment, and then move the same change to other environments. After the upgrade, check that you can log in, access the repo, and sync with a real app.
Use RBAC and AppProjects to set limits on a single shared instance. Only approved repos should be used by app teams to deploy to approved namespaces and clusters.
Back up the argocd namespace (ConfigMaps, Secrets, and CRs) and keep app definitions in Git. Run restore tests on a schedule so recovery steps are proven, not theoretical.


Engineering organizations are waking up to something that used to be optional: measurement.
Not vanity dashboards. Not a quarterly “engineering metrics review” that no one prepares for. Real measurement that connects delivery speed, quality, and reliability to business outcomes and decision-making.
That shift is a good sign. It means engineering leaders are taking the craft seriously.
But there are two patterns I keep seeing across the industry that turn this good intention into a slow-motion failure. Both patterns look reasonable on paper. Both patterns are expensive. And both patterns lead to the same outcome: a metrics tool becomes shelfware, trust erodes, and leaders walk away thinking, “Metrics do not work here.”
Engineering metrics do work. But only when leaders use them the right way, for the right purpose, with the right operating rhythm.
Here are the two patterns, and how to address them.
This is the silent killer.
An engineering executive buys a measurement platform and rolls it out to directors and managers with a message like: “Now you’ll have visibility. Use this to improve.”
Then the executive who sponsored the initiative rarely uses the tool themselves.
No consistent review cadence. No decisions being made with the data. No visible examples of metrics guiding priorities. No executive-level questions that force a new standard of clarity.
What happens next is predictable.
Managers and directors conclude that engineering metrics are optional. They might log in at first. They might explore the dashboards. But soon the tool becomes “another thing” competing with real work. And because leadership is not driving the behavior, the culture defaults to the old way: opinions, anecdotes, and local optimization.
If leaders are not driving direction with data, why would managers choose to?
This is not a tooling problem. It is a leadership ownership problem.
If measurement is important, the most senior leaders must model it.
That does not mean micromanaging teams through numbers. It means creating a clear expectation that engineering metrics are part of how the organization thinks, communicates, and makes decisions.
Here is what executive ownership looks like in practice:
When executives do this, managers follow. Not because they are told to, but because the organization has made measurement real.
This is the other trap, and it is even more common.
There is a false belief that if an organization has DORA metrics, improvements in throughput and quality will automatically follow. Like measurement itself is the intervention.
But measurement does not create performance. It reveals performance.
A tool can tell you:
Those are powerful signals. But they do not change anything on their own.
If the system that produces those numbers stays the same, the numbers stay the same.
This is why organizations buy tools, instrument everything, and still feel stuck. They measured the pain, but never built the discipline to diagnose and treat the cause.
If you want metrics to lead to improvement, you need two things:
Without definitions, metrics turn into arguments. Everyone interprets the same number differently, then stops trusting the system.
Without a practice, metrics turn into observation. You notice, you nod, then you go back to work.
The purpose of measurement is not to create pressure. It is to create clarity. Clarity about where the system is constrained, what tradeoffs you are making, and whether your interventions actually helped.
Here is the shift that unlocks everything:
The goal is not to measure engineers.
The goal is to measure the system.
More specifically, the goal is to prove whether a change you made actually improved outcomes.
A change could be:
If you cannot measure movement after you make a change, you are operating on opinions and hope.
If you can measure movement, you can run engineering like a disciplined improvement engine.
This is where DORA metrics become extremely valuable, when they are used as confirmation and learning, not as a scoreboard.
The best leaders I have worked with do not hand leadership over to dashboards. They use metrics as confirmation of what they already sense, and as a way to test assumptions.
That is the role of measurement. It turns gut feel into validated understanding, then turns interventions into provable outcomes.
If you want measurement to drive real improvement, here is a straightforward structure that scales.
Use DORA as a baseline, but make definitions explicit:
This prevents endless debates and keeps the organization aligned.
You do not need a heavy process. You need consistency.
A strong starting point:
A metric without a lever becomes a complaint.
Examples:
This is the part most organizations skip.
Pick one change. Implement it. Measure before and after. Learn. Repeat.
Improvement becomes a system, not a motivational speech.
This brings us back to Pattern #1.
If executives use the tool and drive decisions with it, measurement becomes real. If they do not, the tool becomes optional, and optional always loses.
The organizations that do this well eventually stop talking about “metrics adoption.” They talk about “how we run the business.”
Measurement becomes part of how engineering communicates with leadership, how priorities get set, how teams remove friction, and how investment decisions are made.
And the biggest shift is this:They stop expecting a measurement tool to fix problems.They use measurement to prove that the problems are being fixed.
That is the point. Not dashboards, not reporting, not performance theater: Clarity, decisions, experiments, and outcomes.
In the end, measurement is not the transformation. It is the instrument panel that tells you whether your transformation is working.


Managing a global army of 450+ developers, Devan has seen the "DevOps to DevSecOps to AI" evolution firsthand. Here are the core AI data security insights from his conversation on ShipTalk.
IBM’s internal "OnePipeline" isn't just a convenience; it’s a necessity. Built on Tekton (CI) and Argo CD, the platform has become the standard for both SaaS and on-prem deliveries.
This transition highlights a growing industry trend: as code generation accelerates, the bottleneck shifts to delivery and security. This phenomenon, often called the AI Velocity Paradox, suggests that without downstream automation, upstream speed gains are often neutralized by manual security gates.
IBM uses an internal AI coding agent called "Bob." But how do you ensure AI-generated code doesn't become technical debt?
"It’s not just about the code working; it’s about it being maintainable. If you don't provide context, the AI will build its own functions for JWT validation or encryption instead of using your existing, secure SDKs." — Devan Shah
To combat this, the team implements:
Quantifying the success of these initiatives is the next frontier. For organizations looking to move beyond "vibes" and toward hard data, the ebook Measuring the Impact of AI Development Tools offers a framework for tracking how these assistants actually affect cycle time and code quality.
We’ve all heard of "Garbage In, Garbage Out," but in the AI era, Devan warns of "Crown Jewels In, Crown Jewels Out." If you feed sensitive data or hard-coded secrets into an LLM training set, that model becomes a potential leak for attackers using sophisticated prompt injection.
Data Security Posture Management (DSPM) has emerged as a critical layer. It solves three major problems:
For a deeper technical look at how infrastructure must be architected to protect customer data in this environment, the Harness Data Security Whitepaper provides an excellent breakdown of security and privacy-by-design principles.
We are moving beyond simple chatbots to Agentic Workflows—where AI agents talk to other agents and API endpoints.
When asked how to balance speed with security, Devan's philosophy is simple: Identify the "Bare Minimum 15." You don't need a list of 300 compliance checks to start, but you do need 10 to 15 non-negotiables:
Whether you are a startup or a global giant like IBM, the goal of software delivery remains the same: Ship fast, but stay out of legal trouble. By integrating AI data security guardrails and robust data protection directly into the pipeline, security stops being a "speed bump" and starts being a foundational feature.
Want to dive deeper? Connect with Devan Shah on LinkedIn to follow his latest work, and subscribe to the ShipTalk podcast for more insights on using AI for everything after code.



Your developers complain about 20-minute builds while your cloud bill spirals out of control. Pipeline sprawl across teams creates security gaps you can't even see. These aren't separate problems. They're symptoms of a lack of actionable data on what actually drives velocity and cost.
The right CI metrics transform reactive firefighting into proactive optimization. With analytics data from Harness CI, platform engineering leaders can cut build times, control spend, and maintain governance without slowing teams down.
Platform teams who track the right CI metrics can quantify exactly how much developer time they're saving, control cloud spending, and maintain security standards while preserving development velocity. The importance of tracking CI/CD metrics lies in connecting pipeline performance directly to measurable business outcomes.
Build time, queue time, and failure rates directly translate to developer hours saved or lost. Research shows that 78% of developers feel more productive with CI, and most want builds under 10 minutes. Tracking median build duration and 95th percentile outliers can reveal your productivity bottlenecks.
Harness CI delivers builds up to 8X faster than traditional tools, turning this insight into action.
Cost per build and compute minutes by pipeline eliminate the guesswork from cloud spending. AWS CodePipeline charges $0.002 per action-execution-minute, making monthly costs straightforward to calculate from your pipeline metrics.
Measuring across teams helps you spot expensive pipelines, optimize resource usage, and justify infrastructure investments with concrete ROI.
SBOM completeness, artifact integrity, and policy pass rates ensure your software supply chain meets security standards without creating development bottlenecks. NIST and related EO 14028 guidance emphasize on machine-readable SBOMs and automated hash verification for all artifacts.
However, measurement consistency remains challenging. A recent systematic review found that SBOM tooling variance creates significant detection gaps, with tools reporting between 43,553 and 309,022 vulnerabilities across the same 1,151 SBOMs.
Standardized metrics help you monitor SBOM generation rates and policy enforcement without manual oversight.
Not all metrics deserve your attention. Platform engineering leaders managing 200+ developers need measurements that reveal where time, money, and reliability break down, and where to fix them first.
So what does this look like in practice? Let's examine the specific metrics.
Build duration becomes most valuable when you track both median (p50) and 95th percentile (p95) times rather than simple averages. Research shows that timeout builds have a median duration of 19.7 minutes compared to 3.4 minutes for normal builds. That’s over five times longer.
While p50 reveals your typical developer experience, p95 exposes the worst-case delays that reduce productivity and impact developer flow. These outliers often signal deeper issues like resource constraints, flaky tests, or inefficient build steps that averages would mask. Tracking trends in both percentiles over time helps you catch regressions before they become widespread problems. Build analytics platforms can surface when your p50 increases gradually or when p95 spikes indicate new bottlenecks.
Keep builds under seven minutes to maintain developer engagement. Anything over 15 minutes triggers costly context switching. By monitoring both typical and tail performance, you optimize for consistent, fast feedback loops that keep developers in flow. Intelligent test selection reduces overall build durations by up to 80% by selecting and running only tests affected by the code changes, rather than running all tests.

An example of build durations dashboard (on Harness)
Queue time measures how long builds wait before execution begins. This is a direct indicator of insufficient build capacity. When developers push code, builds shouldn't sit idle while runners or compute resources are tied up. Research shows that heterogeneous infrastructure with mixed processing speeds creates excessive queue times, especially when job routing doesn't account for worker capabilities. Queue time reveals when your infrastructure can't handle developer demand.
Rising queue times signal it's time to scale infrastructure or optimize resource allocation. Per-job waiting time thresholds directly impact throughput and quality outcomes. Platform teams can reduce queue time by moving to Harness Cloud's isolated build machines, implementing intelligent caching, or adding parallel execution capacity. Analytics dashboards track queue time trends across repositories and teams, enabling data-driven infrastructure decisions that keep developers productive.
Build success rate measures the percentage of builds that complete successfully over time, revealing pipeline health and developer confidence levels. When teams consistently see success rates above 90% on their default branches, they trust their CI system to provide reliable feedback. Frequent failures signal deeper issues — flaky tests that pass and fail randomly, unstable build environments, or misconfigured pipeline steps that break under specific conditions.
Tracking success rate trends by branch, team, or service reveals where to focus improvement efforts. Slicing metrics by repository and pipeline helps you identify whether failures cluster around specific teams using legacy test frameworks or services with complex dependencies. This granular view separates legitimate experimental failures on feature branches from stability problems that undermine developer productivity and delivery confidence.

An example of Build Success/Failure Rate Dashboard (on Harness)
Mean time to recovery measures how fast your team recovers from failed builds and broken pipelines, directly impacting developer productivity. Research shows organizations with mature CI/CD implementations see MTTR improvements of over 50% through automated detection and rollback mechanisms. When builds fail, developers experience context switching costs, feature delivery slows, and team velocity drops. The best-performing teams recover from incidents in under one hour, while others struggle with multi-hour outages that cascade across multiple teams.
Automated alerts and root cause analysis tools slash recovery time by eliminating manual troubleshooting, reducing MTTR from 20 minutes to under 3 minutes for common failures. Harness CI's AI-powered troubleshooting surfaces failure patterns and provides instant remediation suggestions when builds break.
Flaky tests pass or fail non-deterministically on the same code, creating false signals that undermine developer trust in CI results. Research shows 59% of developers experience flaky tests monthly, weekly, or daily, while 47% of restarted failing builds eventually passed. This creates a cycle where developers waste time investigating false failures, rerunning builds, and questioning legitimate test results.
Tracking flaky test rate helps teams identify which tests exhibit unstable pass/fail behavior, enabling targeted stabilization efforts. Harness CI automatically detects problematic tests through failure rate analysis, quarantines flaky tests to prevent false alarms, and provides visibility into which tests exhibit the highest failure rates. This reduces developer context switching and restores confidence in CI feedback loops.
Cost per build divides your monthly CI infrastructure spend by the number of successful builds, revealing the true economic impact of your development velocity. CI/CD pipelines consume 15-40% of overall cloud infrastructure budgets, with per-run compute costs ranging from $0.40 to $4.20 depending on application complexity, instance type, region, and duration. This normalized metric helps platform teams compare costs across different services, identify expensive outliers, and justify infrastructure investments with concrete dollar amounts rather than abstract performance gains.
Automated caching and ephemeral infrastructure deliver the biggest cost reductions per build. Intelligent caching automatically stores dependencies and Docker layers. This cuts repeated download and compilation time that drives up compute costs.
Ephemeral build machines eliminate idle resource waste. They spin up fresh instances only when the queue builds, then terminate immediately after completion. Combine these approaches with right-sized compute types to reduce infrastructure costs by 32-43% compared to oversized instances.
Cache hit rate measures what percentage of build tasks can reuse previously cached results instead of rebuilding from scratch. When teams achieve high cache hit rates, they see dramatic build time reductions. Docker builds can drop from five to seven minutes to under 90 seconds with effective layer caching. Smart caching of dependencies like node_modules, Docker layers, and build artifacts creates these improvements by avoiding expensive regeneration of unchanged components.
Harness Build and Cache Intelligence eliminates the manual configuration overhead that traditionally plagues cache management. It handles dependency caching and Docker layer reuse automatically. No complex cache keys or storage management required.
Measure cache effectiveness by comparing clean builds against fully cached runs. Track hit rates over time to justify infrastructure investments and detect performance regressions.
Test cycle time measures how long it takes to run your complete test suite from start to finish. This directly impacts developer productivity because longer test cycles mean developers wait longer for feedback on their code changes. When test cycles stretch beyond 10-15 minutes, developers often switch context to other tasks, losing focus and momentum. Recent research shows that optimized test selection can accelerate pipelines by 5.6x while maintaining high failure detection rates.
Smart test selection optimizes these feedback loops by running only tests relevant to code changes. Harness CI Test Intelligence can slash test cycle time by up to 80% using AI to identify which tests actually need to run. This eliminates the waste of running thousands of irrelevant tests while preserving confidence in your CI deployments.
Categorizing pipeline issues into domains like code problems, infrastructure incidents, and dependency conflicts transforms chaotic build logs into actionable insights. Harness CI's AI-powered troubleshooting provides root cause analysis and remediation suggestions for build failures. This helps platform engineers focus remediation efforts on root causes that impact the most builds rather than chasing one-off incidents.

Visualizing issue distribution reveals whether problems are systemic or isolated events. Organizations using aggregated monitoring can distinguish between infrastructure spikes and persistent issues like flaky tests. Harness CI's analytics surface which pipelines and repositories have the highest failure rates. Platform teams can reduce overall pipeline issues by 20-30%.
Artifact integrity coverage measures the percentage of builds that produce signed, traceable artifacts with complete provenance documentation. This tracks whether each build generates Software Bills of Materials (SBOMs), digital signatures, and documentation proving where artifacts came from. While most organizations sign final software products, fewer than 20% deliver provenance data and only 3% consume SBOMs for dependency management. This makes the metric a leading indicator of supply chain security maturity.
Harness CI automatically generates SBOMs and attestations for every build, ensuring 100% coverage without developer intervention. The platform's SLSA L3 compliance capabilities generate verifiable provenance and sign artifacts using industry-standard frameworks. This eliminates the manual processes and key management challenges that prevent consistent artifact signing across CI pipelines.
Tracking CI metrics effectively requires moving from raw data to measurable improvements. The most successful platform engineering teams build a systematic approach that transforms metrics into velocity gains, cost reductions, and reliable pipelines.
Tag every pipeline with service name, team identifier, repository, and cost center. This standardization creates the foundation for reliable aggregation across your entire CI infrastructure. Without consistent tags, you can't identify which teams drive the highest costs or longest build times.
Implement naming conventions that support automated analysis. Use structured formats like team-service-environment for pipeline names and standardize branch naming patterns. Centralize this metadata using automated tag enforcement to ensure organization-wide visibility.
Modern CI platforms eliminate manual metric tracking overhead. Harness CI provides dashboards that automatically surface build success rates, duration trends, and failure patterns in real-time. Teams can also integrate with monitoring stacks like Prometheus and Grafana for live visualization across multiple tools.
Configure threshold-based alerts for build duration spikes or failure rate increases. This shifts you from fixing issues after they happen to preventing them entirely.
Focus on p95 and p99 percentiles rather than averages to identify critical performance outliers. Drill into failure causes and flaky tests to prioritize fixes with maximum developer impact. Categorize pipeline failures by root cause — environment issues, dependency problems, or test instability — then target the most frequent culprits first.
Benchmark cost per build and cache hit rates to uncover infrastructure savings. Optimized caching and build intelligence can reduce build times by 30-40% while cutting cloud expenses.
Standardize CI pipelines using centralized templates and policy enforcement to eliminate pipeline sprawl. Store reusable templates in a central repository and require teams to extend from approved templates. This reduces maintenance overhead while ensuring consistent security scanning and artifact signing.
Establish Service Level Objectives (SLOs) for your most impactful metrics: build duration, queue time, and success rate. Set measurable targets like "95% of builds complete within 10 minutes" to drive accountability. Automate remediation wherever possible — auto-retry for transient failures, automated cache invalidation, and intelligent test selection to skip irrelevant tests.
The difference between successful platform teams and those drowning in dashboards comes down to focus. Elite performers track build duration, queue time, flaky test rates, and cost per build because these metrics directly impact developer productivity and infrastructure spend.
Start with the measurements covered in this guide, establish baselines, and implement governance that prevents pipeline sprawl. Focus on the metrics that reveal bottlenecks, control costs, and maintain reliability — then use that data to optimize continuously.
Ready to transform your CI metrics from vanity to velocity? Experience how Harness CI accelerates builds while cutting infrastructure costs.
Platform engineering leaders often struggle with knowing which metrics actually move the needle versus creating metric overload. These answers focus on metrics that drive measurable improvements in developer velocity, cost control, and pipeline reliability.
Actionable metrics directly connect to developer experience and business outcomes. Build duration affects daily workflow, while deployment frequency impacts feature delivery speed. Vanity metrics look impressive, but don't guide decisions. Focus on measurements that help teams optimize specific bottlenecks rather than general health scores.
Build duration, queue time, and flaky test rate directly affect how fast developers get feedback. While coverage monitoring dominates current practices, build health and time-to-fix-broken-builds offer the highest productivity gains. Focus on metrics that reduce context switching and waiting.
Cost per build and cache hit rate reveal optimization opportunities that maintain quality while cutting spend. Intelligent caching and optimized test selection can significantly reduce both build times and infrastructure costs. Running only relevant tests instead of entire suites cuts waste without compromising coverage.
Begin with pipeline metadata standardization using consistent tags for service, team, and cost center. Most CI platforms provide basic metrics through built-in dashboards. Start with DORA metrics, then add build-specific measurements as your monitoring matures.
Daily monitoring of build success rates and queue times enables immediate issue response. Weekly reviews of build duration trends and monthly cost analysis drive strategic improvements. Automated alerts for threshold breaches prevent small problems from becoming productivity killers.



Modern unit testing in CI/CD can help teams avoid slow builds by using smart strategies. Choosing the right tests, running them in parallel, and using intelligent caching all help teams get faster feedback while keeping code quality high.
Platforms like Harness CI use AI-powered test intelligence to reduce test cycles by up to 80%, showing what’s possible with the right tools. This guide shares practical ways to speed up builds and improve code quality, from basic ideas to advanced techniques that also lower costs.
Knowing what counts as a unit test is key to building software delivery pipelines that work.
A unit test looks at a single part of your code, such as a function, class method, or a small group of related components. The main point is to test one behavior at a time. Unit tests are different from integration tests because they look at the logic of your code. This makes it easier to figure out what went wrong if something goes wrong.
Unit tests should only check code that you wrote and not things like databases, file systems, or network calls. This separation makes tests quick and dependable. Tests that don't rely on outside services run in milliseconds and give the same results no matter where they are run, like on your laptop or in a CI pipeline.
Unit tests are one of the most important part of continuous integration in CI/CD pipelines because they show problems right away after code changes. Because they are so fast, developers can run them many times a minute while they are coding. This makes feedback loops very tight, which makes it easier to find bugs and stops them from getting to later stages of the pipeline.
Teams that run full test suites on every commit catch problems early by focusing on three things: making tests fast, choosing the right tests, and keeping tests organized. Good unit testing helps developers stay productive and keeps builds running quickly.
Deterministic Tests for Every Commit
Unit tests should finish in seconds, not minutes, so that they can be quickly checked. Google's engineering practices say that tests need to be "fast and reliable to give engineers immediate feedback on whether a change has broken expected behavior." To keep tests from being affected by outside factors, use mocks, stubs, and in-memory databases. Keep commit builds to less than ten minutes, and unit tests should be the basis of this quick feedback loop.
As projects get bigger, running all tests on every commit can slow teams down. Test Impact Analysis looks at coverage data to figure out which tests really check the code that has been changed. AI-powered test selection chooses the right tests for you, so you don't have to guess or sort them by hand.
To get the most out of your infrastructure, use selective execution and run tests at the same time. Divide test suites into equal-sized groups and run them on different machines simultaneously. Smart caching of dependencies, build files, and test results helps you avoid doing the same work over and over. When used together, these methods cut down on build time a lot while keeping coverage high.
Standardized Organization for Scale
Using consistent names, tags, and organization for tests helps teams track performance and keep quality high as they grow. Set clear rules for test types (like unit, integration, or smoke) and use names that show what each test checks. Analytics dashboards can spot flaky tests, slow tests, and common failures. This helps teams improve test suites and keep things running smoothly without slowing down developers.
A good unit test uses the Arrange-Act-Assert pattern. For example, you might test a function that calculates order totals with discounts:
def test_apply_discount_to_order_total():
# Arrange: Set up test data
order = Order(items=[Item(price=100), Item(price=50)])
discount = PercentageDiscount(10)
# Act: Execute the function under test
final_total = order.apply_discount(discount)
# Assert: Verify expected outcome
assert final_total == 135 # 150 - 10% discountIn the Arrange phase, you set up the objects and data you need. In the Act phase, you call the method you want to test. In the Assert phase, you check if the result is what you expected.
Testing Edge Cases
Real-world code needs to handle more than just the usual cases. Your tests should also check edge cases and errors:
def test_apply_discount_with_empty_cart_returns_zero():
order = Order(items=[])
discount = PercentageDiscount(10)
assert order.apply_discount(discount) == 0
def test_apply_discount_rejects_negative_percentage():
order = Order(items=[Item(price=100)])
with pytest.raises(ValueError):
PercentageDiscount(-5)Notice the naming style: test_apply_discount_rejects_negative_percentage clearly shows what’s being tested and what should happen. If this test fails in your CI pipeline, you’ll know right away what went wrong, without searching through logs.
When teams want faster builds and fewer late-stage bugs, the benefits of unit testing are clear. Good unit tests help speed up development and keep quality high.
When you use smart test execution in modern CI/CD pipelines, these benefits get even bigger.
Disadvantages of Unit Testing: Recognizing the Trade-Offs
Unit testing is valuable, but knowing its limits helps teams choose the right testing strategies. These downsides matter most when you’re trying to make CI/CD pipelines faster and more cost-effective.
Research shows that automatically generated tests can be harder to understand and maintain. Studies also show that statement coverage doesn’t always mean better bug detection.
Industry surveys show that many organizations have trouble with slow test execution and unclear ROI for unit testing. Smart teams solve these problems by choosing the right tests, using smart caching, and working with modern CI platforms that make testing faster and more reliable.
Developers use unit tests in three main ways that affect build speed and code quality. These practices turn testing into a tool that catches problems early and saves time on debugging.
Before they start coding, developers write unit tests. They use test-driven development (TDD) to make the design better and cut down on debugging. According to research, TDD finds 84% of new bugs, while traditional testing only finds 62%. This method gives you feedback right away, so failing tests help you decide what to do next.
Unit tests are like automated guards that catch bugs when code changes. Developers write tests to recreate bugs that have been reported, and then they check that the fixes work by running the tests again after the fixes have been made. Automated tools now generate test cases from issue reports. They are 30.4% successful at making tests that fail for the exact problem that was reported. To stop bugs that have already been fixed from coming back, teams run these regression tests in CI pipelines.
Good developer testing doesn't look at infrastructure or glue code; it looks at business logic, edge cases, and public interfaces. Testing public methods and properties is best; private details that change often should be left out. Test doubles help developers keep business logic separate from systems outside of their control, which makes tests more reliable. Integration and system tests are better for checking how parts work together, especially when it comes to things like database connections and full workflows.
Slow, unreliable tests can slow down CI and hurt productivity, while also raising costs. The following proven strategies help teams check code quickly and cut both build times and cloud expenses.
Choosing between manual and automated unit testing directly affects how fast and reliable your pipeline is.
Manual Unit Testing: Flexibility with Limitations
Manual unit testing means developers write and run tests by hand, usually early in development or when checking tricky edge cases that need human judgment. This works for old systems where automation is hard or when you need to understand complex behavior. But manual testing can’t be repeated easily and doesn’t scale well as projects grow.
Automated Unit Testing: Speed and Consistency at Scale
Automated testing transforms test execution into fast, repeatable processes that integrate seamlessly with modern development workflows. Modern platforms leverage AI-powered optimization to run only relevant tests, cutting cycle times significantly while maintaining comprehensive coverage.
Why High-Velocity Teams Prioritize Automation
Fast-moving teams use automated unit testing to keep up speed and quality. Manual testing is still useful for exploring and handling complex cases, but automation handles the repetitive checks that make deployments reliable and regular.
Difference Between Unit Testing and Other Types of Testing
Knowing the difference between unit, integration, and other test types helps teams build faster and more reliable CI/CD pipelines. Each type has its own purpose and trade-offs in speed, cost, and confidence.
Unit Tests: Fast and Isolated Validation
Unit tests are the most important part of your testing plan. They test single functions, methods, or classes without using any outside systems. You can run thousands of unit tests in just a few minutes on a good machine. This keeps you from having problems with databases or networks and gives you the quickest feedback in your pipeline.
Integration Tests: Validating Component Interactions
Integration testing makes sure that the different parts of your system work together. There are two main types of tests: narrow tests that use test doubles to check specific interactions (like testing an API client with a mock service) and broad tests that use real services (like checking your payment flow with real payment processors). Integration tests use real infrastructure to find problems that unit tests might miss.
End-to-End Tests: Complete User Journey Validation
The top of the testing pyramid is end-to-end tests. They mimic the full range of user tasks in your app. These tests are the most reliable, but they take a long time to run and are hard to fix. Unit tests can find bugs quickly, but end-to-end tests may take days to find the same bug. This method works, but it can be brittle.
The Test Pyramid: Balancing Speed and Coverage
The best testing strategy uses a pyramid: many small, fast unit tests at the bottom, some integration tests in the middle, and just a few end-to-end tests at the top.
Modern development teams use a unit testing workflow that balances speed and quality. Knowing this process helps teams spot slow spots and find ways to speed up builds while keeping code reliable.
Before making changes, developers write code on their own computers and run unit tests. They run tests on their own computers to find bugs early, and then they push the code to version control so that CI pipelines can take over. This step-by-step process helps developers stay productive by finding problems early, when they are easiest to fix.
Once code is in the pipeline, automation tools run unit tests on every commit and give feedback right away. If a test fails, the pipeline stops deployment and lets developers know right away. This automation stops bad code from getting into production. Research shows this method can cut critical defects by 40% and speed up deployments.
Modern CI platforms use Test Intelligence to only run the tests that are affected by code changes in order to speed up this process. Parallel testing runs test groups in different environments at the same time. Smart caching saves dependencies and build files so you don't have to do the same work over and over. These steps can help keep coverage high while lowering the cost of infrastructure.
Teams analyze test results through dashboards that track failure rates, execution times, and coverage trends. Analytics platforms surface patterns like flaky tests or slow-running suites that need attention. This data drives decisions about test prioritization, infrastructure scaling, and process improvements. Regular analysis ensures the unit testing approach continues to deliver value as codebases grow and evolve.
Using the right unit testing techniques can turn unreliable tests into a reliable way to speed up development. These proven methods help teams trust their code and keep CI pipelines running smoothly:
These methods work together to build test suites that catch real bugs and stay easy to maintain as your codebase grows.
As we've talked about with CI/CD workflows, the first step to good unit testing is to separate things. This means you should test your code without using outside systems that might be slow or not work at all. Dependency injection is helpful because it lets you use test doubles instead of real dependencies when you run tests.
It is easier for developers to choose the right test double if they know the differences between them. Fakes are simple working versions, such as in-memory databases. Stubs return set data that can be used to test queries. Mocks keep track of what happens so you can see if commands work as they should.
This method makes sure that tests are always quick and accurate, no matter when you run them. Tests run 60% faster and there are a lot fewer flaky failures that slow down development when teams use good isolation.
Teams need more ways to get more test coverage without having to do more work, in addition to isolation. You can set rules that should always be true with property-based testing, and it will automatically make hundreds of test cases. This method is great for finding edge cases and limits that manual tests might not catch.
Parameterized testing gives you similar benefits, but you have more control over the inputs. You don't have to write extra code to run the same test with different data. Tools like xUnit's Theory and InlineData make this possible. This helps find more bugs and makes it easier to keep track of your test suite.
Both methods work best when you choose the right tests to run. You only run the tests you need, so platforms that know which tests matter for each code change give you full coverage without slowing things down.
The last step is to test complicated data, such as JSON responses or code that was made. Golden tests and snapshot testing make things easier by saving the expected output as reference files, so you don't have to do complicated checks.
If your code’s output changes, the test fails and shows what’s different. This makes it easy to spot mistakes, and you can approve real changes by updating the snapshot. This method works well for testing APIs, config generators, or any code that creates structured output.
Teams that use full automated testing frameworks see code coverage go up by 32.8% and catch 74.2% more bugs per build. Golden tests help by making it easier to check complex cases that would otherwise need manual testing.
The main thing is to balance thoroughness with easy maintenance. Golden tests should check real behavior, not details that change often. When you get this balance right, you’ll spend less time fixing bugs and more time building features.
Picking the right unit testing tools helps your team write tests efficiently, instead of wasting time on flaky tests or slow builds. The best frameworks work well with your language and fit smoothly into your CI/CD process.
Modern teams use these frameworks along with CI platforms that offer analytics and automation. This mix of good tools and smart processes turns testing from a bottleneck into a productivity boost.
Smart unit testing can turn CI/CD from a bottleneck into an advantage. When tests are fast and reliable, developers spend less time waiting and more time releasing code. Harness Continuous Integration uses Test Intelligence, automated caching, and isolated build environments to speed up feedback without losing quality.
Want to speed up your team? Explore Harness CI and see what's possible.


Filtering data is at the heart of developer productivity. Whether you’re looking for failed builds, debugging a service or analysing deployment patterns, the ability to quickly slice and dice execution data is critical.
At Harness, users across CI, CD and other modules rely on filtering to navigate complex execution data by status, time range, triggers, services and much more. While our legacy filtering worked, it had major pain points — hidden drawers, inconsistent behaviour and lost state on refresh — that slowed both developers and users.
This blog dives into how we built a new Filters component system in React: a reusable, type-safe and feature-rich framework that powers the filtering experience on the Execution Listing page (and beyond).
Our old implementation revealed several weaknesses as Harness scaled:
These problems shaped our success criteria: discoverability, smooth UX, consistent behaviour, reusable design and decoupled components.

Building a truly reusable and powerful filtering system required exploration and iteration. Our journey involved several key stages and learning from the pitfalls of each:
Shifted to React functional components but kept logic centralised in the FilterFramework. Each filter was conditionally rendered based on visibleFilters array. Framework fetched filter options and passed them down as props.
COMPONENT FilterFramework:
STATE activeFilters = {}
STATE visibleFilters = []
STATE filterOptions = {}
ON visibleFilters CHANGE:
FOR EACH filter IN visibleFilters:
IF filterOptions[filter] NOT EXISTS:
options = FETCH filterData(filter)
filterOptions[filter] = options
ON activeFilters CHANGE:
makeAPICall(activeFilters)
RENDER:
<AllFilters setVisibleFilters={setVisibleFilters} />
IF 'services' IN visibleFilters:
<DropdownFilter
name="Services"
options={filterOptions.services}
onAdd={updateActiveFilters}
onRemove={removeFromVisible}
/>
IF 'environments' IN visibleFilters:
<DropdownFilter ... />
Pitfalls: Adding new filters required changes in multiple places, creating a maintenance nightmare and poor developer experience. The framework had minimal control over filter implementation, lacked proper abstraction and scattered filter logic across the codebase, making it neither “stupid-proof” nor scalable.
Improved the previous approach by accepting filters as children and using React.cloneElement to inject callbacks (onAdd, onRemove) from the parent framework. This gave developers a cleaner API to add filters.
children.forEach(child => {
if (visibleFilters.includes(child.props.filterKey)) {
return React.cloneElement(child, {
onAdd: (label, value) => {
activeFilters[child.props.filterKey].push({ label, value });
},
onRemove: () => {
delete activeFilters[child.props.filterKey];
}
});
}
});Pitfalls: React.cloneElement is an expensive operation that causes performance issues with frequent re-renders and it’s considered an anti-pattern by the React team. The approach tightly coupled filters to the framework’s callback signature, made prop flow implicit and difficult to debug and created type safety issues since TypeScript struggles with dynamically injected props.
The winning design uses React Context API to provide filter state and actions to child components. Individual filters access setValue and removeFilter via useFiltersContext() hook. This decouples filters from the framework while maintaining control.
COMPONENT Filters({ children, onChange }):
STATE filtersMap = {} // { search: { value, query, state } }
STATE filtersOrder = [] // ['search', 'status']
FUNCTION updateFilter(key, newValue):
serialized = parser.serialize(newValue) // Type → String
filtersMap[key] = { value: newValue, query: serialized }
updateURL(serialized)
onChange(allValues)
ON URL_CHANGE:
parsed = parser.parse(urlString) // String → Type
filtersMap[key] = { value: parsed, query: urlString }
RENDER:
<Context.Provider value={{ updateFilter, filtersMap }}>
{children}
</Context.Provider>
END COMPONENTBenefits: This solution eliminated the performance overhead of cloneElement, decoupled filters from framework internals and made it easy to add new filters without touching framework code. The Context API provides clear data flow that’s easy to debug and test, with type safety through TypeScript.

The Context API in React unlocks something truly powerful — Inversion of Control (IoC). This design principle is about delegating control to a framework instead of managing every detail yourself. It’s often summed up by the Hollywood Principle: “Don’t call us, we’ll call you.”
In React, this translates to building flexible components that let the consumer decide what to render, while the component itself handles how and when it happens.
Our Filters framework applies this principle: you don’t have to manage when to update state or synchronise the URL. You simply define your filter components and the framework orchestrates the rest — ensuring seamless, predictable updates without manual intervention.
Our Filters framework demonstrates Inversion of Control in three key ways.
The result? A single, reusable Filters component that works across pipelines, services, deployments or repositories. By separating UI logic from business logic, we gain flexibility, testability and cleaner architecture — the true power of Inversion of Control.
COMPONENT DemoPage:
STATE filterValues
FilterHandler = createFilters()
FUNCTION applyFilters(data, filters):
result = data
IF filters.onlyActive == true:
result = result WHERE item.status == "Active"
RETURN result
filteredData = applyFilters(SAMPLE_DATA, filterValues)
RENDER:
<RouterContextProvider>
<FilterHandler onChange = (updatedFilters) => SET filterValues = updatedFilters>
// Dropdown to add filters dynamically
<FilterHandler.Dropdown>
RENDER FilterDropdownMenu with available filters
</FilterHandler.Dropdown>
// Active filters section
<FilterHandler.Content>
<FilterHandler.Component parser = booleanParser filterKey = "onlyActive">
RENDER CustomActiveOnlyFilter
</FilterHandler.Component>
</FilterHandler.Content>
</FilterHandler>
RENDER DemoTable(filteredData)
</RouterContextProvider>
END COMPONENTOne of the key technical challenges in building a filtering system is URL synchronization. Browsers only understand strings, yet our applications deal with rich data types — dates, booleans, arrays and more. Without a structured solution, each component would need to manually convert these values, leading to repetitive, error-prone code.
The solution is our parser interface, a lightweight abstraction with just two methods: parse and serialize.
parse converts a URL string into the type your app needs.serialize does the opposite, turning that typed value back into a string for the URL.This bidirectional system runs automatically — parsing when filters load from the URL and serialising when users update filters.
const booleanParser: Parser<boolean> = {
parse: (value: string) => value === 'true', // "true" → true
serialize: (value: boolean) => String(value) // true → "true"
}At the heart of our framework lies the FiltersMap — a single, centralized object that holds the complete state of all active filters. It acts as the bridge between your React components and the browser, keeping UI state and URL state perfectly in sync.
Each entry in the FiltersMap contains three key fields:
You might ask — why store both the typed value and its string form? The answer is performance and reliability. If we only stored the URL string, every re-render would require re-parsing, which quickly becomes inefficient for complex filters like multi-selects. By storing both, we parse only once — when the value changes — and reuse the typed version afterward. This ensures type safety, faster URL synchronization and a clean separation between UI behavior and URL representation. The result is a system that’s predictable, scalable, and easy to maintain.
interface FilterType<T = any> {
value?: T // The actual filter value
query?: string // Serialized string for URL
state: FilterStatus // VISIBLE | FILTER_APPLIED | HIDDEN
}Let’s trace how a filter value moves through the system — from user interaction to URL synchronization.
It all starts when a user interacts with a filter component — for example, selecting a date. This triggers an onChange event with a typed value, such as a Date object. Before updating the state, the parser’s serialize method converts that typed value into a URL-safe string.
The framework then updates the FiltersMap with both versions:
value andquery.From here, two things happen simultaneously:
onChange callback fires, passing typed values back to the parent component — allowing the app to immediately fetch data or update visualizations.The reverse flow works just as seamlessly. When the URL changes — say, the user clicks the back button — the parser’s parse method converts the string back into a typed value, updates the FiltersMap and triggers a re-render of the UI.
All of this happens within milliseconds, enabling a smooth, bidirectional synchronization between the application state and the URL — a crucial piece of what makes the Filters framework feel so effortless.

For teams tackling similar challenges — complex UI state management, URL synchronization and reusable component design — this architecture offers a practical blueprint to build upon. The patterns used are not specific to Harness; they are broadly applicable to any modern frontend system that requires scalable, stateful and user-driven filtering.
The team’s core objectives — discoverability, smooth UX, consistent behavior, reusable design and decoupled elements — directly shaped every architectural decision. Through Inversion of Control, the framework manages the when and how of state updates, lifecycle events and URL synchronization, while developers define the what — business logic, API calls and filter behavior.
By treating the URL as part of the filter state, the architecture enables shareability, bookmarkability and native browser history support. The Context API serves as the control distribution layer, removing the need for prop drilling and allowing deeply nested components to seamlessly access shared logic and state.
Ultimately, Inversion of Control also paved the way for advanced capabilities such as saved filters, conditional rendering, and sticky filters — all while keeping the framework lightweight and maintainable. This approach demonstrates how clear objectives and sound architectural principles can lead to scalable, elegant solutions in complex UI systems.


In most teams, the question is no longer "Do we need an internal developer portal?" but "Do we really want to run backstage ourselves?"
Backstage proved the internal developer portal (IDP) pattern, and it works. It gives you a flexible framework, plugins, and a central place for services and docs. It also gives you a long-term commitment: owning a React/TypeScript application, managing plugins, chasing upgrades, and justifying a dedicated platform squad to keep it all usable.
That's why there are Backstage alternatives like Harness IDP and managed Backstage services. It's also why so many platform teams are taking a long time to look at them before making a decision.
Backstage was created by Spotify to fix real problems with platform engineering, such as problems with onboarding, scattered documentation, unclear ownership, and not having clear paths for new services. There was a clear goal when Spotify made Backstage open source in 2020. The main value props are good: a software catalog, templates for new services, and a place to put all the tools you need to work together.
The problem is not the concept. It is the operating model. Backstage is a framework, not a product. If you adopt it, you are committing to:
Once Backstage moves beyond a proof of concept, it takes a lot of engineering work to keep it reliable, secure, and up to date. Many companies don't realize how much work it takes. At the same time, platforms like Harness are showing that you don't have to build everything yourself to get good results from a portal.
When you look at how Harness connects IDP to CI, CD, IaC Management, and AI-powered workflows, you start to see an alternate model: treat the portal as a product you adopt, then spend platform engineering energy on standards, golden paths, and self-service workflows instead of plumbing.
When you strip away branding, almost every Backstage alternative fits one of three patterns. The differences are in how much you own and how much you offload:
| Build (Self-Hosted Backstage) |
Hybrid (Managed Backstage) | Buy (Commercial IDP) | |
|---|---|---|---|
| You own | Everything: UI, plugins, infra, roadmap | Customization, plugin choices, catalog design | Standards, golden paths, workflows |
| Vendor owns | Nothing | Hosting, upgrades, security patches | Platform, upgrades, governance tooling, support |
| Engineering investment |
High (2–5+ dedicated engineers) | Medium (1–2 engineers for customization) | Low (configuration, not code) |
| Time to value | Months | Weeks to months | Weeks |
| Flexibility | Unlimited | High, within Backstage conventions | Moderate, within vendor abstractions |
| Governance & RBAC |
Build it yourself | Build or plugin-based | Built-in |
| Best for | Large orgs wanting full control | Teams standardized on Backstage who want less ops | Teams prioritizing speed, governance, and actionability |
What This Actually Means
You fork or deploy OSS Backstage, install the plugins you need, and host it yourself. Or you build your own internal portal from scratch. Either way, you now own:
Backstage gives you the most flexibility because you can add your own custom plugins, model your internal world however you want, and connect it to any tool. If you're willing to put a lot of money into it, that freedom is very powerful.
Where It Breaks Down
In practice, that freedom has a price:
This path could still work. If you run a very large organization and want to make the portal a core product, you need to have strong React/TypeScript and platform skills, and you really want to be able to customize it however you want, building on Backstage is a good idea. Just remember that you are not choosing a tool; you are hiring people to work on a long-term project.
What This Actually Means
Managed Backstage providers run and host Backstage for you. You still get the framework and everything that goes with it, but you don't have to fix Kubernetes manifests at 2 a.m. or investigate upstream patch releases.
Vendor responsibilities typically include:
You get "Backstage without the server babysitting."
Where The Trade-Offs Show Up
You also inherit Backstage's structural limits:
Hybrid works well if you have already standardized on Backstage concepts, want to keep the ecosystem, and simply refuse to run your own instance. If you're just starting out with IDPs and are still looking into things like golden paths, self-service workflows, and platform-managed scorecards, it might be helpful to compare hybrid Backstage to commercial IDPs that were made to be products from the start.
What This Actually Means
Commercial IDPs approach the space from the opposite angle. You do not start with a framework, you start with a product. You get a portal that ships with:
The main point that sets them apart is how well that portal is connected to the systems that your developers use every day. Some products act as a metadata hub, bringing together information from your current tools. Harness does things differently. The IDP is built right on top of a software delivery platform that already has CI, CD, IaC Management, Feature Flags, and more.
Why Teams Go This Route
Teams that choose commercial Backstage alternatives tend to prioritize:
You trade some of Backstage's absolute freedom for a more focused, maintainable platform. For most organizations, that is a win.
People often think that the difference is "Backstage is free; commercial IDPs are expensive." In reality, the choice is "Where do you want to spend?"
When you use open source, you save money but lose engineering capacity. With commercial IDPs like Harness, you do the opposite: you pay to keep developers focused on the platform and save time. A platform's main purpose is to serve the teams that build on it. Who does the hard work depends on whether you build or buy.
This is how it works in practice:
| Dimension | Open-Source Backstage | Commercial IDP (e.g., Harness) |
|---|---|---|
| Upfront cost | Free (no license fees) | Subscription or usage-based pricing |
| Engineering staffing | 2–5+ engineers dedicated at scale | Minimal—vendor handles core platform |
| Customization freedom | Unlimited—you own the code | Flexible within vendor abstractions |
| UX consistency | Drifts as teams extend the portal | Controlled by product design |
| AI/automation depth | Add-on or custom build | Native, grounded in delivery data |
| Vendor lock-in risk | Low (open source) | Medium (tied to platform ecosystem) |
| Long-term TCO (3–5 years) | High (hidden in headcount) | Predictable (visible in contract) |
Backstage is a solid choice if you explicitly want to own design, UX, and technical debt. Just be honest about how much that will cost over the next three to five years.
Commercial IDPs like Harness come with pre-made catalogs, scorecards, workflows, and governance that show you the best ways to do things. In short, it's ready to use right away. You get faster rollout of golden paths, self-service workflows, and environment management, as well as predictable roadmaps and vendor support.
The real question is what you want your platform team to do: shipping features in your portal framework, or defining and evolving the standards that drive better software delivery.
When compared to other Backstage options, Harness IDP is best understood as a platform-based choice rather than a separate portal. It runs on Backstage where it makes sense (for example, to use the plugin ecosystem), but it is packaged as a curated product that sits on top of the Harness Software Delivery Platform as a whole.
There are a few design principles stand out:
When you think about Backstage alternatives in terms of "How much of this work do we want to own?" and "Should our portal be a UI or a control plane?" Harness naturally fits into the group that sees the IDP as part of a connected delivery platform rather than as a separate piece of infrastructure.
A lot of teams say, "We'll start with Backstage, and if it gets too hard, we'll move to something else." That sounds safe on paper. In production, moving from Backstage gets harder over time.
Common points where things go wrong include:
The point isn't "never choose Backstage." The point is that if you do, you should think of it as a strategic choice, not an experiment you can easily undo in a year.
Whether you are comparing Backstage alone, Backstage in a managed form, or commercial platforms like Harness, use a lens that goes beyond feature checklists. These seven questions will help you cut through the noise.
If a solution cannot give you concrete answers here, it is not the right Backstage alternative for you.
Choosing among Backstage alternatives comes down to one question: what kind of work do you want your platform team to own?
Open source Backstage gives you maximum flexibility and maximum responsibility. Managed Backstage reduces ops burden but keeps you within Backstage's conventions. Commercial IDPs like Harness narrow the surface area you maintain and connect your portal directly to CI/CD, environments, and governance.
If you want fast time to value, built-in governance, and a portal that acts rather than just displays, connect with Harness.


We’ve all seen it happen. A DevOps initiative starts with high energy, but two years later, you’re left with a sprawl of "fragile agile" pipelines. Every team has built their own bespoke scripts, security checks are inconsistent (or non-existent), and maintaining the system feels like playing whack-a-mole.
This is where the industry is shifting from simple DevOps execution to Platform Engineering.
The goal of a modern platform team isn't to be a help desk that writes YAML files for developers. The goal is to architect a "Golden Path"—a standardized, pre-vetted route to production that is actually easier to use than the alternative. It reduces the cognitive load for developers while ensuring that organizational governance isn't just a policy document, but a reality baked into every commit.
In this post, I want to walk through the architecture of a Golden Standard Pipeline. We’re going to look beyond simple task automation and explore how to weave Governance, Security, and Supply Chain integrity into a unified workflow that stands the test of time.
A Golden Standard Pipeline isn't defined by the tools you use—Harness, Gitlab, GitHub Actions—but by its layers of validation. It’s not enough to simply "build and deploy" anymore. We need to architect a system that establishes trust at every single stage.
I like to break this architecture down into four distinct domains:

The Principle: Don't process what you can't approve.
In traditional pipelines, we often see compliance checks shoehorned in right before production deployment. This is painful. There is nothing worse than waiting 20 minutes for a build and test cycle, only to be told you can't deploy because you used a non-compliant base image.
In a Golden Standard architecture, we shift governance to Step Zero.
By implementing Policy as Code (using frameworks like OPA) at the very start of execution, we solve a few problems:
The Principle: Security must speed the developer up, not slow them down.
The "Inner Loop" is sacred ground. This is where developers live. If your security scanning adds friction or takes too long, developers will find a way to bypass it. To solve this, we rely on Parallel Orchestration.
Instead of running checks linearly (Lint → then SAST → then Secrets), we group "Code Smells," "Linting," and "Security Scanners" to run simultaneously.
This gives us a huge architectural advantage:
The Principle: Prove the origin and ingredients of your software.
This is the biggest evolution we've seen in CI/CD recently. We need to stop treating the build artifact (Docker image/Binary) as a black box. Instead, we generate three critical pieces of metadata that travel with the artifact:
The Principle: Build once, deploy everywhere.
A common anti-pattern I see is rebuilding artifacts for different environments—building a "QA image" and then rebuilding a "Prod image" later. This introduces risk.
In the Golden Standard, the artifact generated and signed in Layer 3 is the exact same immutable object deployed to QA and Production. We use a Rolling Deployment strategy with an Approval Gate between environments. The production stage explicitly references the digest of the artifact verified in QA, ensuring zero drift.
To successfully build this, your platform needs to provide specific capabilities mapped to these layers.

Tools change. Jenkins, Harness, GitHub Actions—they all evolve. But the Architecture remains constant. If you adhere to these principles, you future-proof your organization:
Adopting a Golden Standard architecture transforms the CI/CD pipeline from a simple task runner into a governance engine. By abstracting security and compliance into these reusable layers, Platform Engineering teams can guarantee that every microservice—regardless of the language or framework—adheres to the organization's highest standards of trust.