CI/CD with Robert Erez

The Pragmatic Engineer 1h15 9 min #90
CI/CD with Robert Erez
Watch on YouTube

Summary

  • Rob Erez, a CI/CD expert and early engineer at Octopus Deploy, joins the podcast to share hard-won lessons from over a decade of building deployment tooling at scale. The conversation covers progressive delivery in practice, the realities of GitOps, why rollbacks are overrated, how AI is changing CI/CD, and what it actually takes to run deployment infrastructure for thousands of customers. A recurring theme: most teams just want to ship software reliably, and the industry’s dogmatic attachment to specific methodologies often obscures that simple goal.

From Skype to Octopus Deploy: A CI/CD Origin Story

  • Rob and the host worked together on the Skype for Web team in the mid-2010s, where they quietly practiced continuous delivery years before it became mainstream.
    • Their team inherited the Outlook.com plugin serving ~400 million users per month, running on Azure.
    • Despite a formal change advisory board (CAB) process requiring weekly sign-offs, the team built an automated pipeline that ran tests, promoted through staging, and shipped to production multiple times per week.
    • They used New Zealand as a canary deployment target: it was the first country to reach a new day (due to timezone), English-speaking, and small enough that a bug wouldn’t cause massive damage.
    • This experience opened Rob’s eyes to the power of progressive delivery and faster feedback loops.
  • After returning to Australia, Rob joined Octopus Deploy as employee #8 or #9, drawn by the company’s focus on deployment tooling and the interesting problems in the CI/CD space.
    • Octopus Deploy was originally built in Brisbane and had a startup culture where everyone, including the CEO, was an engineer writing code.

CI/CD Maturity Stages

  • Rob describes a maturity model that teams typically progress through:
    • YOLO: Deploying directly to production with no process.
    • Continuous Integration (CI): Merging code changes into a single branch and running tests continuously.
    • Continuous Delivery: Ensuring the deployment process itself is tested, so you could push to production at any time with a button click. Changes flow through dev, staging, and other environments, but the final production push may still be manual.
    • Continuous Deployment: Changes flow through the entire pipeline and reach production automatically without manual intervention.
  • The key difference between continuous delivery and continuous deployment is whether the final step to production is automated.
  • Not every company should aim for continuous deployment. Regulated industries, compliance requirements, and the need for coordinated release timing all make continuous delivery a perfectly valid end state.
  • The core value of reaching continuous delivery is risk mitigation: you’ve tested the process itself and can push whenever you’re ready.

Kubernetes and the On-Prem Reality

  • Kubernetes emerged from Google’s internal Borg system and was released partly to level the playing field between cloud vendors (AWS, Google Cloud, Azure) by making workloads portable.
  • It won the container orchestration space over alternatives like Nomad, Docker Swarm, Azure Service Fabric, and others, largely because of its cross-platform nature and cloud vendor adoption.
  • Despite being labeled “cloud native,” a significant number of Octopus Deploy’s Kubernetes customers run on-premise.
    • Financial institutions and regulated industries often want full control over their infrastructure while leveraging Kubernetes’ declarative model.
    • Examples include Kubernetes clusters running in point-of-sale systems across hundreds of stores, and even on research vessels at sea.
      • The research vessel case is particularly interesting: ships may be out for weeks or months, so deployments can only happen when they return to port, creating unique availability challenges.
  • Kubernetes fits into a broader trend of declarative infrastructure tools (like Terraform and Puppet) where you define desired state and the system continuously reconciles reality to match it.

GitOps: What It Actually Is

  • GitOps was coined by Weaveworks around 2017 and gained traction alongside Kubernetes. The core idea is extending Kubernetes’ continuous reconciliation pattern further back into the supply chain, pulling desired state from a Git repository.
  • The four pillars of GitOps:
    1. Declarative: State is defined declaratively, not through imperative steps.
    2. Versioned and Immutable: Desired state is stored somewhere with version history and immutability (a tag or commit SHA that can’t be changed).
    3. Pulled, Not Pushed: The GitOps agent pulls state from the repository rather than having it pushed from outside.
    4. Continuously Reconciled: The system constantly corrects drift between desired and actual state.
  • Despite the name, none of the four pillars actually require Git. The name has created an industry-wide misconception that everything must live in Git.
    • This causes real problems, like teams trying to store secrets in Git (even encrypted) when they shouldn’t be there at all.
    • As long as you can achieve versioning and immutability, the storage mechanism doesn’t matter.
  • GitOps is growing primarily because Kubernetes is growing. It’s most mature in the Kubernetes space, with some experiments extending it to Terraform and other infrastructure.
  • Rob’s pragmatic take: GitOps is not necessary for all teams. Some of the absolutism in the community is counterproductive. Teams should use GitOps principles where they help, but not force every process into a Git-centric model.

Platform Teams as an Organizational Pattern

  • Platform teams have emerged as a solution to the scaling problems that arise when every application team owns their own DevOps process end-to-end.
    • In the old model, dev teams threw code over the wall to ops teams.
    • DevOps improved this by giving engineering teams ownership of operations, but at scale, every team reinvented the wheel differently, creating fragmentation and context overload.
  • Platform teams define best practices and provide self-service mechanisms (often through an Internal Development Portal or IDP) so application teams can spin up projects from templates without becoming infrastructure experts.
    • Application teams still own the operational running of their services (preserving DevOps feedback loops) but don’t need to be experts in every deployment tool.
  • This pattern is common in larger organizations but not necessary for smaller companies where the app team can handle everything.

Progressive Delivery in Practice

  • Progressive delivery is the evolution beyond continuous delivery: instead of shipping to all of production at once, you release changes in a controlled, gradual way.
  • Canary deployments: Route a small percentage of traffic to a new version, gradually increasing while monitoring for issues. Named after the canary in a coal mine — an early warning system.
    • The unit of change is the entire application, so if you’ve made 20 commits, you’re testing all 20 changes at once.
  • Blue-green deployments: Run the new version alongside the old, validate it independently, then swap all traffic at once. Avoids some cold-start issues and gives you a validation window before customers are affected.
  • Feature toggles (feature flags): Rob considers these the most useful progressive delivery strategy for application-level changes.
    • The unit of change can be as granular as a single line of code.
    • Customer targeting is far more precise (e.g., “everyone from Germany with this product in their basket”).
    • Rollback is instant — flip a switch rather than redeploying.
    • Decouples deployment from release: you can ship code on Monday and enable the feature on Tuesday when you’re ready to monitor.
    • Multiple teams shipping simultaneously can each control their own release timing independently.
  • Schema changes remain the hardest problem in progressive delivery. They require careful multi-stage processes (expand and contract patterns) and are where most teams’ rollback strategies fall apart.
    • Octopus Deploy faces a unique challenge here because they support both SaaS (where they control the rollout) and on-prem (where customers might skip multiple versions).

Rollbacks: Why You Should Roll Forward Instead

  • Many customers ask for a “rollback button,” but true rollback is harder than it sounds.
    • For completely stateless systems, GitOps makes rollback straightforward (revert the Git commit).
    • For systems with databases, rolling back code that has already interacted with a migrated schema can cause serious data inconsistencies.
    • Even providing “anti-migrations” to undo schema changes doesn’t solve the problem of data that was created or transformed under the new schema.
  • Rob’s strong advice: avoid talking about rollbacks entirely. Always roll forward.
    • If there’s a bug in version 2, the fix is version 3, not going back to version 1.
    • This is where fast feedback loops and hotfix processes matter most.
    • The only exception is stateless application logic, where a Git revert or feature toggle flip can safely undo a change.
  • Feature toggles are the ideal rollback mechanism for application changes because they’re instant and don’t require redeployment.
    • But even with toggles, schema compatibility must be maintained within the toggle’s code paths.

Feature Toggle Hygiene

  • Feature toggles create a maintenance problem: they’re easy to add and hard to remove, leading to toggle sprawl across the codebase.
  • Octopus Deploy manages this by:
    • Wrapping toggles with metadata about which team owns them and an expiry date.
    • Sending CI notifications when a toggle passes its expiry, prompting the owning team to clean it up.
    • Using observability to track when a toggle was last evaluated, helping identify safe removal windows.
  • Best practice: remove the toggle from code first, wait for that change to reach production (which may take weeks), then remove the toggle configuration from the platform. Tools that track this lifecycle prevent premature deletion.

AI’s Impact on CI/CD

  • AI is the elephant in the room, but its impact on CI/CD is still early and tightly coupled to how development teams adopt AI coding agents.
  • The biggest expected change: much more code velocity, which means pipelines will need to handle significantly higher throughput.
  • The emphasis may shift from pipeline speed to risk management:
    • When AI agents generate code and can babysit the build/test process themselves, shaving 10 minutes off a build matters less because the human engineer has already moved on.
    • The focus becomes managing the risk of AI-generated code reaching production.
  • Feature toggles become even more important in an AI-heavy world:
    • They let you ship AI-generated code fast while controlling feature rollout independently.
    • AI agents can use toggles to react to issues quickly without human intervention.
  • Octopus Deploy is adding AI capabilities pragmatically (MCP server, recovery agents that review logs) rather than plastering “AI” on everything for marketing purposes.

Development Environment Evolution

  • The classic environment progression is dev → test → prod, but this is a simplification.
    • Dev is the first point of integration — does the deployment process work at all?
    • Test is kept in sync with production (using sanitized data) for QA and product review.
  • Ephemeral environments are increasingly replacing shared test environments:
    • Each feature branch gets its own full-fledged environment spun up pre-merge, with all necessary dependencies.
    • The team can access it (e.g., via a URL) to validate the feature, then it’s torn down when the PR is merged.
    • This eliminates contention for shared test environments and speeds up feedback.
  • Cloud-based development environments (like VS Code connecting to cloud containers) were heavily discussed a few years ago but have faded from the conversation, possibly due to complexity with multi-service environments and stateful dependencies.
  • AI agents make ephemeral environments even more valuable: an agent can spin up an environment, validate its own code (especially UI changes), and tear it down — all without human involvement.

Running Octopus Deploy’s SaaS Platform

  • Octopus Deploy’s SaaS offering started as an experiment in 2020 using dedicated VMs per customer, which was not cost-effective ($100/customer/month infrastructure cost on ~$20 revenue).
  • It was rebuilt on Kubernetes using a cell-based architecture called a “Reef”:
    • Each customer instance runs in its own pod within a cluster, with dedicated Azure database and other resources.
    • This allowed the SaaS offering to become viable and scale to several thousand customers running thousands of deployments per month.
  • Current engineering focus: making the deployment process itself more resilient.
    • Currently, deployment steps are stored in memory, requiring downtime during upgrades (stop tasks, kill instance, spin up new one).
    • The goal is to reduce this downtime as close to zero as possible, though going from seconds to true zero-downtime is a much larger architectural challenge.

The On-Prem Reality and Business Strategy

  • Octopus Deploy maintains both SaaS and on-prem offerings, which creates significant engineering complexity.
    • SaaS allows gradual, controlled rollouts (a few days to reach all customers).
    • On-prem customers upgrade on their own schedule: it takes ~200 days for 50% of on-prem customers to adopt a new version, and ~400+ days for 75%. Some customers run versions from 5-7 years ago.
    • This means every new release must support upgrades from very old versions, adding significant testing and schema migration burden.
  • The majority of Octopus Deploy’s customers are still on-prem (banks, financial institutions, governments) who want full control over their infrastructure.
  • Supporting on-prem is a deliberate business strategy:
    • It’s where the demand and revenue are.
    • There’s less competition because most infrastructure startups have gone SaaS-only.
    • Customers who are happy with an old version and keep paying are, from a business perspective, ideal customers — even if the engineering team wants them on the latest version.
  • This mirrors a broader pattern with AI: some customers will want to pin a specific model version and run it on their own infrastructure rather than accept continuous updates.

Recommendations

  • For engineers wanting to adopt progressive delivery: start with one feature toggle. The first time you use it to instantly turn off a production bug at 2 AM, you’ll never want to go back. The main risk is toggle sprawl, so build hygiene processes from the start.
  • Book recommendations:
    • The Phoenix Project by Gene Kim: a foundational text on why engineers should be involved in the operational side of what they ship. Parts are dated, but the core message is timeless.
    • Radical Candor by Kim Scott: a framework for communicating with both empathy and directness, useful for any engineer working with teams.
    • Anything by Greg Egan (e.g., Diaspora, Schild’s Ladder): hard science fiction by a mathematician that builds entire stories from single scientific premises.
Back to The Pragmatic Engineer