Systematically Terraforming a Brownfield of Cloud Infrastructure
Some thinking, trade-offs, theory building, and method-making one might ended up doing, in the course of bringing Infrastructure as Code (IaC) discipline to brownfield (and greenfield) services, at a small regulated fintech company, having a smaller engineering team that serves several business units, including one of India's largest national tax gateways. Only somewhat easier than reading a long compound sentence without pausing for breath. Phew.
Contents
Nota bene
Take what is useful, discard the rest.
- This post was salvaged from draft hell 1. Parts of it may not have aged well. Other parts may be repetitive and repetitive.
- It is not specifically about Hashicorp's Terraform 2 (or about me 3), though both of us are players on the board.
- Follow the ToC for a quick topical summary.
- The sample Terraform monorepo layout here illustrates what I ended up making.
This is a long read.
Suit up!

The field is brown
A Fortune 500 CTO counts many zeroes (and nines). A Startup CTO counts fewer. Yet they are equally sleep-deprived.
— Zen koan
'twas the Before Times, at a small "fintech" company with a smaller engineering team. 'twas a CTO one really wanted to work for. Shortly after joining his team, one was bidden; "Go forth! Terraform all our Infrastructure.".
The Goal…
… 4 was to replace the extant ad-hoc semi-automation, with a comprehensive system for use company-wide, over the long haul.
"Mnhmm, so like, we're a 50-ish person company? Total? How hard could this be?"
— My brain.
Such pleasant thoughts.
One looked closer.
And closer.
Closer…
Until one was staring into the abyss, haunted by the specter of abject failure. Because, complexity manifests stunningly easily, even in a small company.
A conglomerate in startup clothing
Though small, the company had (has) many business interests (and consequently, business units); supply chain finance, corporate tax compliance, services that touch receivables and payments. This means…
- Handling many different categories of customer financial records.
- Working with banks, NBFCs, government IT infrastructure (private MPLS lines, anyone?), and enterprise IT systems (a lot of SAP!).
- Working under lavishly sprawling Banking and Finance regulations.
- Keeping Enterprise IT certification current (PCI-DSS, ISO…).
Smallness, alas, is no excuse to the powers that be. So. Many. Audits. Plus, each business unit has its own distinct customer base, operating model, and third party integrations.
From a DevOps point of view, such business diversification would have been merely onerous in a less regulated industry. However in a heavily regulated one, it spells trouble. All businesses units must be siloed not just legally, but also computationally (repo owners, builds and deploys, storage, servers, networking etc.).
If everything is siloed, how to make One System to rule them all?
No stopping the world
While we build out this grand new all-encompassing vision, business continuity demands that the business continue to use the legacy mix bag of semi-automated scripts and manual AWS console processes. And let people make their new stuff while our new stuff is still not up to snuff.
Organisation-wide impact
"Uhhh, so like, if this project fails it will, at the very least, nuke several person-years of productivity? Because several teams are counting on it, and they will each be set back at least several months?"
— My brain, staring into the abyss.
The un-system we had was getting unwieldy and risky by the day, owing to the increasing pace and diversification of the company's business interests. Some, like the national tax compliance gateway, were growing fast. Meanwhile, extant infrastructure management and IT operations had accreted over many years, trailing the accretion of business interests.
But the un-system we had worked. So if the new thing didn't, then the old thing would eventually fail, but by that time nobody would have the time or budget to make another new thing, without taking away from whatever they were actually supposed to be building. Not that I would know, because then I'd be out of a job.
No move fast and break things
One of the unique delights of regulation / compliance is that a CTO must sit on their hands for six months before they can allow people like you and I to touch production. Nobody is allowed to move fast and breaking things here, sorry.
Luckily there is not-production which you and I can break at will. We know because we have, in fact, broken non-prod. Some of us (not I, sadly) have done it often enough to earn a custom emoji in the corporate Slack.
One more thing…
The number one priority for this Infra-as-code project was to "lift and shift" a tax gateway service from its crufty docker swarm situation to "something that lets everybody sleep better at night".
A fifth of the whole country's corporate tax filings were flowing through said gateway. This was spiky, bi-weekly, aligned with routine corporate tax compliance deadlines. However, the service was gearing for "e-way bill" tax compliance traffic, which would be live, and cause at least an order of magnitude jump in total traffic. Goods carriers must generate an "e-way bill" for the exact shipment, and register it with the tax gateway on a "just in time" basis. Transit is not allowed sans that document.
In other words, this service absolutely could not afford to go offline at any point in the infrastructure migration (and certainly not after!). Otherwise a large fraction of India's corporate tax filings, and worse, goods transit would be disrupted. When it comes to things that must flow, even small disruptions have large knock-on effects 5.
So no pressure, really.
A theory of change
Our CTO had evaluated options and chosen Hashicorp's Terraform 6 for our IaC requirements.
Tools model reality
Any tool imposes design trade-offs. Choosing it means we have chosen its baked-in assumptions and opinions about what IaC means. Design dominates everything, and so, to have a shot at succeed, one must deeply understand the tool's model of reality.
In the case of Hashicorp Terraform, understanding its state model, and learning the ins and outs of its behaviour under live fire is critical one's thinking about change, and one's process and workflow of change management.
All green fields turn brown
My experience is certainly not unique. Most of us have had to introduce Terraform in an already-live, maybe semi-automated runtime. Even if one starts "greenfield", change is inevitable. One's system will change in small and big ways, and one's codebase will grow and be refactored. Terraform itself will change too. One will have to keep up with all these changes. For example, I started using Terraform circa HCL 0.11.x and then had to migrate our codebase to HCL 0.12.x, which introduced breaking changes in HCL itself.
Operator knowledge is tacit
As with any operator tool, real-world Terraform usage involves a lot of tacit and tribal knowledge, in addition to a solid grasp of one's local operating context. The official documentation and manuals can do only so much for an operator.
It is therefore important to discover ways to model reality with the tools at hand, and construct a closely reasoned theory of change suited to the needs of the business.
Learn from all available prior art
A ton of prior art—tribal knowledge—that helped me understand how to use Terraform effectively, while meeting my system design goals. Thanks to all these people for publicly sharing how they think. When there are no text books, stories are all we have.
Design sensibilities: My first devops role (we called our practice "production engineering") shaped my thinking about infrastructure automation. My former colleague and team lead explains in his talk.
Automation Principles at Helpshift (video, slides)Automation Principles at Helpshift - Raghu Udiyar
Everyone knows automation in operations is necessary, and is a key objective of any DevOps practice. But if not started on the right foot, or with the right objective in mind, can often leads to false starts and bad design choices. Building and extending to incorporate new requirements gets increasingly difficult leading to hacks and rewrites.
At Helpshift we follow certain principles and guidelines that have helped us avoid these pitfalls, and allow us to build a stable and extensible automations framework using Ansible.
I will walk through these principles, the thought process behind them, and the results through demos.
Raghu is a Production engineering manager at Helpshift. Leading a team responsible for the Helpshift infrastructure; doing Operations, Systems Architecture, Performance engineering, etc
Terraform mastery: advanced users sharing how they solved various real-world problems; simple as well as sophisticated.
Experimental Self-education
The regulators created mandatory bench time, my CTO provided resources and liberty to run all sorts of experiments, and modern clouds make it cheap to run them. So I had leeway to figure out:
- What the heck "immutable infrastructure" is supposed to be?
- Ought we even treat infra as immutable, if yes, what are the rules?
- The language constraints and sharp edges of the HCL language.
- The ins and outs of state management with Terraform.
- Design and tooling for safe change management, operations procedures and operator safety, staff onboarding, compliance audits, code migrations, error recovery etc.
The intention was to:
- Locate any hard constraints and one-way streets (architecture/service lock-ins)
- As far as possible eschew one-way choices
- Find and validate at least one migration path away from any foundational one-way choice
- Identify stated and unstated system boundaries
- Find and mitigate limitations, warts, and idiosyncrasies of the chosen tools
- Discover safe workflows and operational properties of the new management system
- Articulate "how-to" and "why-to" knowledge for operators and maintainers
With 20/20 hindsight, here is a list of exercises to do, to acquire das Fingerspitzengefühl with Terraform. In real life, I did these things as they occurred to me.
Bare-bones Setup
Write some terraform to set up some infrastructure:
- Write everything in a single directory
- Let it be something small but representative; let's say, three or four distinct resources. e.g. a VPC with public/private subnet, an S3 bucket, maybe a datastore.
- Do NOT use third-party modules at this time. Just keep it simple and write everything yourself.
Progressively riskier code refactoring
- Small refactor: Using the bare-bones setup, try to improve names of some resources (run-of-the-mill refactoring). Run terraform plan and observe the changes.
- Small run-time change: Next, go to the AWS console and alter some resource properties there, such as resource names, or resource config parameters. Try to fix your terraform to the new reality (e.g. someone did an emergency fix via the AWS console).
- Medium refactor: Now try to split your terraform code into a better directory structure… assume the codebase has grown and needs to be factored better. Pass dependencies from one directory to another. Consider what order would you need to do terraform plan/apply if something changes in the directories.
- Large refactor: Try to bring in a third-party module to manage one of the pieces. This emulates a condition where you originally decided said third party module was too much. Eventually your usage grew such that you would rather pick the module as it's now a better choice.
- Add environments: Now try to support a "test", "staging", and "prod" setup, with the same structure but varied configuration, without duplicating code. The two setups must be controlled by variable files only. Configs can change (e.g. you want cheaper resources in staging), but the architecture must be the same. This is because changes to infra must be testable in a safe environment before propagating to "staging", and then finally to "production".
Migration Exercise
Try to change something fundamental, like your VPC's CIDR block (e.g. new requirement to migrate to a different VPC structure). Once a VPC is created, its allocations cannot be modified. There is no way to do an in-place upgrade, and a "whole data center migration" is needed.
- Think through what all must be accounted for, in order to make the migration succeed?
- Assume the current VPC is a live system, that cannot go down while the migration is in progress.
Disaster Exercise
Lose your state file, but keep your infrastructure. Try to rebuild your Terraform state file from your live infrastructure.
Destroy critical parts of your infrastructure, but keep your state file. Try to restore your infrastructure from the state file.
Take copious notes about the steps taken, issues faced, manual workarounds needed. And at the very least, make notes about how to:
- Prevent this from happening.
- Recover if the worst happens.
- Teach colleagues how to respond.
I didn't attempt to solve for the whole thing, only attempted prevention. This, I did with a combination of S3 bucket versioning policies + IAM policies + scripted CLI to defend against typos / accidental command invocations + checklists for change management procedures + some operator training (more like admonishment: "If we lose that state file, we're dead.").
Service not fully supported by the tool
This is a bonus exercise. I did it, by chance, without wanting to.
Pick the most complicated service from your cloud provider, that you want to use in "guru" mode; configured to the hilt. See if Terraform supports it fully. OR trawl the AWS provider issues to find a service that it does not yet support fully. Figure out what you'd do in that case. Legitimate options include:
- Patch the provider source.
- Work around with an out-of-band script and/or checklist.
- Do nothing. Defer until the provider rolls out an update.
In my case the Terraform AWS Provider did not support certain advanced routing rules for the AWS Application Load Balancer. If memory serves, this was being discussed in the AWS provider Github at the time. Since I had a one-time-only use case (a live migration), I got away by scripting the rules, because AWS CLI had the requisite support.
Creating the design space
Though I didn't invent anything new, all the outside lessons coalesced through The Force within 7, into a (I think) rather neat solution, over the weeks and months it took to find it. In other words, I made it up as I went along and what follows could be totally not industry standard practice!
Here is what I thunk (and think).
Large-scale Architecture must be Functional and Composable
However, Terraform itself is not functional. It is a highly stateful, object oriented system, with what amounts to remote procedure calls. Complects structure + behaviour + state. The ansible system is imperative in the small, but functional in the large because it is an idempotent transform of infra -> infra, with UNIX-like composable workflows, dynamic in-process state as a reference (dynamic inventory).
Constrain change, not people
As a designer, I strongly prefer to build a system that helps us:
- Strictly gate-keep, structure, and micro-manage changes to live production, at one extreme of the management spectrum.
- And at the other extreme, enable a fresh engineer to play with infrastructure willy-nilly. On day one of employment, a newcomer should be able to clone the codebase and follow a bunch of instructions to spin up, modify, destroy production-like infrastructure without any supervision whatsoever.
Any infrastructure automation system must limit the "blast radius" of changes. The longer a thing lives, the more the odds that something is going to go awry. At the same time, one must make experimentation easy and safe, to ease change management, and to onboard new people through tons of hands-on practice.
If a system has such a capability, it signals to me that it has solved for strategic risk management as well as daily-driver operator safety.
Change service infrastructure to suit the management model
A key change was to migrate application runtime to a "serverless" model, because this gels better with the convenient fiction of "Immutable" infrastructure.
The company had already chosen to switch to managed databases, and had previously adopted containers. However, we operated our own EC2 VMs and/or our own Docker Swarms. We adopted 8 AWS Fargate , and it worked out quite well for us.
So if there is a general design principle for Terraformers, it would be "Subtract explicit management of mutable infrastructure (e.g. pool of EC2 VMs).". Managed infrastructure bills can rack up fast, so use some good ol' napkin arithmetic to make a sound tradeoff (management complexity X
operating expense X
degree of control and observability needed X
vendor lock-in).
Build your own tooling instead of buying into a framework
There are many frameworks and tools out there to help operators create and maintain their terraform code, as well as to apply it safely against live services.
However, I felt Terraform itself was a huge variable. Third-party tools and frameworks impose their own bespoke ideas and abstractions. Outsourcing critical thinking (why to do what?) to industrial automation (frameworks) is quick and convenient in the short term, but risky in the long run. I wanted to ensure we acquired enough first-party operational knowledge about Terraform, specific to our unique constraints, before locking ourselves into some third-party approach.
This paid off because, as I progressively discovered what worked for us, it became clearer that none of the extant third party tools / frameworks / guides would have served us in the long run. Further, because we needed to cater to only our specific workflow automation needs, only in AWS, we could tool up with just a handful of tiny shell scripts.
Find a core unit of abstraction and define it clearly
- We "Layer" infrastructure: On-disk, a "Layer" is just a directory containing Terraform files, and crucially, has its own state file. The files follow a certain naming convention to help separate functionally different pieces for ease of maintenance. A Layer includes a notion of 'main' code by convention, which can be hand-rolled Terraform, or a third party "Module".
- A "Layer" is a stateful object: Each "Layer" defines some piece infrastructure (say, a VPC), or a "service" (cluster of apps, or data store), or a prerequisite (e.g. IAM). The art lies in choosing what to put in a Layer, because it is a unit of composition and a unit of change management of some logical unit of one's live runtime. Layers can look different for different systems.
- A "Layer" is NOT a "Module": To reiterate, in this architecture, a Layer is a stateful object. It is not to be confused with a "Module", which is also just a directory containing Terraform files. However, a Module is supposed to be a unit of HCL code reusability, and so, it must not be associated with any state. i.e. a "Module" must be function-like.
Share Practices, Protect Secrets
Each business unit is a silo by necessity. It owns and operates its own infrastructure and services. Each unit needs total control over its secrets, its specific system architecture, its pace of development, and crucially who can effect changes in production.
However a small company of this type can't afford to — in fact, it simply cannot — have each unit build their automation from the ground up. Thus, each business unit also needs to use and share the meta-stuff; tooling, techniques, workflows, operator experience.
Think like Terraform when laying out code
Terraform's programming and state model firmly dictated directory layout, entity naming convention, coding convention, etc. All of this is version controlled under git, of course, which in turn allows us to safely run tooling that modifies code en-mass (e.g. recreate auto-generated files as needed).
Use a mono-repo for great good
A single repository can contain a directory for each department. It's always possible to split out each department into its own repo later, if some hard constraint dictates it be so. Absent that, it's far better to put it all in one repo. Doing this well is invaluable in helping departments share practices while protecting secrets.
# Each "department" runs / owns its own infrastructure,
# but benefit from shared tooling.
project-repo-root
|
| # shared tooling and scripts
|_ tools/*.sh
|
| # Each "department" owns and maintains its own infra.
| # in a dedicated directory.
|_ department-A
|_ department-B
|_ department-C
|
| # Global Vars must be common to two or more "layers",
| # e.g. CIDR blocks, security groups, load balancers etc.
|_ global-vars/{test,dev,prod}.tfvars
|
|_ vpc-layer
|
|_ iam-layer
|
|_ common-load-balancer-layer
|
|_ alpha-app-servers-layer # uses common load balancer
|
|_ delta-app-servers-layer # uses common load balancer
|
|_ job-runners-layer # does not need load balancer
|
|_ search-service-layer # has own dedicated load balancer
|
|_ ... # more layers
|
|_ ... # more layers
|
|_ foobar-layer
| # variable declarations auto-generated using
| # tfvars declared in department-wide global-vars
|- input-global.tf
|
| # variables declared and maintained manually
| # by THIS layer's department / owner
|- input-local.tf
|
| # terraform version etc. must be hard-coded
| # because of HCL's bootstrap problem
|- main-hardcoded.tf
|
| # "main" has ALL the resources for the layer.
| # Everything is parameterised based on env name.
|- main.tf
|
| # Export state to layer N+1 that uses THIS layer
|- output.tf
|
| # Per-environment variables and input-local overrides,
| # PLUS mandatory state isolation for each environment.
|- env-test.tfvars
|- env-test-s3-backend.conf
|
|- env-stage.tfvars
|- env-stage-s3-backend.conf
|
|- env-prod.tfvars
|- env-prod-s3-backend.conf
This is a classic monorepo structure, albeit with the following rationale / utility:
- At the top level, each department's code is namespaced in a directory named after it. All shared workflows/scripts go into a top-level directory "tools" directory too.
- Inside a "department", the directory-per-Layer structure is precisely because Terraform treats all files in a directory as one thing.
- Inside each layer, files follow a definite naming convention to tell maintainers where to look for what. e.g. "input-*.tf" and "output.tf" files are interface boundaries between a layer and the "global" parts of the system, and also between Layers. Some boilerplate HCL code is also separated from a maintenance point of view. E.g. main-hardcoded.tf can simply be overwritten using a tooling script, if we need to bump Terraform's version. "input-global.tf" can be regenerated for all layers, whenever global-vars change.
- Within each .tf file, every configuration parameter is written as a variable. There are no hard-coded values whatsoever.
- Each Layer has environment-specific configurations: The environment-specific .tfvars files control the value of variables that are environment specific. e.g. The number of VMs in a cluster, or the configuration of a load balancer, or a database. The vars files can optionally be encrypted on disk by department / team.
- Unified command-line interface: And last but not least, given a good directory name-spacing and file naming convention, one can design a standard, unified command-line workflow to help operators of all departments safely run all the tooling, be it for maintenance, update, or plan/apply. Pragmatic tooling serves as living documentation of operations practices, and multiplies benefits across departments.
Thinking in Systems pays off
Big systems are made of little systems. The overall system (design convention + code layout + tooling + ops workflow) helped addresses several problems operators face.
Over the years, I've learned some new terms and concepts that helped me reflect on what I was doing and frame it in ways I can explain to others. This section outlines various levels of abstraction to think at, as a designer of an infra-as-code system.
Though most design choices were intentional, not everything was clear and organised a-priori in my head. Much was synthesised while in the thick of things, wrought of common sense, trial and error, a bunch of luck, and prior operator experience. A gray beard helps, I am told :)
Think brownfield-first
- Design for graded migrations.
- Expect multiple tactical lift-and-shift operations. Use these as learning / improvement opportunities.
- Discover the right IaC architecture and abstractions.
- Find a balance between services, procedures, tooling and operational complexity. e.g. your team already has infra, but wants to adopt terraform (prettymuch every team in the world)
- don't bother with VPC/IAM/crazy-critical stuff, start replacing replacable services (e.g. things that can be blue/green deployed)
- meanwhile each non-migrated layer can simply be a stub, importing state from live env, and exporting only relevant state via output.tf
- then slowly terraform in rising order of mission-criticality / global side effects (e.g. borking IAM can blow up everything)
Create environment for elite devops culture
Be hardcore about "full-systems" IaC
- No more button-clicking in AWS console
- Fine-grained IAM and security controls
- Centralized Logging and Monitoring
- CI/CD pipelines and push-button deploys
- Auto-scaling
Promote Safety Culture. Security derives from this.
- Psychological Safety
- Operator Safety
- System Safety
Deliver vastly better capabilities than extant un-system.
- Zero-downtime live infrastructure switching
- Composability
- System Invariants
- Dev/stage/prod parity of architecture and subsystem relationships
Promote tool ownership.
- Learn/teach trade-offs imposed by Terrafrom's model.
- Embrace limitations and work around them.
Craft procedures and tools for safe incremental changes.
Incremental change is routine. Errors happen when routine tasks are poorly understood. Every operator should be trained in routine procedures, but nobody should have to operate relying solely on their own memory. e.g. In the system I arrived at, introducing new common vars meant running through a checklist like this:
- Manually add var(s) to the global environment file(s) (
tfvars
files located via standard file and directory naming convention). - Use a script to auto-generate input-global.tf for all layers.
- Check git diff to verify it's OK.
- Commit.
- Plan/Apply, and
- If success, immediately push commit to remote
- If failure, recover from failure, commit a fix
- GOTO Plan/Apply
Ditto for migrating main-hardcoded.tf (e.g. doing HCL version upgrades).
The acid test for a rock-solid routine procedure is: can any operator run it flawlessly under live fire?
Figure out decision-making at various levels of abstraction
Decisions about separation of concerns based on "unit-ness" of a thing:
Examples:
- If multiple services are to share a load-balancer, then the load-balancer must be its own layer.
- If a service has its own load balancer, the the "layer" is service+load-balancer.
- Each datastore gets its own layer, and if a store is multi-tenant, it may make sense to further isolate each tenant to its own layer
Decisions about separation of layers based on rate of change:
Examples:
- A VPC once defined almost never changes, and it is shared by everything, so it goes in its own layer
- IAM are added/updated (rarely removed), and shared, so they go into an IAM layer. Also good for IAM visibility.
- Security groups are more frequently changed, and are shared, so they go into a single layer
Decisions about separation of environments based on access level of operator:
Example: allow anybody in the ops team to CRUD the 'test' environment. These are by definition throwaway, and encourage experimentation — "fearless devops".
- enforce access to s3 state — newbies don't get access to "staging" and "prod" state — even terraform plan will fail
- some staff get access to "staging", once they become familiar with team ops
- no individual user gets IAM on "prod" state; this is a controlled user, probably on a jump box or some such place, accessible only via a change management process, and where durable audit logs note who used it when, and why
Factor in department-specific change management
e.g. create some explicit order of layers and perform plan/apply operations only in that order (and destroy in exactly the reverse order)
- I simply committed a file with a whitelist of layer names — a util script would run terraform commands only in that order
- this little trick also let me add / prototype a new layer, without messing with the existing ones … the new layer would make it to the whitelist, after it was OK in test/stage
Choose a sexy code name for branding reasons
I called the project "Skywalker".
Okay, that may too pun-y to be sexy. But it did make people chuckle. And I could joke around that we couldn't possibly fail because The Source was with us :)
Epilogue
Implementation details of the concrete solution and tooling I developed must have changed a lot. Though I hope the core design and architecture hasn't (i.e. it has been a continued success).
The smell of success is wonderful
Anyone who has done a project with lots of moving parts and a nebulous end state knows that so many things can go wrong at any time. What does "done" even mean? Will the gnawing self-doubt leave already, please? So imagine my utter disbelief, when it not only worked, but it worked well!
- The project fulfilled the design intention of generating a reference architecture that the whole company could adopt. Also tools, checklists, runbooks, and procedures to play with it in our AWS test account.
- Various teams were able to start adopting it immediately to create and operate their own bespoke infrastructure for greenfield services. They gained confidence that they could port over their legacy systems at their pace.
- The whole thing proved itself in production (design, implementation, tooling, checklists). I ported and live-migrated the company's most mission-critical system. Then we ran the spanking new system without hitch for several months. And then, after I had moved on from the company into a sabbatical 9 other operators entirely deprecated the old system without having to call me back for assistance.
Computers and software are amazing
Our tools and clouds are extremely leveraged. When used with even a bit of care and forethought, a single person can get a lot done. This, I knew from experience 10.
Even so, it took literal first-hand experience for the insight to sink in viscerally… the power available to a single person willing to plough through several hundred pages of manuals, incur hair-loss due to insane APIs, and doggedly try failed experiment after failed experiment. Until it all just… works!.
That person was I, and that guy still can't believe he ended up pulling off this whole terraforming exercise almost entirely by himself (like I mentioned, very small team!). This work would have required a whole department in the early 2000s.
None of it is possible without supportive bosses and peers
Thanks! You know who you are!