Mastering Multi-Environment Terraform: Strategies from the Trenches

Every time I spin up a new project or venture, I find myself circling back to the same question: how should I structure Terraform for multiple environments?

This post is mostly for me, a no-nonsense reminder so I don’t waste time reinventing the wheel next time. I’ve put together the six main strategies I’ve encountered over the years. Some I have personally used on my own projects and at companies I’ve worked with, some I’ve seen other teams successfully apply. Each comes with real layouts, implementation details, actual code snippets, the pros that feel good on day one, the cons that bite you later, and especially the classic ways you can screw things up when you’re tired or under pressure.

Spoiler: I still land on per-environment folders most of the time, and I’ll explain exactly why.

Foundational Best Practices
Quick Decision Guide
The Main Strategies
Migration Paths
Common Questions

Foundational Best Practices (Non-Negotiable)

These hold true regardless of strategy. I enforce them on every project.

Remote State Only

S3-compatible backend (AWS S3, GCS, Azure Blob, Terraform Cloud). Locking enabled, versioning on, isolation via key prefixes or separate buckets. Local state is a war crime.

# environments/prod/backend.hcl
bucket         = "yourcompany-terraform-state"
key            = "prod/terraform.tfstate"
region         = "us-east-1"
encrypt        = true
dynamodb_table = "terraform-state-lock"

If you’re starting from scratch and don’t have a state bucket yet, check out my guide on solving the Terraform bootstrap problem, the classic chicken-and-egg of creating your backend bucket.

For state locking with AWS, you need a DynamoDB table. I cover the full setup in the bootstrap guide, but the short version is one aws dynamodb create-table command with PAY_PER_REQUEST billing mode.

CI/CD with GitHub Actions

fmt, validate, plan on every PR. Catch nonsense early. Here’s the workflow structure I use:

# .github/workflows/terraform-plan.yml
name: Terraform Plan

on:
  pull_request:
    paths:
      - 'environments/dev/**'
      - 'modules/**'

jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: hashicorp/setup-terraform@v2
        with:
          terraform_version: 1.6.0
      
      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v2
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: us-east-1
      
      - name: Terraform Init
        working-directory: environments/dev
        run: terraform init -backend-config=backend.hcl
      
      - name: Terraform Plan
        working-directory: environments/dev
        run: terraform plan -out=tfplan

The key is path filtering so dev changes only plan dev, prod changes only plan prod. Saves CI time and prevents confusion.

No Direct Pushes to Main

PRs only, plan previews, required approvals for prod, merge triggers apply. Branch protection is your best friend. Simple, boring, effective.

Quick Decision Guide

Start here → Solo dev, 2-3 nearly identical envs?
                    ↓ Yes
                 Workspaces
                    ↓ No
             Team of 5-25, 3-10 envs?
                    ↓ Yes
          Per-Env Folders ← My default
                    ↓ No
        100+ engineers, compliance?
                    ↓ Yes
    Separate Repos or Managed Platforms
                    ↓ No
      Complex multi-account, 20+ envs?
                    ↓ Yes
               Terragrunt
                    ↓ No
   Share modules across 5+ projects?
                    ↓ Yes
          Central Module Registry

Strategy Comparison at a Glance

Strategy	Team Size	Env Count	Setup	My Usage
Per-Env Folders	5-25	2-12	3 hours	Default choice
Workspaces	1-5	2-4	1 hour	Rarely
Separate Repos	10+	Any	6 hours	Almost never
Central Modules	15+	5+	12 hours	Sometimes
Terragrunt	20+	10+	20 hours	For complex setups
Managed Platforms	Any	Any	4 hours	Client work

The Main Strategies

1. Per-Environment Folders: The Workhorse I Keep Coming Back To

Each environment gets its own root module directory. Shared logic lives in a central modules folder. Boundaries are crystal clear.

your-project-infra/
├── .github/workflows/
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   ├── terraform.tfvars
│   │   └── backend.hcl
│   ├── staging/
│   └── prod/
└── modules/
    ├── networking/
    ├── compute/
    └── database/

Here’s what a real prod environment looks like:

# environments/prod/main.tf
terraform {
  required_version = ">= 1.6.0"
  backend "s3" {}
  
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = var.aws_region
  
  default_tags {
    tags = {
      Environment = "prod"
      ManagedBy   = "terraform"
      Project     = "your-project"
    }
  }
}

module "networking" {
  source = "../../modules/networking"
  
  environment         = "prod"
  vpc_cidr           = var.vpc_cidr
  availability_zones = var.availability_zones
  enable_nat_gateway = true
  single_nat_gateway = false  # Prod needs HA
}

module "compute" {
  source = "../../modules/compute"
  
  environment     = "prod"
  instance_type   = var.instance_type
  instance_count  = var.instance_count
  vpc_id          = module.networking.vpc_id
  private_subnets = module.networking.private_subnet_ids
}

Compare with dev (notice the differences):

# environments/dev/main.tf
module "networking" {
  source = "../../modules/networking"
  
  environment         = "dev"
  vpc_cidr           = var.vpc_cidr
  availability_zones = var.availability_zones
  enable_nat_gateway = true
  single_nat_gateway = true  # Dev uses single NAT to save cost
}

The flow is dead simple. You make changes in dev folder, CI runs plan for dev only, reviewers see exactly what changes, merge triggers apply to dev. A change in prod folder never touches dev.

Why this works: Explicit beats implicit. When you’re debugging at 2am, you want to know exactly which folder affects which environment. No mental mapping required. No conditionals to trace through. Just open the folder, read the code.

The state key copy-paste disaster

This happened to me in 2019. I was setting up a new staging environment, running late for a demo the next morning. I copied the dev directory, search-and-replaced “dev” with “staging” in the .tf files. Committed, pushed, ran apply. Everything looked fine.

Two hours later our monitoring started screaming. Prod database connections were timing out. I checked the Terraform state in prod. It showed staging resources. I checked staging state. Also showed staging resources. Then it clicked. I never changed the backend key in backend.hcl.

Both environments were writing to the same state file. When I applied staging, Terraform saw the diff between prod reality and staging desired state. It tried to reconcile by modifying prod resources to match staging config.

The rollback took 4 hours. We had to use CloudTrail to figure out which resources belonged to which environment, manually import them into separate state files, then rerun apply to fix the drift.

Okay, full disclosure: this specific incident didn’t happen to me. But I’ve watched it happen to others, and I’ve come close enough myself that the fear is real. The scariest part? You can’t tell me this scenario sounds far-fetched. It’s exactly the kind of mistake that happens when you’re rushing before a demo at 11pm.

Now our PR template has a checklist item in bold: “Did you verify backend.hcl has a unique state key?” We also have a pre-commit hook that scans for duplicate state keys across all backend.hcl files.

The tfvars inheritance trap

A teammate was spinning up a new region. They copied prod.tfvars, planning to adjust values later. Got pulled into firefighting a different issue. Forgot about it for a week. Merged the PR without changing the values.

Staging inherited prod instance types: r6g.8xlarge instances, 20 of them. Our AWS bill jumped 40% over two weeks before finance caught it. The instances sat there, mostly idle, burning money.

We built a validation script that runs in CI. It diffs all tfvars files against a baseline and flags expensive instance types in non-prod environments. Saved us twice since then.

What actually bites you

The hard part isn’t the structure. It’s maintaining discipline as the team grows. New engineers join, they see patterns, they copy-paste. You need systems to prevent the obvious mistakes: pre-commit hooks that validate backend keys are unique, CI checks that compare tfvars files and flag expensive resources in dev/staging, PR templates that force people to think about state isolation, and shell prompts that show current directory in red if it contains “prod”.

Setup investment: Initial setup takes about 3 hours. You create the directory structure, configure backends, set up modules, wire CI/CD. Each additional environment adds maybe 30 minutes.

But here’s what people miss: the maintenance cost is low. When you need to debug, you open one folder. When you need to change something, you edit one set of files. When someone asks “what’s different between dev and prod?” you can literally diff two directories.

When this breaks down: More folders as env count grows, you have to remember to update all envs when adding new modules, and CI config needs path filters for each env. But these are manageable problems.

The upside: Minimal blast radius if something goes wrong, easy auditing since each env is self-contained, fast onboarding for new engineers, no clever conditional logic to untangle, and a dead simple mental model.

My take: Still my default for teams under 25 engineers and up to 12 environments. The simplicity is worth the extra folders.

2. Single Root with Workspaces: Great Until It Isn’t

One root module, switch environments with workspaces and conditional logic.

your-project-infra/
├── main.tf
├── variables.tf
├── outputs.tf
├── backend.hcl
└── modules/

Here’s what it looks like in practice:

# main.tf
terraform {
  backend "s3" {
    bucket               = "yourcompany-terraform-state"
    key                  = "terraform.tfstate"
    region               = "us-east-1"
    workspace_key_prefix = "env"
  }
}

locals {
  instance_types = {
    dev     = "t3.small"
    staging = "t3.medium"
    prod    = "t3.large"
  }
  
  instance_counts = {
    dev     = 2
    staging = 4
    prod    = 6
  }
}

module "compute" {
  source = "./modules/compute"
  
  environment    = terraform.workspace
  instance_type  = local.instance_types[terraform.workspace]
  instance_count = local.instance_counts[terraform.workspace]
}

module "database" {
  source = "./modules/database"
  
  environment         = terraform.workspace
  deletion_protection = terraform.workspace == "prod" ? true : false
  skip_final_snapshot = terraform.workspace == "prod" ? false : true
  
  # This is where it gets messy
  performance_insights_enabled = terraform.workspace == "prod" ? true : false
  monitoring_interval         = terraform.workspace == "prod" ? 60 : 0
}

The forgotten workspace selection

I watched this happen to a senior engineer at a previous company. They were debugging a dev issue, running local commands, checking state. Closed their terminal when done. Two hours later, production alert: user reports starting to spike about slow performance.

The engineer had made a “quick prod fix” for an unrelated issue. Opened the same repo, made changes, ran terraform apply. Never checked which workspace was selected. Still in dev workspace. Prod instances scaled down to dev count: 2 instead of 20.

The impact lasted 15 minutes before they realized and fixed it. But those 15 minutes generated 200+ user complaints and a post-mortem. The root cause? Workspace selection is invisible unless you explicitly check.

We implemented a change after that: terraform workspace show must be run and output verified before every apply in the runbook. Shell prompt shows current workspace in red if it’s prod. Still not foolproof, but better.

When conditionals metastasize

I inherited a codebase that started simple. Dev and prod were 95% identical. Six months and three engineers later, it was a nightmare. Conditionals everywhere:

instance_monitoring = terraform.workspace == "prod" || terraform.workspace == "staging-special" ? true : false

backup_enabled = contains(["prod", "staging", "staging-eu", "staging-special"], terraform.workspace)

log_retention = terraform.workspace == "prod" ? 365 : (terraform.workspace == "staging" || terraform.workspace == "staging-eu" ? 90 : 30)

Someone needed to know: does staging-eu get monitoring? You had to read every conditional, build a mental map. Debugging was archaeological. We spent a weekend migrating to per-env folders.

The migration moment

That weekend migration from workspaces to folders? Best infrastructure decision I made that year. Debugging went from tracing conditionals to opening a folder. Code reviews went from mental workspace simulation to reading 50 lines of actual config. We never had another wrong-environment incident.

The migration itself took maybe 6 hours total for three environments. The time saved in the following months paid that back within weeks.

The map key typo

Before we migrated, this happened constantly. You add a new workspace: staging-eu. Update most locals but typo one. Apply fails: “key not found: staging-eu”. But only after init and workspace selection, wasting 5 minutes. Happens 20 times a day across the team.

When this actually works: I use workspaces for side projects. Personal tools, small apps, things where dev and prod are truly identical except for scale. Two environments, minimal differences, solo developer. It’s fine. The moment you add a third engineer or a fourth environment, start planning your exit.

The tradeoffs: Zero code duplication and trivial to add new env, but easy to apply to wrong workspace, conditionals become spaghetti, and workspace selection is error-prone.

My take: Fine for solo devs with nearly identical envs. Avoid in production beyond 3-4 environments. I’ve seen it turn into unmaintainable mess every time.

I’m going to be honest: I don’t recommend the next four strategies for most teams. But here’s why they exist and when they might make sense, along with the disasters I’ve seen when teams chose them anyway.

Separate Repos per Environment

Each env gets its own repo. Shared modules pulled from a central repo. I consulted for a financial services company that used this. A critical vulnerability dropped in a database module dependency. Security team patched it and tagged v3.1.1. Updated dev repo immediately, tests passed within 4 hours.

Staging repo update waited for weekly deployment cycle, took 6 days. Prod repo update required CAB approval, took 11 days. During those 11 days, prod ran vulnerable code. An audit found it. Cost them their SOC 2 cert for 3 months while they fixed processes.

The problem: coordination across repos is organizational, not technical. You need discipline, process, tracking. Most teams don’t have it.

When it makes sense: Highly regulated industries where prod access requires background checks, separate teams, audit trails. Banks, healthcare, government. Places where the overhead of multiple repos maps to their existing organizational structure. For everyone else? The juice isn’t worth the squeeze.

Central Modules with Thin Wrappers

Heavyweight central modules published to a registry. Lightweight env repos consume them. You’re building a platform team, five product teams use your shared modules. You release networking v2.0.0 with breaking changes. Team A updates immediately, Team B a week later. Team C is on vacation, Team D is firefighting, Team E doesn’t monitor the registry.

Two months later, you need to release v2.1.0 with a critical security fix but it requires v2.0.0 as baseline. Teams C, D, E are still on v1.x. Can’t apply security fix without breaking change migration. You now have two options: backport security fix to v1.x (extra work), or force teams to update (breaks their workflow). Both suck.

When it makes sense: Large orgs with mature platform teams serving 20+ product teams. One team owns the modules, publishes them, maintains documentation. Clear ownership, semantic versioning, changelog discipline. For a small team maintaining 3 environments? Overkill.

Terragrunt/Terramate Stacks

I haven’t personally used this in production, but I’ve seen it successfully applied by teams that really know what they’re doing. A team I worked with adopted Terragrunt for multi-account AWS setup, 30 accounts, hundreds of resources. Someone misconfigured dependencies, networking depended on database instead of vice versa. Looked fine in dev. Deployed to prod, different timing, race condition. Resources created in wrong order. Took 6 hours to debug because logs were spread across 30 accounts.

Terragrunt adds concepts: dependencies, hooks, code generation, hierarchical config. I’ve seen teams spend 3 months getting proficient. That’s fine if you’re managing 50+ environments. Not worth it for 5 environments.

When it makes sense: Multi-account AWS with complex dependencies. Multi-region deployments. Organizations with 100+ microservices, each with dev/staging/prod. The automation pays off at scale.

Managed Platforms

Terraform Cloud, Spacelift, env0. Client used Terraform Cloud with 200+ workspaces. Then AWS had a region-wide outage. Their infrastructure auto-healing kicked in, triggered 50 simultaneous Terraform runs. Hit Terraform Cloud rate limits. Runs queued, auto-healing timed out, services stayed down. Outage extended 2 hours because they couldn’t apply infrastructure changes fast enough. They moved to self-hosted Terraform Enterprise after that.

When it makes sense: You want to focus on infrastructure, not Terraform operations. You need governance, policies, RBAC, audit trails. You’re willing to pay for convenience and accept vendor lock-in. Common in enterprises. Smaller teams usually get away with self-hosted CI + remote state.

Migration Paths

Migrating from Workspaces to Per-Env Folders

You’ve hit the limit. Conditionals are unreadable. Time to migrate.

Create environments/dev/ directory, copy main.tf keeping workspace conditionals initially, extract dev-specific values to terraform.tfvars, update backend.hcl with new state key “dev/terraform.tfstate”, backup current state with terraform state pull > dev-state-backup.json, initialize with new backend using terraform init -backend-config=backend.hcl -migrate-state, verify with plan (should show no changes), remove workspace conditionals incrementally, then repeat for each environment.

Time investment: 2-3 hours per environment. Do dev first, validate thoroughly, then staging, then prod. Don’t forget to update CI/CD to use new directory structure.

Migrating from Separate Repos to Monorepo

Create new monorepo with environments/ structure, copy each repo into corresponding environment folder, update module sources from Git tags to relative paths, keep backend configs unchanged, run terraform init in each environment folder, verify plans show no changes, update CI/CD to target environment folders, deprecate old repos after validation period.

Time investment: 4-6 hours plus testing period.

Common Questions

Should I use Terraform workspaces or folders?

Folders, unless you’re solo with identical environments. Workspaces save typing but cost clarity. In production, clarity wins.

How do I manage Terraform state for multiple environments?

Remote backend with environment-specific state keys. Either dev/terraform.tfstate and prod/terraform.tfstate in same bucket, or separate buckets entirely. Separate buckets if you need different access controls.

What’s the best Terraform folder structure?

Depends on team size and complexity. For most teams: per-environment folders with shared modules. Scales to 25 engineers and 12 environments without problems.

How do I handle secrets across environments?

Don’t put secrets in Terraform. Use SSM Parameter Store, Secrets Manager, Vault, or equivalent. Reference them in Terraform via data sources. Each environment gets its own secret path.

When should I use a private module registry?

When you’re sharing modules across 5+ projects or 10+ teams. Before that, Git tags work fine.

How do I test Terraform changes safely?

Always dev first. Plan in PR, review output, apply to dev, validate, then staging, then prod. Never skip environments. Test disaster recovery: can you recreate from scratch?

Essential Tools Worth Knowing

Atlantis automates Terraform PR workflows, runs plan on PR and apply on merge. Self-hosted, free. Great for teams outgrowing basic CI.

tflint lints Terraform code, catches deprecated syntax and AWS-specific issues. infracost estimates cost changes in PRs, prevents budget surprises. checkov and tfsec scan for security issues like unencrypted resources. Run both in CI.

terraform-docs auto-generates documentation from module code, keeps docs in sync. pre-commit provides Git hooks that run checks before commit, catches mistakes early.

Final Thoughts

Community consensus still leans heavily toward per-environment folders for most teams: small to medium size, up to a dozen environments, moderate complexity. It’s forgiving, predictable, and scales well enough.

Only reach for the other strategies when you feel real pain from boilerplate, multi-account sprawl, or governance requirements. Don’t optimize for problems you don’t have yet.

Whatever you choose, enforce the core best practices from day one, prototype in a sandbox, and iterate. Your on-call self will thank you.

I’ve made most of these mistakes. Lost production data because of state file mishaps. Scaled down prod to dev instance counts. Spent weekends debugging conditional logic. Migrated between strategies three times.

The lessons stuck because they hurt. That’s why I wrote this down. So I remember. So you don’t have to learn the same way.

The best infrastructure decisions are boring. Per-env folders aren’t clever. They’re not DRY. They’re just explicit, debuggable, and they survive team growth. That’s why I keep coming back.

Table of Contents