I keep a personal wiki of infrastructure patterns I’ve used. This is one of those notes, cleaned up for public consumption. Every time I start a fresh Terraform project, I reference this. You’re welcome to steal it.
Table of Contents
- TL;DR - The Pattern That Works
- The Problem Nobody Talks About
- Why This Actually Matters
- The Four Approaches People Try
- The Bootstrap Module Pattern
- Migrating Existing Infrastructure
- S3-Compatible Backends
- Production Failure Patterns
- Bootstrap Principles
- The Complete Checklist
- Common Questions
TL;DR - The Pattern That Works
If you care about audits, recovery time, and team growth, the correct way to bootstrap Terraform state is:
- Use a dedicated Terraform bootstrap module
- Store its state locally and temporarily
- Create the remote backend (S3 or compatible) with:
- versioning enabled
- encryption at rest
- public access blocked
- locking configured
- Point all main infrastructure at that backend
- Never allow main infrastructure to create or modify its own state backend
Everything below explains why this survives audits, production incidents, and team turnover.
The Problem Nobody Talks About
You’re starting a new Terraform project. You know you need remote state storage because local state files are a disaster waiting to happen. You want S3 with versioning, encryption, and locking. So you write the Terraform code to create the bucket.
Then you hit the wall: Terraform needs a backend to store state during resource creation. But the backend doesn’t exist yet. You’re trying to use Terraform to create the thing Terraform needs to work.
This is the bootstrap problem, and it’s the first real test of whether you actually understand infrastructure as code or you’re just moving ClickOps into HCL files.
Why This Actually Matters
Bad bootstrapping doesn’t fail immediately. It fails later, when the cost is higher.
Picture this
You’re six months into a project. Team has grown from 3 to 15 engineers. Someone spins up a staging environment by copying the production Terraform code. They manually create a state bucket through the console because that’s what the setup notes say (or what they remember).
Different naming convention than prod. Different region (closer to their location). Forgot to enable lifecycle policies.
Fast forward another six months. Compliance audit. Auditor asks: “Show me your state bucket configuration.”
You pull up AWS console. Two buckets. Completely different security postures. One has versioning, one doesn’t. One has encryption with specific settings, one has whatever the defaults were. One blocks public access explicitly, one relies solely on IAM.
The audit finding: “Inconsistent security controls across environments.”
Time investment: roughly 40 hours across multiple people. Root cause: state backend created outside of code, leading to silent drift.
What proper bootstrapping prevents
Proper bootstrapping gives you consistency by default. Disaster recovery becomes trivialβrerun the bootstrap module instead of reconstructing from CloudTrail. New engineers onboard by running code, not copying wiki commands.
I learned this the hard way after we lost a state bucket during a cleanup and spent two days reconstructing infrastructure that should have taken minutes.
The Four Approaches People Try
There are four approaches you’ll see. Three have problems that only surface later.
Approach 1: Manual Bucket Creation
Create the bucket manually through AWS console or CLI, then point Terraform at it.
aws s3api create-bucket --bucket my-terraform-state --region us-east-1
aws s3api put-bucket-versioning --bucket my-terraform-state \
--versioning-configuration Status=Enabled
When it works: Solo developer, throwaway POC, everything gets deleted next week.
When it fails: Everything else.
The AWS S3 security checklist has roughly 15 items. You’ll remember 12 of them. Versioning, encryption, public access blocks, lifecycle policies.
Three months pass. Security scanner flags your bucket for missing encryption. You enable it now. But compliance wants to know about historical state files. Were there credentials in those unencrypted files?
Now you’re auditing every previous state version to prove no exposure occurred.
The multi-environment divergence
Imagine inheriting infrastructure where three engineers each set up their own environment over a year. No coordination.
Final state:
- Dev:
terraform-dev-state, us-east-1, no encryption, versioning enabled - Staging:
my-company-tfstate-staging, us-west-2, AES256 encryption, no versioning - Prod:
prod-terraform-state-bucket-2024, eu-west-1, KMS encryption, versioning enabled
Different names, regions, security configurations. Your monitoring script becomes a mess of special cases. When the auditor asks about state management policy, there’s no consistent answer.
Root cause: each bucket created manually, in isolation, with different assumptions.
Approach 2: Terraform with Local Backend, Then Migrate
Start your main project with local backend, use it to create the S3 bucket, then switch to remote backend and migrate.
# Initially
terraform {
backend "local" {}
}
resource "aws_s3_bucket" "state" {
bucket = "my-terraform-state"
}
# After apply, change to remote backend
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "terraform.tfstate"
region = "us-east-1"
}
}
Why people choose this: Single codebase, no extra directories. Feels simple.
Why it’s risky: Your main infrastructure code has permission to create and modify its own state backend. That’s privilege escalation. The service account running applies shouldn’t control where state is stored.
If you lose local state between initial apply and migration (laptop crash, forgot to commit), you’re in trouble. The bucket exists but Terraform doesn’t know about it. Manual import required or delete-and-recreate (which might violate retention policies).
The lost local state scenario
You’re setting up production Terraform. Local backend, create state bucket, about to migrate. Urgent customer issue. Context-switch. Work from home that day.
Next morning, back at office desktop. Local state file is on your laptop at home. Not in git (correctly gitignored).
Run terraform apply to continue. Error: bucket already exists.
Options: import manually (45 minutes of syntax debugging), delete bucket (30-day retention policy blocks it), or drive home for the laptop.
Root cause: temporary local state with no isolation from main infrastructure.
Approach 3: Dedicated Bootstrap Module
Separate Terraform project using local state to create just the backend. Main infrastructure points to bootstrapped backend.
project/
βββ bootstrap/
β βββ main.tf
β βββ variables.tf
β βββ terraform.tfstate # Local, gitignored
βββ infrastructure/
βββ main.tf
βββ backend.tf
Why this works: Complete separation. Bootstrap is small, focused, runs once. Main infrastructure never has permission to modify its own backend.
The trade-off: Two terraform init and terraform apply cycles. Some engineers resist the extra step.
The separation pays off during incidents and audits.
When separation saved everything
Financial services scenario, SOC 2 compliance. Auditor requirement: prove production engineers cannot tamper with audit history. State files are that history.
Bootstrap module creates state bucket with specific IAM policy. Bucket writable only by CI/CD service account. Engineers have read-only access. They run plans and applies through CI/CD, cannot directly modify state.
Someone leaves on bad terms. Still has AWS console access for a few hours during offboarding. Cannot destroy infrastructure because they cannot modify state. Security team uses state history to verify no unauthorized changes.
The separation meant 8 hours of setup. It also meant zero risk during a security incident.
Approach 4: Separate Account for Backend
Backend resources in dedicated AWS account. Main infrastructure uses cross-account access.
AWS Organization:
βββ management-account (state buckets)
βββ dev-account (uses management bucket)
βββ staging-account (uses management bucket)
βββ prod-account (uses management bucket)
When it makes sense: Regulated industries, strict compliance, need to prove separation of duties.
The cost: Multiple accounts, cross-account IAM, assume-role chains, credential rotation. Significant overhead.
Works well in large organizations with security teams. For a 10-person startup, it’s overkill. For a bank, it might be required.
The Bootstrap Module Pattern
This is what goes in my wiki. Least pain, most reliability, passes audits.
Step 1: Create the Bootstrap Module
# bootstrap/main.tf
terraform {
required_version = ">= 1.6.0"
backend "local" {}
}
provider "aws" {
region = var.region
}
resource "aws_s3_bucket" "terraform_state" {
bucket = var.state_bucket_name
tags = {
Name = "Terraform State"
Environment = var.environment
ManagedBy = "terraform-bootstrap"
}
}
resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256"
}
bucket_key_enabled = true
}
}
resource "aws_s3_bucket_public_access_block" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
resource "aws_s3_bucket_lifecycle_configuration" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
rule {
id = "expire-old-versions"
status = "Enabled"
noncurrent_version_expiration {
noncurrent_days = 90
}
}
rule {
id = "abort-incomplete-uploads"
status = "Enabled"
abort_incomplete_multipart_upload {
days_after_initiation = 7
}
}
}
output "state_bucket_id" {
value = aws_s3_bucket.terraform_state.id
description = "S3 bucket name for Terraform state"
}
output "state_bucket_region" {
value = aws_s3_bucket.terraform_state.region
description = "S3 bucket region"
}
output "state_bucket_arn" {
value = aws_s3_bucket.terraform_state.arn
description = "S3 bucket ARN"
}
# bootstrap/variables.tf
variable "region" {
description = "AWS region for state bucket"
type = string
default = "us-east-1"
}
variable "state_bucket_name" {
description = "Terraform state bucket name (globally unique)"
type = string
}
variable "environment" {
description = "Environment (dev, staging, prod)"
type = string
default = "shared"
}
# bootstrap/terraform.tfvars
region = "us-east-1"
state_bucket_name = "mycompany-terraform-state-2025"
environment = "shared"
Step 2: Run the Bootstrap
cd bootstrap
terraform init
terraform plan
Should create exactly 5 resources: bucket, versioning, encryption, public access block, lifecycle.
terraform apply
Takes about 15 seconds. Save the outputs.
Step 3: Configure Main Infrastructure
# infrastructure/main.tf
terraform {
required_version = ">= 1.6.0"
backend "s3" {
bucket = "mycompany-terraform-state-2025"
key = "infrastructure/terraform.tfstate"
region = "us-east-1"
encrypt = true
# Terraform 1.6+ native locking, no DynamoDB
use_lockfile = true
}
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
provider "aws" {
region = "us-east-1"
}
Step 4: Initialize and Migrate
cd ../infrastructure
terraform init
If you have existing local state, Terraform prompts migration. Type yes.
If starting fresh, just confirm. Done.
Migration mechanics
Terraform reads local state, uploads to S3 as version 1, deletes local copy. Operation is atomic.
Always backup first:
cp terraform.tfstate terraform.tfstate.backup-$(date +%Y%m%d-%H%M%S)
terraform init -migrate-state
aws s3 ls s3://mycompany-terraform-state-2025/infrastructure/
rm terraform.tfstate.backup-*
Migrating Existing Infrastructure
You have manually-created infrastructure. Now you want Terraform to manage it.
The Import Workflow
# 1. Bootstrap state backend first
# 2. Write Terraform matching existing resources
resource "aws_vpc" "legacy" {
cidr_block = "10.0.0.0/16"
tags = {
Name = "legacy-vpc"
}
}
# 3. Import
terraform import aws_vpc.legacy vpc-12345678
For complex infrastructure with hundreds of resources, consider Terraformer (auto-generates code) or Former2 (AWS web UI). For production-critical systems, writing code then importing is most reliable.
The Staged Migration Pattern
Don’t import everything at once. Stage by blast radius.
Week 1: Bootstrap + Networking
βββββββββββββββββββββββββββββββ
β State backend β
β VPCs, subnets, route tables β
βββββββββββββββββββββββββββββββ
β terraform plan shows no changes
Week 2: Compute
βββββββββββββββββββββββββββββββ
β EC2, ASG, launch templates β
β Load balancers β
βββββββββββββββββββββββββββββββ
β verify and stabilize
Week 3: Data Stores (careful)
βββββββββββββββββββββββββββββββ
β RDS, DynamoDB, S3 β
β ElastiCache β
βββββββββββββββββββββββββββββββ
β test thoroughly
Week 4: Everything Else
βββββββββββββββββββββββββββββββ
β IAM, security groups β
β CloudWatch, DNS β
βββββββββββββββββββββββββββββββ
After each stage, run terraform plan until it shows zero changes. That’s your confidence check.
The full-stack import disaster
Imagine inheriting 200 AWS resources from an acquisition. Management wants it “Terraformed” in one sprint to show integration progress.
Someone writes all Terraform in three days. Imports all 200 resources Friday afternoon. Feels good.
Monday, sanity check: terraform plan wants to destroy and recreate 60 resources.
Why? Tag formatting differences. Default values mismatches. Implicit dependencies not captured. Security group rules in different order.
Two weeks fixing this. Multiple times production resources get modified accidentally because Terraform code was wrong.
Root cause: no staged verification, large blast radius prevented early error detection.
S3-Compatible Backends
This section is only relevant if you are not using AWS S3. If you are on AWS, you can safely skip to Production Failure Patterns.
MinIO, DigitalOcean Spaces, Wasabi, Backblaze B2, Hetzner Object Storage speak S3 API. Not all implement it completely.
S3-compatible does not mean S3-equivalent.
Before using any S3-compatible backend for production, verify:
- Locking works under concurrent applies - two simultaneous applies, one must wait
- Versioning produces distinct object versions - version IDs differ after each apply
- Encryption is real - download state file, verify actual encryption
- Lifecycle policies execute - old versions actually get deleted
If any fail, you discover it during an incident, not during setup.
Basic Configuration
terraform {
backend "s3" {
bucket = "terraform-state"
key = "infrastructure/terraform.tfstate"
region = "us-east-1" # Often required but ignored
endpoints {
s3 = "https://minio.example.com"
}
use_path_style = true
skip_s3_checksum = true
skip_region_validation = true
use_lockfile = true
}
}
The MinIO Checksum Problem
MinIO doesn’t support AWS S3’s modern checksums (CRC32, SHA256). Terraform 1.6+ tries to use them.
Symptom: terraform init works. terraform apply fails with signature errors.
Fix: skip_s3_checksum = true
This can cost you hours debugging IAM and networking. Now you know to add that flag immediately.
Provider Quick Reference
DigitalOcean Spaces: Works well, no lifecycle policies (manual cleanup needed)
MinIO: Skip checksums, test locking thoroughly, versioning solid
Wasabi: 90-day minimum retention (early deletion still costs)
Production Failure Patterns
Learn these once. They repeat across teams and organizations.
Pattern: Unversioned State
Trigger: Versioning disabled to save costs ($2/month)
Failure: State corruption with no rollback capability
Impact: Days reconstructing infrastructure from CloudTrail and memory
Prevention: Versioning from day zero, non-negotiable
Picture 40 AWS resources. Someone fat-fingers terraform destroy instead of plan. Confirms without reading. Everything destroyed.
Check state bucket for previous versions. Versioning was disabled months ago for “cost savings.”
Recovery: three engineers, two days, manually reconstructing and importing. Plus production downtime. Plus the incident report explaining why there were no backups.
Prevention cost: $2/month for versioning.
Pattern: Forgotten Encryption
Trigger: Encryption not configured during manual bucket creation
Failure: Compliance audit finding for unencrypted sensitive data
Impact: 20+ hours auditing historical state versions for credentials
Prevention: Encryption enabled before any sensitive data arrives
Security scanner flags bucket three months after creation. You enable encryption immediately.
Auditor asks: “Were there credentials in the unencrypted historical state?”
You audit every state version manually. Search for password =, secret =, API tokens. Find several database passwords in old state.
Next question: “Are these still valid? If so, they were exposed.”
Root cause: encryption not in bootstrap code, added as afterthought.
Pattern: DynamoDB Lock Table Deletion
Trigger: Cost optimization deletes “unused” DynamoDB table
Failure: All Terraform applies fail with lock acquisition errors
Impact: 4+ hours diagnosing, team-wide deployment blockage
Prevention: Use Terraform 1.6+ native S3 locking, no DynamoDB needed
Someone reviews DynamoDB tables for cost savings. Sees terraform-lock with zero metrics (locks are short-lived). Looks unused. Deletes it.
Next 20 deployments across different teams fail. Everyone assumes AWS API issue. Takes 4 hours to connect it to missing table.
100-person engineering team, deployments blocked half a day.
Prevention: use_lockfile = true in Terraform 1.6+. No separate lock table to break.
Pattern: Region Mismatch
Trigger: Copy-paste backend config from different project
Failure: Cryptic endpoint errors, no clear indication of wrong region
Impact: 30 minutes to 2 hours debugging authentication and networking
Prevention: Use bootstrap output values, never hardcode region
Bucket in us-east-1. Backend config says us-west-2 (copied from another project).
Error: “The bucket must be addressed using the specified endpoint.”
You debug IAM permissions (correct), networking (fine), bucket policies (proper). Eventually notice region mismatch.
Change to us-east-1. Terraform thinks you’re migrating backends. Need terraform init -reconfigure.
Root cause: hardcoded region instead of using bootstrap output value.
Terraform State Bootstrap Principles
If you remember nothing else, remember these:
- State must never manage itself - separation prevents privilege escalation
- State is part of your audit log - treat it like compliance-critical data
- Versioning is mandatory, not optional - recovery depends on it
- Locking failures are production outages - concurrent applies corrupt state
- Bootstrap code is intentionally small and disposable - easy to recreate, hard to break
The Complete Bootstrap Checklist
This checklist is intentionally exhaustive. You don’t need to memorize it. Copy it once, use it when needed, thank yourself later.
Pre-flight
- Decided bootstrap approach (default: dedicated module)
- Chosen bucket naming convention (include year for rotation)
- Determined bucket region (match main infrastructure)
- Verified AWS credentials and IAM permissions
Bootstrap Module Creation
- Created
bootstrap/directory - Added
main.tfwith local backend - Added S3 bucket resource with unique name
- Enabled versioning (required)
- Enabled encryption (AES256 minimum, KMS for high-security)
- Configured all four public access blocks
- Added lifecycle policy (90-day noncurrent version expiration)
- Added lifecycle policy (7-day incomplete upload abort)
- Added appropriate tags
- Added outputs for bucket name, region, ARN
- Created
variables.tfandterraform.tfvars
Bootstrap Execution
- Ran
terraform initin bootstrap directory - Ran
terraform plan, reviewed carefully - Verified plan shows exactly 5 resources
- Ran
terraform apply, confirmed success - Verified bucket exists in AWS console
- Verified versioning enabled
- Verified encryption configured
- Verified public access blocks enabled
- Saved output values
Main Infrastructure Configuration
- Created backend config in
infrastructure/main.tf - Used exact bucket name from bootstrap output
- Used exact region from bootstrap output
- Set
encrypt = true - Set
use_lockfile = true(Terraform 1.6+) - For S3-compatible: added required flags
State Migration
- Backed up local state:
cp terraform.tfstate terraform.tfstate.backup-$(date +%Y%m%d) - Ran
terraform init -migrate-state - Confirmed migration completed
- Verified state exists in S3
- Verified local state removed
- Deleted backup after verification
Post-Bootstrap Validation
- Ran
terraform plan(should show no changes) - Tested concurrent read (two terminals, both run plan)
- Tested locking (two terminals, both run apply, one waits)
- Created second state version with trivial change
- Verified multiple versions exist in S3
- Added bootstrap to version control
- Added
*.tfstate*to.gitignore - Documented process in wiki
For S3-Compatible Backends
- Tested endpoint connectivity
- Verified path-style URLs work
- Confirmed checksum support or disabled it
- Tested locking with concurrent applies
- Validated versioning creates distinct versions
- Documented provider-specific quirks
Security and Compliance
- Verified IAM policies restrict bucket modification
- Confirmed encryption key management
- Enabled bucket logging if required
- Verified data residency compliance
- Added monitoring/alerting
- Documented controls for audits
Common Questions
Should I store bootstrap state in git?
No. Add it to .gitignore. If lost, recreate bucket with same name (will fail on “already exists”), then import: terraform import aws_s3_bucket.terraform_state bucket-name.
Can I use the same bucket for multiple environments?
Yes, with different state keys. But I don’t recommend it. Blast radius too large. Separate buckets cost ~$5/month each and provide better isolation.
What if I need to delete the state bucket?
Verify you want to delete all infrastructure state. Empty bucket completely (all versions). Remove lifecycle policies if they prevent deletion. Then delete bucket.
How do I rotate the state bucket yearly?
Create new bootstrap with new name (include new year). Run it. Update infrastructure backend config. Run terraform init -migrate-state. Delete old bucket after verifying migration.
Do I need DynamoDB for locking?
Not with Terraform 1.6+. Use use_lockfile = true for native S3 locking. Older versions need DynamoDB table.
What happens with simultaneous applies and no locking?
Both proceed. Potential state corruption. One person’s changes might overwrite the other’s. Always use locking.
Should I use KMS or AES256 encryption?
AES256 for most cases. KMS if you need audit trails (CloudTrail logs KMS operations), key rotation, or compliance requires it. KMS adds complexity and cost.
How often should I clean up old state versions?
90 days is reasonable. Long enough for recovery, short enough to avoid paying for years of history. Adjust for compliance requirements.
Can I use Terraform Cloud instead?
Yes. Handles state, locking, versioning. No bootstrap needed. Trade-off: dependency on Terraform Cloud availability and pricing.
What if state gets corrupted?
Download previous version from S3. Verify with terraform show -json. Replace current state. Run terraform plan to see differences. Apply corrections carefully.
How do I migrate between backends?
Update backend config. Run terraform init -migrate-state. Always backup first. Test in dev before touching production.
Essential Tools
For bootstrapping:
- Terraform >= 1.6 (native S3 locking)
- AWS CLI (verification, testing)
jq(parsing JSON output)
For state management:
terraform state list # All resources
terraform state show aws_instance.ex # Inspect resource
terraform state pull > backup.tfstate # Download for backup
terraform state rm aws_instance.ex # Remove from state
For S3-compatible providers:
curl https://minio.example.com # Test connectivity
mc alias set myminio https://... # MinIO client
For disaster recovery:
# List all state versions
aws s3api list-object-versions \
--bucket mycompany-terraform-state-2025 \
--prefix infrastructure/
# Download specific version
aws s3api get-object \
--bucket mycompany-terraform-state-2025 \
--key infrastructure/terraform.tfstate \
--version-id "version-id" \
old-state.tfstate
Final Thoughts
The bootstrap problem is your first real infrastructure decision. Handle it wrong and you fight your tooling for months. Handle it right and you forget it exists.
State isolation from main infrastructure. Versioning from day zero. Encryption before sensitive data arrives. Locking that prevents corruption.
Manual bucket creation works for weekend experiments. Everything else needs code.
The bootstrap module costs an extra hour upfront. It saves days when things break.
Bootstrap with code. Version your state. Encrypt from the start. Test your locking.
Never trust infrastructure you created manually.