From a Single Terraform Root to Modules + Live: Phased Migration, Import, and State Surgery

This post records one thing I keep coming back to: if Terraform starts with only one root, and later you want to add dev / prod, add a shared layer, and add CI/CD boundaries, how do you migrate without affecting the existing environment?

My final conclusion is simple:

  • do not rewrite everything in one shot
  • do not split CI/CD first
  • do not delete old roots before new roots converge

The method that actually works is phased migration.

What the pre-migration problem looks like

A very common early Terraform layout looks like this:

1
2
3
4
infra/terraform/
  bootstrap/
  platform/
  service/

This layout works well in the early stage because it is fast.

But once requirements grow, multiple problems appear at the same time:

  • shared resources and env resources live in the same root
  • service root depends on platform root assumptions
  • secrets are not environment-safe
  • production domain logic leaks into dev
  • CI/CD is coupled to legacy roots

At this point, the most dangerous thought is: delete the old roots directly and switch to a new structure.

That usually does not succeed.

My target structure

The goal is not just renaming folders. The goal is changing the ownership model.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
infra/terraform/
  modules/
    shared/
    platform/
    service/
  live/
    shared/
    dev/
      platform/
      service/
    prod/
      platform/
      service/

The key points of this target structure are:

  • modules/* defines reusable infrastructure contracts
  • live/* owns real state and environment realization

Migration order matters more than final structure

I think the most important thing is not what the end state looks like, but migration order.

I write the order this way:

  1. create new modules and live roots
  2. keep legacy roots during migration
  3. move or import resources into new live roots
  4. let new roots converge to a clean state
  5. then switch CI/CD to the new paths
  6. delete legacy roots only at the end

If you move step 5 and step 6 earlier, it is easy to break things.

Resource Adoption Method Selection

This is the most practical migration question.

I split it this way:

Cases for Adopting Existing Remote Resources

This bucket maps to terraform import for resources that already exist in the cloud provider and just are not yet owned by the new root:

  • Cloud Run service
  • Cloud SQL instance
  • service account
  • IAM member
  • Artifact Registry repository

Using import is the natural choice for these resources:

1
2
3
4
import {
  to = module.service.google_cloud_run_v2_service.api
  id = "projects/my-project/locations/us-east1/services/app-dev"
}

Cases for Internal Ownership Moves

This bucket maps to terraform state mv, and the most typical one is random_password.

For this type of resource, the remote system does not directly keep the exact value represented in Terraform state. If you re-import or recreate it, you may not be “adopting an existing value”; you may be “generating a new value.”

That is very dangerous for things like DB passwords.

1
2
3
4
5
terraform state mv \
  -state=old.tfstate \
  -state-out=new.tfstate \
  random_password.db \
  module.platform.random_password.db

In this case, what you must protect is not the resource name, but the value itself.

Initial Cutover Edge Cases

In theory, we all want No changes as soon as the new root is in place.

In practice, first migration usually has exceptions, especially these cases:

  • existing custom domain mapping
  • manually populated secrets
  • runtime values that were never written in tfvars
  • old resources with names that no longer match the new contract

These items usually do not naturally fall into an idealized automation flow.

So now I split migration into two step types:

Type Description
structural migration modules, roots, state ownership
first deploy exception handling import-only resources, secret copy, runtime value capture

If you force these two concerns into one straight line, the runbook becomes overly optimistic.

Deployment Pipeline Cutover Timing

I am now very sure about one thing:

Switch CI/CD at the end.

Reason is simple. As long as new live roots have not passed these checks:

  • live/shared plan is clean
  • live/dev/platform plan is clean
  • live/dev/service plan is clean

CI/CD should not be changed to fully depend on new roots yet.

Otherwise, once auto deploy starts running, you hit all of these at once:

  • code path changed
  • workflow changed
  • state ownership changed
  • runtime contract changed

At that point, debugging becomes very hard.

A safer way is:

  1. make the new roots valid
  2. import or move state
  3. reconcile to zero drift
  4. only then switch workflows over

Exit Criteria for the Refactor

I now use a strict standard for myself.

It is not done just because terraform apply succeeds.

I check all of these:

  • new roots can plan with no changes
  • old roots no longer own migrated resources
  • current workflows point only to new roots
  • runtime behavior is still correct
  • production-only resources still stay in production paths only

If one is missing, migration is not truly complete.

Phase Checklist I Reuse

If I do a similar migration again, I will directly reuse this checklist:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Phase 1
- create modules
- create live roots
- keep legacy roots untouched

Phase 2
- import shared resources
- import or move dev platform resources
- import or move dev service resources
- copy secret values if needed

Phase 3
- verify all new roots show no changes
- switch workflows to the new live roots
- verify deploys end to end

Phase 4
- bring up prod
- move prod-only resources like custom domain
- verify domain and runtime behavior

Phase 5
- delete legacy workflow
- delete legacy roots
- remove legacy IAM and secrets

The benefit is clarity: you always know which phase is blocked, instead of treating migration as one vague large task.

Runtime Value Retrieval Commands

In this kind of migration note, what is easiest to forget is often not Terraform, but where to fetch current runtime values like BASE_URL. These are the commands I check most often.

Inspect Deployed Container Reference

1
2
3
gcloud run services describe app-dev \
  --region us-east1 \
  --format='value(spec.template.spec.containers[0].image)'

Inspect Active Endpoint Value

1
2
3
gcloud run services describe app-dev \
  --region us-east1 \
  --format='value(status.url)'

InspectBASE_URLRuntime Value

1
2
3
4
gcloud run services describe app-dev \
  --region us-east1 \
  --format=json \
  | jq -r '.spec.template.spec.containers[0].env[]? | select(.name=="BASE_URL") | .value'

Inspect Full Runtime Variable Array

1
2
3
gcloud run services describe app-dev \
  --region us-east1 \
  --format='yaml(spec.template.spec.containers[0].env)'

Populate New Per-Env Credential Entry

1
2
gcloud secrets versions access latest --secret=auth-backend-secret \
  | gcloud secrets versions add auth-backend-secret-dev --data-file=-

Retrieve Latest Credential Content

1
gcloud secrets versions access latest --secret=database-url-dev

Conclusion

The biggest risk in Terraform migration is not writing extra HCL files. The risk is binding ownership change, workflow switching, and runtime change together at the same time.

The truly stable way is:

  • modules first
  • live roots next
  • state migration after that
  • CI/CD switch later
  • legacy deletion last

As long as order is right, a modules + live refactor can be very stable and does not need a big bang rewrite.

References