Cloud Run Runtime Contract: Environment Variable Injection, Secret-Gated Auth Bypass, and Auto-Deploy Pipeline

This post records how I defined a runtime contract for running a FastAPI service on Cloud Run, how I designed auth bypass for the dev environment, and how I built the auto-deploy pipeline for each deployment.

Different from what I wrote before about WIF branch boundaries and dev/prod isolation, this post focuses on the “contract” between the app layer and the infra layer — what Terraform injects, what app code consumes, and where to draw the boundary.

Service boundary: no direct provider client calls

At first I thought an app on Cloud Run should fetch secrets from Secret Manager by itself. Later I found this approach has several problems:

  1. App code needs to import google.cloud.secretmanager, which increases coupling from the domain/application layer to GCP.
  2. Every startup or every request needs an API call, which adds latency.
  3. Tests need to mock GCP SDK.
  4. IAM scope changes from “runtime SA can use secret” to “runtime SA can read secret + can call Secret Manager API.”

Cloud Run natively supports injecting Secret Manager values into environment variables. After Terraform sets it up, app code only needs os.getenv():

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
dynamic "env" {
  for_each = {
    DATABASE_URL    = "database-url-${var.env}"
    CLERK_SECRET_KEY = "clerk-secret-key"
    ADMIN_API_KEY   = "admin-api-key"
  }
  content {
    name = env.key
    value_source {
      secret_key_ref {
        secret  = env.value
        version = "latest"
      }
    }
  }
}

App code only sees environment variables and does not know where values come from. So I defined two contract constants in config:

1
2
GCP_SECRET_DELIVERY_MODE = "environment-injection-only"
GCP_IDENTITY_MODE = "service-account-adc-only"

Then I use tests to ensure nobody imports Secret Manager client under src/:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
FORBIDDEN_RUNTIME_SECRET_PATTERNS = (
    "google.cloud.secretmanager",
    "SecretManagerServiceClient",
)

def test_source_tree_does_not_use_runtime_secret_manager_reads():
    for path in SOURCE_ROOT.rglob("*.py"):
        content = path.read_text()
        for pattern in FORBIDDEN_RUNTIME_SECRET_PATTERNS:
            assert pattern not in content

The domain layer and application layer are also not allowed to use any GCP identity API:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
FORBIDDEN_MODULE_IDENTITY_PATTERNS = (
    "google.cloud",
    "google.auth",
    "secretmanager",
    "GOOGLE_APPLICATION_CREDENTIALS",
)

def test_module_domain_and_application_layers_do_not_depend_on_gcp_identity_apis():
    module_root = SOURCE_ROOT / "modules"
    for layer in ("application", "domain"):
        for path in module_root.glob(f"*/{layer}/**/*.py"):
            content = path.read_text()
            for pattern in FORBIDDEN_MODULE_IDENTITY_PATTERNS:
                assert pattern not in content

With this guard, even if someone accidentally adds a GCP-related import in the domain layer, CI blocks it.

Environment variable categories

In config, all environment variables are split into explicit categories:

Category Variables Injected by
Required secrets DATABASE_URL, CLERK_SECRET_KEY, ADMIN_API_KEY Secret Manager → env var
Optional secrets RESEND_API_KEY Secret Manager → env var
Required plain ENV, BASE_URL Terraform env block
Optional plain RESEND_FROM_EMAIL Terraform env block
Code defaults SHOPIFY_API_VERSION, SHOPIFY_SCOPES App code has default values
Dev-only plain ALLOW_DEV_AUTH_BYPASS Terraform env block (conditional)
Dev-only secrets DEV_BYPASS_SECRET Secret Manager → env var (conditional)

Each category has a corresponding tuple constant, and tests verify those constant values were not changed by accident.

Secret-Gated Auth Bypass

When testing APIs in the dev environment, getting a real Clerk JWT every time is painful. So I need a dev bypass — send a fake token and simulate login. But if bypass has no guard, anyone can impersonate any user.

Three environments, three behaviors

Environment Bypass behavior Token format
LOCAL Enabled by default, no secret required dev:<user_id>
DEV Requires flag + secret dev:<secret>:<user_id>
PROD Cannot be enabled (crash at startup)

LOCAL is the most relaxed — local development needs no extra setup, and dev:user_1 works. DEV adds one protection layer: token must include a secret, and secret is verified with hmac.compare_digest using constant-time comparison.

Protection layers

The whole design has five protection layers:

  1. PROD crashconfig.py checks at startup; if ALLOW_DEV_AUTH_BYPASS is set in PROD, it raises directly and app does not start.
  2. Secret required — if bypass is enabled in DEV but DEV_BYPASS_SECRET is not set, it also crashes.
  3. Token validationClerkAuthClient verifies secret with hmac.compare_digest, not ==.
  4. Terraform guard — injection of ALLOW_DEV_AUTH_BYPASS and DEV_BYPASS_SECRET is conditional, only present when var.allow_dev_auth_bypass == true.
  5. Platform guarddev-bypass-secret resource in Secret Manager is also conditional; if TF variable is not enabled, it does not exist.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
def _parse_dev_token(self, token: str) -> str:
    payload = token[4:]
    if self._env == Environment.LOCAL:
        if not payload:
            raise AuthenticationError("dev bypass: empty user id")
        return payload
    parts = payload.split(":", 1)
    if len(parts) != 2 or not parts[0] or not parts[1]:
        raise AuthenticationError("dev bypass: expected dev:<secret>:<user_id>")
    secret, user_id = parts
    if not hmac.compare_digest(secret, self._dev_bypass_secret):
        raise AuthenticationError("dev bypass: invalid secret")
    return user_id

Why constant-time comparison is required

String comparison (==) returns at the first different character. Attackers can guess each character of the secret from response timing (timing attack). hmac.compare_digest takes the same time whether content matches or not.

Single Source of Truth

The flag controlling bypass is placed in a GitHub repo variable (vars.ALLOW_DEV_AUTH_BYPASS), and both infra workflow and deploy workflow read it:

1
2
env:
  TF_VAR_allow_dev_auth_bypass: ${{ vars.ALLOW_DEV_AUTH_BYPASS || 'false' }}

When the infra workflow runs platform Terraform, this value decides whether Secret Manager resource is created. When the deploy workflow runs service Terraform, it decides whether environment variables are injected. One variable controls two paths, so there is no inconsistency like “platform created secret but service did not inject it.”

Auto-Deploy Pipeline

Push to dev triggers deploy-dev, and push to main triggers deploy-prod. The pipeline has four stages:

1
test → build → migrate → deploy (+ health check)

Test

Same lint + type check + test as CI workflow (run on PR):

1
2
3
4
5
6
- name: Lint and test
  run: |
    uv run --directory api ruff check src tests
    uv run --directory api ruff format --check src tests
    uv run --directory api pyright
    uv run --directory api pytest tests -q

Why run once more in deploy pipeline: after PRs are merged into dev or main, multiple PRs can combine, and merged state may not pass tests. The deploy pipeline test is the final defense line.

Build

Build Docker image, push to Artifact Registry, and resolve immutable digest:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
- name: Build and push
  run: |
    TAG="${GCP_REGION}-docker.pkg.dev/${GCP_PROJECT_ID}/pebble/pebble-api:sha-${GITHUB_SHA::7}"
    docker build -f api/Dockerfile -t "$TAG" api/
    docker push "$TAG"

- name: Resolve immutable digest
  run: |
    IMAGE=$(docker inspect --format='{{index .RepoDigests 0}}' "${BUILD_TAG}")
    echo "image=$IMAGE" >> "$GITHUB_OUTPUT"

Use sha-${GITHUB_SHA::7} as a human-readable tag, but deploy uses immutable digest (sha256:...), not tag. Tag can be overwritten; digest cannot.

Migrate

Use Cloud SQL Proxy to connect Cloud SQL and run alembic upgrade head:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
- name: Run database migration
  run: |
    CONN=$(gcloud sql instances describe "${GCP_SQL_INSTANCE}" \
      --project="${GCP_PROJECT_ID}" --format='value(connectionName)')
    cloud-sql-proxy "${CONN}" --port 5432 &
    sleep 2
    SECRET_URL=$(gcloud secrets versions access latest \
      --secret="database-url-${ENV}" --project="${GCP_PROJECT_ID}")
    CI_URL=$(echo "${SECRET_URL}" \
      | sed 's|@/pebble?host=/cloudsql/.*|@127.0.0.1:5432/pebble|')
    DATABASE_URL="${CI_URL}" uv run --directory api alembic upgrade head

One detail here: runtime DATABASE_URL connects Cloud SQL Auth Proxy sidecar through Unix socket path (?host=/cloudsql/...), but in CI we connect to locally started Cloud SQL Proxy through TCP (127.0.0.1:5432). So we need sed to rewrite URL.

Migration runs before deploy to ensure schema is up to date. If migration fails, deploy does not run.

Deploy

Use Terraform to update Cloud Run service:

1
2
3
4
5
6
7
8
9
- name: Terraform plan
  run: |
    terraform plan -lock-timeout=5m \
      -var-file=terraform.tfvars \
      -var="api_image=${API_IMAGE}" \
      -out=tfplan

- name: Terraform apply
  run: terraform apply -lock-timeout=5m tfplan

Image digest is passed into Terraform with -var, then Terraform updates Cloud Run revision. Using a saved plan file (-out=tfplan) avoids surprises during apply.

Health Check

After deploy, wait until new Cloud Run revision is up, then call /health:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
- name: Health check
  run: |
    API_URL=$(terraform output -raw api_url)
    for i in $(seq 1 30); do
      if curl -sf "${API_URL}/health"; then
        echo "Health check passed"
        exit 0
      fi
      sleep 5
    done
    echo "::error::Health check failed after 150 seconds"
    exit 1

Prod has one extra custom-domain health check, but it is warning instead of error — DNS and certificate propagation can take longer, and should not mark the whole deploy as failed.

Environment-specific behavior differences

Item Dev Prod
Trigger push to dev push to main
cancel-in-progress true false
Branch validation allow dev only allow main only
Health check Cloud Run URL only Cloud Run URL + custom domain
Secret suffix database-url-dev database-url-prod
GitHub environment dev prod

cancel-in-progress: true is reasonable on dev — with continuous pushes, only latest run matters. Prod uses false because we do not want an in-progress deploy to be canceled halfway.

Pitfalls I hit

Infrastructure stack execution order

Secret Manager resource is created by platform Terraform, while env var injection is set by service Terraform. When adding bypass secret, I ran service TF deploy first — then Cloud Run returned “Permission denied on secret” because secret did not exist yet.

Correct order: platform TF (create secret + IAM) → store secret value → service TF (deploy)

Resource created but still empty

Platform TF google_secret_manager_secret only creates the secret resource, but does not store a value. Secret Manager requires at least one version before Cloud Run can read it. So after platform apply, you still need to manually store a value:

1
echo -n "the-actual-secret" | gcloud secrets versions add dev-bypass-secret --data-file=-

This step is not automated, because secret value should not appear in any repo.

Pipeline definition change missed the merge request

I added TF_VAR_allow_dev_auth_bypass in .github/workflows/deploy-dev.yml, but this change was not included in that PR. After merge, infra workflow ran without the TF variable value, and deploy failed.

Lesson: workflow file changes and the code changes they serve should be in the same PR.

Lessons learned

  1. App code does not touch GCP SDK — use env injection, do not do runtime secret fetch. Domain/application layers stay fully decoupled from cloud platform.
  2. Enforce runtime contract with tests — constant + pattern scan, CI blocks every violation.
  3. Dev bypass needs a secret gate — flag alone is not enough; add secret with constant-time comparison.
  4. Run tests again in pipeline — PR CI passing does not guarantee post-merge state also passes.
  5. Platform TF before Service TF — secrets and IAM must exist first, then deploy can succeed.
  6. Ship workflow changes with code together — otherwise workflow cannot read new settings after merge.

References