Cloud Run Runtime Contract: Environment Variable Injection, Secret-Gated Auth Bypass, and Auto-Deploy Pipeline
This post records how I defined a runtime contract for running a FastAPI service on Cloud Run, how I designed auth bypass for the dev environment, and how I built the auto-deploy pipeline for each deployment.
Different from what I wrote before about WIF branch boundaries and dev/prod isolation, this post focuses on the “contract” between the app layer and the infra layer — what Terraform injects, what app code consumes, and where to draw the boundary.
Service boundary: no direct provider client calls
At first I thought an app on Cloud Run should fetch secrets from Secret Manager by itself. Later I found this approach has several problems:
- App code needs to import
google.cloud.secretmanager, which increases coupling from the domain/application layer to GCP. - Every startup or every request needs an API call, which adds latency.
- Tests need to mock GCP SDK.
- IAM scope changes from “runtime SA can use secret” to “runtime SA can read secret + can call Secret Manager API.”
Cloud Run natively supports injecting Secret Manager values into environment variables. After Terraform sets it up, app code only needs os.getenv():
|
|
App code only sees environment variables and does not know where values come from. So I defined two contract constants in config:
|
|
Then I use tests to ensure nobody imports Secret Manager client under src/:
|
|
The domain layer and application layer are also not allowed to use any GCP identity API:
|
|
With this guard, even if someone accidentally adds a GCP-related import in the domain layer, CI blocks it.
Environment variable categories
In config, all environment variables are split into explicit categories:
| Category | Variables | Injected by |
|---|---|---|
| Required secrets | DATABASE_URL, CLERK_SECRET_KEY, ADMIN_API_KEY |
Secret Manager → env var |
| Optional secrets | RESEND_API_KEY |
Secret Manager → env var |
| Required plain | ENV, BASE_URL |
Terraform env block |
| Optional plain | RESEND_FROM_EMAIL |
Terraform env block |
| Code defaults | SHOPIFY_API_VERSION, SHOPIFY_SCOPES |
App code has default values |
| Dev-only plain | ALLOW_DEV_AUTH_BYPASS |
Terraform env block (conditional) |
| Dev-only secrets | DEV_BYPASS_SECRET |
Secret Manager → env var (conditional) |
Each category has a corresponding tuple constant, and tests verify those constant values were not changed by accident.
Secret-Gated Auth Bypass
When testing APIs in the dev environment, getting a real Clerk JWT every time is painful. So I need a dev bypass — send a fake token and simulate login. But if bypass has no guard, anyone can impersonate any user.
Three environments, three behaviors
| Environment | Bypass behavior | Token format |
|---|---|---|
| LOCAL | Enabled by default, no secret required | dev:<user_id> |
| DEV | Requires flag + secret | dev:<secret>:<user_id> |
| PROD | Cannot be enabled (crash at startup) | — |
LOCAL is the most relaxed — local development needs no extra setup, and dev:user_1 works. DEV adds one protection layer: token must include a secret, and secret is verified with hmac.compare_digest using constant-time comparison.
Protection layers
The whole design has five protection layers:
- PROD crash —
config.pychecks at startup; ifALLOW_DEV_AUTH_BYPASSis set in PROD, it raises directly and app does not start. - Secret required — if bypass is enabled in DEV but
DEV_BYPASS_SECRETis not set, it also crashes. - Token validation —
ClerkAuthClientverifies secret withhmac.compare_digest, not==. - Terraform guard — injection of
ALLOW_DEV_AUTH_BYPASSandDEV_BYPASS_SECRETis conditional, only present whenvar.allow_dev_auth_bypass == true. - Platform guard —
dev-bypass-secretresource in Secret Manager is also conditional; if TF variable is not enabled, it does not exist.
|
|
Why constant-time comparison is required
String comparison (==) returns at the first different character. Attackers can guess each character of the secret from response timing (timing attack). hmac.compare_digest takes the same time whether content matches or not.
Single Source of Truth
The flag controlling bypass is placed in a GitHub repo variable (vars.ALLOW_DEV_AUTH_BYPASS), and both infra workflow and deploy workflow read it:
|
|
When the infra workflow runs platform Terraform, this value decides whether Secret Manager resource is created. When the deploy workflow runs service Terraform, it decides whether environment variables are injected. One variable controls two paths, so there is no inconsistency like “platform created secret but service did not inject it.”
Auto-Deploy Pipeline
Push to dev triggers deploy-dev, and push to main triggers deploy-prod. The pipeline has four stages:
|
|
Test
Same lint + type check + test as CI workflow (run on PR):
|
|
Why run once more in deploy pipeline: after PRs are merged into dev or main, multiple PRs can combine, and merged state may not pass tests. The deploy pipeline test is the final defense line.
Build
Build Docker image, push to Artifact Registry, and resolve immutable digest:
|
|
Use sha-${GITHUB_SHA::7} as a human-readable tag, but deploy uses immutable digest (sha256:...), not tag. Tag can be overwritten; digest cannot.
Migrate
Use Cloud SQL Proxy to connect Cloud SQL and run alembic upgrade head:
|
|
One detail here: runtime DATABASE_URL connects Cloud SQL Auth Proxy sidecar through Unix socket path (?host=/cloudsql/...), but in CI we connect to locally started Cloud SQL Proxy through TCP (127.0.0.1:5432). So we need sed to rewrite URL.
Migration runs before deploy to ensure schema is up to date. If migration fails, deploy does not run.
Deploy
Use Terraform to update Cloud Run service:
|
|
Image digest is passed into Terraform with -var, then Terraform updates Cloud Run revision. Using a saved plan file (-out=tfplan) avoids surprises during apply.
Health Check
After deploy, wait until new Cloud Run revision is up, then call /health:
|
|
Prod has one extra custom-domain health check, but it is warning instead of error — DNS and certificate propagation can take longer, and should not mark the whole deploy as failed.
Environment-specific behavior differences
| Item | Dev | Prod |
|---|---|---|
| Trigger | push to dev |
push to main |
cancel-in-progress |
true |
false |
| Branch validation | allow dev only |
allow main only |
| Health check | Cloud Run URL only | Cloud Run URL + custom domain |
| Secret suffix | database-url-dev |
database-url-prod |
| GitHub environment | dev |
prod |
cancel-in-progress: true is reasonable on dev — with continuous pushes, only latest run matters. Prod uses false because we do not want an in-progress deploy to be canceled halfway.
Pitfalls I hit
Infrastructure stack execution order
Secret Manager resource is created by platform Terraform, while env var injection is set by service Terraform. When adding bypass secret, I ran service TF deploy first — then Cloud Run returned “Permission denied on secret” because secret did not exist yet.
Correct order: platform TF (create secret + IAM) → store secret value → service TF (deploy)
Resource created but still empty
Platform TF google_secret_manager_secret only creates the secret resource, but does not store a value. Secret Manager requires at least one version before Cloud Run can read it. So after platform apply, you still need to manually store a value:
|
|
This step is not automated, because secret value should not appear in any repo.
Pipeline definition change missed the merge request
I added TF_VAR_allow_dev_auth_bypass in .github/workflows/deploy-dev.yml, but this change was not included in that PR. After merge, infra workflow ran without the TF variable value, and deploy failed.
Lesson: workflow file changes and the code changes they serve should be in the same PR.
Lessons learned
- App code does not touch GCP SDK — use env injection, do not do runtime secret fetch. Domain/application layers stay fully decoupled from cloud platform.
- Enforce runtime contract with tests — constant + pattern scan, CI blocks every violation.
- Dev bypass needs a secret gate — flag alone is not enough; add secret with constant-time comparison.
- Run tests again in pipeline — PR CI passing does not guarantee post-merge state also passes.
- Platform TF before Service TF — secrets and IAM must exist first, then deploy can succeed.
- Ship workflow changes with code together — otherwise workflow cannot read new settings after merge.