Kubernetes Hands-On: Stable Networking, Zero-Downtime Deploys, Autoscaling, and Troubleshooting

Lou Chang included in DevOps Kubernetes

2026-02-25 About 2800 words 14 minutes

Contents

Operations learning notes

Following the previous post (Cluster, Pod, Deployment), this one covers stable network access, zero-downtime updates, restart-loop behavior, automatic replica scaling, and the troubleshooting workflow you reach for when things go wrong.

Prerequisite: a kind cluster already running a go-api Deployment (3 replicas).

Stable Network Entry Points

Why You Need a Stable Endpoint

Pod IPs are ephemeral — every rebuild gets a new one:

1
2
3
4
5
6


kubectl get pod go-api-xxx -o wide
# IP: 10.244.0.5

kubectl delete pod go-api-xxx
kubectl get pod go-api-yyy -o wide
# IP: 10.244.0.8 — changed

Right now we have 3 Pods, each with a different IP, and any of them could change at any moment. How do other workloads connect?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


without Service:
  Pod 1: 10.244.0.5  ┐
  Pod 2: 10.244.0.6  ├── which IP? what if it changes?
  Pod 3: 10.244.0.7  ┘

with Service:
  Service: go-api (stable DNS + stable IP)
    ├── routes to Pod 1
    ├── routes to Pod 2
    └── routes to Pod 3

How the Endpoint Finds Its Backends

Through a label selector — the endpoint matches Pods by label:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


# service.yaml
spec:
  selector:
    app: go-api        # find all Pods with this label

# deployment.yaml (Pod template)
template:
  metadata:
    labels:
      app: go-api      # matches the selector above

Labels are tags attached to Pods. The endpoint uses them to decide “which Pods belong to me.”

Endpoint Definition File

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


apiVersion: v1
kind: Service
metadata:
  name: go-api
spec:
  type: ClusterIP
  selector:
    app: go-api
  ports:
    - port: 8080
      targetPort: 8080
      protocol: TCP

Three Exposure Categories

Type	Purpose	Access
ClusterIP	In-cluster access (default)	Reachable only from inside the cluster
NodePort	Expose via a port on the Node	`<NodeIP>:<NodePort>`
LoadBalancer	Cloud environments get an external IP	GKE/EKS auto-assigns an external IP

Backend List: The Addresses Behind the Endpoint

1

kubectl describe svc go-api

Focus on the Endpoints:

1
2
3


Selector:   app=go-api
ClusterIP:  10.96.252.198
Endpoints:  10.244.0.5:8080, 10.244.0.6:8080, 10.244.0.7:8080

Endpoints is the live list of Pod IPs that currently match the label selector.

Delete a Pod, the Deployment rebuilds it, and Endpoints updates automatically:

1
2
3


before: 10.244.0.5, 10.244.0.6, 10.244.0.7
after:  10.244.0.8, 10.244.0.6, 10.244.0.7
        ^^^^^^^^^^^ new Pod

The endpoint’s ClusterIP 10.96.252.198 never changed — that is what “stable entry point” means.

Name Resolution Inside the Cluster

Other Pods inside the cluster can reach the endpoint by DNS name:

1
2
3
4
5
6


go-api.default.svc.cluster.local
  │      │       │      │
  │      │       │      └── cluster domain
  │      │       └── fixed suffix
  │      └── namespace
  └── service name

Within the same namespace you can shorten it to go-api:8080.

Tunneling From Your Laptop to the Workload Network

ClusterIP is reachable only from inside the cluster. From your laptop, open a tunnel with port-forward:

1
2


your laptop                          K8s cluster
localhost:8080 ──── tunnel ────→ Service go-api:8080 ──→ Pod

1
2
3
4


kubectl port-forward svc/go-api 8080:8080

# in another terminal
curl http://localhost:8080/healthz

port-forward is for development and debugging only. Production uses LoadBalancer or Ingress instead.

Zero-Downtime Revision Deployment

Triggering a Revision

1
2
3
4
5
6


# build new image
docker build -t go-api:0.0.3 -f Dockerfile .
kind load docker-image go-api:0.0.3 --name devops-lab

# trigger update
kubectl set image deployment/go-api go-api=go-api:0.0.3

Watching the Transition

1

kubectl get pods --watch

1
2
3
4
5
6
7


old-pod-1   Running                          ← 3 old Pods running
new-pod-1   Pending → ContainerCreating → Running   ← new Pod 1 ready
old-pod-1   Terminating                      ← THEN old Pod 1 killed
new-pod-2   Pending → Running               ← new Pod 2 ready
old-pod-2   Terminating                      ← old Pod 2 killed
new-pod-3   Pending → Running               ← new Pod 3 ready
old-pod-3   Terminating                      ← old Pod 3 killed

Key point: the new Pod must be Running before the old Pod is killed. That is zero downtime.

Controlled by the Deployment’s strategy:

1
2
3
4
5


strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 1        # at most 1 extra Pod during update
    maxUnavailable: 0  # no downtime allowed

Two Replica Sets Coexisting

1

kubectl get rs

1
2


go-api-65577fc4f9   3   3   3   2m     ← new ReplicaSet (0.0.3), 3 Pods
go-api-668dcc5dd    0   0   0   4d     ← old ReplicaSet (0.0.2), scaled to 0

The old ReplicaSet is not deleted — its Pod count scales down to 0. K8s keeps it on purpose so you can roll back.

Revision Trail and Reverting to an Earlier Version

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


# see revision history
kubectl rollout history deployment/go-api
# REVISION 1 ← go-api:0.0.2
# REVISION 2 ← go-api:0.0.3

# rollback to previous version
kubectl rollout undo deployment/go-api

# check — old ReplicaSet scales back up
kubectl get rs

rollout undo pulls the old ReplicaSet from 0 back to 3, and shrinks the new one from 3 to 0. It follows the same rolling-update process — still zero downtime.

Restart Loops: When a Workload Keeps Dying

What a Restart Loop Looks Like

When a container keeps crashing, kubelet does not restart it immediately without limit. It uses exponential backoff:

1
2
3
4
5


crash #1 → restart immediately
crash #2 → wait ~10s, then restart
crash #3 → wait ~20s, then restart
crash #4 → wait ~40s, then restart
...keeps doubling, up to 5 minutes max

kubelet is not giving up — it waits longer each time. If the program itself has a bug, restarting right away would just crash right away again, wasting resources.

Observing the Restart Loop

1

kubectl get pods --watch

1
2
3
4
5
6
7
8


my-pod   Running                        ← container starts
my-pod   Error                          ← container crashed (exit code != 0)
my-pod   Running   1                    ← kubelet restarted it
my-pod   Error                          ← crashed again
my-pod   CrashLoopBackOff              ← kubelet: "crashing too often, waiting..."
my-pod   Running   2 (14s ago)          ← restarted after 14s delay
my-pod   Error                          ← crashed again
my-pod   CrashLoopBackOff              ← waiting even longer (27s)

What the Three States Mean

STATUS	Meaning
Error	Container just died
CrashLoopBackOff	Died too many times; kubelet is waiting, has not restarted yet
Running	Restart succeeded

What to Do When Your Workload Is Stuck in a Loop

1
2


# see the logs from the PREVIOUS crashed container
kubectl logs <pod-name> --previous

--previous is the key — it shows the log from before the last crash, so you can pinpoint which part of the code failed.

Peeking Under the Hood

Control-Plane Components

1

kubectl get pods -n kube-system

1
2
3
4
5
6


etcd-control-plane               ← database, stores all cluster state
kube-apiserver-control-plane      ← front door, all requests go through here
kube-scheduler-control-plane      ← decides which Node runs each Pod
kube-controller-manager-control-plane  ← runs reconciliation loops
kube-proxy-xxxxx                  ← network rules (iptables/ipvs)
coredns-xxxxx (x2)               ← DNS for service discovery

-n = --namespace. All of K8s’ own components run in the kube-system namespace.

Event Records: Watching Components Coordinate

1

kubectl describe pod <name>

The Events section records the full lifecycle of a Pod:

1
2
3
4


Scheduled  → default-scheduler  → assigned to devops-lab-control-plane
Pulled     → kubelet             → image already present
Created    → kubelet             → container created
Started    → kubelet             → container started

Note: Events are retained for only 1 hour. Older Pods will not show Events — only newly created ones will.

Cluster-Wide Activity Log

1

kubectl get events --sort-by='.lastTimestamp'

Without specifying a Pod name, this lists events for all resources. You can see the ReplicaSet’s Pod-creation records:

1

Normal  SuccessfulCreate  replicaset/go-api-668dcc5dd  Created pod: go-api-668dcc5dd-h5ttf

This proves it is the ReplicaSet that creates Pods, not the Deployment directly.

Automatic Replica Scaling

Why You Need Autoscaling

A Deployment’s replica count is fixed — set it to 3 and it stays at 3 forever. Real traffic fluctuates:

1
2
3
4
5
6


without HPA:
  replicas: 3 → always 3 Pods, even at 3am with zero traffic

with HPA:
  traffic high → scale up to 7 Pods
  traffic low  → scale down to 2 Pods

Prerequisite: Resource Usage Collector

Autoscaling needs to know each Pod’s CPU usage. The metrics-server collects that data:

1
2
3
4
5
6
7


HPA: "CPU usage is how much?"
 ↓
metrics-server: "let me ask kubelet on each Node"
 ↓
kubelet: "Pod A uses 30m CPU, Pod B uses 45m CPU"
 ↓
HPA: "over threshold, scale up"

Without metrics-server, autoscaling is blind.

Installing the Usage Collector (Local Cluster)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


# install
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# kind uses self-signed certs, need to skip TLS verification
kubectl -n kube-system patch deployment metrics-server \
  --type='json' \
  -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--kubelet-insecure-tls"}]'

# wait for ready
kubectl -n kube-system rollout status deployment/metrics-server

--kubelet-insecure-tls is only needed for kind. Production (GKE/EKS) has proper certificates and does not require it.

What In-Place Patching Does

patch modifies a K8s resource in place, without rewriting the entire YAML:

The path maps to the YAML structure:

1
2
3
4
5
6
7
8


spec:                          # /spec
  template:                    # /spec/template
    spec:                      # /spec/template/spec
      containers:              # /spec/template/spec/containers
        - name: metrics-server # /spec/template/spec/containers/0
          args:                # /spec/template/spec/containers/0/args
            - --cert-dir=/tmp
            - --kubelet-insecure-tls  # ← /args/- means append here

How do you figure out the path? Inspect the structure first with -o yaml:

1

kubectl -n kube-system get deployment metrics-server -o yaml

Verifying the Usage Collector

1

kubectl top pods

If you see numbers, metrics-server is working.

Autoscaler Definition File

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: go-api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: go-api              # which Deployment to scale
  minReplicas: 2              # minimum Pods
  maxReplicas: 10             # maximum Pods
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 50  # scale up when avg CPU > 50%

Autoscaling does not manage Pods directly — it changes the Deployment’s replica count, and the Deployment adjusts the number of Pods.

averageUtilization: 50 is measured against the Deployment’s resources.requests.cpu (50m), so 50% = 25m. If the average Pod CPU exceeds 25m, scaling up begins.

Watching Autoscaling in Action

1
2
3
4
5


# apply HPA
kubectl apply -f k8s/hpa.yaml

# watch HPA
kubectl get hpa --watch

1
2
3
4
5
6


NAME     TARGETS        MINPODS  MAXPODS  REPLICAS
go-api   cpu: 0%/50%    2        10       3          ← idle
go-api   cpu: 27%/50%   2        10       3          ← load starting
go-api   cpu: 64%/50%   2        10       3          ← over threshold!
go-api   cpu: 65%/50%   2        10       4          ← scaled up: 3 → 4
go-api   cpu: 42%/50%   2        10       4          ← 4 Pods share load, CPU drops

How to generate load:

1
2


kubectl run load-test --rm -it --image=busybox --restart=Never -- sh -c \
  "while true; do wget -q -O- http://go-api.default.svc.cluster.local:8080/healthz; done"

After you stop the load, wait roughly 5 minutes (cooldown period) and the autoscaler will scale back down to minReplicas: 2.

Troubleshooting Workflow

When you hit a problem, always follow this order — from broad to narrow:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


kubectl get pods                          ← what's the STATUS?
 ↓
status tells you the next step:

  ImagePullBackOff
   → image name wrong or forgot kind load

  Pending
   → kubectl describe pod <name>
   → Events: usually resource or scheduling issue

  CrashLoopBackOff
   → kubectl logs <name> --previous
   → application error in the logs

  Running but not working
   → kubectl logs <name>
   → check application logic

  don't know where to start
   → kubectl get events --sort-by='.lastTimestamp'
   → see everything that happened recently

Frequently Used Diagnostic Commands

Situation	Command
Pod status looks wrong	`kubectl get pods`
Why won’t it start	`kubectl describe pod <name>`
Application error	`kubectl logs <name>`
Log from before a crash	`kubectl logs <name> --previous`
Live observation	`kubectl get pods --watch`
Cluster-wide activity	`kubectl get events --sort-by='.lastTimestamp'`
Shell into a container	`kubectl exec -it <name> -- sh`
Check backends behind an endpoint	`kubectl describe svc <name>`
View full YAML	`kubectl get <resource> <name> -o yaml`

kubectl exec lets you drop inside a container to look around. But if the image is built from scratch and has no shell, you cannot enter it — which is also one reason scratch images are more secure.

Caveats for Reading Container Output

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


# wrong — "go-api" is Deployment name, not Pod name
kubectl logs go-api

# correct — use actual Pod name
kubectl logs go-api-65577fc4f9-k9f9p

# shortcut — pick a Pod from Deployment automatically
kubectl logs deployment/go-api

# follow mode (like tail -f)
kubectl logs deployment/go-api -f

Common Questions

How Do Local Multi-Container Runs Relate to Container Orchestration?

Completely different tools, solving problems at different stages:

1
2


Docker Compose → single machine, multiple containers (dev/test)
Kubernetes     → multiple machines, managed containers (production)

They replace each other — you do not use them together:

1
2


dev (your laptop):  docker-compose.yml → docker compose up
production (cloud): k8s/*.yaml         → kubectl apply

Docker Compose does not exist in the K8s world. Once you are in K8s, you use K8s’ own YAML.

Dividing Responsibility Across Tool Boundaries

1
2
3


Terraform        → infrastructure: EC2, VPC, EKS cluster, RDS
K8s YAML         → application deployment: Pods, Services, HPA
Docker Compose   → local dev only: quick multi-container setup

Layered relationship:

1
2
3
4
5


Terraform runs first: creates EKS cluster + RDS + VPC
 ↓
K8s YAML runs next:   deploys your app inside the cluster
 ↓
Docker Compose:        unrelated, only used on your laptop

Can Local Cluster Manifests Move Straight to Managed Clusters?

YAML can move straight over; the cluster itself cannot. K8s is standardized — no matter what runs underneath (kind, GKE, EKS), the kubectl apply YAML format is identical.

Things that need adjustment:

Item	kind	GKE/EKS (production)
Image source	`kind load` from local machine	Container Registry (GCR/ECR)
Service type	ClusterIP + port-forward	LoadBalancer
Resource requests	Set casually	Tune to actual load
Ingress	Not needed	Domain, HTTPS
Secrets	Hard-coded or unused	Secret Manager (follow least privilege)

Core Deployment, ReplicaSet, and autoscaling logic does not change.

Does the Container Runtime Use the Desktop Engine Internally?

K8s uses containerd internally (via the CRI interface), not Docker.

1
2
3
4
5
6
7
8


your laptop
├── Docker daemon
│   └── devops-lab-control-plane ← this is a Docker container (kind)
│       └── K8s cluster
│           └── containerd       ← K8s uses this, not Docker
│               ├── Pod 1
│               ├── Pod 2
│               └── Pod 3

docker stop devops-lab-control-plane stops the entire kind cluster container, not an individual Pod. All operations inside K8s use kubectl.

In production (GKE/EKS) you will not touch the docker command at all.

Teardown Instructions

Removing Your Application (Keep the Environment Running)

1
2
3


kubectl delete -f k8s/hpa.yaml
kubectl delete -f k8s/service.yaml
kubectl delete -f k8s/deployment.yaml

Order: autoscaler → endpoint → Deployment (from outer layer inward).

Destroying the Entire Environment

1

kind delete cluster --name devops-lab

One line, everything gone.

Pausing for Later

1
2
3


docker stop devops-lab-control-plane
# next time:
docker start devops-lab-control-plane

End-to-End Setup Walkthrough

1
2
3
4
5
6


1. write YAML (deployment.yaml, service.yaml, hpa.yaml)
2. build image → load into kind (or push to registry)
3. kubectl apply -f deployment.yaml   ← app runs
4. kubectl apply -f service.yaml      ← app is accessible
5. install metrics-server              ← once per cluster
6. kubectl apply -f hpa.yaml          ← auto-scaling enabled

Updating the code:

1

edit code → docker build new tag → kind load → kubectl set image → done

Key Takeaways

Commands You Will Use Daily

1
2
3
4
5
6
7
8


kubectl get pods
kubectl get pods -o wide
kubectl describe pod <name>
kubectl logs <name>
kubectl logs <name> --previous
kubectl apply -f <file>
kubectl delete -f <file>
kubectl get events --sort-by='.lastTimestamp'

Concepts Worth Committing to Memory

1
2
3
4
5
6


Deployment → ReplicaSet → Pod → Container
desired state vs actual state → reconciliation loop
Pod is ephemeral → IP changes → need Service
Service finds Pods by label selector
Rolling update: new Pod ready → then kill old Pod
CrashLoopBackOff: exponential backoff (10s → 20s → 40s → ... → 5min max)

Don’t Memorize: Look It Up

1
2
3


kubectl explain deployment.spec     # YAML format reference
kubectl get deploy <name> -o yaml   # see full YAML of any resource
kubectl api-resources               # all resource types and abbreviations

References

Kubernetes Documentation — Service — full explanation of Service type, ClusterIP, NodePort, and LoadBalancer
Kubernetes Documentation — Rolling Update — detailed breakdown of maxSurge and maxUnavailable
Kubernetes Documentation — HPA — design and behavior of the horizontal autoscaler
Kubernetes Documentation — Debug Pods — official troubleshooting guide
metrics-server — metrics-server installation and configuration
Kubernetes Documentation — DNS for Services — DNS rules for service discovery