Kubernetes is not a deployment tool

Posted on Mar 30, 2026

It is a distributed state machine.

You declare what you want and the system reconciles toward it, forever.

That is the whole thing.

The mental model for k8s

  1. read desired state
  2. read actual state
  3. if different, close gap
  4. goto 1

That is what every controller does in Kubernetes.

The desired state is read from API server, which is just a database with a HTTP frontend and pub/sub system (the watch mechanism). etcd is the actual storage.

Every component: scheduler, kubelet, kube-controller-manager; is just a controller doing this loop agaisnt different resources.

You never tell Kubernetes how to do something. You declare what you want. The diff if the system’s problem.

When something is broken, ask: what controller owns this resource? What is it reconciling agaisnt? That always points to the bug.

API request lifecycle

  1. kubectl apply
  2. authn
  3. authz (RBAC)
  4. admission webhooks
  5. validation
  6. etcd

Once it hits etcd, every watching controller gets a MODIFIED event and starts reconciling.

Key API concepts

  • resourceVersion
    • Optimistic concurrency. If yours doesn’t match, your write fails. Prevents races.
  • generation / observedGeneration
    • generation bumps on spec change. observedGeneration is what the controller has actually processed. If they differ, the controller is behind.
  • finalizers
    • String list on an object. Deletion is blocked until all finalizers are removed. How controllers do cleanup before an object dies.
  • ownerReferences
    • Garbage collection. When the owner is deleted, owned objects are deleted too. A ReplicaSet owns its Pods.

References:

Architecture

Two planes. Control plane tells the cluster what to do. Data plane does it.

  1. Control plane (usually 1-3 nodes)
    • kube-apiserver
    • etcd
    • kube-scheduler
    • kube-controller-manager
    • cloud-controller-manager
  2. Data plane (every worker node)
    • kubelet
    • kube-proxy
    • container runtime (e.g. containerd, cri-o, etc)
    • CNI plugin
Component What It Does
kube-apiserver The only thing that talks to etcd. All state reads/writes go through here. Provides REST + watch API.
etcd Distributed KV store using Raft consensus. Source of truth. Back it up. If it dies, the cluster is blind.
kube-scheduler Watches for unscheduled pods. Runs filter plugins (can this node run this pod?) then score plugins (which node is best?). Writes the node name to the pod spec. Kubelet does the rest.
kube-controller-manager Runs ~30 controllers in one process (ReplicaSet, Deployment, Job, etc.), all doing the reconcile loop.
cloud-controller-manager Cloud-specific controllers: provision a load balancer for Services of type LoadBalancer, assign node IPs, etc.
kubelet Runs on every node. Watches assigned pods via API, calls CRI to start/stop containers, reports node/pod status.
kube-proxy Maintains iptables (or IPVS) rules on each node so Service IPs route to pod IPs. Not a traditional proxy.
container runtime containerd or CRI-O. Pulls images, creates containers via OCI spec. Kubelet talks to it via the CRI gRPC API.

Misconception: kube-proxy does not proxy traffic. It programs the kernel’s netfilter rules so the kernel routes packets directly. The name is historical.

References:

Workloads

The pod is the unit of scheduling. One or more containers, one IP, one set of volumes, one lifecycle. You almost never create pods directly, instead you use a controller that manages pods for you.

Pod spec essentials:

spec:
  containers:
  - name: app
    image: nginx:1.25
    resources:
      requests: # what scheduler uses for bin-packing
        cpu: 100m
        memory: 128Mi
      limits:   # kernel enforces this via cgroups
        cpu: 500m
        memory: 256Mi
    readinessProbe: # controls traffic routing
      httpGet: {path: /healthz, port: 8080}
    livenessProbe:  # controls restart
      httpGet: {path: /healthz, port: 8080}
  initContainers: # run to completion before app containers start
  - name: migrate
    image: migrate-tool:latest

Tip: Always set requests and limits, No requests = scheduler schedules blind. No limits = one container can OOM the whole node. QoS class is determined by this: Guaranteed (requests == limits), Burstable (requests set, limits higher), BestEffort (neither set, evicted first).

Controller Use When Key Fields
Deployment Stateless apps, rolling updates replicas, strategy (RollingUpdate/Recreate), maxSurge, maxUnavailable
StatefulSet Databases, anything needing stable identity or ordered rollout serviceName (headless), volumeClaimTemplates, podManagementPolicy
DaemonSet One pod per node (log agents, monitoring, CNI) updateStrategy, node targeting via tolerations
Job Run-to-completion tasks completions, parallelism, backoffLimit, completionMode: Indexed
CronJob Scheduled jobs schedule (cron), concurrencyPolicy, startingDeadlineSeconds

Multi-container pod patterns:

  • sidecar
    • Extends the main container. Log shipper, service mesh proxy (Envoy). Runs alongside, same lifecycle.
  • init container
    • Runs before app containers. DB migrations, config fetching, wait-for-dependency. Must exit 0.
  • ephemeral container
    • Debug-only. Injected into a running pod with kubectl debug. Not in the spec by default.

Graceful shutdown: K8s sends SIGTERM, waits terminationGracePeriodSeconds (default 30s), then SIGKILL. Use a preStop hook to delay shutdown if your app needs it. Handle SIGTERM in your app.

References:

Configuration and Secrets

Two objects. ConfigMap for non-sensitive config. Secret for sensitive data (base64 encoded by default which means it is not encrypted).

How to Consume Works With Note
env var ConfigMap, Secret Snapshot at pod start — changes don’t propagate
envFrom ConfigMap, Secret Inject all keys as environment variables at once
volume mount ConfigMap, Secret Files in a directory — live updates (with ~60s delay)
projected volume serviceAccountToken, ConfigMap, Secret, downwardAPI Combine multiple sources into one mount path

Downward API: expose pod metadata to the container

Pod name, namespace, labels, resource limits: available as env vars or files. No API call needed from inside the container.

Secrets are not encrypted by default: They’re base64. Anyone with etcd access or the right RBAC can read them. To actually encrypt: configure EncryptionConfiguration on the API server, or use an external secrets operator (ESO + Vault/AWS Secrets Manager). For production, use external secrets.

References:

Storage and CSI

Object What It Is
PersistentVolume (PV) A piece of storage in the cluster. Created by an admin or dynamically by a provisioner. Has a lifecycle independent of any pod.
PersistentVolumeClaim (PVC) A request for storage by a user. “I need 10Gi, ReadWriteOnce.” Binds to a matching PV.
StorageClass Defines a provisioner + parameters. PVC references it → provisioner creates the PV automatically (dynamic provisioning).

Tip: volumeBindingMode: WaitForFirstConsumer. Default is Immediate; PV is provisioned in a random AZ before the pod is scheduled. This causes pod-PV AZ mismatches. Always use WaitForFirstConsumer so provisioning waits until the pod is scheduled and we know which AZ to use.

Access modes

  • RWO
    • ReadWriteOnce. One node mounts read/write. Standard for block storage (EBS, PD).
  • ROX
    • ReadOnlyMany. Many nodes read. Rare.
  • RWX
    • ReadWriteMany. Many nodes read/write. Needs NFS or a distributed filesystem.
  • RWOP
    • ReadWriteOncePod. Only one pod (not just node) mounts read/write. Strongest isolation.

CSI

CSI (Container Storage Interface) is the plugin system. Every cloud volume, NFS driver, or Ceph plugin implements this spec. You almost never implement it, but knowing the pieces explains every stuck-volume bug you’ll ever hit.

  1. controller plugin
    • One instance (Deployment). Talks to the storage backend API. Creates/deletes/attaches/detaches volumes.
  2. node plugin
    • DaemonSet — runs on every node. Mounts/unmounts the volume on the node. Calls NodeStageVolume, NodePublishVolume.
  3. external-provisioner
    • Sidecar. Watches PVCs, calls CreateVolume on the driver when a new PVC appears.
  4. external-attacher
    • Sidecar. Watches VolumeAttachment objects, calls ControllerPublishVolume to attach to the node.
  5. external-resizer
    • Sidecar. Watches PVCs for capacity changes, triggers ControllerExpandVolume.
  6. external-snapshotter
    • Sidecar. Handles VolumeSnapshot CRDs → calls CreateSnapshot on the driver.

Debugging stuck volumes: Pod stuck in ContainerCreating with a volume error → check VolumeAttachment (external-attacher), then check the node plugin logs (NodeStageVolume/NodePublishVolume). PV stuck Released → check the reclaim policy and whether a finalizer is blocking deletion.

References:

Networking: it is just routing

Three rules define the whole model:

  1. Every pod gets a unique IP across the entire cluster
  2. Any pod can reach any other pod by IP, no NAT
  3. Containers in the same pod share a network namespace (same IP, same port space)

How this is implemented is the CNI plugin’s problem (Calico, Cilium, Flannel, etc). Kubernetes doesn’t care; it just requires the above contract.

Services: stable endpoints in front of pods

Pods die and get new IPs. A Service is a stable virtual IP (ClusterIP) with a DNS name that always routes to healthy pods matching its selector.

Type Reachable From Use Case
ClusterIP Inside cluster only Default. Internal service-to-service communication.
NodePort Outside cluster via node IP:port Dev/testing. Exposes a port on every node.
LoadBalancer Outside cluster via cloud LB IP Production external traffic. Cloud-controller-manager provisions the LB.
ExternalName Inside cluster CNAME alias for an external DNS name. No proxying.
Headless (ClusterIP: None) Inside cluster DNS returns pod IPs directly. Required for StatefulSets. No virtual IP.

How ClusterIP actually works: The ClusterIP is not assigned to any interface. kube-proxy programs iptables/IPVS rules on every node: packets destined to the ClusterIP get DNAT’d to a real pod IP by the kernel. There’s no userspace proxy involved.

DNS

CoreDNS runs in the cluster.

Every Service gets a DNS name: service-name.namespace.svc.cluster.local.

Pods in the same namespace can just use service-name.

StatefulSet pods get stable DNS per-pod: pod-name.service-name.namespace.svc.cluster.local.

Note: ndots:5 gotcha: Default resolv.conf has ndots:5. Any name with fewer than 5 dots triggers a search domain walk before going to the root. api.github.com → tries api.github.com.default.svc.cluster.local first. Adds latency. For external FQDNs, add a trailing dot or tune dnsConfig.options.ndots.

Ingress

L7 routing in front of Services.

Rules match on hostname and path, route to a Service backend.

An IngressController (nginx, Traefik, etc.) implements the actual proxying — Ingress is just config.

Gateway API is the replacement. It’s GA. More expressive, supports TCP/gRPC routes, splits roles between infra (Gateway) and app (HTTPRoute). Start here for new clusters.

NetworkPolicy

Firewall rules for pods.

Default: all traffic allowed. A NetworkPolicy with a podSelector creates a whitelist — only matching ingress/egress is allowed.

Policies are additive (union). Enforced by the CNI plugin (not kube-proxy).

References:

Scheduling, autoscaling and reliability

Controlling where pods land

Mechanism Use
nodeSelector Simple label match. Pod goes to nodes with this label.
nodeAffinity Same idea but with required/preferred rules and richer expressions.
podAffinity / anti-affinity Co-locate or spread pods relative to other pods. Anti-affinity for HA (avoid same node).
taints + tolerations Taints repel pods from nodes. Tolerations allow pods to ignore a taint. Used to dedicate nodes (GPU, spot).
topology spread constraints Spread pods evenly across zones/nodes. Preferred over anti-affinity for large-scale distribution.

Autoscaling

  • HPA
    • Horizontal Pod Autoscaler. Scales replicas based on CPU, memory, or custom metrics. Checks every 15s.
  • VPA
    • Vertical Pod Autoscaler. Adjusts resource requests. In Auto mode it restarts pods. Use Off mode to just get recommendations first.
  • KEDA
    • Event-driven autoscaling. Scale on Kafka lag, queue depth, cron schedule, HTTP request rate. More flexible than HPA.
  • Cluster Autoscaler
    • Adds/removes nodes. Triggers when pods are unschedulable (scale up) or nodes are underutilized (scale down).

Reliability primitives

Object What It Does
PodDisruptionBudget Limits voluntary disruptions. During a drain, K8s won’t evict pods if it would violate the PDB. Use this so drains don’t take down your whole service.
ResourceQuota Per-namespace cap on total CPU, memory, and object count. Prevents one team from starving another.
LimitRange Per-namespace default requests/limits and min/max enforcement. Prevents pods with no resources set from being scheduled.
PriorityClass Higher priority pods preempt lower priority pods if the cluster is full. Use for critical workloads.

References:

Security

Authentication vs authorization

Authn: who are you? K8s supports x509 certs, bearer tokens, OIDC (standard), webhook. No internal user database — users exist only in credentials.

Authz: what can you do? RBAC is the answer.

RBAC — four objects

  1. Role
    • Permissions within a namespace. verbs (get, list, watch, create, update, patch, delete) on resources.
  2. ClusterRole
    • Same but cluster-scoped. Also used for non-namespaced resources (nodes, PVs).
  3. RoleBinding
    • Grants a Role to a subject (user, group, ServiceAccount) in a namespace.
  4. ClusterRoleBinding
    • Grants a ClusterRole cluster-wide.

ServiceAccount

Identity for pods. A pod runs as a ServiceAccount — its token is a projected volume mounted at /var/run/secrets/kubernetes.io/serviceaccount/token. This token is used to authenticate against the API server. IRSA (AWS) and Workload Identity (GCP) use this mechanism to grant pods cloud IAM permissions.

Admission — the last gate before etcd

After authz, admission webhooks can mutate or reject requests. Order matters. The built-in controllers run first (NamespaceLifecycle, LimitRanger, ResourceQuota…), then your webhooks.

  • MutatingAdmissionWebhook
    • Can modify the object. Used by service mesh injectors (inject Envoy sidecar), secret injectors, defaulters.
  • ValidatingAdmissionWebhook
    • Can only allow or reject. Used for policy enforcement.
  • ValidatingAdmissionPolicy
    • CEL-based in-tree validation. No webhook required. GA in 1.30. Prefer this over webhooks for simple rules.
  • PodSecurity admission
    • Enforces Pod Security Standards (privileged / baseline / restricted) per namespace. Replaced PodSecurityPolicy.

Runtime security essentials

Set on securityContext: runAsNonRoot: true, readOnlyRootFilesystem: true, allowPrivilegeEscalation: false, drop all Linux capabilities then add only what you need.

Use seccompProfile: RuntimeDefault as a baseline.

References:

Extensibility and the operator pattern

Kubernetes is designed to be extended.

The mechanism is: add your own API types (CRDs), then write a controller that reconciles them. That’s an operator.

CRD — Custom Resource Definition

Register a new API type (e.g. Postgres in group databases.example.com).

From that point it behaves like any built-in type — you can kubectl apply it, watch it, RBAC it.

The CRD defines the schema (validated via OpenAPI/CEL).

The operator pattern

operator = CRD + controller

You define: “what does a Postgres cluster look like” (CRD)

Controller does: “when I see a Postgres object, reconcile the actual StatefulSets,

Services, Secrets, and backups toward what the spec says

The controller uses an informer cache (local copy of API objects, kept in sync via watch) so it never hammers the API server.

Work is queued in a rate-limited queue. On leader election, only one replica of the controller acts at a time.

Server-side apply

Instead of client-side three-way merge, the API server tracks field ownership.

Two controllers can manage the same object if they own different fields. Conflicts are explicit.

Use this for operators — it prevents the “last-write-wins” problem.

References:

Observability

Metrics: three layers

Tool What It Gives You
metrics-server CPU/memory per pod/node in real time. Powers HPA and kubectl top. In-memory only — no history.
kube-state-metrics Object state metrics: deployment replicas, pod phase, job completions. Not resource usage.
node-exporter Host-level metrics: disk, network, filesystem, CPU steal. Typically runs as a DaemonSet.

Prometheus scrapes all three. Prometheus Operator makes this declarative with ServiceMonitor and PodMonitor CRDs.

Logs

kubectl logs reads from the container runtime log files on the node; –previous gets the logs of the last crashed container.

For aggregation: run Fluentbit as a DaemonSet, ship to Loki/Elastic/CloudWatch. Don’t log to files inside the container — log to stdout/stderr.

Events

The most underused debugging tool.

kubectl describe pod foo shows Events. They have a reason, a message, a count, and they reference the involved object.

Everything that happened to an object is there. They expire after ~1 hour by default — if you want history, ship them to a persistent store.

Debugging checklist:

  1. kubectl describe pod — events tell you what happened
  2. kubectl logs –previous — last crash
  3. kubectl debug -it –image=busybox — ephemeral container in the pod’s network namespace
  4. Check controller logs (deployment controller, etc.) if the pod never got created

References:

Cluster operations

Node lifecycle

cordon -> marks node unschedulable (no new pods) drain -> evicts all pods (respects PDBs) uncordon -> node back in rotation

etcd: backup or cry

etcd is the cluster. If it’s gone, so is everything. Snapshot: etcdctl snapshot save.

Restore: etcdctl snapshot restore then reconfigure the API server to point at the restored data dir.

Test your restore procedure before you need it.

Package management

  • Helm
    • Templated YAML packaged as charts. Values override defaults. Hooks for lifecycle. The standard for distributing third-party software.
  • kustomize
    • Overlay patches on base YAML. No templating — pure structural patching. Built into kubectl. Better for your own app configs.
  • GitOps (ArgoCD/Flux)
    • Git is the source of truth. Controller syncs cluster to whatever’s in the repo. Drift detection, auto-sync, rollback by reverting git commit.

Version skew policy: kubelet can be at most 2 minor versions behind the API server. kubectl can be ±1. During upgrades: upgrade control plane first, then nodes. Never skip minor versions.

References:

Node and runtime internals

This section explains why production problems that aren’t K8s bugs happen. Useful to know before going on-call.

You often hear “kubernetes is just Linux”, and that is correct, fundamentally kubernetes relies on Linux kernel to give the majority of its functionality.

cgroups — how limits are actually enforced

CPU limit → cpu.cfs_quota_us / cpu.cfs_period_us.

If your container uses its entire quota in the first part of a period, it’s throttled for the rest — even if the node has idle CPUs.

This is the source of latency spikes on CPU-limited pods with no visible CPU saturation.

Memory limit → memory.limit_in_bytes.

Exceed it → OOM killed by the kernel. No warning. OOMKilled exit code in pod status.

cgroups v1 vs v2: cgroups v2 (unified hierarchy) is default on modern kernels (>= kernel 5.8, most distros now). Memory accounting is more accurate. CPU throttling behavior differs slightly. Know which one your nodes are running.

Linux namespaces — isolation per pod

Each pod gets its own: net (network stack, IP), pid (process tree), mnt (filesystem), uts (hostname), ipc.

Containers in the same pod share the net and ipc namespaces — they can talk on localhost and see each other’s processes if shareProcessNamespace: true.

Device plugins — GPUs and other hardware

Hardware that isn’t CPU/memory is exposed via device plugins.

Plugin registers with kubelet, advertises capacity (e.g. nvidia.com/gpu: 4). Pod requests it in resources. Kubelet allocates it. The plugin handles the actual device assignment to the container.

Dynamic Resource Allocation (DRA) is the next-gen version, GA in 1.32 — more flexible, structured parameters, not just counts.

Kubelet eviction

When a node is under memory/disk pressure, kubelet evicts pods.

Order: BestEffort first, then Burstable (those exceeding requests), then Guaranteed. This is why you always set resource requests: it determines your eviction priority.

References: