Olivine logo
jpnt blog / Journal / Kubernetes is not a deployment tool

Kubernetes is not a deployment tool

#infrastructure #programming

It is a distributed state machine.

You declare what you want and the system reconciles toward it, forever.

That is the whole thing.

The mental model for k8s

  1. read desired state
  2. read actual state
  3. if different, close gap
  4. goto 1

That is what every controller does in Kubernetes.

The desired state is read from API server, which is just a database with a HTTP frontend and pub/sub system (the watch mechanism). etcd is the actual storage.

Every component: scheduler, kubelet, kube-controller-manager; is just a controller doing this loop agaisnt different resources.

You never tell Kubernetes how to do something. You declare what you want. The diff if the system's problem.

When something is broken, ask: what controller owns this resource? What is it reconciling agaisnt? That always points to the bug.

API request lifecycle

  1. kubectl apply
  2. authn
  3. authz (RBAC)
  4. admission webhooks
  5. validation
  6. etcd

Once it hits etcd, every watching controller gets a MODIFIED event and starts reconciling.

Key API concepts

References:

Architecture

Two planes. Control plane tells the cluster what to do. Data plane does it.

  1. Control plane (usually 1-3 nodes)
    • kube-apiserver
    • etcd
    • kube-scheduler
    • kube-controller-manager
    • cloud-controller-manager
  2. Data plane (every worker node)
    • kubelet
    • kube-proxy
    • container runtime (e.g. containerd, cri-o, etc)
    • CNI plugin
ComponentWhat It Does
kube-apiserverThe only thing that talks to etcd. All state reads/writes go through here. Provides REST + watch API.
etcdDistributed KV store using Raft consensus. Source of truth. Back it up. If it dies, the cluster is blind.
kube-schedulerWatches for unscheduled pods. Runs filter plugins (can this node run this pod?) then score plugins (which node is best?). Writes the node name to the pod spec. Kubelet does the rest.
kube-controller-managerRuns ~30 controllers in one process (ReplicaSet, Deployment, Job, etc.), all doing the reconcile loop.
cloud-controller-managerCloud-specific controllers: provision a load balancer for Services of type LoadBalancer, assign node IPs, etc.
kubeletRuns on every node. Watches assigned pods via API, calls CRI to start/stop containers, reports node/pod status.
kube-proxyMaintains iptables (or IPVS) rules on each node so Service IPs route to pod IPs. Not a traditional proxy.
container runtimecontainerd or CRI-O. Pulls images, creates containers via OCI spec. Kubelet talks to it via the CRI gRPC API.

Misconception: kube-proxy does not proxy traffic. It programs the kernel's netfilter rules so the kernel routes packets directly. The name is historical.

References:

Workloads

The pod is the unit of scheduling. One or more containers, one IP, one set of volumes, one lifecycle. You almost never create pods directly, instead you use a controller that manages pods for you.

Pod spec essentials:

spec:
  containers:
  - name: app
    image: nginx:1.25
    resources:
      requests: # what scheduler uses for bin-packing
        cpu: 100m
        memory: 128Mi
      limits:   # kernel enforces this via cgroups
        cpu: 500m
        memory: 256Mi
    readinessProbe: # controls traffic routing
      httpGet: {path: /healthz, port: 8080}
    livenessProbe:  # controls restart
      httpGet: {path: /healthz, port: 8080}
  initContainers: # run to completion before app containers start
  - name: migrate
    image: migrate-tool:latest

Tip: Always set requests and limits, No requests = scheduler schedules blind. No limits = one container can OOM the whole node. QoS class is determined by this: Guaranteed (requests == limits), Burstable (requests set, limits higher), BestEffort (neither set, evicted first).

ControllerUse WhenKey Fields
DeploymentStateless apps, rolling updatesreplicas, strategy (RollingUpdate/Recreate), maxSurge, maxUnavailable
StatefulSetDatabases, anything needing stable identity or ordered rolloutserviceName (headless), volumeClaimTemplates, podManagementPolicy
DaemonSetOne pod per node (log agents, monitoring, CNI)updateStrategy, node targeting via tolerations
JobRun-to-completion taskscompletions, parallelism, backoffLimit, completionMode: Indexed
CronJobScheduled jobsschedule (cron), concurrencyPolicy, startingDeadlineSeconds

Multi-container pod patterns:

Graceful shutdown: K8s sends SIGTERM, waits terminationGracePeriodSeconds (default 30s), then SIGKILL. Use a preStop hook to delay shutdown if your app needs it. Handle SIGTERM in your app.

References:

Configuration and Secrets

Two objects. ConfigMap for non-sensitive config. Secret for sensitive data (base64 encoded by default which means it is not encrypted).

How to ConsumeWorks WithNote
env varConfigMap, SecretSnapshot at pod start — changes don’t propagate
envFromConfigMap, SecretInject all keys as environment variables at once
volume mountConfigMap, SecretFiles in a directory — live updates (with ~60s delay)
projected volumeserviceAccountToken, ConfigMap, Secret, downwardAPICombine multiple sources into one mount path

Downward API: expose pod metadata to the container

Pod name, namespace, labels, resource limits: available as env vars or files. No API call needed from inside the container.

Secrets are not encrypted by default: They're base64. Anyone with etcd access or the right RBAC can read them. To actually encrypt: configure EncryptionConfiguration on the API server, or use an external secrets operator (ESO + Vault/AWS Secrets Manager). For production, use external secrets.

References:

Storage and CSI

ObjectWhat It Is
PersistentVolume (PV)A piece of storage in the cluster. Created by an admin or dynamically by a provisioner. Has a lifecycle independent of any pod.
PersistentVolumeClaim (PVC)A request for storage by a user. "I need 10Gi, ReadWriteOnce." Binds to a matching PV.
StorageClassDefines a provisioner + parameters. PVC references it → provisioner creates the PV automatically (dynamic provisioning).

Tip: volumeBindingMode: WaitForFirstConsumer. Default is Immediate; PV is provisioned in a random AZ before the pod is scheduled. This causes pod-PV AZ mismatches. Always use WaitForFirstConsumer so provisioning waits until the pod is scheduled and we know which AZ to use.

Access modes

CSI

CSI (Container Storage Interface) is the plugin system. Every cloud volume, NFS driver, or Ceph plugin implements this spec. You almost never implement it, but knowing the pieces explains every stuck-volume bug you'll ever hit.

  1. controller plugin
    • One instance (Deployment). Talks to the storage backend API. Creates/deletes/attaches/detaches volumes.
  2. node plugin
    • DaemonSet — runs on every node. Mounts/unmounts the volume on the node. Calls NodeStageVolume, NodePublishVolume.
  3. external-provisioner
    • Sidecar. Watches PVCs, calls CreateVolume on the driver when a new PVC appears.
  4. external-attacher
    • Sidecar. Watches VolumeAttachment objects, calls ControllerPublishVolume to attach to the node.
  5. external-resizer
    • Sidecar. Watches PVCs for capacity changes, triggers ControllerExpandVolume.
  6. external-snapshotter
    • Sidecar. Handles VolumeSnapshot CRDs → calls CreateSnapshot on the driver.

Debugging stuck volumes: Pod stuck in ContainerCreating with a volume error → check VolumeAttachment (external-attacher), then check the node plugin logs (NodeStageVolume/NodePublishVolume). PV stuck Released → check the reclaim policy and whether a finalizer is blocking deletion.

References:

Networking: it is just routing

Three rules define the whole model:

  1. Every pod gets a unique IP across the entire cluster
  2. Any pod can reach any other pod by IP, no NAT
  3. Containers in the same pod share a network namespace (same IP, same port space)

How this is implemented is the CNI plugin's problem (Calico, Cilium, Flannel, etc). Kubernetes doesn't care; it just requires the above contract.

Services: stable endpoints in front of pods

Pods die and get new IPs. A Service is a stable virtual IP (ClusterIP) with a DNS name that always routes to healthy pods matching its selector.

TypeReachable FromUse Case
ClusterIPInside cluster onlyDefault. Internal service-to-service communication.
NodePortOutside cluster via node IP:portDev/testing. Exposes a port on every node.
LoadBalancerOutside cluster via cloud LB IPProduction external traffic. Cloud-controller-manager provisions the LB.
ExternalNameInside clusterCNAME alias for an external DNS name. No proxying.
Headless (ClusterIP: None)Inside clusterDNS returns pod IPs directly. Required for StatefulSets. No virtual IP.

How ClusterIP actually works: The ClusterIP is not assigned to any interface. kube-proxy programs iptables/IPVS rules on every node: packets destined to the ClusterIP get DNAT'd to a real pod IP by the kernel. There's no userspace proxy involved.

DNS

CoreDNS runs in the cluster.

Every Service gets a DNS name: service-name.namespace.svc.cluster.local.

Pods in the same namespace can just use service-name.

StatefulSet pods get stable DNS per-pod: pod-name.service-name.namespace.svc.cluster.local.

Note: ndots:5 gotcha: Default resolv.conf has ndots:5. Any name with fewer than 5 dots triggers a search domain walk before going to the root. api.github.com → tries api.github.com.default.svc.cluster.local first. Adds latency. For external FQDNs, add a trailing dot or tune dnsConfig.options.ndots.

Ingress

L7 routing in front of Services.

Rules match on hostname and path, route to a Service backend.

An IngressController (nginx, Traefik, etc.) implements the actual proxying — Ingress is just config.

Gateway API is the replacement. It's GA. More expressive, supports TCP/gRPC routes, splits roles between infra (Gateway) and app (HTTPRoute). Start here for new clusters.

NetworkPolicy

Firewall rules for pods.

Default: all traffic allowed. A NetworkPolicy with a podSelector creates a whitelist — only matching ingress/egress is allowed.

Policies are additive (union). Enforced by the CNI plugin (not kube-proxy).

References:

Scheduling, autoscaling and reliability

Controlling where pods land

MechanismUse
nodeSelectorSimple label match. Pod goes to nodes with this label.
nodeAffinitySame idea but with required/preferred rules and richer expressions.
podAffinity / anti-affinityCo-locate or spread pods relative to other pods. Anti-affinity for HA (avoid same node).
taints + tolerationsTaints repel pods from nodes. Tolerations allow pods to ignore a taint. Used to dedicate nodes (GPU, spot).
topology spread constraintsSpread pods evenly across zones/nodes. Preferred over anti-affinity for large-scale distribution.

Autoscaling

Reliability primitives

ObjectWhat It Does
PodDisruptionBudgetLimits voluntary disruptions. During a drain, K8s won't evict pods if it would violate the PDB. Use this so drains don't take down your whole service.
ResourceQuotaPer-namespace cap on total CPU, memory, and object count. Prevents one team from starving another.
LimitRangePer-namespace default requests/limits and min/max enforcement. Prevents pods with no resources set from being scheduled.
PriorityClassHigher priority pods preempt lower priority pods if the cluster is full. Use for critical workloads.

References:

Security

Authentication vs authorization

Authn: who are you? K8s supports x509 certs, bearer tokens, OIDC (standard), webhook. No internal user database — users exist only in credentials.

Authz: what can you do? RBAC is the answer.

RBAC — four objects

  1. Role
    • Permissions within a namespace. verbs (get, list, watch, create, update, patch, delete) on resources.
  2. ClusterRole
    • Same but cluster-scoped. Also used for non-namespaced resources (nodes, PVs).
  3. RoleBinding
    • Grants a Role to a subject (user, group, ServiceAccount) in a namespace.
  4. ClusterRoleBinding
    • Grants a ClusterRole cluster-wide.

ServiceAccount

Identity for pods. A pod runs as a ServiceAccount — its token is a projected volume mounted at /var/run/secrets/kubernetes.io/serviceaccount/token. This token is used to authenticate against the API server. IRSA (AWS) and Workload Identity (GCP) use this mechanism to grant pods cloud IAM permissions.

Admission — the last gate before etcd

After authz, admission webhooks can mutate or reject requests. Order matters. The built-in controllers run first (NamespaceLifecycle, LimitRanger, ResourceQuota…), then your webhooks.

Runtime security essentials

Set on securityContext: runAsNonRoot: true, readOnlyRootFilesystem: true, allowPrivilegeEscalation: false, drop all Linux capabilities then add only what you need.

Use seccompProfile: RuntimeDefault as a baseline.

References:

Extensibility and the operator pattern

Kubernetes is designed to be extended.

The mechanism is: add your own API types (CRDs), then write a controller that reconciles them. That's an operator.

CRD — Custom Resource Definition

Register a new API type (e.g. Postgres in group databases.example.com).

From that point it behaves like any built-in type — you can kubectl apply it, watch it, RBAC it.

The CRD defines the schema (validated via OpenAPI/CEL).

The operator pattern

operator = CRD + controller

You define: "what does a Postgres cluster look like" (CRD)

Controller does: "when I see a Postgres object, reconcile the actual StatefulSets,

Services, Secrets, and backups toward what the spec says

The controller uses an informer cache (local copy of API objects, kept in sync via watch) so it never hammers the API server.

Work is queued in a rate-limited queue. On leader election, only one replica of the controller acts at a time.

Server-side apply

Instead of client-side three-way merge, the API server tracks field ownership.

Two controllers can manage the same object if they own different fields. Conflicts are explicit.

Use this for operators — it prevents the "last-write-wins" problem.

References:

Observability

Metrics: three layers

ToolWhat It Gives You
metrics-serverCPU/memory per pod/node in real time. Powers HPA and kubectl top. In-memory only — no history.
kube-state-metricsObject state metrics: deployment replicas, pod phase, job completions. Not resource usage.
node-exporterHost-level metrics: disk, network, filesystem, CPU steal. Typically runs as a DaemonSet.

Prometheus scrapes all three. Prometheus Operator makes this declarative with ServiceMonitor and PodMonitor CRDs.

Logs

kubectl logs reads from the container runtime log files on the node; --previous gets the logs of the last crashed container.

For aggregation: run Fluentbit as a DaemonSet, ship to Loki/Elastic/CloudWatch. Don't log to files inside the container — log to stdout/stderr.

Events

The most underused debugging tool.

kubectl describe pod foo shows Events. They have a reason, a message, a count, and they reference the involved object.

Everything that happened to an object is there. They expire after ~1 hour by default — if you want history, ship them to a persistent store.

Debugging checklist:

  1. kubectl describe pod — events tell you what happened
  2. kubectl logs --previous — last crash
  3. kubectl debug -it --image=busybox — ephemeral container in the pod's network namespace
  4. Check controller logs (deployment controller, etc.) if the pod never got created

References:

Cluster operations

Node lifecycle

cordon -> marks node unschedulable (no new pods) drain -> evicts all pods (respects PDBs) uncordon -> node back in rotation

etcd: backup or cry

etcd is the cluster. If it's gone, so is everything. Snapshot: etcdctl snapshot save.

Restore: etcdctl snapshot restore then reconfigure the API server to point at the restored data dir.

Test your restore procedure before you need it.

Package management

Version skew policy: kubelet can be at most 2 minor versions behind the API server. kubectl can be ±1. During upgrades: upgrade control plane first, then nodes. Never skip minor versions.

References:

Node and runtime internals

This section explains why production problems that aren't K8s bugs happen. Useful to know before going on-call.

You often hear "kubernetes is just Linux", and that is correct, fundamentally kubernetes relies on Linux kernel to give the majority of its functionality.

cgroups — how limits are actually enforced

CPU limit → cpu.cfs_quota_us / cpu.cfs_period_us.

If your container uses its entire quota in the first part of a period, it's throttled for the rest — even if the node has idle CPUs.

This is the source of latency spikes on CPU-limited pods with no visible CPU saturation.

Memory limit → memory.limit_in_bytes.

Exceed it → OOM killed by the kernel. No warning. OOMKilled exit code in pod status.

cgroups v1 vs v2: cgroups v2 (unified hierarchy) is default on modern kernels (>= kernel 5.8, most distros now). Memory accounting is more accurate. CPU throttling behavior differs slightly. Know which one your nodes are running.

Linux namespaces — isolation per pod

Each pod gets its own: net (network stack, IP), pid (process tree), mnt (filesystem), uts (hostname), ipc.

Containers in the same pod share the net and ipc namespaces — they can talk on localhost and see each other's processes if shareProcessNamespace: true.

Device plugins — GPUs and other hardware

Hardware that isn't CPU/memory is exposed via device plugins.

Plugin registers with kubelet, advertises capacity (e.g. nvidia.com/gpu: 4). Pod requests it in resources. Kubelet allocates it. The plugin handles the actual device assignment to the container.

Dynamic Resource Allocation (DRA) is the next-gen version, GA in 1.32 — more flexible, structured parameters, not just counts.

Kubelet eviction

When a node is under memory/disk pressure, kubelet evicts pods.

Order: BestEffort first, then Burstable (those exceeding requests), then Guaranteed. This is why you always set resource requests: it determines your eviction priority.

References: