Kubernetes is not a deployment tool

It is a distributed state machine.

You declare what you want and the system reconciles toward it, forever.

That is the whole thing.

The mental model for k8s

read desired state
read actual state
if different, close gap
goto 1

That is what every controller does in Kubernetes.

The desired state is read from API server, which is just a database with a HTTP frontend and pub/sub system (the watch mechanism). etcd is the actual storage.

Every component: scheduler, kubelet, kube-controller-manager; is just a controller doing this loop agaisnt different resources.

You never tell Kubernetes how to do something. You declare what you want. The diff if the system's problem.

When something is broken, ask: what controller owns this resource? What is it reconciling agaisnt? That always points to the bug.

API request lifecycle

kubectl apply
authn
authz (RBAC)
admission webhooks
validation
etcd

Once it hits etcd, every watching controller gets a MODIFIED event and starts reconciling.

Key API concepts

resourceVersion
- Optimistic concurrency. If yours doesn't match, your write fails. Prevents races.
generation / observedGeneration
- generation bumps on spec change. observedGeneration is what the controller has actually processed. If they differ, the controller is behind.
finalizers
- String list on an object. Deletion is blocked until all finalizers are removed. How controllers do cleanup before an object dies.
ownerReferences
- Garbage collection. When the owner is deleted, owned objects are deleted too. A ReplicaSet owns its Pods.

References:

https://kubernetes.io/docs/concepts/overview/working-with-objects/
https://kubernetes.io/docs/reference/kubernetes-api/

Architecture

Two planes. Control plane tells the cluster what to do. Data plane does it.

Control plane (usually 1-3 nodes)
- kube-apiserver
- etcd
- kube-scheduler
- kube-controller-manager
- cloud-controller-manager
Data plane (every worker node)
- kubelet
- kube-proxy
- container runtime (e.g. containerd, cri-o, etc)
- CNI plugin

Component	What It Does
kube-apiserver	The only thing that talks to etcd. All state reads/writes go through here. Provides REST + watch API.
etcd	Distributed KV store using Raft consensus. Source of truth. Back it up. If it dies, the cluster is blind.
kube-scheduler	Watches for unscheduled pods. Runs filter plugins (can this node run this pod?) then score plugins (which node is best?). Writes the node name to the pod spec. Kubelet does the rest.
kube-controller-manager	Runs ~30 controllers in one process (ReplicaSet, Deployment, Job, etc.), all doing the reconcile loop.
cloud-controller-manager	Cloud-specific controllers: provision a load balancer for Services of type LoadBalancer, assign node IPs, etc.
kubelet	Runs on every node. Watches assigned pods via API, calls CRI to start/stop containers, reports node/pod status.
kube-proxy	Maintains iptables (or IPVS) rules on each node so Service IPs route to pod IPs. Not a traditional proxy.
container runtime	containerd or CRI-O. Pulls images, creates containers via OCI spec. Kubelet talks to it via the CRI gRPC API.

Misconception: kube-proxy does not proxy traffic. It programs the kernel's netfilter rules so the kernel routes packets directly. The name is historical.

References:

https://kubernetes.io/docs/concepts/architecture/

Workloads

The pod is the unit of scheduling. One or more containers, one IP, one set of volumes, one lifecycle. You almost never create pods directly, instead you use a controller that manages pods for you.

Pod spec essentials:

spec:
  containers:
  - name: app
    image: nginx:1.25
    resources:
      requests: # what scheduler uses for bin-packing
        cpu: 100m
        memory: 128Mi
      limits:   # kernel enforces this via cgroups
        cpu: 500m
        memory: 256Mi
    readinessProbe: # controls traffic routing
      httpGet: {path: /healthz, port: 8080}
    livenessProbe:  # controls restart
      httpGet: {path: /healthz, port: 8080}
  initContainers: # run to completion before app containers start
  - name: migrate
    image: migrate-tool:latest

Tip: Always set requests and limits, No requests = scheduler schedules blind. No limits = one container can OOM the whole node. QoS class is determined by this: Guaranteed (requests == limits), Burstable (requests set, limits higher), BestEffort (neither set, evicted first).

Controller	Use When	Key Fields
Deployment	Stateless apps, rolling updates	replicas, strategy (RollingUpdate/Recreate), maxSurge, maxUnavailable
StatefulSet	Databases, anything needing stable identity or ordered rollout	serviceName (headless), volumeClaimTemplates, podManagementPolicy
DaemonSet	One pod per node (log agents, monitoring, CNI)	updateStrategy, node targeting via tolerations
Job	Run-to-completion tasks	completions, parallelism, backoffLimit, completionMode: Indexed
CronJob	Scheduled jobs	schedule (cron), concurrencyPolicy, startingDeadlineSeconds

Multi-container pod patterns:

sidecar
- Extends the main container. Log shipper, service mesh proxy (Envoy). Runs alongside, same lifecycle.
init container
- Runs before app containers. DB migrations, config fetching, wait-for-dependency. Must exit 0.
ephemeral container
- Debug-only. Injected into a running pod with kubectl debug. Not in the spec by default.

Graceful shutdown: K8s sends SIGTERM, waits terminationGracePeriodSeconds (default 30s), then SIGKILL. Use a preStop hook to delay shutdown if your app needs it. Handle SIGTERM in your app.

References:

https://kubernetes.io/docs/concepts/workloads/

Configuration and Secrets

Two objects. ConfigMap for non-sensitive config. Secret for sensitive data (base64 encoded by default which means it is not encrypted).

How to Consume	Works With	Note
env var	ConfigMap, Secret	Snapshot at pod start — changes don’t propagate
envFrom	ConfigMap, Secret	Inject all keys as environment variables at once
volume mount	ConfigMap, Secret	Files in a directory — live updates (with ~60s delay)
projected volume	serviceAccountToken, ConfigMap, Secret, downwardAPI	Combine multiple sources into one mount path

Downward API: expose pod metadata to the container

Pod name, namespace, labels, resource limits: available as env vars or files. No API call needed from inside the container.

Secrets are not encrypted by default: They're base64. Anyone with etcd access or the right RBAC can read them. To actually encrypt: configure EncryptionConfiguration on the API server, or use an external secrets operator (ESO + Vault/AWS Secrets Manager). For production, use external secrets.

References:

https://kubernetes.io/docs/concepts/configuration/
https://external-secrets.io/latest/

Storage and CSI

Object	What It Is
PersistentVolume (PV)	A piece of storage in the cluster. Created by an admin or dynamically by a provisioner. Has a lifecycle independent of any pod.
PersistentVolumeClaim (PVC)	A request for storage by a user. "I need 10Gi, ReadWriteOnce." Binds to a matching PV.
StorageClass	Defines a provisioner + parameters. PVC references it → provisioner creates the PV automatically (dynamic provisioning).

Tip: volumeBindingMode: WaitForFirstConsumer. Default is Immediate; PV is provisioned in a random AZ before the pod is scheduled. This causes pod-PV AZ mismatches. Always use WaitForFirstConsumer so provisioning waits until the pod is scheduled and we know which AZ to use.

Access modes

RWO
- ReadWriteOnce. One node mounts read/write. Standard for block storage (EBS, PD).
ROX
- ReadOnlyMany. Many nodes read. Rare.
RWX
- ReadWriteMany. Many nodes read/write. Needs NFS or a distributed filesystem.
RWOP
- ReadWriteOncePod. Only one pod (not just node) mounts read/write. Strongest isolation.

CSI

CSI (Container Storage Interface) is the plugin system. Every cloud volume, NFS driver, or Ceph plugin implements this spec. You almost never implement it, but knowing the pieces explains every stuck-volume bug you'll ever hit.

controller plugin
- One instance (Deployment). Talks to the storage backend API. Creates/deletes/attaches/detaches volumes.
node plugin
- DaemonSet — runs on every node. Mounts/unmounts the volume on the node. Calls NodeStageVolume, NodePublishVolume.
external-provisioner
- Sidecar. Watches PVCs, calls CreateVolume on the driver when a new PVC appears.
external-attacher
- Sidecar. Watches VolumeAttachment objects, calls ControllerPublishVolume to attach to the node.
external-resizer
- Sidecar. Watches PVCs for capacity changes, triggers ControllerExpandVolume.
external-snapshotter
- Sidecar. Handles VolumeSnapshot CRDs → calls CreateSnapshot on the driver.

Debugging stuck volumes: Pod stuck in ContainerCreating with a volume error → check VolumeAttachment (external-attacher), then check the node plugin logs (NodeStageVolume/NodePublishVolume). PV stuck Released → check the reclaim policy and whether a finalizer is blocking deletion.

References:

https://kubernetes.io/docs/concepts/storage/
https://kubernetes-csi.github.io/docs/

Networking: it is just routing

Three rules define the whole model:

Every pod gets a unique IP across the entire cluster
Any pod can reach any other pod by IP, no NAT
Containers in the same pod share a network namespace (same IP, same port space)

How this is implemented is the CNI plugin's problem (Calico, Cilium, Flannel, etc). Kubernetes doesn't care; it just requires the above contract.

Services: stable endpoints in front of pods

Pods die and get new IPs. A Service is a stable virtual IP (ClusterIP) with a DNS name that always routes to healthy pods matching its selector.

Type	Reachable From	Use Case
ClusterIP	Inside cluster only	Default. Internal service-to-service communication.
NodePort	Outside cluster via node IP:port	Dev/testing. Exposes a port on every node.
LoadBalancer	Outside cluster via cloud LB IP	Production external traffic. Cloud-controller-manager provisions the LB.
ExternalName	Inside cluster	CNAME alias for an external DNS name. No proxying.
Headless (ClusterIP: None)	Inside cluster	DNS returns pod IPs directly. Required for StatefulSets. No virtual IP.

How ClusterIP actually works: The ClusterIP is not assigned to any interface. kube-proxy programs iptables/IPVS rules on every node: packets destined to the ClusterIP get DNAT'd to a real pod IP by the kernel. There's no userspace proxy involved.

DNS

CoreDNS runs in the cluster.

Every Service gets a DNS name: service-name.namespace.svc.cluster.local.

Pods in the same namespace can just use service-name.

StatefulSet pods get stable DNS per-pod: pod-name.service-name.namespace.svc.cluster.local.

Note: ndots:5 gotcha: Default resolv.conf has ndots:5. Any name with fewer than 5 dots triggers a search domain walk before going to the root. api.github.com → tries api.github.com.default.svc.cluster.local first. Adds latency. For external FQDNs, add a trailing dot or tune dnsConfig.options.ndots.

Ingress

L7 routing in front of Services.

Rules match on hostname and path, route to a Service backend.

An IngressController (nginx, Traefik, etc.) implements the actual proxying — Ingress is just config.

Gateway API is the replacement. It's GA. More expressive, supports TCP/gRPC routes, splits roles between infra (Gateway) and app (HTTPRoute). Start here for new clusters.

NetworkPolicy

Firewall rules for pods.

Default: all traffic allowed. A NetworkPolicy with a podSelector creates a whitelist — only matching ingress/egress is allowed.

Policies are additive (union). Enforced by the CNI plugin (not kube-proxy).

References:

https://kubernetes.io/docs/concepts/services-networking/
https://gateway-api.sigs.k8s.io/
https://www.youtube.com/watch?v=Mj04QOqAaJ8

Scheduling, autoscaling and reliability

Controlling where pods land

Mechanism	Use
nodeSelector	Simple label match. Pod goes to nodes with this label.
nodeAffinity	Same idea but with required/preferred rules and richer expressions.
podAffinity / anti-affinity	Co-locate or spread pods relative to other pods. Anti-affinity for HA (avoid same node).
taints + tolerations	Taints repel pods from nodes. Tolerations allow pods to ignore a taint. Used to dedicate nodes (GPU, spot).
topology spread constraints	Spread pods evenly across zones/nodes. Preferred over anti-affinity for large-scale distribution.

Autoscaling

HPA
- Horizontal Pod Autoscaler. Scales replicas based on CPU, memory, or custom metrics. Checks every 15s.
VPA
- Vertical Pod Autoscaler. Adjusts resource requests. In Auto mode it restarts pods. Use Off mode to just get recommendations first.
KEDA
- Event-driven autoscaling. Scale on Kafka lag, queue depth, cron schedule, HTTP request rate. More flexible than HPA.
Cluster Autoscaler
- Adds/removes nodes. Triggers when pods are unschedulable (scale up) or nodes are underutilized (scale down).

Reliability primitives

Object	What It Does
PodDisruptionBudget	Limits voluntary disruptions. During a drain, K8s won't evict pods if it would violate the PDB. Use this so drains don't take down your whole service.
ResourceQuota	Per-namespace cap on total CPU, memory, and object count. Prevents one team from starving another.
LimitRange	Per-namespace default requests/limits and min/max enforcement. Prevents pods with no resources set from being scheduled.
PriorityClass	Higher priority pods preempt lower priority pods if the cluster is full. Use for critical workloads.

References:

https://kubernetes.io/docs/concepts/scheduling-eviction/
https://keda.sh/

Security

Authentication vs authorization

Authn: who are you? K8s supports x509 certs, bearer tokens, OIDC (standard), webhook. No internal user database — users exist only in credentials.

Authz: what can you do? RBAC is the answer.

RBAC — four objects

Role
- Permissions within a namespace. verbs (get, list, watch, create, update, patch, delete) on resources.
ClusterRole
- Same but cluster-scoped. Also used for non-namespaced resources (nodes, PVs).
RoleBinding
- Grants a Role to a subject (user, group, ServiceAccount) in a namespace.
ClusterRoleBinding
- Grants a ClusterRole cluster-wide.

ServiceAccount

Identity for pods. A pod runs as a ServiceAccount — its token is a projected volume mounted at /var/run/secrets/kubernetes.io/serviceaccount/token. This token is used to authenticate against the API server. IRSA (AWS) and Workload Identity (GCP) use this mechanism to grant pods cloud IAM permissions.

Admission — the last gate before etcd

After authz, admission webhooks can mutate or reject requests. Order matters. The built-in controllers run first (NamespaceLifecycle, LimitRanger, ResourceQuota…), then your webhooks.

MutatingAdmissionWebhook
- Can modify the object. Used by service mesh injectors (inject Envoy sidecar), secret injectors, defaulters.
ValidatingAdmissionWebhook
- Can only allow or reject. Used for policy enforcement.
ValidatingAdmissionPolicy
- CEL-based in-tree validation. No webhook required. GA in 1.30. Prefer this over webhooks for simple rules.
PodSecurity admission
- Enforces Pod Security Standards (privileged / baseline / restricted) per namespace. Replaced PodSecurityPolicy.

Runtime security essentials

Set on securityContext: runAsNonRoot: true, readOnlyRootFilesystem: true, allowPrivilegeEscalation: false, drop all Linux capabilities then add only what you need.

Use seccompProfile: RuntimeDefault as a baseline.

References:

https://kubernetes.io/docs/concepts/security/
https://kubernetes.io/docs/concepts/security/pod-security-standards/

Extensibility and the operator pattern

Kubernetes is designed to be extended.

The mechanism is: add your own API types (CRDs), then write a controller that reconciles them. That's an operator.

CRD — Custom Resource Definition

From that point it behaves like any built-in type — you can kubectl apply it, watch it, RBAC it.

The CRD defines the schema (validated via OpenAPI/CEL).

The operator pattern

operator = CRD + controller

You define: "what does a Postgres cluster look like" (CRD)

Controller does: "when I see a Postgres object, reconcile the actual StatefulSets,

Services, Secrets, and backups toward what the spec says

The controller uses an informer cache (local copy of API objects, kept in sync via watch) so it never hammers the API server.

Work is queued in a rate-limited queue. On leader election, only one replica of the controller acts at a time.

Server-side apply

Instead of client-side three-way merge, the API server tracks field ownership.

Two controllers can manage the same object if they own different fields. Conflicts are explicit.

Use this for operators — it prevents the "last-write-wins" problem.

References:

https://kubernetes.io/docs/concepts/extend-kubernetes/
https://book.kubebuilder.io/
https://operatorframework.io/

Observability

Metrics: three layers

Tool	What It Gives You
metrics-server	CPU/memory per pod/node in real time. Powers HPA and `kubectl top`. In-memory only — no history.
kube-state-metrics	Object state metrics: deployment replicas, pod phase, job completions. Not resource usage.
node-exporter	Host-level metrics: disk, network, filesystem, CPU steal. Typically runs as a DaemonSet.

Prometheus scrapes all three. Prometheus Operator makes this declarative with ServiceMonitor and PodMonitor CRDs.

Logs

kubectl logs reads from the container runtime log files on the node; --previous gets the logs of the last crashed container.

For aggregation: run Fluentbit as a DaemonSet, ship to Loki/Elastic/CloudWatch. Don't log to files inside the container — log to stdout/stderr.

Events

The most underused debugging tool.

kubectl describe pod foo shows Events. They have a reason, a message, a count, and they reference the involved object.

Everything that happened to an object is there. They expire after ~1 hour by default — if you want history, ship them to a persistent store.

Debugging checklist:

kubectl describe pod — events tell you what happened
kubectl logs --previous — last crash
kubectl debug -it --image=busybox — ephemeral container in the pod's network namespace
Check controller logs (deployment controller, etc.) if the pod never got created

References:

https://kubernetes.io/docs/tasks/debug/
https://prometheus-operator.dev/

Cluster operations

Node lifecycle

cordon -> marks node unschedulable (no new pods) drain -> evicts all pods (respects PDBs) uncordon -> node back in rotation

etcd: backup or cry

etcd is the cluster. If it's gone, so is everything. Snapshot: etcdctl snapshot save.

Restore: etcdctl snapshot restore then reconfigure the API server to point at the restored data dir.

Test your restore procedure before you need it.

Package management

Helm
- Templated YAML packaged as charts. Values override defaults. Hooks for lifecycle. The standard for distributing third-party software.
kustomize
- Overlay patches on base YAML. No templating — pure structural patching. Built into kubectl. Better for your own app configs.
GitOps (ArgoCD/Flux)
- Git is the source of truth. Controller syncs cluster to whatever's in the repo. Drift detection, auto-sync, rollback by reverting git commit.

Version skew policy: kubelet can be at most 2 minor versions behind the API server. kubectl can be ±1. During upgrades: upgrade control plane first, then nodes. Never skip minor versions.

References:

https://kubernetes.io/docs/tasks/administer-cluster/
https://argo-cd.readthedocs.io/en/stable/
https://fluxcd.io/flux/

Node and runtime internals

This section explains why production problems that aren't K8s bugs happen. Useful to know before going on-call.

You often hear "kubernetes is just Linux", and that is correct, fundamentally kubernetes relies on Linux kernel to give the majority of its functionality.

cgroups — how limits are actually enforced

CPU limit → cpu.cfs_quota_us / cpu.cfs_period_us.

If your container uses its entire quota in the first part of a period, it's throttled for the rest — even if the node has idle CPUs.

This is the source of latency spikes on CPU-limited pods with no visible CPU saturation.

Memory limit → memory.limit_in_bytes.

Exceed it → OOM killed by the kernel. No warning. OOMKilled exit code in pod status.

cgroups v1 vs v2: cgroups v2 (unified hierarchy) is default on modern kernels (>= kernel 5.8, most distros now). Memory accounting is more accurate. CPU throttling behavior differs slightly. Know which one your nodes are running.

Linux namespaces — isolation per pod

Each pod gets its own: net (network stack, IP), pid (process tree), mnt (filesystem), uts (hostname), ipc.

Containers in the same pod share the net and ipc namespaces — they can talk on localhost and see each other's processes if shareProcessNamespace: true.

Device plugins — GPUs and other hardware

Hardware that isn't CPU/memory is exposed via device plugins.

Plugin registers with kubelet, advertises capacity (e.g. nvidia.com/gpu: 4). Pod requests it in resources. Kubelet allocates it. The plugin handles the actual device assignment to the container.

Dynamic Resource Allocation (DRA) is the next-gen version, GA in 1.32 — more flexible, structured parameters, not just counts.

Kubelet eviction

When a node is under memory/disk pressure, kubelet evicts pods.

Order: BestEffort first, then Burstable (those exceeding requests), then Guaranteed. This is why you always set resource requests: it determines your eviction priority.

References:

https://kubernetes.io/docs/concepts/architecture/cgroups/
https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/
https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/

Table of contents

Kubernetes is not a deployment tool

The mental model for k8s

API request lifecycle

Key API concepts

Architecture

Workloads

Configuration and Secrets

Storage and CSI

Access modes

CSI

Networking: it is just routing

Services: stable endpoints in front of pods

DNS

Ingress

NetworkPolicy

Scheduling, autoscaling and reliability

Controlling where pods land

Autoscaling

Reliability primitives

Security

Authentication vs authorization

RBAC — four objects

ServiceAccount

Admission — the last gate before etcd

Runtime security essentials

Extensibility and the operator pattern

CRD — Custom Resource Definition

The operator pattern

Server-side apply

Observability

Metrics: three layers

Logs

Events

Cluster operations

Node lifecycle

etcd: backup or cry

Package management

Node and runtime internals

cgroups — how limits are actually enforced

Linux namespaces — isolation per pod

Device plugins — GPUs and other hardware

Kubelet eviction

Keyboard shortcuts