Kubernetes Interview Questions and Answers

This guide is for engineers preparing for Kubernetes-heavy interviews: DevOps, platform, SRE, and backend roles where k8s is on the job description. It covers what interviewers actually probe at each level, from the conceptual basics through the hands-on troubleshooting that separates a junior from a senior. Every question below comes with a model answer you can adapt, plus the commands and gotchas that show you have run these systems in production.

Reading answers is not the same as being able to fix a broken cluster while someone watches. The strongest preparation pairs concept review with real debugging reps. prepme.io turns a pasted job description into a free briefing that surfaces hands-on k3s/Kubernetes and Terraform labs, auto-graded ABCD exams, and an architecture design round, so you practice on the exact stack the role expects rather than generic trivia.

What Kubernetes interviews actually test, by seniority

Kubernetes interviews are layered, and the level you are interviewing for changes what counts as a good answer. Juniors are expected to define the core objects and explain the declarative model: you describe desired state in YAML, the API server persists it to etcd, and controllers reconcile actual state toward it. Mid-level candidates must connect objects into working systems, reason about failure modes, and read the output of kubectl describe and kubectl logs fluently. Senior and staff candidates are judged on trade-offs, blast radius, multi-tenancy, security posture, and how they would design and operate clusters at scale.

A reliable tell of seniority is whether you reach for the right diagnostic command unprompted. A junior says a pod is broken; a senior says kubectl describe pod shows the events, kubectl get events --sort-by=.lastTimestamp gives the timeline, and kubectl logs --previous shows the last crash. Interviewers also watch for honest scoping: knowing what you would check first, second, and third under time pressure beats reciting every possible cause.

Junior: define pods, Deployments, Services, namespaces; explain desired vs actual state and the reconciliation loop.
Mid: wire objects together, read describe/logs/events, explain rolling updates, probes, and resource requests vs limits.
Senior: design for multi-tenancy and least-privilege RBAC, reason about scheduling and capacity, debug live incidents, and justify trade-offs.
All levels: show a deliberate diagnostic order rather than guessing.

Core concepts: pods, Deployments, Services, namespaces

The pod is the smallest deployable unit and the atom of scheduling: one or more containers that share a network namespace (same IP and port space) and can share volumes. You rarely create bare pods in production because nothing reschedules them when a node dies. Instead a Deployment manages a ReplicaSet, which manages pods, giving you declarative scaling, rolling updates, and rollbacks (kubectl rollout status, kubectl rollout undo). StatefulSets are the equivalent for workloads needing stable network identity and per-pod persistent storage; DaemonSets run one pod per node for agents like log shippers or CNI components.

A Service gives a stable virtual IP and DNS name in front of an ephemeral set of pods selected by labels, load-balancing across the ready endpoints. ClusterIP is internal-only, NodePort exposes a static port on every node, and LoadBalancer provisions an external load balancer via the cloud provider. The Service decouples callers from pod churn: pods come and go, the Service name stays constant. Namespaces are virtual partitions for scoping names, applying ResourceQuotas and RBAC, and organizing tenants or environments, but they are not a hard security boundary by themselves; cross-namespace network traffic is allowed by default and must be restricted with NetworkPolicies.

Deployment manages ReplicaSet manages pods; never depend on bare pods for HA.
Labels and selectors are the glue: Services, ReplicaSets, and NetworkPolicies all target pods by label.
StatefulSet for stable identity and storage; DaemonSet for one-per-node agents; Job/CronJob for batch and scheduled work.
Namespaces scope names and policy, not network isolation by default.

Networking, storage, and config: Ingress, DNS, PV/PVC, ConfigMaps, Secrets

Cluster DNS (CoreDNS) makes Services discoverable by name. A Service named api in namespace prod resolves at api.prod.svc.cluster.local, and same-namespace callers can just use api. An Ingress is not a Service type; it is an L7 routing layer that maps hostnames and paths to backend Services, and it does nothing without an Ingress controller (ingress-nginx, Traefik, or a cloud controller) actually running and watching Ingress objects. A classic interview trap is claiming you created an Ingress and it should work, when no controller is installed to act on it. (Newer clusters increasingly use the Gateway API for the same job, so mentioning it signals you are current.)

Storage uses the PersistentVolume and PersistentVolumeClaim split: a PVC is the workload's request for storage (size, access mode), and it binds to a PV that satisfies it, usually provisioned dynamically through a StorageClass. Access modes matter: ReadWriteOnce allows read-write by pods on a single node, so a multi-replica Deployment whose pods land on different nodes and share one RWO volume will get pods stuck Pending or ContainerCreating because the volume cannot attach to a second node. ConfigMaps hold non-sensitive configuration and Secrets hold sensitive values; both can be mounted as files or injected as environment variables. Remember that stock Secrets are only base64-encoded, not encrypted at rest unless you enable an encryption provider, and changes to a mounted ConfigMap propagate to the file but the process must be designed to reload (env-var injections do not update a running pod at all).

CoreDNS: service.namespace.svc.cluster.local; cross-namespace calls need the namespace in the name.
Ingress needs a running controller; the object alone routes nothing. Gateway API is the modern successor.
ReadWriteOnce volumes attach to one node; this is a common cause of Pending pods on scale-out.
Secrets are base64, not encrypted by default; enable encryption at rest and tight RBAC.

Scheduling, reliability, and RBAC: probes, quotas, ServiceAccounts, least-privilege

The scheduler places pods based on resource requests, node selectors, affinity and anti-affinity, taints and tolerations, and topology spread constraints. Requests reserve capacity and drive scheduling; limits cap usage at runtime. Get this wrong and you either pack nodes too tightly (no headroom, evictions) or waste capacity. A pod whose requests no node can satisfy stays Pending, and kubectl describe pod will show the FailedScheduling event explaining exactly which predicate failed (Insufficient cpu, Insufficient memory, untolerated taint).

Reliability hinges on probes. A liveness probe restarts a hung container; a readiness probe removes a pod from Service endpoints until it is ready (without restarting it); a startup probe protects slow-booting apps from being killed by liveness before they finish initializing. A common production outage is a liveness probe that is too aggressive, restarting healthy-but-busy pods into a CrashLoop. On security, every pod runs as a ServiceAccount, and RBAC (Roles and ClusterRoles bound by RoleBindings and ClusterRoleBindings) governs what that identity can do. The senior-level answer is least privilege: grant a namespaced Role with only the verbs and resources a workload needs, never bind the cluster-admin ClusterRole to application identities, and prefer per-namespace scoping so a compromised pod cannot read or write outside its own namespace.

Requests drive scheduling and reserve capacity; limits cap runtime usage; set both deliberately.
Liveness restarts, readiness gates traffic, startup protects slow boots; do not conflate them.
RBAC verbs and resources scoped by Role/ClusterRole + bindings; default to namespaced least-privilege.
Use ResourceQuota and LimitRange per namespace to stop a noisy tenant from starving the cluster.

Troubleshooting: CrashLoopBackOff, ImagePullBackOff, Pending, OOMKilled

Live troubleshooting is where senior interviews are won or lost, and the discipline is always the same: kubectl get pods to see status, kubectl describe pod for events, and kubectl logs (add --previous to read the crashed instance). CrashLoopBackOff means the container keeps exiting and Kubernetes is backing off (exponentially, capped at five minutes) between restarts; the cause is almost always in the logs of the previous attempt: a bad config or missing env var, a failed dependency at startup, a non-zero exit, or an over-aggressive liveness probe. The status itself is a symptom, not a diagnosis, so never stop at the name.

ImagePullBackOff and ErrImagePull are registry problems: a wrong image name or tag, a private registry with no imagePullSecret, rate limiting, or the node having no network path to the registry. kubectl describe pod shows the exact pull error. Pending means the scheduler cannot place the pod: insufficient CPU or memory across all nodes, an unsatisfiable nodeSelector or affinity rule, an untolerated taint, or a PVC that will not bind (for example a ReadWriteOnce volume already attached on another node). OOMKilled appears as the container's last state with reason OOMKilled and exit code 137 (128 + SIGKILL): the process exceeded its memory limit and the cgroup OOM killer terminated it. The fix is to right-size the memory limit, find the leak or unbounded buffer, and confirm requests and limits reflect real usage measured under load.

CrashLoopBackOff: read kubectl logs --previous; check config, env, dependencies, and probe aggressiveness.
ImagePullBackOff: verify image name/tag, imagePullSecret for private registries, and node-to-registry network.
Pending: read the FailedScheduling event for insufficient resources, taints, affinity, or unbound PVC.
OOMKilled (exit 137): raise the memory limit only after measuring real usage, and fix the underlying leak.

How to prepare so the answers stick

Memorizing definitions gets you through the first ten minutes; the interview turns when someone hands you a broken cluster or asks you to design one. Build muscle memory on the diagnostic loop until kubectl describe, kubectl get events --sort-by=.lastTimestamp, and kubectl logs --previous are reflexes. Practice deliberately breaking things in a throwaway cluster: ship a bad image tag, set a memory limit too low, apply a too-aggressive liveness probe, request more CPU than any node has, then walk yourself through finding and fixing each one out loud as you would in the interview.

prepme.io is built for exactly this loop. Paste the job description you are targeting and you get a free briefing that maps the role to three practice surfaces: real in-browser k3s/Kubernetes and Terraform labs where you debug live pods (the flagship), auto-graded ABCD exams across easy, medium, and hard to drill recall, and an architecture design round for the whiteboard portion. Each task is graded 0 to 100 with written feedback so you know exactly where you stand, and a free Company Coverage card researches the employer (recent news, funding, culture, and likely interview intel) so you can tailor your answers. Generating and viewing the briefing is free; you can try the demo before deciding to subscribe and practice the hands-on labs.

Drill the diagnostic loop until describe/events/logs --previous are automatic.
Break a sandbox cluster on purpose, then narrate the fix as if interviewing.
Target practice to the actual JD instead of generic question dumps.
Pair concept recall (exams) with hands-on reps (labs) and design practice.

Common interview questions & answers

What is the difference between a Deployment, a ReplicaSet, and a pod?

A pod is the smallest deployable unit, one or more co-located containers sharing a network namespace and storage. A ReplicaSet ensures a specified number of identical pod replicas are running at any time. A Deployment is a higher-level controller that manages ReplicaSets to give you declarative rolling updates, rollbacks, and scaling. In practice you create Deployments; on each update the Deployment creates a new ReplicaSet and scales the old one down gradually, which is what enables zero-downtime rollouts and kubectl rollout undo.

How does a Service find the pods it routes to, and what are the main Service types?

A Service selects pods by label and load-balances across the ready endpoints behind a stable virtual IP and DNS name, so callers never depend on individual pod IPs that churn. The endpoint list is maintained automatically as pods pass or fail readiness. ClusterIP exposes the Service only inside the cluster, NodePort opens a static port on every node, and LoadBalancer asks the cloud provider for an external load balancer. Headless Services (clusterIP: None) skip the virtual IP and return pod IPs directly via DNS, which StatefulSets use for stable per-pod records.

Explain liveness, readiness, and startup probes and when each matters.

A liveness probe detects a hung or deadlocked container and restarts it; failing liveness kills and recreates the container per the pod's restartPolicy. A readiness probe controls whether a pod receives traffic by adding or removing it from Service endpoints, without restarting it, which is what you want during warmup or temporary dependency loss. A startup probe protects slow-starting apps by holding off liveness and readiness checks until the app has booted, preventing premature restarts. A frequent production mistake is an aggressive liveness probe that restarts healthy-but-busy pods, turning a load spike into a CrashLoopBackOff.

A pod is stuck in CrashLoopBackOff. Walk me through diagnosing it.

CrashLoopBackOff means the container starts, exits, and Kubernetes is backing off before restarting, so the status is a symptom not a cause. I run kubectl describe pod to read the events and last state, then kubectl logs <pod> --previous to see output from the crashed instance, since the current one may not have logged yet. The usual causes are a missing or malformed config or env var, a failed startup dependency, a non-zero exit from the app, or an over-aggressive liveness probe. I fix the root cause shown in the previous-instance logs rather than just restarting and hoping.

What causes ImagePullBackOff, and how do you fix it?

ImagePullBackOff (preceded by ErrImagePull) means the kubelet cannot pull the container image. kubectl describe pod shows the exact error. Common causes are a wrong image name or tag, a private registry with no imagePullSecret referenced by the pod or its ServiceAccount, registry rate limiting, or the node having no network route to the registry. The fix follows the error: correct the tag, attach the right imagePullSecret, authenticate to the registry, or restore connectivity. I also confirm the image actually exists in the registry, since a typo in the tag is the most common culprit.

A pod stays Pending and never schedules. What do you check?

Pending means the scheduler cannot place the pod, so I read kubectl describe pod for the FailedScheduling event, which names the failing predicate. The usual reasons are insufficient CPU or memory across all nodes relative to the pod's requests, an unsatisfiable nodeSelector or affinity rule, a taint with no matching toleration, or a PVC that cannot bind. A subtle case is a ReadWriteOnce volume already attached on another node, which blocks a second replica from scheduling there. I fix it by right-sizing requests, adding capacity, adjusting affinity or tolerations, or correcting the storage access mode.

What does OOMKilled mean and how do you remediate it properly?

OOMKilled means the container exceeded its memory limit and the cgroup out-of-memory killer terminated it, which shows as last state OOMKilled with exit code 137 in kubectl describe pod. The wrong fix is to blindly raise the limit; the right approach is to measure actual memory use under realistic load, set the limit above the true working set with headroom, and align the request so the scheduler reserves enough and the pod gets a stable QoS class. If usage grows unbounded I treat it as a leak or an unbounded buffer or cache and fix the application, because a higher limit only delays the next kill.

How does RBAC work in Kubernetes, and what does least privilege look like?

RBAC grants permissions through Roles and ClusterRoles, which list allowed verbs on resources, bound to subjects by RoleBindings (namespace-scoped) and ClusterRoleBindings (cluster-wide). Every pod authenticates as a ServiceAccount, so workload permissions come from what is bound to that account. Least privilege means giving each workload a namespaced Role with only the verbs and resources it needs, never binding cluster-admin to application identities, and scoping per namespace so a compromised pod cannot reach other tenants. kubectl auth can-i is a useful check, but for cluster-scoped resources it can mislead, so confirm with a real API call rather than relying on can-i alone.

What is the difference between resource requests and limits, and why do both matter?

A request is the amount of CPU or memory the scheduler reserves for a container and uses to decide which node fits the pod; a limit is the runtime ceiling the container cannot exceed. CPU over the limit is throttled, while memory over the limit gets the container OOMKilled. If requests are too low you oversubscribe nodes and trigger evictions under pressure; if too high you waste capacity and leave pods Pending. The interview-grade answer is that requests and limits together define the pod's Quality of Service class (Guaranteed, Burstable, BestEffort), which determines eviction order when a node runs low on resources.

How do ConfigMaps and Secrets differ, and what are their security caveats?

Both inject configuration into pods as environment variables or mounted files, but ConfigMaps are for non-sensitive data and Secrets are for sensitive values. The key caveat is that stock Secrets are only base64-encoded, not encrypted, so anyone with API read access or etcd access can decode them unless you enable encryption at rest and lock down RBAC. Another gotcha is propagation: changes to a mounted ConfigMap or Secret eventually update the file, but env-var injections are fixed at pod start and never update a running pod, and most apps need an explicit reload to pick up file changes anyway.

What is an Ingress and why might one not work even though it is created?

An Ingress is an L7 routing object that maps hostnames and paths to backend Services, enabling host-based and path-based routing, TLS termination, and a single entry point for many Services. The critical point is that an Ingress object does nothing on its own; it requires an Ingress controller (ingress-nginx, Traefik, or a cloud controller) actually running in the cluster to watch Ingress resources and configure the proxy. If you create an Ingress and traffic does not route, first verify a controller is installed and healthy, then that the backend Service and its endpoints exist and that DNS points at the controller's external address. Many newer clusters use the Gateway API instead, which generalizes the same model.

How does cluster DNS resolve Service names across namespaces?

CoreDNS provides in-cluster DNS, giving every Service a record at service.namespace.svc.cluster.local. Within the same namespace you can call a Service by its short name, but to reach a Service in another namespace you must include the namespace, for example payments.prod.svc.cluster.local or at minimum payments.prod. A frequent bug is an app using a bare Service name and failing to reach a Service that lives in a different namespace. For StatefulSets, a headless Service gives each pod a stable DNS name like pod-0.service.namespace.svc.cluster.local, which is how clustered databases find their peers.

How would you design multi-tenant isolation so one team's workloads cannot affect another's?

I would isolate tenants by namespace and layer several controls, because a namespace alone is not a security boundary. RBAC scoped per namespace ensures each tenant's ServiceAccounts can only act within their own namespace; ResourceQuota and LimitRange cap CPU, memory, and object counts so a noisy tenant cannot starve the cluster; and NetworkPolicies are required to actually block cross-namespace traffic, which is allowed by default. For stronger isolation I would add Pod Security Standards, separate node pools via taints, and for hard guarantees consider per-tenant virtual clusters, since cluster-scoped resources like CRDs and ClusterRoles are shared and can collide across tenants.

Practice this for real, from your target job

Reading about it only gets you so far. Paste a job description into prepme and get hands-on k3s/Terraform labs, auto-graded exams, and an architecture round — generated for that exact role and scored 0–100. Generating a briefing is free.

Generate a free briefing →See the demo

FAQ

How long does it take to prepare for a Kubernetes interview?+

If you already run k8s daily, a focused week of review plus hands-on debugging reps is usually enough. Coming in cold, plan three to six weeks: learn the core objects, then spend most of your time breaking and fixing a sandbox cluster, because interviewers test whether you can diagnose live problems, not just recite definitions.

Should I get the CKA or CKAD certification before interviewing?+

A certification signals baseline competence, and the CKA in particular forces hands-on kubectl fluency that transfers directly to interviews. It is not required and will not substitute for being able to reason about trade-offs and debug under pressure. Treat the cert's hands-on prep as the real value and the credential as a bonus.

Do Kubernetes interviews include live, hands-on troubleshooting?+

Increasingly yes, especially for DevOps, platform, and SRE roles. You may be given a broken cluster or a failing manifest and asked to find and fix the problem while explaining your reasoning. Practicing the describe, events, and logs --previous loop in a real cluster is the single highest-value preparation.

What is the most common reason candidates fail Kubernetes interviews?+

Stopping at the symptom. Saying a pod is in CrashLoopBackOff or Pending without a methodical next step signals you have read about Kubernetes but not operated it. Strong candidates name the exact command they would run next and explain what each line of output would tell them.

How do I prepare for the specific Kubernetes setup a company uses?+

Tailor your prep to the job description rather than studying generic question lists. On prepme.io you paste the JD and get a free briefing with hands-on labs, exams, and a design round matched to that role's stack, plus a free Company Coverage card that surfaces likely interview intel about the employer.

Are managed Kubernetes services like EKS, GKE, and AKS tested differently?+

The core concepts are identical, but managed-service roles add questions about cloud integration: IAM and IRSA or EKS Pod Identity for pod identity on AWS, cloud load balancers and Ingress controllers, managed node groups and autoscaling, and the provider's storage classes. Know which responsibilities the managed control plane covers and which remain yours.