prepme.io

Kubernetes Troubleshooting Interview Questions

If you are interviewing for a DevOps, SRE, or platform engineering role, the Kubernetes round is rarely about reciting definitions. It is fix-it-live: the interviewer hands you a broken cluster (or describes a failing workload) and watches how you reason from symptom to root cause. This guide is for mid-to-senior engineers who already know the basics and need to sharpen the diagnostic muscle that actually gets graded — the order you run commands in, the signals you read, and the trade-offs you name out loud.

You will get a workflow for the failure modes interviewers reach for most — CrashLoopBackOff, ImagePullBackOff, Pending pods, failing probes, OOMKilled, and RBAC denials — plus 14 real interview questions with strong model answers and a short FAQ on how to prepare. Treat the commands here as muscle memory: in a live round, the candidate who runs kubectl describe and reads the Events before guessing almost always wins.

Why senior Kubernetes interviews are fix-it-live, not recall

At junior levels, an interviewer might ask you to define a Deployment or list the parts of a Pod. At mid-senior level that stops earning points. What separates a strong candidate is a repeatable diagnostic loop: observe the symptom, gather evidence with describe and logs and events, form a hypothesis, and confirm it before you change anything. Interviewers are listening for whether you narrow the search space methodically or thrash by editing manifests at random.

The single most important habit to demonstrate is reading the cluster's own telling of the story before touching it. kubectl get pods shows you the headline status; kubectl describe pod shows the Events at the bottom — the scheduler, kubelet, and controllers all narrate their decisions there. kubectl logs (and logs --previous for a crashed container) shows the application's own voice. Almost every Kubernetes failure is explained by one of those three sources, and a candidate who reaches for them in order looks like someone who has actually run production.

Say your reasoning out loud and name trade-offs. "I'd bump the memory limit, but first I want to know whether this is a real leak or just an under-provisioned limit, because raising it blindly can hide a leak until it takes down a node" is the kind of sentence that moves an interview from a pass to a strong hire.

  • The diagnostic order that wins: get -> describe (read Events) -> logs (and --previous) -> events --sort-by, then form a hypothesis.
  • Confirm before you mutate: a fix applied to the wrong root cause wastes the round and signals guessing.
  • Verbalize trade-offs: every remediation has a cost (masking a leak, loosening a quota, weakening isolation) — naming it shows seniority.

CrashLoopBackOff and ImagePullBackOff: the logs and describe workflow

CrashLoopBackOff means the container starts, exits, and the kubelet keeps restarting it with exponential backoff (10s, 20s, 40s, doubling up to a 5-minute cap). The status itself is not the bug — it tells you the process is dying. Your first move is kubectl logs <pod> -c <container>, and if the current attempt is mid-backoff and empty, kubectl logs <pod> --previous to read the last crashed instance's output. Then kubectl describe pod to see the Last State, the exit code, and the reason. Exit code 1 is a generic application error (read the logs); 137 is 128+9 (SIGKILL), usually an OOMKill against the memory limit or a SIGKILL after a liveness-probe-triggered termination; 139 is 128+11 (SIGSEGV, a segfault); and 143 is 128+15 (SIGTERM). A container that exits 0 immediately is often a missing long-running command or a misconfigured entrypoint.

Common root causes a candidate should rattle off: a bad config or missing environment variable, a missing or unmountable ConfigMap/Secret, a dependency that is not reachable yet at startup (a database that is not up), a failing migration on boot, an application bug, or a liveness probe that is too aggressive and kills a slow-starting app before it is ready. The fix depends entirely on which one — that is why you read logs first rather than editing the Deployment. Note that a missing ConfigMap key or Secret usually surfaces as CreateContainerConfigError, not CrashLoopBackOff, because the kubelet cannot even build the container — the distinction itself is a good thing to name.

ImagePullBackOff (and its transient sibling ErrImagePull) is a different class entirely: the container never started because the kubelet could not pull the image. The Events in describe pod tell you exactly which: a typo in the image name or tag, a tag that does not exist, a private registry with no imagePullSecret (manifests as 'pull access denied' or 'unauthorized'), rate limiting (Docker Hub 429 'toomanyrequests'), or no network route to the registry. The fix is almost never in the application — it is the image reference, the pull secret, or registry connectivity.

  • kubectl logs <pod> --previous reads the last crashed container — the single most missed command in interviews.
  • Exit codes: 0 = clean exit (often a missing long-running process), 1 = app error, 137 = 128+9 SIGKILL (OOM or post-liveness kill), 139 = 128+11 SIGSEGV, 143 = 128+15 SIGTERM.
  • A missing ConfigMap key or Secret shows as CreateContainerConfigError, not CrashLoopBackOff — know the difference.
  • ImagePullBackOff is a registry/auth/typo problem, not an app problem — read the Events line, fix the image ref or imagePullSecret.
  • A too-aggressive livenessProbe can manufacture a CrashLoopBackOff on a slow-booting app; check probe timing before blaming the code.

Pending pods: scheduling, resource quotas, and node pressure

A Pod stuck in Pending has not been scheduled to a node — so logs are useless (there is no running container). The answer is almost always in kubectl describe pod, specifically the FailedScheduling Events from the default-scheduler, which state why no node fit. The classic line is '0/N nodes are available: Insufficient cpu' or 'Insufficient memory' — the sum of pods' requests (not limits, not actual usage) exceeds the unreserved allocatable on any node. Remember the scheduler bin-packs on requests; a node can be 20% utilized in reality and still reject a pod whose requests do not fit the remaining headroom.

Beyond raw capacity, the scheduler reports other unsatisfiable constraints in the same Events block: 'node(s) had untolerated taint' (the pod lacks a matching toleration, common with dedicated or control-plane nodes), 'node(s) didn't match Pod's node affinity/selector' (a nodeSelector or nodeAffinity nothing satisfies), 'node(s) had volume node affinity conflict' (a PVC bound to a zone with no schedulable node), or unsatisfiable pod topology spread / anti-affinity (the single-node case where required hostname anti-affinity blocks a second replica forever). A separate Pending cause is an unbound PVC — the pod waits because its PersistentVolumeClaim is still Pending, which itself is often a missing StorageClass or no provisioner.

On a multi-tenant or quota'd cluster, Pending can also be a ResourceQuota or LimitRange problem rather than node capacity. If a namespace ResourceQuota is exhausted, pod creation is rejected outright (visible as a FailedCreate Event on the ReplicaSet, with the replicas never appearing as Pods), and if a LimitRange forces minimum requests that the pod cannot fit, the scheduler has no candidate. This is exactly the kind of platform-side gotcha senior interviews probe — and it is why prepme's hands-on labs run on a real multi-tenant k3s cluster where each candidate gets a namespace with its own ResourceQuota, so a 'resource-quota-exceeded' scenario behaves like production rather than a toy.

  • Pending = unscheduled, so describe pod (FailedScheduling Events), not logs, holds the answer.
  • Scheduling is bin-packing on requests, not usage — a lightly-loaded node can still be 'Insufficient cpu/memory'.
  • Read the exact scheduler message: Insufficient cpu/memory, untolerated taint, node affinity/selector mismatch, volume node affinity conflict, or topology spread / anti-affinity failure.
  • Namespace ResourceQuota exhaustion surfaces as FailedCreate on the ReplicaSet (no Pod ever appears); LimitRange minimums cause unschedulable Pods — both are independent of node capacity.

Failing readiness and liveness probes and rollout stalls

Probes are where many real outages and many interview scenarios hide, because the three probe types do very different things and confusing them is a classic mistake. A readiness probe failing removes the pod from Service endpoints — the container keeps running but receives no traffic; this is why a Deployment can be 'Running' yet the app returns no responses, and why a rollout stalls with new pods never going Ready. A liveness probe failing restarts the container — too aggressive a liveness probe (short initialDelaySeconds, tight timeoutSeconds) will kill a healthy-but-slow app and produce a CrashLoopBackOff. A startup probe gates the other two during boot, which is the correct fix for slow-starting apps rather than inflating liveness delays.

To debug, kubectl describe pod surfaces 'Readiness probe failed:' / 'Liveness probe failed:' Events with the underlying error (HTTP status, connection refused, timeout). The frequent root causes: the probe path is wrong (the app serves /health but the probe hits /healthz — the exact bug shown on the prepme homepage terminal), the port is wrong, the timeout is shorter than the app's real response time under load, or the probe needs auth the kubelet does not send. Always confirm with kubectl get endpoints <service> (or kubectl get endpointslices) — empty endpoints behind a Service that should have backends almost always means readiness is failing.

Rollout stalls deserve their own attention because RollingUpdate behavior is governed by maxUnavailable and maxSurge plus readiness. If new pods never pass readiness, the rollout is stuck by design — it will not tear down old pods while new ones are unready, which is the deployment strategy protecting you. kubectl rollout status deployment/<name> hangs, kubectl get pods shows new-revision pods at 0/1 Ready, and kubectl rollout undo deployment/<name> is the immediate mitigation while you fix the probe or the new image. A senior answer mentions both: undo to restore service, then root-cause the readiness failure offline.

  • readiness = traffic gating (pulled from Service endpoints), liveness = restart, startup = boot gate — never conflate them.
  • An empty kubectl get endpoints <service> is the fingerprint of a failing readiness probe (or a selector mismatch).
  • Slow-start apps need a startupProbe, not a bloated livenessProbe initialDelaySeconds.
  • Stuck rollout: new pods at 0/1 Ready -> the strategy is correctly holding; kubectl rollout undo to mitigate, then fix the probe/image.

OOMKilled, resource limits, and RBAC / permission errors

OOMKilled appears in kubectl describe pod under Last State with reason OOMKilled and exit code 137. It means the container exceeded its memory limit and the kernel cgroup OOM killer terminated it — this is per-container against the limit, distinct from node-level memory pressure that evicts whole pods. The senior trap is to immediately raise the limit. First decide: is the limit genuinely too low for the workload, or is the app leaking? Raising the limit on a leak just delays the crash and risks the node. Use kubectl top pod (needs metrics-server), the trend in restart counts, and the app's own metrics to tell them apart. Also know the requests-vs-limits trade-off: memory requests affect scheduling and the QoS class (a pod with requests == limits for both cpu and memory is Guaranteed and last to be evicted; requests < limits is Burstable; none is BestEffort and first to die under node pressure).

Node pressure is the related failure: when a node runs low on memory or disk (ephemeral-storage), the kubelet evicts pods by QoS class, and you will see pods with status Evicted and a message about the resource. kubectl describe node shows conditions like MemoryPressure or DiskPressure. The remediation is a mix of right-sizing requests/limits, adding capacity, and ensuring noisy-neighbor workloads have limits so one bad pod cannot starve a node.

RBAC and permission errors are a category interviewers love because they test whether you understand the security model. The signature is a 'Forbidden' error: 'User "system:serviceaccount:ns:sa" cannot get resource "pods" in API group "" in the namespace "x"'. Read it literally — it names the subject, the verb, the resource, and the scope. The fix is a Role (namespaced) or ClusterRole (cluster-wide) plus a RoleBinding/ClusterRoleBinding tying it to the ServiceAccount. kubectl auth can-i get pods --as=system:serviceaccount:ns:sa -n ns lets you test impersonated permissions directly. One gotcha worth naming: with a wildcard namespaced Role, kubectl auth can-i can return a false-positive 'yes' for cluster-scoped resources because it defaults the review to the local namespace — trust the real API call over can-i. A subtler permission failure is a pod whose ServiceAccount lacks rights to do what the app needs (read a Secret, list endpoints); also distinguish RBAC denials from admission-controller rejections (PodSecurity, OPA/Gatekeeper, Kyverno), which also surface as creation errors but for policy, not identity.

  • OOMKilled (exit 137, per-container limit) != node MemoryPressure eviction (whole pod, status Evicted, QoS-ordered) — name which one you're seeing.
  • Don't reflexively raise the memory limit: separate 'under-provisioned' from 'leaking' with kubectl top, restart trend, and app metrics.
  • QoS classes from requests/limits: Guaranteed (requests==limits, cpu and memory) > Burstable (<) > BestEffort (none) determines eviction order under node pressure.
  • RBAC 'Forbidden' errors are self-documenting: subject + verb + resource + scope -> add a Role/ClusterRole + binding; verify with kubectl auth can-i --as.
  • Distinguish RBAC denials from admission-controller policy rejections (PodSecurity, Gatekeeper, Kyverno) — same surface, different fix.

How to prepare for a live Kubernetes troubleshooting round

Reading about CrashLoopBackOff is not the same as fixing one under a stranger's gaze. The way to prepare is to break clusters and fix them until the diagnostic loop is automatic — get, describe, logs --previous, events — so you are not spending working memory on syntax during the interview. Build a personal runbook of the top failure modes (the six in this guide cover the vast majority), each with the one or two commands that confirm it and the typical fixes, then practice until you can narrate it.

This is exactly what prepme is built for. Paste the actual job description you are interviewing against and prepme generates a free briefing with real k3s/Kubernetes and Terraform labs tailored to that role's stack — live containers you debug in the browser, including scenarios like a broken probe path, an exhausted ResourceQuota, an OOMKilled workload, a missing RBAC ServiceAccount, an ImagePullBackOff typo, and a Service selector mismatch. Each lab is graded 0-100 with specific feedback, the auto-graded ABCD exams (easy/medium/hard, 30 questions) drill the conceptual recall, and the architecture design round (currently in beta) covers the whiteboard side. Generating and viewing the briefing — including free Company Coverage web research on the employer — costs nothing; hands-on practice is $20/month, cancel anytime. Try the demo briefing to see a full set of labs before you commit.

  • Internalize one diagnostic loop (get -> describe/Events -> logs --previous -> events) until it is reflexive.
  • Keep a one-page runbook: per failure mode, the confirming command and the typical fixes.
  • Practice on a real cluster, not slides — break and fix CrashLoopBackOff, Pending, probe failures, OOMKilled, and RBAC denials.
  • Rehearse narrating trade-offs out loud; interviewers grade reasoning, not just the final kubectl command.

Common interview questions & answers

A pod is in CrashLoopBackOff. Walk me through how you debug it.

First I confirm the symptom with kubectl get pods, then kubectl describe pod to read the Last State, exit code, and Events. The key command is kubectl logs <pod> --previous, because a pod mid-backoff often has an empty current log and the real error is in the last crashed instance. I map the exit code (1 = app error, 137 = OOM/SIGKILL, 139 = segfault) to a hypothesis, and I check whether a too-aggressive liveness probe is killing a slow-starting app rather than a genuine code crash. I fix the root cause — bad config, missing Secret, failed migration, unreachable dependency, or probe timing — rather than just restarting and hoping. I'd also note that a missing ConfigMap key or Secret typically shows as CreateContainerConfigError, not CrashLoopBackOff, so the status itself already narrows the search.

What is the difference between CrashLoopBackOff and ImagePullBackOff?

CrashLoopBackOff means the container did start but the process keeps exiting, so the kubelet restarts it with exponential backoff — the bug is in the application, config, or probe, and you read it with kubectl logs (use --previous for the crashed instance). ImagePullBackOff means the container never started because the kubelet could not pull the image, so logs are empty and the answer is in describe pod's Events. ImagePullBackOff causes are a wrong image name or tag, a private registry with no imagePullSecret (pull access denied / unauthorized), registry rate limiting (Docker Hub 429 toomanyrequests), or no network route to the registry. They are completely different classes: one is a runtime problem, the other is a registry/auth/typo problem.

A pod has been Pending for five minutes. Where do you look and what are the likely causes?

Logs are useless because nothing is running, so I go straight to kubectl describe pod and read the default-scheduler FailedScheduling Events. The most common message is '0/N nodes are available: Insufficient cpu/memory', which means the sum of pod requests exceeds free allocatable headroom — remembering the scheduler bin-packs on requests, not actual usage. Other scheduler messages point elsewhere: untolerated taint, node affinity or nodeSelector mismatch, volume node affinity conflict, or topology spread / anti-affinity failure. I also check whether it is a namespace ResourceQuota exhaustion (often a FailedCreate on the ReplicaSet, where no Pod ever appears) or a LimitRange minimum, or an unbound PVC waiting on a StorageClass — those cause Pending independent of node capacity.

A Deployment shows all pods Running but the app returns nothing. What is happening?

Running means the container process is up, but readiness is what gates traffic, so a failing readiness probe keeps pods Running while pulling them out of the Service endpoints. I confirm with kubectl get endpoints <service> — if it is empty, either readiness is failing or the Service selector does not match the pod labels. Then kubectl describe pod shows the 'Readiness probe failed' Event with the underlying error, usually a wrong path (app serves /health, probe hits /healthz), wrong port, or a timeout too short for the app under load. If the pods are Ready but endpoints are still empty, I compare the Service spec.selector against the pod labels. The fix is correcting the probe or the selector, not the application code.

You see exit code 137 and reason OOMKilled. What does it mean and how do you respond?

Exit 137 is 128+9, i.e. the process received SIGKILL; with reason OOMKilled it means the container exceeded its memory limit and the kernel cgroup OOM killer terminated it — a per-container event against the limit, distinct from node-level memory pressure that evicts whole pods. My first decision is not to blindly raise the limit but to determine whether the limit is genuinely too low or the app is leaking, using kubectl top pod (metrics-server), the restart trend, and app metrics. If it is a leak, raising the limit only delays the crash and risks the node. I also consider the QoS implications: setting memory requests equal to limits makes the pod Guaranteed, so it is last to be evicted under node pressure.

Explain the difference between liveness, readiness, and startup probes.

A readiness probe controls traffic — when it fails, the pod is removed from Service endpoints but keeps running, so traffic is withheld until it recovers. A liveness probe controls restarts — when it fails, the kubelet kills and restarts the container, which means a too-aggressive liveness probe can manufacture a CrashLoopBackOff on a healthy but slow app. A startup probe gates both liveness and readiness during boot, and is the correct mechanism for slow-starting applications instead of inflating liveness initialDelaySeconds. Conflating these three is one of the most common Kubernetes mistakes and a frequent source of self-inflicted outages.

A rolling update is stuck — new pods never become Ready. What do you do?

kubectl rollout status hangs and kubectl get pods shows the new-revision pods at 0/1 Ready, which means the rollout is correctly refusing to tear down healthy old pods while new ones are unready — the deployment strategy is protecting service. The immediate mitigation is kubectl rollout undo deployment/<name> to restore the previous revision, then I root-cause offline. The usual cause is a failing readiness probe in the new version, a bad new image (ImagePullBackOff or a crash on boot), or a missing config the new revision needs. I also check maxUnavailable and maxSurge, since they govern how the update progresses.

You get 'Error from server (Forbidden): pods is forbidden: User cannot list resource pods'. How do you fix it?

The error is self-documenting RBAC: it names the subject (often a ServiceAccount), the verb (list), the resource (pods), and the namespace scope. The fix is to grant exactly that — a Role for namespaced access or a ClusterRole for cluster-wide, bound to the subject with a RoleBinding or ClusterRoleBinding. I verify with kubectl auth can-i list pods --as=system:serviceaccount:ns:sa -n ns. One gotcha: with a wildcard namespaced Role, auth can-i can falsely report 'yes' for cluster-scoped resources because it defaults the review to the local namespace, so I trust the real API call over can-i. I also distinguish a true RBAC denial from an admission-controller rejection like PodSecurity, Gatekeeper, or Kyverno, which look similar but are policy, not identity.

How does the scheduler decide a node has 'Insufficient cpu' even though the node looks idle?

The scheduler bin-packs on resource requests, not on real-time utilization. It sums the requests of all pods already assigned to a node and only schedules a new pod if its requests fit in the remaining allocatable headroom. So a node running at 15% actual CPU can still reject a pod, because earlier pods reserved capacity via their requests even though they are not using it. The remediation is right-sizing requests to reflect real needs, or adding capacity — and this is exactly why over-requesting wastes cluster money while under-requesting causes noisy-neighbor problems and risky overcommit.

What is the difference between an OOMKilled container and an Evicted pod?

OOMKilled is per-container: the container exceeded its own memory limit and the cgroup OOM killer terminated just that container, shown as reason OOMKilled with exit 137 in Last State. Eviction is node-level: when a node hits MemoryPressure or DiskPressure, the kubelet proactively evicts whole pods to reclaim resources, and you see pod status Evicted with a message about the resource. Eviction order follows QoS class — BestEffort pods go first, then Burstable, with Guaranteed pods last. kubectl describe node reveals the pressure conditions, and the fix combines right-sizing requests, setting limits so a noisy pod cannot starve the node, and adding capacity.

How do you debug a Service that has no endpoints?

An empty kubectl get endpoints <service> means no Ready pods are currently backing the Service, which has two common causes. First, the Service selector does not match any pod labels — I compare the Service spec.selector against the pods' labels exactly, since a single mismatched label key breaks it silently. Second, the matching pods exist but are not Ready, so a failing readiness probe is pulling them out of endpoints; I check pod readiness and the describe pod probe output. I also verify the Service targetPort matches the container's actual containerPort. Endpoints (or EndpointSlices) is the fast diagnostic that disambiguates a selector bug from a readiness bug.

A pod cannot reach another service in the cluster. How do you isolate whether it is DNS, network policy, or the service itself?

I work outward layer by layer from inside the pod with kubectl exec. First DNS: nslookup <service>.<namespace>.svc.cluster.local — if it fails, the issue is CoreDNS or the service name/namespace (a cross-namespace lookup is a classic trap), not connectivity. Then the destination: kubectl get endpoints to confirm the target Service actually has Ready backends, ruling out a readiness or selector problem at the other end. Then connectivity: try the ClusterIP and port directly (wget or nc) — if DNS resolves and endpoints exist but the connection is refused or times out, I suspect a NetworkPolicy denying traffic, which I confirm by listing policies in both namespaces. Isolating DNS vs endpoints vs NetworkPolicy in that order keeps me from guessing.

What does kubectl get events --sort-by=.lastTimestamp give you that describe does not?

kubectl describe pod scopes events to a single object, but a failure often spans several objects — the ReplicaSet's FailedCreate, the scheduler's FailedScheduling, the kubelet's probe failures — and a namespace-wide, time-sorted event stream lets me see the whole sequence and ordering. It is especially useful for intermittent or cascading issues where the triggering event happened on a different resource than the one I am staring at, and for catching FailedCreate from quota or admission rejections that never produce a pod to describe. The caveat is that events are retained for only about an hour by default (the kube-apiserver --event-ttl), so for older incidents I rely on logs, metrics, or an external event store.

When would you reach for kubectl debug or an ephemeral container instead of logs and describe?

When the container image is distroless or minimal and has no shell, or when the process is crashing too fast to exec into, or when I need tools (curl, dig, ps) that the application image does not ship. kubectl debug -it <pod> --image=busybox --target=<container> attaches an ephemeral debug container that shares the target container's process namespace (and the pod's network namespace), so I can inspect the live environment without rebuilding the image or adding debug tooling to production. I can also use kubectl debug node/<node> to get a privileged pod on a specific node for host-level investigation. It is the right escalation once describe, logs, and events have not been enough.

Practice this for real, from your target job

Reading about it only gets you so far. Paste a job description into prepme and get hands-on k3s/Terraform labs, auto-graded exams, and an architecture round — generated for that exact role and scored 0–100. Generating a briefing is free.

FAQ

How should I prepare for a live Kubernetes troubleshooting interview?+

Practice the diagnostic loop until it is reflexive rather than memorizing facts. Break real clusters and fix them — CrashLoopBackOff, Pending pods, probe failures, OOMKilled, and RBAC denials — so that get, describe, logs --previous, and events come automatically under pressure. Keep a one-page runbook mapping each failure mode to its confirming command and typical fixes, and rehearse narrating your reasoning out loud, since interviewers grade how you think, not just the final command.

What kubectl commands should I have memorized for a troubleshooting round?+

At minimum: kubectl get pods -o wide, kubectl describe pod (always read the Events), kubectl logs <pod> --previous and -c <container>, kubectl get events --sort-by=.lastTimestamp, kubectl get endpoints, kubectl top pod/node, kubectl auth can-i --as, kubectl rollout status/undo, and kubectl debug for shell-less images. Knowing which one to reach for first per symptom matters more than knowing every flag.

Do interviewers expect me to fix problems or just explain them?+

For mid-to-senior roles, usually both — they want to see you reason from symptom to root cause and then apply the correct fix, often in a live or shared terminal. The graded signal is whether you confirm the root cause before mutating anything and whether you can name the trade-offs of your fix, such as not masking a memory leak by raising a limit. Explaining without being able to drive kubectl, or driving without a hypothesis, both read as weaker than a calm, evidence-led fix.

How is a Kubernetes troubleshooting interview different from a general Kubernetes interview?+

A general Kubernetes interview spans architecture, objects, networking, and concepts and may be largely verbal. A troubleshooting round is narrower and more practical: you are handed a broken state and judged on diagnostic method, command fluency, and how quickly you isolate the layer at fault. It rewards production scar tissue — recognizing that an empty endpoints list means readiness or a selector bug, or that exit 137 means a SIGKILL (often OOM) — over textbook recall.

How can prepme.io help me practice Kubernetes troubleshooting specifically?+

Paste the job description you are targeting and prepme generates a free briefing with real k3s/Kubernetes labs matched to that role's stack — live containers you debug in the browser, including broken-probe, exhausted-quota, OOMKilled, missing-RBAC, ImagePullBackOff, and Service-selector scenarios — each graded 0 to 100 with specific feedback. It also produces auto-graded ABCD exams (easy/medium/hard, 30 questions) for conceptual recall and an architecture design round. Generating and viewing the briefing is free; hands-on practice is $20/month, cancel anytime, and you can try the demo briefing first.

Related guides