SRE Interview Questions and Answers
This guide is for engineers preparing for a Site Reliability Engineer interview: people who already know Linux, Kubernetes, and a cloud, and now need to prove they can reason about reliability as an engineering discipline rather than a checklist of tools. SRE loops lean harder than generalist DevOps loops on quantified reliability, on-call judgment, and the trade-off between feature velocity and stability, so memorizing definitions is not enough. You have to defend numbers under pressure.
How an SRE interview differs from a generalist DevOps interview
A DevOps interview tends to test whether you can build and ship a delivery pipeline: containerize an app, write the Terraform, wire up CI/CD, get it running on Kubernetes. An SRE interview assumes you can already do that and instead probes whether you can keep it running at a defined level of reliability under real failure. The center of gravity shifts from 'how do I deploy this' to 'how do I measure that it is healthy, decide when to stop shipping, and recover fast when it breaks.'
Practically, that means three things dominate the loop. First, you must speak in measured reliability: SLIs, SLOs, and error budgets, with actual percentages and the math behind them. Second, you will get an on-call and incident-response round where the interviewer cares about your decision-making under uncertainty, not just your kubectl recall. Third, expect a systems-debugging scenario where they degrade a service and watch how you triage. Treating reliability as a quantity you trade against feature velocity is the single mental shift that separates SRE candidates from strong DevOps candidates.
A useful framing borrowed from Google's SRE practice: reliability is the most important feature, but 100 percent is the wrong target because it is effectively impossible to reach, costs disproportionately more for each added nine, and your users often cannot tell the difference between 100 percent and 99.99 percent through their own flaky networks and devices anyway. Interviewers reward candidates who can say where the right reliability target comes from (the product, the user journey, the cost of an extra nine) rather than reflexively maximizing uptime.
- DevOps round: can you build and ship the pipeline. SRE round: can you measure, defend, and recover the running system.
- Three SRE-specific rounds to expect: measured reliability (SLOs and error budgets), on-call and incident response, and live systems debugging.
- Never default to 100 percent uptime; the right reliability target is a business decision driven by the user journey and the cost of each extra nine.
Reliability fundamentals: SLIs, SLOs, error budgets, and toil
An SLI (Service Level Indicator) is a measured ratio of good events to valid events, expressed as a percentage. A good SLI is defined from the user's perspective at the boundary where they experience the service, for example the fraction of HTTP requests served in under 300ms at the load balancer, or the fraction of requests returning a non-5xx status. An SLO (Service Level Objective) is the target for that SLI over a window, for example 99.9 percent of requests succeed over a rolling 28 days. An SLA (Service Level Agreement) is the contractual version with financial or legal penalties, and it should always be looser than your internal SLO so you breach the SLO and react well before you ever breach the SLA.
The error budget is the most important derived concept: it is simply 100 percent minus the SLO. A 99.9 percent availability SLO leaves a 0.1 percent budget, which over a 30-day month is about 43 minutes of allowable downtime (and therefore roughly 2.2 hours per 90-day quarter). That budget is a currency. When the budget is healthy you can ship aggressively and take risks; when it is exhausted you freeze risky launches and spend the next cycle on reliability work. This turns the classic dev-versus-ops tension into a shared, data-driven policy instead of an argument.
Toil is the other fundamental. Toil is manual, repetitive, automatable, tactical work that scales linearly with service size and carries no enduring value: think hand-editing configs, manually restarting a stuck pod every night, or copy-pasting the same runbook step. SRE practice caps toil (Google's well-known guideline is under 50 percent of an SRE's time) so the rest goes to engineering that reduces future toil. In interviews, name a concrete piece of toil you eliminated and quantify the time it gave back.
- Availability cheat sheet (per 30-day month): 99.9 percent is about 43m of downtime; 99.95 percent about 22m; 99.99 percent (four nines) about 4.3m.
- Pick SLIs that map to a user journey, not to infrastructure internals. CPU at 80 percent is not an SLI; checkout requests succeeding in under 1s is.
- Set SLOs slightly tighter than the SLA, and give every SLO an explicit measurement window and a documented denominator (what counts as a valid event).
Observability: metrics, logs, traces, Prometheus and Grafana
Observability is your ability to ask new questions about a system's internal state from its external outputs without shipping new code to answer them. The three common pillars are metrics (cheap, aggregatable numeric time series, great for dashboards and alerting), logs (discrete, often high-cardinality events, great for forensic detail), and traces (the path of a single request across services, great for finding which hop is slow). A strong answer explains the trade-off: metrics scale cheaply but lose per-request detail; logs and traces keep detail but cost far more to store and query at volume, which is why teams sample traces and cap log retention.
Prometheus is the de facto open-source metrics backbone. It pulls (scrapes) metrics over HTTP from instrumented targets, stores them as time series identified by a metric name plus labels, and is queried with PromQL. Know the four core metric types: counters (monotonic, only go up between resets, like total requests, which you read via rate()), gauges (go up and down, like current memory), histograms (bucketed observations for latency, which let you compute quantiles server-side with histogram_quantile over aggregated buckets), and summaries (client-side quantiles). A common gotcha: you cannot meaningfully average pre-computed percentiles such as a p99 across instances, so you store latency as a histogram and compute the quantile over the aggregated buckets. Grafana then visualizes those series and is where you build SLO burn-rate dashboards.
Alerting maturity is a frequent differentiator. Naive alerting fires on causes (CPU is high, disk is 80 percent full) and produces noise. Mature SRE alerting fires on symptoms tied to the SLO, typically using multi-window multi-burn-rate alerts: a fast-burn alert (the canonical Google example: burning roughly 14.4x faster than the SLO allows over a 1-hour window, which would consume 2 percent of a monthly budget in an hour) pages immediately for an acute outage, while a slower-burn alert (for example 6x over a 6-hour window, or 1x over 3 days) opens a ticket for a steady degradation. The principle is that every page should be actionable, novel, and urgent; if it does not require a human to act now, it is a ticket, not a page.
- RED method for request-driven services: Rate, Errors, Duration. USE method for resources: Utilization, Saturation, Errors.
- Use histograms for latency SLIs so quantiles aggregate correctly across pods; never average percentiles.
- Page on symptoms (user-facing SLO burn), ticket on causes. Multi-burn-rate alerts catch both fast outages and slow leaks while keeping pages rare.
Incident response: on-call, runbooks, blameless postmortems, MTTR
Incident response is where SRE judgment is most visible. A well-run incident has clear roles: an Incident Commander who owns decisions and coordination (and is explicitly not the person elbow-deep in the fix), an Operations/subject-matter lead doing the hands-on remediation, and a Communications lead handling stakeholder and status-page updates. The first job in almost every incident is to mitigate, not to root-cause. Rolling back the last deploy, failing over to a healthy region, or shedding load restores users faster than diagnosing why the new code is broken; you can do the forensics after the bleeding stops.
Runbooks are the codified, step-by-step responses to known alert conditions, and every alert that can page should link to one. A good runbook states the symptom, the likely causes, the exact diagnostic commands, the mitigation steps, and an escalation path. Runbooks reduce mean time to recovery precisely because they let a tired on-call engineer at 3am act without reconstructing tribal knowledge. Note the key metrics: MTTD (detect), MTTA (acknowledge), MTTR (mean time to recovery or to repair, depending on the team's definition), and MTBF (between failures). SRE work usually drives MTTR down faster than it pushes MTBF up, because fast, safe recovery is more tractable than preventing every failure.
Blameless postmortems are the cultural backbone. The premise is that humans operating a complex system act reasonably given the information and tooling they had, so a postmortem targets the system and process, never the individual. Naming a person as the root cause is a red flag in interviews; the right output is concrete action items with owners (better guardrails, an added automated check, a clearer runbook, a removed footgun) plus an honest timeline. The test of a blameless culture is whether the engineer who triggered the incident feels safe writing the postmortem themselves.
- Mitigate first, diagnose second: rollback, failover, or load-shed to stop user impact before chasing root cause.
- Separate the Incident Commander (decisions, coordination) from the hands-on responder so no one is both fixing and coordinating.
- Postmortems are blameless and action-oriented: every contributing factor yields an owned, tracked follow-up, never a name.
Capacity planning and a worked production-debugging scenario
Capacity planning asks: given expected demand plus headroom for spikes and failures, how much resource do you provision, and when do you scale? An SRE answer distinguishes organic growth (model from historical trend) from inorganic events (a launch, a marketing push, Black Friday) that need explicit forecasting. You provision to a target utilization (commonly 60 to 70 percent steady-state) so a node or zone failure does not immediately saturate the survivors, and you load-test to find the real breaking point rather than trusting CPU as a proxy. On Kubernetes you connect this to the Horizontal Pod Autoscaler scaling pods on a metric, the Cluster Autoscaler (or Karpenter) adding nodes, and requests/limits sized so the scheduler can actually bin-pack and so you avoid CPU throttling and OOMKills.
Now a scenario interviewers love: 'p99 latency on the checkout service jumped from 200ms to 3s ten minutes ago, error rate is still near zero. Walk me through it.' A structured response beats raw command recall. First, confirm scope from the SLO dashboard: is it all requests or one endpoint, one region, one version, one pod? Errors near zero plus a latency spike suggests requests still succeed but are waiting on something, so I am thinking saturation or a slow dependency, not a crash. Check the RED metrics and the deploy timeline: did anything ship at minute zero? If a deploy correlates, the fastest mitigation is to roll it back and confirm latency recovers, then investigate offline.
If no deploy correlates, follow the request with a trace to find which hop owns the new latency: is it the database (a missing index, lock contention, connection-pool exhaustion), a downstream API, or the pod itself? On the pod, check whether CPU is being throttled (container_cpu_cfs_throttled_seconds_total rising against a tight limit), whether it is GC-thrashing or near its memory limit, and whether the node is saturated. A classic culprit here is connection-pool or thread-pool exhaustion: a slow dependency causes requests to queue, latency climbs while error rate stays low until timeouts finally trip. Throughout, I narrate hypotheses and how each metric confirms or rejects them. The interviewer is scoring the systematic narrowing far more than whether you remember an exact flag.
- Latency up, errors flat usually means waiting (saturation, slow dependency, pool exhaustion), not crashing. Errors up usually means a failing dependency or a bad deploy.
- Always correlate against the deploy timeline first; the cheapest fix for a regression is a rollback that you confirm on the dashboard.
- Watch for CPU CFS throttling and OOMKills as silent latency causes on Kubernetes when limits are set too tight.
How to prepare effectively
Reading SRE theory creates the dangerous illusion of competence. You can recite the error-budget formula and still freeze when an interviewer says 'pods are in CrashLoopBackOff, the dashboard is red, go.' The fix is deliberate, hands-on reps against realistic failures: deliberately break a Kubernetes deployment and recover it, instrument a service with Prometheus and write the PromQL for a latency SLI, then drive a mock incident end to end including the postmortem write-up. Aim to be able to talk while you type, because that running commentary is exactly what the interview round rewards.
Tailor your prep to the actual job. A platform-SRE role at a Kubernetes-heavy shop weights cluster internals and on-call differently than an SRE role at a database-centric company. Pull the specific stack from the job description, drill the failure modes of that stack, and research the company's reliability posture and recent engineering signals before you walk in. This is exactly the loop prepme.io is built for: paste a real SRE job description and it generates a free briefing surfacing three practice surfaces. Its flagship is hands-on Kubernetes and Terraform labs you debug in a real in-browser container; alongside that are auto-graded ABCD multiple-choice exams across easy, medium, and hard, and architecture and design whiteboard interviews (currently in beta). It also adds a free, on-demand Company Coverage card that web-researches the employer's news, layoffs, funding, culture, and interview intel, with cited sources. Generating and viewing the briefing is free; doing the practice (labs, exams, and design) is $20 a month, cancel anytime, and every lab, exam, and design task is graded 0 to 100 with specific feedback, so you find your weak spots before the interviewer does.
- Practice talking while you debug; the live narration of your hypothesis is half the score in the systems round.
- Drill the exact stack in the job description, not a generic checklist, and prepare numbers you can defend (your SLOs, your error budget, your incidents).
- Rehearse a real incident end to end including a blameless postmortem, since the write-up reveals your judgment more than the fix did.
Common interview questions & answers
What is the difference between an SLI, an SLO, and an SLA?
An SLI is a measured indicator of service health, expressed as the ratio of good events to valid events, such as the percentage of requests served under 300ms. An SLO is the internal target for that SLI over a window, for example 99.9 percent over 28 days. An SLA is the externally contracted promise with financial or legal consequences for breach. You always set the SLO tighter than the SLA so you detect and react to degradation internally before you ever violate a customer contract.
What is an error budget and how does a team use it?
An error budget is 100 percent minus the SLO, so a 99.9 percent availability SLO yields a 0.1 percent budget, which over a 30-day month is roughly 43 minutes of allowable failure. It is a shared currency between product and reliability: when the budget is healthy the team can ship features fast and take launch risks, and when it is exhausted the policy is to freeze risky changes and spend the cycle on reliability and toil reduction. This replaces the dev-versus-ops argument with a number both sides agreed to in advance.
How much downtime do 99.9 percent and 99.99 percent allow per month, and how do you decide which to target?
Measured over a 30-day month, 99.9 percent (three nines) allows about 43 minutes of downtime and 99.99 percent (four nines) allows about 4.3 minutes. You do not pick the highest number by default, because each additional nine is dramatically more expensive and users often cannot perceive the difference through their own network and device variance. The right target comes from the product and the user journey: what does an extra nine cost to engineer, and does the business actually need it? Over-targeting reliability burns budget you could have spent on features.
What are the three pillars of observability and what is each best for?
Metrics are cheap, aggregatable numeric time series, ideal for dashboards, trend analysis, and alerting on SLO burn. Logs are discrete, often high-cardinality event records, ideal for forensic detail when you already know roughly where to look. Traces follow a single request across services, ideal for pinpointing which hop introduced latency in a distributed call. The trade-off is cost versus detail: metrics scale cheaply but lose per-request granularity, while logs and traces preserve detail but get expensive at volume, which is why traces are sampled and log retention is capped.
Explain the four Prometheus metric types and when you would use each.
A counter is monotonic and only increases (until a process restart resets it), used for cumulative totals like requests served or errors, and you typically look at its rate() over time. A gauge moves up and down, used for instantaneous values like current memory or in-flight requests. A histogram buckets observations and lets you compute quantiles server-side with histogram_quantile, which is the correct way to measure latency because the buckets aggregate across instances. A summary computes quantiles client-side, which is cheaper to query but cannot be meaningfully aggregated across pods, so histograms are preferred for SLIs.
Why should you alert on symptoms rather than causes, and what is a multi-burn-rate alert?
Cause-based alerts (CPU high, disk filling) generate noise because many of them never reach the user, leading to alert fatigue and ignored pages. Symptom-based alerts fire on user-facing SLO burn, so every page corresponds to real impact. A multi-burn-rate alert watches error-budget consumption over multiple windows: a fast-burn rule pages immediately when the budget is being consumed many times faster than allowed over a short window (the canonical Google example is roughly 14.4x over 1 hour, i.e. 2 percent of the monthly budget in an hour), while a slow-burn rule (for example 6x over 6 hours, or 1x over 3 days) opens a ticket when a smaller leak persists. This catches both sudden outages and slow degradations while keeping pages rare and actionable.
Walk me through your first actions in a major production incident.
First I declare the incident and establish roles, separating the Incident Commander who coordinates from the responder doing the hands-on work. Then I prioritize mitigation over diagnosis: I check whether a recent deploy correlates and, if so, roll it back, or fail over to a healthy region, or shed load to stop user impact immediately. I keep communication flowing through regular status updates so stakeholders are not pinging the responders. Only after the bleeding stops do I move to root cause, and afterward I drive a blameless postmortem with owned action items.
What is toil and how do you manage it as an SRE?
Toil is manual, repetitive, automatable work that has no enduring value and scales linearly with the service, such as hand-restarting a stuck process nightly or copy-pasting the same remediation. It is corrosive because it crowds out engineering and grows with the system. SRE practice caps toil, with Google's well-known guideline being under 50 percent of an SRE's time, so the remainder goes to automation that permanently reduces future toil. The strongest answer names a specific piece of toil you automated away and quantifies the recurring time it returned to the team.
What makes a postmortem blameless, and why does it matter?
A blameless postmortem operates on the premise that engineers acted reasonably given the information and tools available to them, so it analyzes the system and process rather than assigning fault to a person. Its output is a clear timeline plus concrete, owned action items: better guardrails, an added automated check, a clearer runbook, or a removed footgun. It matters because blame drives incidents underground; if people fear punishment they hide mistakes and the organization stops learning. The cultural litmus test is whether the engineer who triggered the incident feels safe authoring the postmortem.
The p99 latency of a service spiked but the error rate stayed near zero. How do you debug it?
Latency up with errors flat means requests are succeeding but waiting, which points to saturation or a slow dependency rather than a crash. I first scope it on the SLO dashboard (one endpoint, one region, one version, one pod?) and check the deploy timeline, since a correlated deploy is fastest to fix by rollback. If nothing shipped, I follow a trace to find which hop owns the new latency: a database with lock contention or pool exhaustion, a slow downstream API, or the pod itself. On the pod I check CPU CFS throttling against a tight limit, memory pressure or GC, and node saturation, with connection-pool or thread-pool exhaustion behind a slow dependency being a classic cause of rising latency with flat errors.
How do you approach capacity planning for a service?
I separate organic growth, which I model from historical usage trends, from inorganic events like launches or sales that require explicit forecasting. I provision to a steady-state target utilization, often 60 to 70 percent, so the loss of a node or availability zone does not immediately saturate the survivors. I load-test to find the real breaking point rather than trusting CPU as a proxy, and on Kubernetes I tie this to the Horizontal Pod Autoscaler, the Cluster Autoscaler or Karpenter, and correctly sized requests and limits so the scheduler can bin-pack without causing throttling or OOMKills. The plan also reserves headroom for spikes and for failover capacity.
What is the difference between MTTR, MTTD, and MTBF, and which does SRE work usually improve most?
MTTD is mean time to detect a problem, MTTA is mean time to acknowledge it, MTTR is mean time to recover or repair, and MTBF is mean time between failures. SRE work typically drives MTTR down faster than it raises MTBF, because making recovery fast and safe (good runbooks, easy rollbacks, automated failover) is more tractable than preventing every possible failure in a complex system. Reducing MTTR also directly protects the error budget, since shorter incidents consume less of it. A good answer notes that slow detection (MTTD) often dominates total impact, so observability investment pays off there too.
How do you decide what to make an alert that pages versus a ticket?
A page is justified only when a human must take action right now to prevent or stop user-facing impact, so it should be actionable, urgent, and novel. If the condition can wait until business hours, can be auto-remediated, or does not directly threaten the SLO, it belongs in a ticket queue rather than waking someone. Every page should link to a runbook so the on-call engineer can act without reconstructing context at 3am. The goal is a low, high-signal page rate, because a flood of non-actionable pages causes fatigue and real alerts get missed.
How do reliability goals and feature velocity get balanced in practice?
They are balanced through the error-budget policy. As long as the service is within its error budget, the team is free to ship features quickly and absorb the inherent risk of change. When the budget is exhausted, the agreed policy kicks in: pause risky launches and redirect effort to reliability, testing, and toil reduction until the budget recovers. This turns a perennial cultural fight into a data-driven, pre-negotiated rule, and it gives product a concrete incentive to invest in reliability because reliability work is what unlocks future shipping.
Practice this for real, from your target job
Reading about it only gets you so far. Paste a job description into prepme and get hands-on k3s/Terraform labs, auto-graded exams, and an architecture round — generated for that exact role and scored 0–100. Generating a briefing is free.
FAQ
What is the difference between an SRE and a DevOps engineer interview?+
Both assume cloud, Linux, and containers, but a DevOps interview centers on building and shipping delivery pipelines, while an SRE interview centers on measuring and defending reliability. Expect SRE loops to dig into SLIs, SLOs, error budgets, on-call judgment, incident command, and a live production-debugging scenario. The mental shift is treating reliability as a quantity you trade against feature velocity, not as an afterthought.
How long should I study for an SRE interview?+
If you already work in DevOps or platform engineering, two to four weeks of focused, hands-on practice is usually enough to be sharp. The constraint is rarely reading; it is reps. Spend most of the time breaking and recovering real systems, writing PromQL for latency SLIs, and running mock incidents end to end including the postmortem, rather than re-reading theory you can already recite.
Do I need to memorize the availability nines and downtime numbers?+
You should know the common ones cold because interviewers use them as a quick filter: over a 30-day month, 99.9 percent is about 43 minutes of downtime and 99.99 percent is about 4.3 minutes. More importantly, be ready to derive an error budget from an SLO on the spot and explain why you would not blindly target the most nines. Showing the reasoning matters more than reciting a table.
What books or sources should I study for SRE interviews?+
The Google SRE Book and The Site Reliability Workbook are the canonical, freely readable references for SLOs, error budgets, toil, and postmortems, and most interview rubrics echo their vocabulary. Pair that reading with hands-on Prometheus, Grafana, and Kubernetes practice so the concepts are muscle memory, not trivia. Then tailor to the specific employer's stack and reliability posture before the interview.
How can I practice the live systems-debugging round realistically?+
The best practice is debugging real, broken infrastructure while narrating your reasoning out loud, because the interview scores your systematic triage as much as the fix. Deliberately break Kubernetes deployments and recover them, then rehearse explaining each hypothesis and the metric that confirms or rejects it. prepme.io generates hands-on labs in real in-browser containers from a pasted SRE job description and grades each attempt 0 to 100 with feedback, which mirrors this round closely.
Should I research the company before an SRE interview, and how?+
Yes. Knowing a company's reliability posture, recent incidents or layoffs, funding, and engineering culture lets you tailor answers and ask sharp questions, which interviewers read as seniority. Pull the exact stack from the job description and prepare for its specific failure modes. prepme.io's free Company Coverage card speeds up the research by web-searching the employer's news, layoffs, funding, culture, and interview intel, with cited sources.
Related guides
- Kubernetes Interview Questions and AnswersKubernetes interview questions with model answers: pods, Deployments, Services, Ingress, RBAC, probes, plus live debugging of CrashLoopBackOff, Pending, and OOMKilled.
- Kubernetes Troubleshooting Interview QuestionsSenior Kubernetes troubleshooting interview questions with model answers: debug CrashLoopBackOff, Pending pods, probes, OOMKilled, RBAC, and stuck rollouts.
- AWS DevOps Interview Questions and AnswersReal AWS DevOps interview questions with strong model answers: ECS vs EKS, VPC and IAM least-privilege, Terraform vs CloudFormation, CI/CD, scaling, and incidents.