advanced2.5 hours17 min read

Site Reliability Engineer (SRE) Interview Questions

35+ SRE interview questions covering SLOs, error budgets, incident management, observability, capacity planning, and toil reduction — with production-grade answers.

sresloslaerror-budgetincident-managementobservabilitychaos-engineeringreliability

Questions

32+

Topics

Est. time

2.5 hours

SLOs, SLAs, SLIs, and Error Budgets

Define SLA, SLO, and SLI and explain how they relate to each other.

SLI (Service Level Indicator): A quantitative measurement of some aspect of service quality. The raw metric.

Example: "The proportion of HTTP requests that complete successfully in under 200 ms."

SLO (Service Level Objective): The target value for an SLI over a time window that you commit to internally.

Example: "99.9% of requests will complete successfully in under 200 ms, measured over a rolling 28 days."

SLA (Service Level Agreement): A contractual agreement with customers that includes consequences (credits, refunds) if SLOs are missed.

Example: Customer contracts promise 99.5% availability; an SLO of 99.9% gives the team a buffer before breaching the SLA.

Relationship: SLIs are measured → compared to SLOs → SLOs inform the SLA. The SLO should always be stricter than the SLA to give an internal buffer.

What is an error budget and how is it used?

An error budget is the amount of unreliability you're allowed before missing your SLO.

If your SLO is 99.9% availability over 30 days:

Total minutes in 30 days: 43,200
Allowed downtime (0.1%): 43.2 minutes

When the error budget is healthy (above ~50%): teams can deploy more aggressively, run experiments. When the error budget is burning fast: freeze non-critical deployments; focus on reliability improvements. When the error budget is exhausted: no new feature deployments until the budget resets or SLO is revised downward (after stakeholder discussion).

Error budgets make reliability a shared concern between product and engineering — product wants features; burning the budget reduces deployment velocity.

What is an error budget burn rate and why alert on it rather than on instantaneous availability?

Burn rate measures how fast you're consuming the error budget relative to budget replenishment speed.

Burn rate 1 = exactly consuming the budget at the SLO pace.
Burn rate 14.4 = consuming 14.4× normal — the budget will be exhausted in 1/14.4 of the window.

Why not alert on instantaneous availability:

A 5-minute outage on a quiet night barely moves the 30-day SLO.
A sustained 1% error rate over days silently burns the budget without triggering a spike alert.

Recommended multiwindow alert (from Google SRE Workbook):

1-hour + 5-minute burn rate > 14.4: page immediately (fast and significant burn)
6-hour + 30-minute burn rate > 6: page (moderate burn threatening budget)
3-day burn rate > 1: ticket only (slow but persistent)

How do you choose meaningful SLIs for a service?

Focus on SLIs that directly reflect user experience, not internal technical metrics:

Availability: ratio of successful requests to total requests. ("Successful" = HTTP 2xx/3xx, not 5xx.)

Latency: proportion of requests completing under a threshold. Use percentiles: p50 (median), p95, p99. Don't use mean — hides long-tail latency.

Throughput: requests per second a service can successfully handle.

Error rate: proportion of requests resulting in errors.

Coverage (data systems): proportion of data processed correctly vs. expected.

Avoid internal SLIs like "CPU < 80%". High CPU doesn't necessarily mean users are suffering. Low latency at high CPU is fine.

What happens when the error budget is exhausted?

Freeze deployments: Non-critical feature deployments are blocked until budget resets.
Reliability sprint: Engineering prioritises reliability work over features — fixing flakiness, reducing MTTR, improving observability.
Post-mortem: Conduct a blameless post-mortem on what burned the budget.
Negotiate SLO: If budget is consistently exhausted because the SLO is too tight for actual user needs, work with stakeholders to revise it.
Communicate: Inform customers proactively if their SLA is being approached.

How do you set an initial SLO for a new service?

When there's no historical data:

Start conservative: 99.5% availability is easier to achieve than 99.99%, and you can tighten it later.
Measure first: Deploy with monitoring; measure actual availability for 4–8 weeks.
Define "good": The SLI threshold (e.g., < 500 ms latency) should align with what users actually experience as acceptable, not an arbitrary number.
Consult stakeholders: product managers and customer-facing teams often have intuition about user tolerance for downtime and latency.
Leave headroom above SLA: Internal SLO should be stricter than contractual SLA by at least 0.5%.

What is the difference between availability, reliability, and durability?

Availability: Fraction of time a service is accessible and functioning. Measured in nines: 99.9% = 43 min downtime/month.

Reliability: Probability of the system performing its intended function without failure over a specified period. Closely related to availability but includes correctness — a system can be available but returning wrong data (unreliable).

Durability: For data systems specifically — the probability that stored data will not be lost. Azure Blob LRS is 11 nines durable: extremely unlikely to lose data, even if the service has availability issues.

Incident Management & On-Call

What is your process when you're paged for a production incident?

Acknowledge the alert — stops escalation; signals you're on it.
Establish context — read the alert, check dashboards, scope of impact.
Communicate — open an incident channel; post initial status ("investigating DB connectivity on prod API").
Triage and mitigate first — before root cause analysis. If a rollback stops user pain, do it immediately.
Escalate when needed — don't work alone on a P1 for more than 15 minutes without pulling in someone.
Document in real time — timeline of actions in the incident doc; others must be able to pick it up.
Resolve and restore — validate metrics returning to SLO.
Write post-mortem — within 24–48 hours while fresh.

What is a post-mortem and what makes it blameless?

A post-mortem is a structured review of an incident to understand what happened, why, and how to prevent recurrence.

Blameless means:

Focus on systems and processes that allowed the failure, not on individuals who made mistakes.
People act rationally given what they knew at the time; "human error" is a symptom, not a cause.
Engineers describe their actions honestly without fear of punishment — essential for accurate root cause analysis.
The goal is learning and prevention, not assigning blame.

Post-mortem structure:

Summary and impact
Timeline of events
Contributing factors (5 Whys / Fishbone)
Root cause(s)
Action items with owners and due dates
What went well

How do you prioritise incidents (P1/P2/P3)?

Priority	Definition	Response	Example
P1 (Critical)	Total outage or widespread user impact	Page on-call immediately, bridge open	Production API returning 500 for all users
P2 (High)	Significant degradation or partial outage	Page on-call during business hours	10% of users experiencing 5× latency
P3 (Medium)	Minor degradation, workaround exists	Next business day	Monitoring dashboard misconfigured
P4 (Low)	Cosmetic or non-functional issue	Backlog	Outdated help text

Severity is by user impact, not by how stressful it feels to fix.

What is MTTR and MTBF? How do you improve them?

MTBF (Mean Time Between Failures): Average time between incidents. Improve by building more resilient systems, fixing recurring failure modes, chaos engineering.

MTTR (Mean Time to Recover): Average time from incident start to service restored. Improve by:

Better alerting (detect faster)
Pre-written runbooks (investigate faster)
Feature flags and quick rollback capability (mitigate faster)
Practiced incident response

Good SREs focus on MTTR as aggressively as MTBF — you can't prevent all failures, but you can recover quickly.

What on-call practices reduce engineer burnout?

Alert quality: Only actionable, user-impacting alerts page engineers. Noisy alerts lead to burnout.
Rotation: Rotate on-call weekly; always have a secondary. No engineer on-call alone for extended periods.
24-hour rule: Engineers who were paged overnight should not be expected to work a full day the next day.
Runbooks: Written, tested runbooks for common incidents reduce cognitive load.
Post-incident recovery time: Time after a major incident to decompress and document.
Toil tracking: If on-call is dominated by repetitive manual work, it must be automated.

Explain escalation paths and runbooks.

Runbook: Step-by-step procedure for responding to a specific alert or incident type. Should include: diagnostic steps, mitigation options, rollback instructions, escalation contacts. Written proactively, tested in fire drills.

Escalation path: Who to contact if the current responder cannot resolve the incident within a time threshold:

On-call engineer (L1)
Senior/domain expert (L2)
Service owner / engineering manager (L3)
Executive / crisis team (P0 incidents)

Escalation should be fast and stigma-free. Escalating early when stuck is professioal, not a sign of failure.

What is an incident command system and when is it used?

The Incident Command System (ICS) is a structured management approach for complex incidents with multiple responders. Roles:

Incident Commander (IC): Coordinates the response; owns communication; makes time-sensitive decisions. Does NOT investigate or fix.
Technical Lead: Leads the technical investigation.
Communications Lead (Comms): Updates status page, emails customers, posts to incident channels.
Scribe: Documents the timeline in real time.

ICS prevents the chaos of multi-person incidents where everyone is investigating simultaneously with no coordination. Use for P1 incidents with 3+ engineers involved.

Observability

What are the four golden signals of monitoring (Google SRE)?

Latency: Time to serve a request. Distinguish successful vs. error latency — slow errors mask real signal.
Traffic: Demand on the system (requests/sec, bytes/sec, transactions/sec).
Errors: Rate of failing requests (5xx, exceptions, business logic failures).
Saturation: How full a resource is — CPU, memory, disk, connection pools. Saturation predicts degradation before it becomes an outage.

If you can only instrument four things, instrument these.

What is Prometheus and how does a scrape-based model work?

Prometheus is an open-source time-series monitoring system. Instead of agents pushing metrics, Prometheus scrapes (pulls) metrics from instrumented endpoints every 15–60 seconds.

scrape_configs:
  - job_name: 'api'
    static_configs:
      - targets: ['api:8080']

Applications expose /metrics in the Prometheus text format (or via client libraries). Prometheus stores the scraped samples in a local TSDB. Alertmanager handles alert routing and deduplication. Grafana visualises the data.

Benefits of pull model: Prometheus controls the rate; easy to discover service health (if scrape fails, service is down); no UDP/TCP firewall holes required for push.

How do you instrument an application for observability?

Use OpenTelemetry (OTel) — vendor-neutral, standard instrumentation SDK:

Metrics: Counters (total requests), gauges (active connections), histograms (latency distribution by bucket).
Traces: startSpan / endSpan wrapping key operations; propagate traceparent header on outbound calls.
Logs: Structured JSON logs (key-value pairs, not freeform text). Include trace_id and span_id to correlate logs with traces.

OTel collector receives all signals and exports to your backend (Prometheus, Jaeger, Grafana, Azure Monitor, Datadog). You instrument once; swap backends without code changes.

What is structured logging and why is it better than plain text logs?

Plain text: "User 42 logged in at 14:32" — requires regex parsing; hard to query at scale.

Structured logging:

{"timestamp": "2026-04-07T14:32:00Z", "level": "info", "event": "user_login", "user_id": 42, "duration_ms": 23, "ip": "1.2.3.4"}

Benefits:

Machine-parseable — instantly query user_id=42 or event=user_login AND duration_ms > 1000.
Consistent schema across services makes cross-service correlation trivial.
Log aggregation tools (Elasticsearch, Loki) index structured fields — faster, cheaper queries than full-text search.
Add trace_id to link logs to distributed traces.

How do you distinguish between a symptom and a root cause in incident investigation?

Symptom: What the system is doing wrong (high error rate, slow responses, users can't log in).

Root cause: The fundamental failure that, if fixed, prevents recurrence.

5 Whys technique:

Error rate is 10% → Why? → DB query timeout
DB query timeout → Why? → Connection pool exhausted
Connection pool exhausted → Why? → Connection leak in v2.3.1 deploy
Connection leak → Why? → Error path didn't close connections
No test coverage for error path → Why? → No testing standards for error handling

The root cause is "no testing standards for error handling." Fixing just the connection leak would leave the gap open for the next occurrence.

What is log aggregation and why do you need it?

In a microservices architecture with hundreds of pods, logs are scattered across nodes and ephemeral containers. Log aggregation collects all logs centrally.

Common stack:

FluentBit (lightweight) or Fluentd — run as DaemonSet on each K8s node, tail container logs, parse and forward.
Elasticsearch (OpenSearch) — index and store logs for full-text search.
Kibana (Grafana Explore) — query and visualise.

Or managed: Azure Monitor Log Analytics, AWS CloudWatch Logs, Datadog Logs.

Without aggregation: you're SSH-ing into individual nodes hoping the pod is still running and its logs haven't rotated.

What is a distributed trace and how do you use it to debug latency?

A trace is a collection of spans — each span records an operation with start time, duration, service name, and attributes. Spans are linked by parent-child relationships.

To debug latency:

Take a slow request's trace ID from logs.
Open the trace in Jaeger/Zipkin/Application Insights.
The waterfall view shows each service's contribution to total latency.
Identify the largest span = the bottleneck (e.g., auth-service taking 800 ms of a 1000 ms request).
Drill into that span — attributes show SQL query, cache hit/miss, downstream call.

Distributed tracing removes the "which service is slow?" guesswork from multi-service debugging.

Capacity Planning & Scalability

How do you approach capacity planning for a service?

Establish baselines: Measure current load (requests/sec, CPU, memory) and resource headroom.
Model growth: Work with product on expected traffic growth (feature launch, seasonal events).
Load test: Identify the service's breaking point and which resource hits saturation first.
Buffer: Provision for peak × 1.5–2× headroom minimum; never run services at >70% saturation to maintain headroom for spikes.
Auto-scaling: Configure HPA (K8s) or VMSS so the system self-adjusts within defined bounds.
Review regularly: Monthly capacity reviews; model next quarter's growth.

What is load testing and what tools do you use?

Load testing sends synthetic traffic to a system to measure behaviour under stress, identify breaking points, and validate scaling.

Types:

Load test: Expected peak traffic sustained.
Stress test: Beyond peak to find the breaking point.
Soak test: Sustained load for hours/days to detect memory leaks.
Spike test: Sudden burst (e.g., marketing campaign).

Tools: k6 (developer-friendly, code-based), Locust (Python), Artillery, JMeter (UI-based).

// k6 example
import http from 'k6/http';
export const options = { vus: 100, duration: '5m' };
export default function () {
  const r = http.get('https://api.example.com/health');
  check(r, { 'status 200': (r) => r.status === 200 });
}

How do you handle stateful services in a horizontally scaled architecture?

Stateful services (databases, caches, message queues) can't be naively replicated like stateless apps. Strategies:

Read replicas: Scale read traffic horizontally across multiple DB replicas; all writes go to primary.
Sharding: Partition data by key (user ID range, geographic region) across multiple DB instances.
External state: Move session state to Redis, allowing stateless app servers to scale freely.
CQRS: Separate read (query) and write (command) models — each can scale independently.
Managed services: Use Azure SQL, Cosmos DB, Redis Cache — the provider handles most scaling complexity.

What is a circuit breaker pattern?

A circuit breaker wraps calls to external services and "trips" (opens) when failure rate exceeds a threshold — preventing your service from hammering a failing dependency and degrading further.

Three states:

Closed (normal): Requests pass through.
Open (tripped): Requests fail fast with a fallback (cached response, default value, error). Dependency not called.
Half-Open (probing): After a timeout, allows a probe request. If it succeeds, circuit closes. If not, back to open.

Libraries: Resilience4j (Java), Polly (.NET), pybreaker (Python).

Prevents cascading failures from propagating through microservices chains.

Chaos Engineering

What is chaos engineering and why do SRE teams use it?

Chaos engineering is the practice of deliberately introducing failures into production (or production-like) systems to discover weaknesses before they cause real incidents.

Why: Complex distributed systems have emergent failure modes that can't be predicted from reading code or architecture diagrams alone. You find weaknesses either in a controlled chaos experiment or during a real incident — better to choose the former.

Pioneered by Netflix (Chaos Monkey terminates EC2 instances randomly to ensure Netflix can survive instance failures).

How would you design a chaos experiment?

Define steady state: What does "normal" look like? (Request success rate > 99.9%, p99 latency < 200 ms)
Hypothesise: "We believe that killing one of three API pods will not affect the success rate because the load balancer will redistribute traffic."
Apply chaos: Terminate one pod. Use tools like Chaos Mesh, Litmus, Gremlin.
Observe: Does the system behave according to the hypothesis?
Fix gaps: If the success rate dropped (hypothesis wrong), fix the gap before repeating.
Document: Record findings; add to runbook.

Start in staging; move to production only when confident in blast radius control.

What chaos tools and techniques are commonly used?

Tool	Platform	What it does
Chaos Mesh	Kubernetes	Pod kill, network latency/loss, CPU stress
Litmus	Kubernetes	Full chaos engineering platform, ChaosHub
Gremlin	Multi-platform	Managed chaos platform (hosted)
AWS FIS	AWS	Native fault injection; stop EC2, inject latency
Azure Chaos Studio	Azure	Fault injection for VMs, AKS, SQL

Common experiments: kill a pod, add 200 ms network latency between services, exhaust CPU, fill disk, simulate DNS failure, block access to a dependency.

What is game day?

A game day is a planned rehearsal where the team deliberately causes failures in a production or staging environment during business hours, with the full team observing.

Purpose:

Test incident response process (not just the system).
Validate runbooks and escalation paths.
Build team confidence in handling failures.
Identify organisational and tooling gaps.

Format: announce the game day in advance; define blast radius and rollback procedures; run 2–3 failure scenarios; debrief afterwards.

Toil & Automation

What is toil in the SRE context?

Toil (Google's definition): Work that is manual, repetitive, automatable, tactical, has no enduring value, and scales with service growth.

Examples: manually rotating secrets, responding to the same low-value alert daily, manually provisioning environments from a checklist.

The Google SRE guideline: keep toil below 50% of on-call time; the rest should be engineering work (automation, reliability improvements, project work). Unchecked toil crowds out reliability investment.

How do you reduce toil through automation?

Identify: Track on-call tickets categorised as toil. Quantify time spent.
Prioritise: Automate the most frequent and time-consuming items first.
Build: Write runbook steps as code — Ansible playbooks, shells scripts, Terraform modules, custom operators.
Self-service: Build internal tools so other teams can action themselves without SRE involvement.
Measure: Track toil percentage monthly; demonstrate reduction.

Example: manually updating DNS on every deployment → automated with ExternalDNS in K8s — removes 30 minutes of toil per deployment.

What is a reliability review / production readiness review?

A Production Readiness Review (PRR) is a checklist-based review of a service before it goes to production, ensuring reliability standards are met.

Typical checklist:

SLOs defined and agreed.
Monitoring and alerting in place.
Runbooks written.
Capacity plan documented.
DR procedure tested.
On-call rotation set up.
Incident classification defined.
Load tested to 2× expected peak.
Secrets management via Key Vault/Vault — no hardcoded creds.
Rollback procedure documented and tested.

PRRs prevent services from going to production without SRE support infrastructure.