intermediate2.5 hours20 min read

DevOps Engineer Interview Questions

40+ DevOps interview questions covering CI/CD, Docker, Kubernetes, IaC with Terraform, monitoring, and Git workflows — with in-depth, practical answers.

devopsci-cddockerkubernetesterraformgitmonitoringpipelines

Questions

40+

Topics

Est. time

2.5 hours

CI/CD Pipelines

What is CI/CD and what problem does it solve?

Continuous Integration (CI) is the practice of merging code changes into a shared repository frequently (many times per day), with automated builds and tests running on each merge.

Continuous Delivery (CD) extends CI by automatically deploying every passing build to a staging environment, ready to release to production at any time.

Continuous Deployment goes one further — every passing build is automatically deployed to production without human approval.

Problems solved:

Eliminates long, painful integration phases ("integration hell").
Bugs are found within minutes of introduction, not weeks.
Deployment becomes a routine, low-risk event rather than a high-stress exercise.
Reduces time-to-market for features.

Walk me through a CI/CD pipeline you have built.

Strong answer template:

"My pipeline for a Node.js API used GitHub Actions. On every pull request:

Install dependencies, run ESLint and unit tests.

Build Docker image.

Run container security scan (Trivy). On merge to main:

Push image to Azure Container Registry (tagged with commit SHA).

Deploy to staging via Helm chart update.

Run integration test suite against staging.

Require manual approval for prod.

Helm upgrade to production; canary rollout.

If any step fails, the pipeline stops; Slack notification is sent to the team channel."

Key elements interviewers look for: automated testing at every stage, security scanning, image tagging strategy, environment separation, rollback capability.

What is a blue/green deployment vs a canary release?

Blue/Green:

Two identical production environments: blue (current) and green (new version).
Traffic switches 0% → 100% instantly at cutover.
Rollback: instantly flip back to blue.
Requires: 2× infrastructure cost.

Canary:

New version receives a small slice of traffic (e.g., 5%), progressively increased as metrics validate stability.
Catches issues before 100% of users are affected.
Requires: gradual traffic shifting (e.g., Kubernetes Argo Rollouts, AWS CodeDeploy).

Use blue/green when you want instant rollback and can afford the cost. Use canary when you want statistical validation with real production traffic before full rollout.

How do you handle failed deployments? What does a rollback strategy look like?

Rollback strategy depends on the deployment method:

Helm rollback: helm rollback <release> <revision> — restores to the previous chart version and values. Fast and reliable.
Kubernetes: kubectl rollout undo deployment/<name> — rolls back to the previous ReplicaSet.
Blue/Green: Switch load balancer target group back to blue immediately.
Feature flags: Disable the flag — no redeployment needed.
Database: Backward-compatible schema changes (additive only during deployment window). Avoid dropping columns until feature is fully rolled out.

Philosophy: Design every deployment to be rollback-safe. Never make non-backwards-compatible DB changes in the same deployment commit as application changes.

What are environment variables and how should secrets be handled in a pipeline?

Environment variables parameterise application behaviour across environments (dev, staging, prod) — database URLs, feature flag settings, log levels.

Secret handling rules:

Never store secrets in source code or environment variable files committed to git.
Use the platform's secret store: GitHub Secrets, Azure DevOps Key Vault variable groups, HashiCorp Vault.
Use workload identity federation (OIDC) to authenticate CI runners to cloud providers — no static secret needed.
Secrets should only be injected at runtime, visible only to the step that needs them.
Rotate secrets regularly; scan with tools like truffleHog or GitHub Advanced Security for leaks.

What is a build artifact and how do you version it?

A build artifact is the immutable output of a CI build — a Docker image, a compiled binary, a ZIP package.

Versioning strategies:

Semantic versioning (v1.2.3) for libraries and APIs — meaningful to consumers.
Git commit SHA (first 7 chars) for internal services — uniquely traces every artifact to its source code.
Build number — sequential, simple, but less traceable.

Best practice: tag Docker images with both the commit SHA and a human-readable tag (e.g., v1.2.3). Never use latest as the only production tag — it's mutable and untraceable.

What is a pipeline as code? Why is it better than a GUI-configured pipeline?

Pipeline as code stores CI/CD configuration in a file in the repository (.github/workflows/ci.yml, Jenkinsfile, azure-pipelines.yml).

Benefits:

Version controlled: Changes are reviewed via pull request, with history and rollback.
Reproducible: Every developer sees and can run the same pipeline.
Reusable: Shared templates/reusable workflows reduce duplication.
Auditable: The diff shows exactly what changed in the pipeline configuration.

GUI-configured pipelines create a separate source of truth, often undocumented, and are difficult to replicate across projects or recover after tool migrations.

What are pipeline caching strategies and why do they matter?

Caching reduces CI time by reusing expensive steps (dependency install, Docker layer builds) from previous runs.

Dependency caching (GitHub Actions cache action): Cache node_modules keyed by package-lock.json hash. If lockfile unchanged, restore cache and skip npm install.

Docker layer caching: Use --cache-from or BuildKit's inline cache to reuse unchanged image layers. Critical for large images — avoids reinstalling OS packages every run.

Why it matters: A pipeline slowed by cache misses kills developer productivity. Aim for < 3 minutes for a typical CI run; dependency install alone can take 2–5 minutes without caching.

Docker & Containers

What problem does Docker solve and how does it differ from a VM?

Docker packages an application and all its dependencies (runtime, libraries, config) into a portable container image that runs identically everywhere.

	Container	Virtual Machine
Isolation level	Process namespace (OS kernel shared)	Full OS virtualisation
Startup time	Milliseconds	Minutes
Image size	Tens–hundreds of MB	Gigabytes
Overhead	Very low	Higher (full guest OS)
Use case	Application packaging	Full OS isolation, legacy apps

Docker solves the "it works on my machine" problem by ensuring the build environment (OS libraries, runtime version) is captured in the image.

Explain the Dockerfile instructions you use most.

FROM node:20-alpine          # Base image — use slim/alpine variants for smaller images
WORKDIR /app                 # Sets working directory for subsequent instructions
COPY package*.json ./        # Copy dependency manifest first (layer cache optimisation)
RUN npm ci --omit=dev        # Install production deps; npm ci is deterministic
COPY . .                     # Copy application code (after deps — cache hit on code changes)
RUN npm run build
EXPOSE 3000                  # Document the port (does not publish it)
USER node                    # Drop privileges — never run as root
CMD ["node", "dist/index.js"] # Default command

Key practices: multi-stage builds to separate build tooling from runtime image; COPY --chown to set file ownership; use specific image tags not latest.

What is a multi-stage Docker build and why use it?

Multi-stage builds use multiple FROM statements in one Dockerfile. Each stage can copy artefacts from previous stages, discarding build tools and intermediate layers.

# Stage 1: build
FROM golang:1.22 AS builder
WORKDIR /app
COPY . .
RUN CGO_ENABLED=0 go build -o server .

# Stage 2: minimal runtime image
FROM gcr.io/distroless/static
COPY --from=builder /app/server /server
ENTRYPOINT ["/server"]

The final image contains only the compiled binary — no Go toolchain, no source code, no package manager. Result: a 10 MB image instead of 1+ GB.

What is Docker Compose and when would you use it over Kubernetes?

Docker Compose defines and runs multi-container applications via a single docker-compose.yml file. Brings up all services, networks, and volumes with docker compose up.

Use Compose for:

Local development — spin up app + database + cache with one command.
Simple single-host deployments.
Running integration tests in CI.

Use Kubernetes when:

You need to orchestrate containers across multiple nodes.
You require self-healing, auto-scaling, rolling updates, service discovery.
You're running production workloads beyond a single VM.

How do you reduce Docker image size?

Use slim base images: node:20-alpine (50 MB) vs node:20 (350 MB).
Multi-stage builds: Exclude build tools from the final image.
Minimise layers: Combine RUN commands with &&.
.dockerignore: Exclude node_modules, .git, test files from the build context.
Remove caches: RUN apt-get install -y package && rm -rf /var/lib/apt/lists/*.
Prefer distroless or scratch for compiled languages — no shell, no package manager.

Small images: faster pulls, smaller attack surface, reduced registry storage cost.

What is container image scanning and how do you integrate it into CI?

Container image scanning analyses layers for OS packages and language dependencies with known CVEs (Common Vulnerabilities and Exposures).

Tools: Trivy (open source, fast), Grype, Snyk, Azure Defender for Containers.

CI integration:

- name: Scan image
  run: trivy image --exit-code 1 --severity HIGH,CRITICAL myapp:${{ github.sha }}

--exit-code 1 fails the build on HIGH/CRITICAL CVEs. Best practice: fail on CRITICAL, warn on HIGH, review MEDIUM periodically. Keep base images updated to reduce CVE count.

What does it mean for a container to be stateless?

A stateless container stores no persistent data locally — any state is externalised to a database, object store, or message queue. On restart, the container starts fresh from the same image.

Benefits:

Any instance can handle any request (horizontal scaling).
Containers are freely replaceable — no migration of local state.
Blue/green and rolling deploys are safe.

Stateful workloads (databases, caches) in containers require Persistent Volumes in Kubernetes (e.g., Azure Disk, EBS) to survive pod restarts.

Explain Docker networking modes.

Mode	Description	Use case
bridge	Default. Container gets a private IP; communicates via virtual bridge. Expose ports explicitly.	Single-host multi-container apps
host	Container shares host network namespace. No port mapping. Linux-only.	Performance-critical, host port access
overlay	Multi-host networking for Docker Swarm / Kubernetes.	Distributed clusters
none	No networking.	Security-isolated jobs

In Docker Compose, containers on the same network can refer to each other by service name (built-in DNS).

Kubernetes

What is Kubernetes and what problems does it solve?

Kubernetes (K8s) is an open-source container orchestrator that automates:

Scheduling: placing containers on appropriate nodes.
Self-healing: restarting failed containers, replacing failed nodes.
Scaling: horizontal pod autoscaling based on CPU/memory/custom metrics.
Rolling updates and rollbacks.
Service discovery and load balancing.
Secret and configuration management.

Without Kubernetes: you'd need to manually place containers on servers, monitor their health, restart failures, and manage networking between services.

Explain Pods, Deployments, and Services.

Pod: The smallest deployable unit. Contains one or more tightly coupled containers sharing network namespace and storage volumes.

Deployment: Manages a desired number of identical Pod replicas (ReplicaSet). Handles rolling updates and rollbacks.

spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1

Service: A stable DNS name and virtual IP for a set of pods (selected by labels). Types:

ClusterIP — internal only.
NodePort — exposes on every node's port.
LoadBalancer — provisions an external cloud LB.
ExternalName — maps to an external FQDN.

What is a Namespace and why use it?

Namespaces provide logical isolation within a cluster:

Separate teams or environments (dev, staging, prod) without multiple clusters.
Per-namespace RBAC: team A can't touch team B's resources.
Resource quotas per namespace limit runaway consumption.
Network policies can restrict cross-namespace communication.

kube-system is reserved for Kubernetes control plane components. default is the fallback namespace. Production workloads should always be in explicitly named namespaces.

What is a ConfigMap vs a Secret?

ConfigMap: Stores non-sensitive configuration as key-value pairs or files.

apiVersion: v1
kind: ConfigMap
data:
  LOG_LEVEL: "info"
  DB_HOST: "postgres.prod.svc.cluster.local"

Secret: Stores sensitive values (passwords, tokens). Values are base64-encoded (not encrypted by default — often encrypted at rest via KMS integration).

Both can be mounted as environment variables or volumes into pods. Secrets should never be committed to git; use External Secrets Operator or a CSI driver to sync from Key Vault / Vault.

What is a Horizontal Pod Autoscaler (HPA)?

HPA automatically scales the number of pod replicas based on observed metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  scaleTargetRef:
    kind: Deployment
    name: api
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60

When CPU across all pods exceeds 60%, HPA adds replicas. Requires metrics-server in the cluster. Custom metrics (RPS, queue depth) need KEDA or custom metrics adapter.

What is the difference between a liveness probe and a readiness probe?

Liveness probe: Checks if the container is alive. If it fails, K8s kills and restarts the container. Use for detecting deadlocks or corrupted state.

Readiness probe: Checks if the container is ready to receive traffic. If it fails, the pod is removed from Service endpoints. Use during startup (waiting for DB connection) or transient overload.

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 15

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

Startup probe: Like liveness, but only runs during startup — prevents liveness probe from killing a slow-starting container.

How does Kubernetes handle a failed node?

kubelet on the node stops sending heartbeats to the control plane.
After the node heartbeat timeout (default 40 seconds), the node is marked NotReady.
After the pod eviction timeout (default 5 minutes), the pods on that node are marked for eviction.
The ReplicaSet controller schedules replacement pods on healthy nodes.
If using PodDisruptionBudgets, the rescheduling respects minimum available replica counts.

The total recovery time is ~5–6 minutes by default. Tune eviction timeouts for faster recovery in critical services.

What is Helm and what problem does it solve?

Helm is a Kubernetes package manager. It templates K8s manifests using Go templates, making it easy to:

Parameterise deployments for different environments (values files).
Package and version an entire application as a chart.
Install/upgrade/rollback complex multi-manifest deployments with a single command.
Share and reuse components (stable charts from Artifact Hub).

helm upgrade --install my-app ./chart \
  -f values-prod.yaml \
  --set image.tag=$IMAGE_TAG \
  --wait

--wait blocks until all pods are healthy — essential for CD pipelines.

Infrastructure as Code

What is Infrastructure as Code and why does it matter?

IaC manages and provisions infrastructure through machine-readable configuration files rather than manual processes.

Benefits:

Reproducibility: Same code produces identical infrastructure every time.
Version control: Infrastructure changes reviewed via pull request with history and rollback.
Drift detection: IaC tools detect configuration drift and revert to desired state.
Documentation: The code is the documentation.
Consistency: Eliminates snowflake servers configured manually over years.

Tools: Terraform (multi-cloud), Bicep (Azure), CloudFormation (AWS), Pulumi (code-native).

Explain Terraform's core workflow.

terraform init     # Download provider plugins, initialise backend
terraform plan     # Generate and review the execution plan (diff)
terraform apply    # Apply the plan to create/update/destroy resources
terraform destroy  # Tear down all managed resources

State file: Terraform tracks what it has created in terraform.tfstate. Always store state in a remote backend (S3, Azure Blob, Terraform Cloud) — never in Git (may contain secrets).

Key concepts: providers, resources, data sources, modules, variables, outputs.

What is Terraform state and how do you manage it in a team?

Terraform state maps configuration to real infrastructure. Without it, Terraform can't know what exists and what needs to change.

Team state management:

Remote backend: Store state in Azure Blob Storage or S3 — shared, versioned, accessible to all team members.
State locking: Azure Blob and S3 support state locking — prevents two apply runs simultaneously (race conditions that corrupt state).
Workspaces: Separate state per environment (terraform workspace new prod).

Never manually edit the state file. Use terraform state mv, terraform import, or terraform state rm for state manipulation.

What is the difference between `terraform plan` and `terraform apply`?

terraform plan is a dry run — it shows exactly what Terraform will create, update, or destroy without making any changes. Always review the plan before applying.

terraform apply executes the plan. In CI/CD, save the plan to a file (-out=plan.tfplan) and apply exactly that plan: prevents drift between plan and apply.

terraform plan -out=plan.tfplan
# Review/approve
terraform apply plan.tfplan

What are Terraform modules and why should you use them?

A module is a reusable, self-contained collection of Terraform resources. It encapsulates a logical group (e.g., a VNet with subnets and NSGs) and exposes variables/outputs as its interface.

module "vnet" {
  source        = "./modules/vnet"
  name          = "prod-vnet"
  address_space = ["10.0.0.0/16"]
  location      = "uksouth"
}

Benefits: DRY principle, consistent patterns across environments, versioned (via Git tags or registry), independently testable.

How do you test Infrastructure as Code?

Level	Tool	What it tests
Static analysis	`terraform validate`, `tflint`	Syntax, type errors, best practices
Security scanning	`checkov`, `tfsec`	Misconfigurations (public storage, no encryption)
Unit tests	`Terratest` (Go), `pytest-terraform`	Module logic in isolation
Integration tests	`Terratest` with real cloud resources	Full resource creation and validation
Policy as code	Sentinel (TFE), OPA	Governance rules (e.g., no untagged resources)

Run static analysis and security scanning in every CI run. Integration tests are slower and costy — reserve for PRs to main or nightly runs.

Monitoring & Alerting

What is the difference between metrics, logs, and traces?

The three pillars of observability:

Metrics: Numeric measurements sampled over time (CPU %, HTTP request rate, error count). Good for dashboards and threshold alerts. Examples: Prometheus, Azure Monitor Metrics.

Logs: Timestamped, immutable records of discrete events ("user 42 logged in", "DB query failed"). Good for debugging specific incidents. Examples: Application Insights, ELK stack.

Traces: End-to-end records of a request's path through multiple services — shows latency at each hop and where failures occur. Essential for microservices debugging. Examples: Jaeger, Zipkin, Azure Monitor distributed tracing.

What is an SLO and how do you set up alerts for it?

An SLO (Service Level Objective) is a target for a service metric over a time window — e.g., "99.9% of requests complete in under 200 ms over a rolling 30-day window."

Error budget = 1 – SLO. At 99.9%, you have 43.8 minutes/month of allowable downtime.

Alert on error budget burn rate rather than instantaneous thresholds:

1-hour burn rate > 14.4× → page on-call immediately (fast burn, drains budget in < 1 hour).
6-hour burn rate > 6× → page during business hours (moderate burn).

This avoids noisy alerts on brief spikes that don't meaningfully impact the SLO.

What is a distributed tracing system and why is it needed in microservices?

In a microservice architecture, a single user request might traverse 10+ services. Without distributed tracing, debugging latency or errors means correlating logs from each service manually.

Distributed tracing attaches a trace ID to every request, propagated via HTTP headers (traceparent). Each service records a span — the time spent within that service. All spans for a trace are stitched together to provide a waterfall view.

Tools: Jaeger, Zipkin, Azure Monitor Application Insights, AWS X-Ray, OpenTelemetry (standard instrumentation SDK).

What alerting fatigue is and how do you prevent it?

Alert fatigue occurs when too many low-quality alerts desensitise on-call engineers, causing them to ignore or dismiss alerts without investigating — eventually missing real incidents.

Prevention:

Alert on symptoms, not causes — "error rate > 1%" is actionable; "CPU > 80%" usually isn't.
Set good thresholds — based on historical data, not guesses.
Multi-window alerts — require the condition to persist for 5+ minutes before firing.
Runbooks: Every alert must have an associated runbook — if you can't write one, the alert probably shouldn't exist.
Regular reviews: Remove alerts that page without anyone taking action.

How do you monitor a Kubernetes cluster?

Layer	Tool	What to monitor
Node	Prometheus node-exporter	CPU, memory, disk I/O, network
K8s control plane	kube-state-metrics	Pod phases, deployment replicas, PVC status
Application	App instrumentation (Prometheus SDK)	Custom metrics: request rate, latency, error rate
Logs	Fluentd/FluentBit → ELK/Loki	Container stdout/stderr
Traces	OpenTelemetry collector	Distributed request traces
Dashboards	Grafana	Unified view across all layers

Key alerts: pod CrashLoopBackOff, deployment unavailable replicas, node NotReady, persistent volume full.

Git & Version Control

What is the difference between `git merge` and `git rebase`?

Both integrate changes from one branch into another.

Merge: Creates a merge commit preserving the full branch history. Non-destructive; safe on shared branches.

A---B---C  main
         \
          D---E  feature
               \
                M  (merge commit)

Rebase: Re-applies commits from the feature branch on top of the base branch as if written there. Creates a linear history; cleaner log.

A---B---C---D'---E'  (rebased feature)

Rule: Never rebase shared/public branches (rewrites history, breaks others' clones). Use rebase for local cleanup before merging. Use merge for integrating into main.

What is a Git branching strategy? Describe GitFlow and trunk-based development.

GitFlow:

main — production.
develop — integration branch.
feature/* — individual features off develop.
release/* — stabilisation before release.
hotfix/* — emergency patches off main.

Suitable for scheduled release cycles, multiple versions in production.

Trunk-Based Development (TBD):

Single main (trunk) branch.
Developers commit directly to main or use very short-lived feature branches (< 1 day).
Feature flags hide incomplete features.
CI runs on every commit; always releasable.

Suitable for high-deployment-frequency teams (SaaS, CI/CD-mature).

How do you handle a broken main branch?

Stop the bleeding: Create a main-lock announcement in your team channel; block additional merges.
Identify the cause: git log --oneline -10, check CI run for the failing commit.
Fix-forward vs revert: For complex changes, git revert <sha> creates a new commit undoing the change (safe for shared branch). For simple commits on trunk, reapply the fix as a new commit.
Verify CI green before announcing main is stable again.

Never use git reset --hard on main in a shared repository — it rewrites history and breaks everyone else's clones.

What is a pull request and what should a good code review process look like?

A pull request (PR) is a request to merge a branch into a base branch, with a built-in review workflow.

A good code review:

Automated checks first: CI must pass before human review starts (tests, lint, security).
Focused PRs: Small, single-concern PRs are reviewed faster and more thoroughly.
What reviewers check: Correctness, edge cases, security implications, test coverage, performance, readability.
Constructive feedback: Suggest alternatives, not just problems. Distinguish blocking vs. non-blocking comments.
Approval requirements: At least one (ideally two) approvals before merge, especially for main.
Merge strategy: Squash and merge for feature branches (clean history); merge commit for release branches.

How do you resolve a merge conflict?

git fetch && git rebase origin/main (or git merge origin/main).
Git marks conflicting files with conflict markers.
Open the file; the conflict looks like:

<<<<<<< HEAD
const timeout = 3000;
=======
const timeout = 5000;
>>>>>>> feature/update-timeout

Edit the file to the correct resolution (may combine both changes, or choose one).
Remove conflict markers.
git add <resolved-file> then git rebase --continue (or git commit for merge).
Push; the conflict is resolved.

Prefer a 3-way merge tool (VS Code, IntelliJ, git mergetool) for complex conflicts.