Platform Engineer Interview Questions
35+ platform engineering interview questions covering internal developer platforms, Kubernetes, GitOps, service mesh, developer experience, and observability — for senior/staff-level roles.
Questions
17+
Topics
7
Est. time
3 hours
Internal Developer Platforms (IDP)
What is a platform engineer and how does the role differ from a DevOps engineer?
A platform engineer builds and maintains the internal tooling, infrastructure, and self-service capabilities that enable product engineers to deploy and operate their services efficiently.
| DevOps Engineer | Platform Engineer | |
|---|---|---|
| Focus | CI/CD pipelines, delivery processes | Internal developer platform (IDP), paved roads |
| Customer | Operations team | Product/application engineers |
| Output | Working pipelines | Composable platform capabilities |
| Mindset | Process improvement | Product thinking for internal tools |
Platform engineering applies product management principles to internal infrastructure — the "customers" are developers, and the "product" is the platform itself.
What is an Internal Developer Platform (IDP) and what are its core components?
An IDP is a self-service layer over infrastructure that lets developers deploy, operate, and observe their services without deep platform expertise.
Core components:
- Service catalog — inventory of services with ownership, dependencies, documentation (e.g., Backstage).
- Infrastructure self-service — golden path templates for provisioning databases, queues, storage (via Terraform modules or Crossplane).
- CI/CD workflows — standardised pipelines that developers customise without writing from scratch.
- Secrets management — centralised vault (Vault, AWS Secrets Manager, External Secrets Operator).
- Observability stack — pre-configured metrics, logs, traces with per-team dashboards.
- Environment management — ephemeral environments for PRs, staging promotion workflows.
The goal: reduce cognitive load. Developers focus on business logic; the platform handles the rest.
How would you design a "golden path" for a new microservice?
A golden path is an opinionated, well-supported route that works for most teams. For a new microservice:
- Template in Backstage (or similar) → developer fills in service name, language, owner, SLA tier.
- Scaffolding → Backstage creates: GitHub repo with standard structure, CI pipeline (GitHub Actions), Dockerfile, Helm chart skeleton, Datadog dashboards, PagerDuty integration.
- GitOps registration → ArgoCD Application manifest added to the platform config repo → service auto-deploys to dev.
- Promotions — PR to
stagingbranch → auto-deploy. PR tomain→ requires approval + smoke tests → prod. - Infrastructure — if the service needs a database, a Terraform module request via the catalog → infrastructure provisioned and connection string injected as a Kubernetes Secret via External Secrets Operator.
The developer never touches Kubernetes YAML, Terraform, or CI config directly — unless they need to escape the golden path.
Kubernetes
Explain the Kubernetes control plane components and what each does.
| Component | Role |
|---|---|
kube-apiserver | REST API frontend — all kubectl commands, controllers, and nodes talk to this |
etcd | Distributed key-value store — the source of truth for all cluster state |
kube-scheduler | Assigns pods to nodes based on resources, affinity, taints/tolerations |
kube-controller-manager | Runs control loops: Deployment, ReplicaSet, Node, Job controllers |
cloud-controller-manager | Interfaces with the cloud provider API (provision load balancers, volumes) |
On worker nodes: kubelet (runs pods), kube-proxy (manages iptables/ipvs rules for Service networking), and the container runtime (containerd).
How does Kubernetes networking work? Explain how a request reaches a pod.
Kubernetes networking requires every pod to have a unique IP and be reachable from every other pod without NAT — the CNI plugin (Calico, Cilium, Flannel) implements this.
Request flow for a Service:
External traffic
→ LoadBalancer Service (cloud LB)
→ NodePort on worker nodes
→ ClusterIP (virtual IP)
→ kube-proxy rewrites to a pod IP (iptables/IPVS rule)
→ Pod (matched by selector label)
For Ingress:
External traffic
→ Cloud LB → Ingress Controller (nginx, Traefik)
→ Ingress rules (host/path matching)
→ ClusterIP Service
→ Pod
A pod is in CrashLoopBackOff. Walk through your debugging steps.
# 1. Describe the pod — look at Events section
kubectl describe pod <pod-name> -n <namespace>
# 2. Check current logs
kubectl logs <pod-name> -n <namespace>
# 3. Check previous container's logs (if pod has restarted)
kubectl logs <pod-name> -n <namespace> --previous
# 4. Check resource constraints — OOMKilled?
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[0].lastState}'
# 5. Exec into the container (if it starts briefly)
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh
Common causes: application error (exit code 1), OOMKilled (memory limit too low, exit code 137), missing secret/configmap, failed readiness probe preventing traffic but liveness probe killing it.
What are the differences between Deployments, StatefulSets, and DaemonSets?
| Controller | Use case | Pod identity | Storage |
|---|---|---|---|
| Deployment | Stateless apps (API servers, web apps) | Interchangeable | Ephemeral or shared |
| StatefulSet | Stateful apps (databases, Kafka, Zookeeper) | Stable, ordered names (pod-0, pod-1) | Dedicated PVC per pod |
| DaemonSet | One pod per node (log collectors, monitoring agents) | Per-node identity | Node-local storage |
StatefulSet specifics: pods start/stop in order, each gets a stable network identity (pod-0.service.ns.svc), and PVCs are retained even when the pod is deleted.
GitOps
What is GitOps and how does it differ from push-based CI/CD?
GitOps treats Git as the single source of truth for both application code and infrastructure state. A GitOps operator (ArgoCD, Flux) continuously reconciles the cluster state with the desired state declared in Git.
| Push-based CI/CD | GitOps | |
|---|---|---|
| Trigger | Pipeline pushes to cluster on commit | Operator pulls from Git on a schedule/webhook |
| Cluster access | CI runner has kubectl/API credentials | Only the in-cluster operator has credentials |
| Drift | Not detected | Detected and auto-reconciled |
| Rollback | Re-run pipeline with old commit | git revert → operator reconciles |
| Audit | Pipeline logs | Git history |
Security advantage: no external system has cluster credentials — the cluster pulls its own desired state.
How does ArgoCD work? Describe its sync mechanism.
ArgoCD runs as a controller inside the cluster. It continuously:
- Watches Git (polling or webhook) for changes to the configured repository/branch.
- Compares the live cluster state to the desired state in Git (using Kubernetes resource diffs).
- If OutOfSync → either auto-syncs (if configured) or raises an alert for manual sync.
- Applies the manifests (via kubectl apply or Helm/Kustomize rendering).
- Reports sync status, health status, and resource tree in the UI.
# Example Application manifest
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: my-app
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/org/gitops-repo
targetRevision: main
path: apps/my-app/overlays/prod
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true # delete resources removed from Git
selfHeal: true # revert manual cluster changes
How do you promote a release from staging to production in a GitOps workflow?
Image update strategy (app repo → config repo):
Developer merges PR to main
→ CI builds image, tags with commit SHA
→ CI opens PR in config repo:
apps/my-app/overlays/staging/kustomization.yaml
newTag: abc1234
→ ArgoCD detects change → syncs to staging
→ Automated tests run against staging
Promotion to prod:
→ Platform engineer (or automated gate) opens PR:
apps/my-app/overlays/prod/kustomization.yaml
newTag: abc1234 (same SHA that passed staging)
→ PR approved → merged
→ ArgoCD syncs to prod
This creates an immutable, auditable trail — you can see exactly which commit is running in prod at any time.
Service Mesh
What problem does a service mesh solve?
In a microservices architecture, you have many services communicating over the network. Cross-cutting concerns that every service needs — mTLS encryption, circuit breaking, retries, rate limiting, distributed tracing — would normally require each team to implement them.
A service mesh (Istio, Linkerd, Cilium) moves these concerns into the infrastructure layer using sidecar proxies (or eBPF), so application code is free of networking logic.
Key capabilities:
- mTLS — automatic mutual TLS between all services (zero-trust)
- Traffic management — canary releases, A/B testing, traffic splitting at the mesh level
- Observability — automatic metrics, traces, and logs for every service-to-service call
- Circuit breaking — automatic failure isolation without code changes
Explain Istio's traffic management for a canary deployment.
# VirtualService: route 90% to v1, 10% to v2
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: my-service
spec:
hosts:
- my-service
http:
- route:
- destination:
host: my-service
subset: v1
weight: 90
- destination:
host: my-service
subset: v2
weight: 10
---
# DestinationRule: define subsets (v1, v2)
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: my-service
spec:
host: my-service
subsets:
- name: v1
labels:
version: v1
- name: v2
labels:
version: v2
Monitor v2 error rates and latency in Kiali/Grafana → gradually shift weight to 100% v2 → delete v1 Deployment.
Observability
What is the difference between metrics, logs, and traces? How do you use each for troubleshooting?
| Signal | What it tells you | Tool examples |
|---|---|---|
| Metrics | Aggregated numerical data over time | Prometheus, Datadog, CloudWatch |
| Logs | Discrete events with context | ELK, Loki, CloudWatch Logs |
| Traces | End-to-end journey of a request across services | Jaeger, Zipkin, AWS X-Ray |
Troubleshooting workflow:
- Metrics alert fires — "p99 latency spike on checkout service."
- Logs — filter checkout service logs for errors around the spike time → find a timeout calling payment service.
- Traces — find the specific trace ID from a failed request → see the entire call chain → identify that payment service → fraud-check API is the bottleneck.
What are the four golden signals and why do SRE/Platform teams monitor them?
From Google's SRE book:
| Signal | Metric | Example |
|---|---|---|
| Latency | Response time (p50, p99, p999) | 95th percentile API latency > 200ms |
| Traffic | Request rate | Requests per second |
| Errors | Error rate | HTTP 5xx % of total requests |
| Saturation | Resource utilisation headroom | CPU at 80%, memory at 90% |
These four signals catch almost every class of production problem. Alert on them before building custom metrics — they answer "is the service broken?" immediately.
Helm & Kustomize
When would you choose Helm over Kustomize?
| Helm | Kustomize | |
|---|---|---|
| Templating | Go templates ({{ .Values.foo }}) | Patch overlays (no templates) |
| Package format | Chart (versioned, shareable) | Plain Kubernetes manifests |
| Values | values.yaml per environment | Overlays per environment |
| Use case | Distributing software (Prometheus, nginx ingress) | Environment-specific customisation of your own manifests |
| Secrets | Requires Helm-secrets plugin | Use External Secrets Operator |
Recommendation: Use Helm for third-party dependencies (from ArtifactHub). Use Kustomize for managing your own applications across environments. Combine: Helm for the chart, Kustomize overlay for per-environment patches.
How do you handle Kubernetes secrets securely in a GitOps workflow?
Never commit plaintext secrets to Git. Strategies:
- External Secrets Operator — defines an
ExternalSecretCR that reads from Vault, AWS Secrets Manager, or Azure Key Vault and creates a Kubernetes Secret. Only the reference is in Git. - Sealed Secrets — encrypts secrets with a cluster-specific key; encrypted
SealedSecretis safe to commit. Only the cluster can decrypt. - Vault Agent Injector — Vault sidecar injects secrets as files into pods at runtime; no secrets in Kubernetes etcd at all.
# ExternalSecrets example
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: db-credentials
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets-manager
kind: SecretStore
target:
name: db-secret
data:
- secretKey: password
remoteRef:
key: prod/myapp/db
property: password
Developer Experience
How do you measure the effectiveness of a platform?
Use the DORA metrics as the primary signal that your platform is improving developer productivity:
| Metric | Measures | Target (Elite) |
|---|---|---|
| Deployment frequency | How often teams deploy | Multiple times per day |
| Lead time for changes | Commit → production | Less than 1 hour |
| Change failure rate | % deploys causing incidents | Less than 5% |
| Mean time to restore (MTTR) | Recovery time from failure | Less than 1 hour |
Secondary platform-specific metrics: time to first deployment for a new service, support ticket volume from developers, NPS (Net Promoter Score) from developer surveys.
