advanced3 hours11 min read

Platform Engineer Interview Questions

35+ platform engineering interview questions covering internal developer platforms, Kubernetes, GitOps, service mesh, developer experience, and observability — for senior/staff-level roles.

platform-engineeringkubernetesgitopsargocdhelmservice-meshidpdeveloper-experiencebackstageterraform

Questions

17+

Topics

Est. time

3 hours

Internal Developer Platforms (IDP)

What is a platform engineer and how does the role differ from a DevOps engineer?

A platform engineer builds and maintains the internal tooling, infrastructure, and self-service capabilities that enable product engineers to deploy and operate their services efficiently.

	DevOps Engineer	Platform Engineer
Focus	CI/CD pipelines, delivery processes	Internal developer platform (IDP), paved roads
Customer	Operations team	Product/application engineers
Output	Working pipelines	Composable platform capabilities
Mindset	Process improvement	Product thinking for internal tools

Platform engineering applies product management principles to internal infrastructure — the "customers" are developers, and the "product" is the platform itself.

What is an Internal Developer Platform (IDP) and what are its core components?

An IDP is a self-service layer over infrastructure that lets developers deploy, operate, and observe their services without deep platform expertise.

Core components:

Service catalog — inventory of services with ownership, dependencies, documentation (e.g., Backstage).
Infrastructure self-service — golden path templates for provisioning databases, queues, storage (via Terraform modules or Crossplane).
CI/CD workflows — standardised pipelines that developers customise without writing from scratch.
Secrets management — centralised vault (Vault, AWS Secrets Manager, External Secrets Operator).
Observability stack — pre-configured metrics, logs, traces with per-team dashboards.
Environment management — ephemeral environments for PRs, staging promotion workflows.

The goal: reduce cognitive load. Developers focus on business logic; the platform handles the rest.

How would you design a "golden path" for a new microservice?

A golden path is an opinionated, well-supported route that works for most teams. For a new microservice:

Template in Backstage (or similar) → developer fills in service name, language, owner, SLA tier.
Scaffolding → Backstage creates: GitHub repo with standard structure, CI pipeline (GitHub Actions), Dockerfile, Helm chart skeleton, Datadog dashboards, PagerDuty integration.
GitOps registration → ArgoCD Application manifest added to the platform config repo → service auto-deploys to dev.
Promotions — PR to staging branch → auto-deploy. PR to main → requires approval + smoke tests → prod.
Infrastructure — if the service needs a database, a Terraform module request via the catalog → infrastructure provisioned and connection string injected as a Kubernetes Secret via External Secrets Operator.

The developer never touches Kubernetes YAML, Terraform, or CI config directly — unless they need to escape the golden path.

Kubernetes

Explain the Kubernetes control plane components and what each does.

Component	Role
`kube-apiserver`	REST API frontend — all kubectl commands, controllers, and nodes talk to this
`etcd`	Distributed key-value store — the source of truth for all cluster state
`kube-scheduler`	Assigns pods to nodes based on resources, affinity, taints/tolerations
`kube-controller-manager`	Runs control loops: Deployment, ReplicaSet, Node, Job controllers
`cloud-controller-manager`	Interfaces with the cloud provider API (provision load balancers, volumes)

On worker nodes: kubelet (runs pods), kube-proxy (manages iptables/ipvs rules for Service networking), and the container runtime (containerd).

How does Kubernetes networking work? Explain how a request reaches a pod.

Kubernetes networking requires every pod to have a unique IP and be reachable from every other pod without NAT — the CNI plugin (Calico, Cilium, Flannel) implements this.

Request flow for a Service:

External traffic
  → LoadBalancer Service (cloud LB)
    → NodePort on worker nodes
      → ClusterIP (virtual IP)
        → kube-proxy rewrites to a pod IP (iptables/IPVS rule)
          → Pod (matched by selector label)

For Ingress:

External traffic
  → Cloud LB → Ingress Controller (nginx, Traefik)
    → Ingress rules (host/path matching)
      → ClusterIP Service
        → Pod

A pod is in CrashLoopBackOff. Walk through your debugging steps.

# 1. Describe the pod — look at Events section
kubectl describe pod <pod-name> -n <namespace>

# 2. Check current logs
kubectl logs <pod-name> -n <namespace>

# 3. Check previous container's logs (if pod has restarted)
kubectl logs <pod-name> -n <namespace> --previous

# 4. Check resource constraints — OOMKilled?
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[0].lastState}'

# 5. Exec into the container (if it starts briefly)
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh

Common causes: application error (exit code 1), OOMKilled (memory limit too low, exit code 137), missing secret/configmap, failed readiness probe preventing traffic but liveness probe killing it.

What are the differences between Deployments, StatefulSets, and DaemonSets?

Controller	Use case	Pod identity	Storage
Deployment	Stateless apps (API servers, web apps)	Interchangeable	Ephemeral or shared
StatefulSet	Stateful apps (databases, Kafka, Zookeeper)	Stable, ordered names (`pod-0`, `pod-1`)	Dedicated PVC per pod
DaemonSet	One pod per node (log collectors, monitoring agents)	Per-node identity	Node-local storage

StatefulSet specifics: pods start/stop in order, each gets a stable network identity (pod-0.service.ns.svc), and PVCs are retained even when the pod is deleted.

GitOps treats Git as the single source of truth for both application code and infrastructure state. A GitOps operator (ArgoCD, Flux) continuously reconciles the cluster state with the desired state declared in Git.

	Push-based CI/CD	GitOps
Trigger	Pipeline pushes to cluster on commit	Operator pulls from Git on a schedule/webhook
Cluster access	CI runner has kubectl/API credentials	Only the in-cluster operator has credentials
Drift	Not detected	Detected and auto-reconciled
Rollback	Re-run pipeline with old commit	`git revert` → operator reconciles
Audit	Pipeline logs	Git history

Security advantage: no external system has cluster credentials — the cluster pulls its own desired state.

How does ArgoCD work? Describe its sync mechanism.

ArgoCD runs as a controller inside the cluster. It continuously:

Watches Git (polling or webhook) for changes to the configured repository/branch.
Compares the live cluster state to the desired state in Git (using Kubernetes resource diffs).
If OutOfSync → either auto-syncs (if configured) or raises an alert for manual sync.
Applies the manifests (via kubectl apply or Helm/Kustomize rendering).
Reports sync status, health status, and resource tree in the UI.

# Example Application manifest
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/org/gitops-repo
    targetRevision: main
    path: apps/my-app/overlays/prod
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true       # delete resources removed from Git
      selfHeal: true    # revert manual cluster changes

How do you promote a release from staging to production in a GitOps workflow?

Image update strategy (app repo → config repo):

Developer merges PR to main
  → CI builds image, tags with commit SHA
  → CI opens PR in config repo:
      apps/my-app/overlays/staging/kustomization.yaml
        newTag: abc1234
  → ArgoCD detects change → syncs to staging
  → Automated tests run against staging

Promotion to prod:
  → Platform engineer (or automated gate) opens PR:
      apps/my-app/overlays/prod/kustomization.yaml
        newTag: abc1234  (same SHA that passed staging)
  → PR approved → merged
  → ArgoCD syncs to prod

This creates an immutable, auditable trail — you can see exactly which commit is running in prod at any time.

Service Mesh

What problem does a service mesh solve?

In a microservices architecture, you have many services communicating over the network. Cross-cutting concerns that every service needs — mTLS encryption, circuit breaking, retries, rate limiting, distributed tracing — would normally require each team to implement them.

A service mesh (Istio, Linkerd, Cilium) moves these concerns into the infrastructure layer using sidecar proxies (or eBPF), so application code is free of networking logic.

Key capabilities:

mTLS — automatic mutual TLS between all services (zero-trust)
Traffic management — canary releases, A/B testing, traffic splitting at the mesh level
Observability — automatic metrics, traces, and logs for every service-to-service call
Circuit breaking — automatic failure isolation without code changes

Explain Istio's traffic management for a canary deployment.

# VirtualService: route 90% to v1, 10% to v2
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: my-service
spec:
  hosts:
  - my-service
  http:
  - route:
    - destination:
        host: my-service
        subset: v1
      weight: 90
    - destination:
        host: my-service
        subset: v2
      weight: 10
---
# DestinationRule: define subsets (v1, v2)
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: my-service
spec:
  host: my-service
  subsets:
  - name: v1
    labels:
      version: v1
  - name: v2
    labels:
      version: v2

Monitor v2 error rates and latency in Kiali/Grafana → gradually shift weight to 100% v2 → delete v1 Deployment.

Observability

What is the difference between metrics, logs, and traces? How do you use each for troubleshooting?

Signal	What it tells you	Tool examples
Metrics	Aggregated numerical data over time	Prometheus, Datadog, CloudWatch
Logs	Discrete events with context	ELK, Loki, CloudWatch Logs
Traces	End-to-end journey of a request across services	Jaeger, Zipkin, AWS X-Ray

Troubleshooting workflow:

Metrics alert fires — "p99 latency spike on checkout service."
Logs — filter checkout service logs for errors around the spike time → find a timeout calling payment service.
Traces — find the specific trace ID from a failed request → see the entire call chain → identify that payment service → fraud-check API is the bottleneck.

What are the four golden signals and why do SRE/Platform teams monitor them?

From Google's SRE book:

Signal	Metric	Example
Latency	Response time (p50, p99, p999)	95th percentile API latency > 200ms
Traffic	Request rate	Requests per second
Errors	Error rate	HTTP 5xx % of total requests
Saturation	Resource utilisation headroom	CPU at 80%, memory at 90%

These four signals catch almost every class of production problem. Alert on them before building custom metrics — they answer "is the service broken?" immediately.

Helm & Kustomize

When would you choose Helm over Kustomize?

	Helm	Kustomize
Templating	Go templates (`{{ .Values.foo }}`)	Patch overlays (no templates)
Package format	Chart (versioned, shareable)	Plain Kubernetes manifests
Values	`values.yaml` per environment	Overlays per environment
Use case	Distributing software (Prometheus, nginx ingress)	Environment-specific customisation of your own manifests
Secrets	Requires Helm-secrets plugin	Use External Secrets Operator

Recommendation: Use Helm for third-party dependencies (from ArtifactHub). Use Kustomize for managing your own applications across environments. Combine: Helm for the chart, Kustomize overlay for per-environment patches.

How do you handle Kubernetes secrets securely in a GitOps workflow?

Never commit plaintext secrets to Git. Strategies:

External Secrets Operator — defines an ExternalSecret CR that reads from Vault, AWS Secrets Manager, or Azure Key Vault and creates a Kubernetes Secret. Only the reference is in Git.
Sealed Secrets — encrypts secrets with a cluster-specific key; encrypted SealedSecret is safe to commit. Only the cluster can decrypt.
Vault Agent Injector — Vault sidecar injects secrets as files into pods at runtime; no secrets in Kubernetes etcd at all.

# ExternalSecrets example
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: db-credentials
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: SecretStore
  target:
    name: db-secret
  data:
  - secretKey: password
    remoteRef:
      key: prod/myapp/db
      property: password

Developer Experience

How do you measure the effectiveness of a platform?

Use the DORA metrics as the primary signal that your platform is improving developer productivity:

Metric	Measures	Target (Elite)
Deployment frequency	How often teams deploy	Multiple times per day
Lead time for changes	Commit → production	Less than 1 hour
Change failure rate	% deploys causing incidents	Less than 5%
Mean time to restore (MTTR)	Recovery time from failure	Less than 1 hour

Secondary platform-specific metrics: time to first deployment for a new service, support ticket volume from developers, NPS (Net Promoter Score) from developer surveys.