advanced3 hours11 min read

Platform Engineer Interview Questions

35+ platform engineering interview questions covering internal developer platforms, Kubernetes, GitOps, service mesh, developer experience, and observability — for senior/staff-level roles.

platform-engineeringkubernetesgitopsargocdhelmservice-meshidpdeveloper-experiencebackstageterraform

Questions

17+

Topics

7

Est. time

3 hours

Internal Developer Platforms (IDP)

What is a platform engineer and how does the role differ from a DevOps engineer?

A platform engineer builds and maintains the internal tooling, infrastructure, and self-service capabilities that enable product engineers to deploy and operate their services efficiently.

DevOps EngineerPlatform Engineer
FocusCI/CD pipelines, delivery processesInternal developer platform (IDP), paved roads
CustomerOperations teamProduct/application engineers
OutputWorking pipelinesComposable platform capabilities
MindsetProcess improvementProduct thinking for internal tools

Platform engineering applies product management principles to internal infrastructure — the "customers" are developers, and the "product" is the platform itself.


What is an Internal Developer Platform (IDP) and what are its core components?

An IDP is a self-service layer over infrastructure that lets developers deploy, operate, and observe their services without deep platform expertise.

Core components:

  1. Service catalog — inventory of services with ownership, dependencies, documentation (e.g., Backstage).
  2. Infrastructure self-service — golden path templates for provisioning databases, queues, storage (via Terraform modules or Crossplane).
  3. CI/CD workflows — standardised pipelines that developers customise without writing from scratch.
  4. Secrets management — centralised vault (Vault, AWS Secrets Manager, External Secrets Operator).
  5. Observability stack — pre-configured metrics, logs, traces with per-team dashboards.
  6. Environment management — ephemeral environments for PRs, staging promotion workflows.

The goal: reduce cognitive load. Developers focus on business logic; the platform handles the rest.


How would you design a "golden path" for a new microservice?

A golden path is an opinionated, well-supported route that works for most teams. For a new microservice:

  1. Template in Backstage (or similar) → developer fills in service name, language, owner, SLA tier.
  2. Scaffolding → Backstage creates: GitHub repo with standard structure, CI pipeline (GitHub Actions), Dockerfile, Helm chart skeleton, Datadog dashboards, PagerDuty integration.
  3. GitOps registration → ArgoCD Application manifest added to the platform config repo → service auto-deploys to dev.
  4. Promotions — PR to staging branch → auto-deploy. PR to main → requires approval + smoke tests → prod.
  5. Infrastructure — if the service needs a database, a Terraform module request via the catalog → infrastructure provisioned and connection string injected as a Kubernetes Secret via External Secrets Operator.

The developer never touches Kubernetes YAML, Terraform, or CI config directly — unless they need to escape the golden path.


Kubernetes

Explain the Kubernetes control plane components and what each does.

ComponentRole
kube-apiserverREST API frontend — all kubectl commands, controllers, and nodes talk to this
etcdDistributed key-value store — the source of truth for all cluster state
kube-schedulerAssigns pods to nodes based on resources, affinity, taints/tolerations
kube-controller-managerRuns control loops: Deployment, ReplicaSet, Node, Job controllers
cloud-controller-managerInterfaces with the cloud provider API (provision load balancers, volumes)

On worker nodes: kubelet (runs pods), kube-proxy (manages iptables/ipvs rules for Service networking), and the container runtime (containerd).


How does Kubernetes networking work? Explain how a request reaches a pod.

Kubernetes networking requires every pod to have a unique IP and be reachable from every other pod without NAT — the CNI plugin (Calico, Cilium, Flannel) implements this.

Request flow for a Service:

External traffic
  → LoadBalancer Service (cloud LB)
    → NodePort on worker nodes
      → ClusterIP (virtual IP)
        → kube-proxy rewrites to a pod IP (iptables/IPVS rule)
          → Pod (matched by selector label)

For Ingress:

External traffic
  → Cloud LB → Ingress Controller (nginx, Traefik)
    → Ingress rules (host/path matching)
      → ClusterIP Service
        → Pod

A pod is in CrashLoopBackOff. Walk through your debugging steps.

# 1. Describe the pod — look at Events section
kubectl describe pod <pod-name> -n <namespace>

# 2. Check current logs
kubectl logs <pod-name> -n <namespace>

# 3. Check previous container's logs (if pod has restarted)
kubectl logs <pod-name> -n <namespace> --previous

# 4. Check resource constraints — OOMKilled?
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[0].lastState}'

# 5. Exec into the container (if it starts briefly)
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh

Common causes: application error (exit code 1), OOMKilled (memory limit too low, exit code 137), missing secret/configmap, failed readiness probe preventing traffic but liveness probe killing it.


What are the differences between Deployments, StatefulSets, and DaemonSets?

ControllerUse casePod identityStorage
DeploymentStateless apps (API servers, web apps)InterchangeableEphemeral or shared
StatefulSetStateful apps (databases, Kafka, Zookeeper)Stable, ordered names (pod-0, pod-1)Dedicated PVC per pod
DaemonSetOne pod per node (log collectors, monitoring agents)Per-node identityNode-local storage

StatefulSet specifics: pods start/stop in order, each gets a stable network identity (pod-0.service.ns.svc), and PVCs are retained even when the pod is deleted.


GitOps

What is GitOps and how does it differ from push-based CI/CD?

GitOps treats Git as the single source of truth for both application code and infrastructure state. A GitOps operator (ArgoCD, Flux) continuously reconciles the cluster state with the desired state declared in Git.

Push-based CI/CDGitOps
TriggerPipeline pushes to cluster on commitOperator pulls from Git on a schedule/webhook
Cluster accessCI runner has kubectl/API credentialsOnly the in-cluster operator has credentials
DriftNot detectedDetected and auto-reconciled
RollbackRe-run pipeline with old commitgit revert → operator reconciles
AuditPipeline logsGit history

Security advantage: no external system has cluster credentials — the cluster pulls its own desired state.


How does ArgoCD work? Describe its sync mechanism.

ArgoCD runs as a controller inside the cluster. It continuously:

  1. Watches Git (polling or webhook) for changes to the configured repository/branch.
  2. Compares the live cluster state to the desired state in Git (using Kubernetes resource diffs).
  3. If OutOfSync → either auto-syncs (if configured) or raises an alert for manual sync.
  4. Applies the manifests (via kubectl apply or Helm/Kustomize rendering).
  5. Reports sync status, health status, and resource tree in the UI.
# Example Application manifest
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/org/gitops-repo
    targetRevision: main
    path: apps/my-app/overlays/prod
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true       # delete resources removed from Git
      selfHeal: true    # revert manual cluster changes

How do you promote a release from staging to production in a GitOps workflow?

Image update strategy (app repo → config repo):

Developer merges PR to main
  → CI builds image, tags with commit SHA
  → CI opens PR in config repo:
      apps/my-app/overlays/staging/kustomization.yaml
        newTag: abc1234
  → ArgoCD detects change → syncs to staging
  → Automated tests run against staging

Promotion to prod:
  → Platform engineer (or automated gate) opens PR:
      apps/my-app/overlays/prod/kustomization.yaml
        newTag: abc1234  (same SHA that passed staging)
  → PR approved → merged
  → ArgoCD syncs to prod

This creates an immutable, auditable trail — you can see exactly which commit is running in prod at any time.


Service Mesh

What problem does a service mesh solve?

In a microservices architecture, you have many services communicating over the network. Cross-cutting concerns that every service needs — mTLS encryption, circuit breaking, retries, rate limiting, distributed tracing — would normally require each team to implement them.

A service mesh (Istio, Linkerd, Cilium) moves these concerns into the infrastructure layer using sidecar proxies (or eBPF), so application code is free of networking logic.

Key capabilities:

  • mTLS — automatic mutual TLS between all services (zero-trust)
  • Traffic management — canary releases, A/B testing, traffic splitting at the mesh level
  • Observability — automatic metrics, traces, and logs for every service-to-service call
  • Circuit breaking — automatic failure isolation without code changes

Explain Istio's traffic management for a canary deployment.

# VirtualService: route 90% to v1, 10% to v2
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: my-service
spec:
  hosts:
  - my-service
  http:
  - route:
    - destination:
        host: my-service
        subset: v1
      weight: 90
    - destination:
        host: my-service
        subset: v2
      weight: 10
---
# DestinationRule: define subsets (v1, v2)
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: my-service
spec:
  host: my-service
  subsets:
  - name: v1
    labels:
      version: v1
  - name: v2
    labels:
      version: v2

Monitor v2 error rates and latency in Kiali/Grafana → gradually shift weight to 100% v2 → delete v1 Deployment.


Observability

What is the difference between metrics, logs, and traces? How do you use each for troubleshooting?

SignalWhat it tells youTool examples
MetricsAggregated numerical data over timePrometheus, Datadog, CloudWatch
LogsDiscrete events with contextELK, Loki, CloudWatch Logs
TracesEnd-to-end journey of a request across servicesJaeger, Zipkin, AWS X-Ray

Troubleshooting workflow:

  1. Metrics alert fires — "p99 latency spike on checkout service."
  2. Logs — filter checkout service logs for errors around the spike time → find a timeout calling payment service.
  3. Traces — find the specific trace ID from a failed request → see the entire call chain → identify that payment service → fraud-check API is the bottleneck.

What are the four golden signals and why do SRE/Platform teams monitor them?

From Google's SRE book:

SignalMetricExample
LatencyResponse time (p50, p99, p999)95th percentile API latency > 200ms
TrafficRequest rateRequests per second
ErrorsError rateHTTP 5xx % of total requests
SaturationResource utilisation headroomCPU at 80%, memory at 90%

These four signals catch almost every class of production problem. Alert on them before building custom metrics — they answer "is the service broken?" immediately.


Helm & Kustomize

When would you choose Helm over Kustomize?

HelmKustomize
TemplatingGo templates ({{ .Values.foo }})Patch overlays (no templates)
Package formatChart (versioned, shareable)Plain Kubernetes manifests
Valuesvalues.yaml per environmentOverlays per environment
Use caseDistributing software (Prometheus, nginx ingress)Environment-specific customisation of your own manifests
SecretsRequires Helm-secrets pluginUse External Secrets Operator

Recommendation: Use Helm for third-party dependencies (from ArtifactHub). Use Kustomize for managing your own applications across environments. Combine: Helm for the chart, Kustomize overlay for per-environment patches.


How do you handle Kubernetes secrets securely in a GitOps workflow?

Never commit plaintext secrets to Git. Strategies:

  1. External Secrets Operator — defines an ExternalSecret CR that reads from Vault, AWS Secrets Manager, or Azure Key Vault and creates a Kubernetes Secret. Only the reference is in Git.
  2. Sealed Secrets — encrypts secrets with a cluster-specific key; encrypted SealedSecret is safe to commit. Only the cluster can decrypt.
  3. Vault Agent Injector — Vault sidecar injects secrets as files into pods at runtime; no secrets in Kubernetes etcd at all.
# ExternalSecrets example
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: db-credentials
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: SecretStore
  target:
    name: db-secret
  data:
  - secretKey: password
    remoteRef:
      key: prod/myapp/db
      property: password

Developer Experience

How do you measure the effectiveness of a platform?

Use the DORA metrics as the primary signal that your platform is improving developer productivity:

MetricMeasuresTarget (Elite)
Deployment frequencyHow often teams deployMultiple times per day
Lead time for changesCommit → productionLess than 1 hour
Change failure rate% deploys causing incidentsLess than 5%
Mean time to restore (MTTR)Recovery time from failureLess than 1 hour

Secondary platform-specific metrics: time to first deployment for a new service, support ticket volume from developers, NPS (Net Promoter Score) from developer surveys.