Cloud Engineer Interview Questions
45+ cloud engineering interview questions covering multi-cloud architecture, networking, IAM, storage, cost optimisation, and high availability — with structured answers.
Questions
45+
Topics
7
Est. time
3 hours
Cloud Fundamentals & Architecture
What is the difference between IaaS, PaaS, and SaaS? Give a real-world example of each.
IaaS (Infrastructure as a Service) — the provider manages the physical hardware, networking, and virtualisation. You manage the OS upward.
- Example: Azure Virtual Machines, AWS EC2. You spin up a VM and manage the OS, runtime, and application yourself.
PaaS (Platform as a Service) — the provider manages the infrastructure and OS. You manage only the application and data.
- Example: Azure App Service, AWS Elastic Beanstalk. You deploy your code; the platform handles patching, scaling, and load balancing.
SaaS (Software as a Service) — the provider manages everything including the application. You only consume it.
- Example: Microsoft 365, Salesforce. You log in and use the product — no infrastructure or application management.
Decision rule: Use IaaS for full control (legacy lift-and-shift), PaaS to focus on code rather than servers, SaaS when you just need the tool.
Explain the shared responsibility model in cloud computing.
The shared responsibility model defines what the cloud provider secures vs. what the customer secures.
| Layer | Azure / AWS responsibility | Customer responsibility |
|---|---|---|
| Physical datacentre | ✅ Provider | — |
| Host infrastructure / hypervisor | ✅ Provider | — |
| Network controls | ✅ Provider (backbone) | ✅ Customer (VNETs, NSGs, peering) |
| OS | Provider (PaaS/SaaS) | ✅ Customer (IaaS VMs) |
| Application | — | ✅ Customer (always) |
| Identity & Access | — | ✅ Customer (always) |
| Data | — | ✅ Customer (always) |
Follow-up trap: "So the cloud provider keeps your data safe?" — No. Encrypting, classifying, and protecting data is always the customer's job.
What is a region versus an availability zone?
- Region: A geographical area containing multiple Azure/AWS datacentres. Examples: East US, UK South.
- Availability Zone (AZ): Physically separate datacentres within a region, connected by low-latency links. Azure typically has 3 per region.
Use multiple AZs to protect against a single datacentre failure (power, cooling). Use multiple regions to protect against regional disasters or comply with data residency requirements.
What is a VNet (Virtual Network) and why is subnetting important?
A VNet is a logically isolated network in the cloud where you place Azure resources. IP ranges are defined using CIDR notation (e.g., 10.0.0.0/16).
Subnetting matters because:
- It segments resources by tier (web, app, database) for security boundary control.
- Each subnet can have its own NSG (Network Security Group) or route table.
- Azure reservces 5 IPs per subnet; plan sizes accordingly.
- A
/24subnet gives 251 usable IPs — appropriate for most tiers.
How do you design for high availability in the cloud?
Key pillars:
- Redundancy — deploy across multiple AZs or regions; use load balancers (Azure Load Balancer, AWS ALB).
- Auto-scaling — VMSS (Azure), Auto Scaling Groups (AWS) to handle demand spikes.
- Health probes — load balancers route away from unhealthy instances.
- Stateless applications — session state in Redis or a database, not on the VM, so any instance can serve any request.
- Database HA — Azure SQL geo-replication, read replicas; AWS RDS Multi-AZ.
- Graceful degradation — circuit breakers and fallbacks so partial failure doesn't cascade.
Target metrics: define RTO (Recovery Time Objective) and RPO (Recovery Point Objective) before designing.
What is the difference between vertical and horizontal scaling?
| Vertical (Scale Up) | Horizontal (Scale Out) | |
|---|---|---|
| Method | Increase VM size (more CPU/RAM) | Add more VM instances |
| Downtime | Often requires restart | Zero-downtime with load balancer |
| Limit | Single machine ceiling | Effectively unlimited |
| State | Simple — single instance | Requires stateless app or distributed state |
| Cost | Exponential at high tiers | Linear |
Best practice: Design applications to scale horizontally from the start. Use vertical scaling only for quick fixes or databases that don't support sharding easily.
What is a CDN and when would you use it?
A Content Delivery Network (Azure CDN, AWS CloudFront) caches static content (images, JS, CSS, video) on edge nodes geographically close to users, reducing latency and origin server load.
Use a CDN when:
- You serve global users and latency matters.
- Your origin faces high static-content traffic (DDoS mitigation too).
- You want to add HTTPS, compression, and caching rules without changing the app.
Not useful for highly dynamic, personalised, or real-time content that can't be cached.
Describe the CAP theorem and how it applies to cloud databases.
CAP theorem: A distributed system can provide at most two of three guarantees simultaneously:
- Consistency — every read gets the most recent write.
- Availability — every request gets a (non-error) response.
- Partition tolerance — the system keeps working despite network partitions.
Since partition tolerance is mandatory in real networks, you choose CP or AP:
- CP (Consistent + Partition tolerant): Azure Cosmos DB (strong consistency mode), traditional RDBMS replicas. Sacrifices availability under partition.
- AP (Available + Partition tolerant): Cosmos DB (eventual consistency), DynamoDB default. Tolerates stale reads for higher availability.
Networking
What is the difference between a public IP and a private IP in Azure?
- Private IP: Routable only within your VNet — assigned from your CIDR range. Used for internal service-to-service communication.
- Public IP: Routable over the internet. Attached to a load balancer, Azure Bastion, or a NIC directly.
Best practice: expose only what must be publicly accessible (load balancers, API gateways). Databases, app servers, and internal services should have private IPs only.
What is VNet Peering and when do you use it over VPN Gateway?
VNet Peering connects two VNets at the Microsoft backbone level — low latency, high throughput, no encryption overhead, no bandwidth bottlenecks placed by the gateway. Traffic stays within the Azure network.
VPN Gateway connects VNets or on-premises networks using encrypted IPsec tunnels. Introduces gateway SKU bandwidth limits and latency.
| VNet Peering | VPN Gateway | |
|---|---|---|
| Use for | Azure-to-Azure (same or cross-region) | Azure-to-on-premises |
| Latency | Very low | Higher (encryption overhead) |
| Throughput | Network-speed | Limited by SKU |
| Cost | Data transfer charges | Gateway + data charges |
Explain NSGs (Network Security Groups) and how they differ from Azure Firewall.
NSGs are stateful, layer-4 (port/protocol) packet filters attached to subnets or NICs. They allow/deny traffic based on source IP, destination IP, port, and protocol. They are cheap, simple, and the first line of defence.
Azure Firewall is a managed, stateful layer-4/7 firewall with:
- FQDN filtering (block
*.malicious.com). - Threat intelligence feeds.
- Application rules (HTTP/S URL filtering).
- Centralised logging.
Use NSGs for subnet-level segmentation everywhere. Add Azure Firewall at the VNet perimeter for traffic between spokes in a hub-and-spoke topology or for egress filtering.
What is Azure Private Link / AWS PrivateLink?
Private Link exposes Azure PaaS services (Storage, SQL, Cosmos DB) or your own services over a private endpoint in your VNet with a private IP. Traffic never traverses the public internet.
Before Private Link, accessing Azure Storage from a VM required either a public endpoint (internet) or a Service Endpoint (still exits VNet boundary). Private Link keeps traffic fully within the Microsoft backbone.
Use it for: compliance requirements, preventing data exfiltration, securing PaaS access from on-premises via ExpressRoute.
What is ExpressRoute and when is it preferred over VPN?
ExpressRoute is a dedicated private connection from on-premises to Azure, provisioned through a connectivity provider (not over the public internet). Bandwidths from 50 Mbps to 100 Gbps.
Prefer ExpressRoute over VPN when:
- You need guaranteed bandwidth and consistent latency (financial trading, SAP).
- You're migrating large data sets (terabytes).
- Regulatory requirements forbid internet traversal.
- You need > 1 Gbps sustained throughput.
VPN is adequate for smaller offices, remote access, and budget-constrained scenarios.
What is DNS and how does Azure DNS work?
DNS resolves human-readable names (e.g., api.example.com) to IP addresses. Azure DNS is a hosted DNS service backed by Azure's anycast network.
- Public DNS zones: Resolve names from the internet. Delegate your domain registrar's NS records to Azure DNS.
- Private DNS zones: Resolve names within VNets only. Link a private zone to one or more VNets for autoregistration of VM hostnames.
Azure DNS does not support DNSSEC natively at time of writing — a common interview trap question.
Explain what a load balancer does and the types available in Azure.
A load balancer distributes incoming traffic across multiple backend instances to provide scale and resilience.
| Azure Load Balancer | Application Gateway | Azure Front Door | |
|---|---|---|---|
| Layer | L4 (TCP/UDP) | L7 (HTTP/S) | L7 (global) |
| SSL termination | ❌ | ✅ | ✅ |
| URL path routing | ❌ | ✅ | ✅ |
| WAF | ❌ | ✅ | ✅ |
| Global load balancing | ❌ | ❌ | ✅ |
| Use case | Internal VMs, non-HTTP | Single-region API / web apps | Multi-region web apps |
Storage & Databases
What are the Azure storage tiers and when do you use each?
Azure Blob Storage offers four access tiers:
| Tier | Access cost (read) | Storage cost | Use case |
|---|---|---|---|
| Hot | Low | High | Actively accessed data |
| Cool | Medium | Medium | Infrequently accessed (30+ days) |
| Cold | Higher | Lower | Rarely accessed (90+ days) |
| Archive | Highest + rehydration delay | Lowest | Long-term retention (180+ days), compliance |
Set lifecycle management policies to automatically move blobs between tiers based on last-modified date.
What is the difference between Azure SQL Database and Azure SQL Managed Instance?
| Azure SQL Database (PaaS) | Azure SQL Managed Instance | |
|---|---|---|
| Compatibility | ~99% SQL Server | ~100% SQL Server |
| SQL Agent | ❌ | ✅ |
| Cross-db queries | ❌ | ✅ |
| VNet injection | ❌ (Private Endpoint only) | ✅ (native VNet) |
| Use case | New cloud-native apps | Lift-and-shift legacy SQL Server |
When would you choose Cosmos DB over a relational database?
Choose Cosmos DB when:
- You need single-digit millisecond global latency with multi-region writes.
- Your data model is document, key-value, graph, or column-family (not strictly tabular).
- You need elastic, automatic scaling without managing partitions.
- Eventual consistency is acceptable, or you need per-operation consistency tuning.
Choose a relational database when:
- Strong ACID transactions across multiple tables are required.
- Your data is highly relational and normalised.
- You're joining many tables with complex queries.
Explain the difference between LRS, ZRS, GRS, and GZRS storage redundancy.
| Tier | Copies | Availability |
|---|---|---|
| LRS (Locally Redundant) | 3 in same datacentre | Lowest cost; no zone/region resilience |
| ZRS (Zone Redundant) | 3 across AZs in same region | Zone failure resilience |
| GRS (Geo-Redundant) | 3 local + 3 in paired region | Region failure; secondary is read-only unless failover |
| GZRS | ZRS primary + GRS secondary | Highest resilience |
For production: at minimum ZRS. For critical systems with data durability requirements: GZRS.
What is the difference between a relational and a NoSQL database?
| Relational (SQL) | NoSQL | |
|---|---|---|
| Schema | Fixed, defined upfront | Flexible / schemaless |
| ACID | Full cross-table | Usually per-document/operation |
| Query language | SQL (standard) | Varies (JSON, API, CQL) |
| Scaling | Typically vertical or read replicas | Horizontal sharding |
| Use case | Complex joins, reports, transactions | High-throughput, flexible schema, world scale |
How would you migrate an on-premises database to Azure?
- Assess: Azure Migrate Database Assessment — identify compatibility issues.
- Schema migration: Export schema, fix compatibility (deprecated features, data types).
- Data migration: Azure Database Migration Service for minimal-downtime migrations.
- Cutover: During a maintenance window, stop writes, migrate final change log, flip connection strings.
- Validate: Row counts, checksums, application smoke tests.
- Rollback plan: Keep source DB available for a minimum of 24–48 hours post-cutover.
IAM & Security
What is RBAC and how does it work in Azure?
Role-Based Access Control (RBAC) grants access to Azure resources by assigning a role to a security principal at a scope.
- Principal: User, group, service principal, managed identity.
- Role: Collection of permissions (e.g., Contributor, Reader, Storage Blob Data Contributor).
- Scope: Management group → subscription → resource group → individual resource.
Assignments are inherited down the scope hierarchy. Best practice: assign roles to groups not individuals; use least privilege; prefer built-in roles over custom where possible.
What is the difference between a service principal and a managed identity?
Both are non-human identities applications use to authenticate to Azure services.
| Service Principal | Managed Identity | |
|---|---|---|
| Credentials | Client ID + secret or certificate (you manage rotation) | None — Azure manages credentials automatically |
| Lifecycle | Manual | Tied to the resource lifecycle |
| Use case | CI/CD pipelines, external systems | Azure-hosted resources (VMs, App Service, AKS pods) |
| Risk | Secret leakage if not rotated | None (no secret exposed) |
Best practice: Always prefer managed identity when the workload runs on Azure. Use service principals only for external systems or CI runners.
What is Azure Key Vault and what types of objects does it store?
Azure Key Vault is a managed hardware-security-module-backed service for storing secrets, keys, and certificates.
- Secrets: Connection strings, passwords, API tokens.
- Keys: RSA/EC cryptographic keys for encryption operations (never leave Key Vault if using HSM tier).
- Certificates: TLS certificates with automatic renewal via integrated CAs.
Access control: Key Vault access policies (legacy) or RBAC (recommended). Applications access Key Vault via managed identity, avoiding hardcoded credentials entirely.
What is Zero Trust security and how does it apply to cloud?
Zero Trust is the model: "Never trust, always verify." Instead of trusting anything inside the network perimeter, every request is authenticated and authorised regardless of origin.
Principles applied in cloud:
- Verify explicitly — Always authenticate and authorise; use MFA, conditional access.
- Least privilege access — JIT (just-in-time) access, RBAC, scope-restricted roles.
- Assume breach — Segment networks (microsegmentation), encrypt data in transit and at rest, log everything to a SIEM.
How do you secure secrets in a CI/CD pipeline?
- Store secrets in the pipeline platform's secret store (GitHub Secrets, Azure DevOps variable groups with Key Vault integration) — never in source code.
- Use workload identity federation (OIDC tokens) to authenticate GitHub Actions to Azure without storing a service principal secret.
- Audit secret access; rotate secrets regularly.
- Scan repos with tools like
git-secrets,truffleHog, or GitHub Advanced Security to catch accidental commits.
What is Conditional Access in Azure AD / Entra ID?
Conditional Access creates policies that control access to applications based on signals:
- Who: User, group, role.
- What: Application (e.g., Azure Portal, Office 365).
- How: Device compliance, IP location, risk level.
- Result: Require MFA, block access, limit session, force password change.
Example: "Require MFA when accessing Azure Portal from outside trusted locations."
What is the principle of least privilege and how do you enforce it?
The principle of least privilege means giving principals only the minimum permissions needed for their function.
Enforcement in Azure:
- Assign built-in roles at the narrowest scope (resource > resource group > subscription).
- Use custom roles for granular permission sets.
- Regularly audit role assignments with Azure AD Access Reviews.
- Remove unused service principals and guest accounts.
- Use PIM (Privileged Identity Management) for just-in-time elevation of admin roles.
Explain DDoS protection options in Azure.
- Azure DDoS Infrastructure Protection: Always-on, free, protects against volumetric and protocol attacks at the infrastructure level. Covers all Azure public IPs.
- Azure DDoS Network Protection (paid): Per-VNet, enhanced mitigation, attack analytics, rapid response support, cost guarantees.
- Azure Web Application Firewall (WAF): L7 protection against OWASP top 10, SQL injection, XSS — deployed on Application Gateway or Front Door.
For production web applications: DDoS Network Protection + WAF on Application Gateway or Front Door.
Cost Management
How do you reduce cloud costs without sacrificing reliability?
Right-sizing: Analyse CPU/memory metrics; downsize overprovisioned VMs.
Reserved Instances: Commit to 1–3 years for predictable workloads — up to 70% savings vs. pay-as-you-go.
Spot/Preemptible VMs: For fault-tolerant, interruptible workloads (batch jobs, ML training) — up to 90% savings.
Auto-scaling: Scale down during off-peak hours; don't run dev/test environments overnight.
Storage tiers: Use lifecycle policies to move cold data to cool/archive tiers automatically.
Savings Plans (AWS) / Azure Compute Savings Plan: Flexibility to apply discounts across instance types.
What is a cloud billing model and how does FinOps work?
FinOps (Financial Operations) is the practice bridging engineering, finance, and business to manage cloud costs collaboratively.
Key practices:
- Tagging: Apply consistent resource tags (team, project, environment) with Azure Policy enforcement so costs can be attributed.
- Budgets & alerts: Set Azure Cost Management budgets with alerts at 80%/100% threshold.
- Showback/chargeback: Report costs per team; force accountability.
- Cost reviews: Regular architecture reviews to identify zombie resources, idle VMs, orphaned disks.
What are Reserved Instances and when should you use them?
Reserved Instances (Azure) / Reserved Capacity commit you to a specific resource type and region for 1 or 3 years in exchange for a significant discount (up to 70%).
Use them when:
- The workload runs 24/7 (production databases, PaaS services, core VMs).
- Usage is predictable and won't change significantly.
- You have at least 6–12 months of usage history to analyse.
Avoid for: dev/test, non-committed projects, rapidly scaling workloads better suited to Savings Plans.
How do you identify and remove wasted cloud spend?
- Azure Advisor: Automatically surfaces under-utilised VMs, idle load balancers, unattached disks.
- Orphaned resources: IP addresses, managed disks, snapshots without attached resources.
- Oversized SKUs: Review CPU ≤ 10% average for VMs — candidate for downsizing.
- Dev/test scheduling: Autostart/autostop VMs outside business hours.
- Data transfer costs: Audit egress charges; move inter-region traffic to VNet peering where possible.
What is the difference between CAPEX and OPEX in cloud context?
- CAPEX (Capital Expenditure): Upfront investment in physical hardware — on-premises servers, networking gear. Depreciated over years.
- OPEX (Operating Expenditure): Ongoing pay-as-you-go cloud consumption. No upfront cost, consumed as a service.
Cloud shifts the model from CAPEX to OPEX, reducing upfront risk and enabling elasticity. However, uncontrolled OPEX can exceed CAPEX if governance is absent — hence FinOps.
High Availability & Disaster Recovery
What is RPO and RTO and how do they influence your DR design?
- RPO (Recovery Point Objective): Maximum acceptable data loss. If RPO = 1 hour, the DR solution must replicate data at least hourly.
- RTO (Recovery Time Objective): Maximum acceptable downtime after a disaster. If RTO = 4 hours, the system must be restorable within 4 hours.
| Tier | RTO | RPO | Approach |
|---|---|---|---|
| Mission critical | < 1 min | ~0 | Active-active, synchronous replication |
| Business critical | < 1 hour | < 15 min | Warm standby, geo-replication |
| Standard | < 4 hours | < 1 hour | Pilot light, scheduled backups |
| Non-critical | < 24 hours | < 24 hours | Cold backup restore |
Explain active-active vs active-passive disaster recovery.
Active-Active: Traffic runs through multiple live regions simultaneously. On failure, the load balancer removes the affected region. Zero failover time (sub-second). Highest cost.
Active-Passive (Warm standby): Traffic runs through one region; a second region has live infrastructure but minimal/no traffic. On failure, traffic is redirected. Failover time: minutes. Moderate cost.
Active-Passive (Cold/Pilot light): Secondary region has just enough infrastructure to bring the rest up quickly. Resources are started only on failover. Lower cost; longer RTO.
What is Azure Site Recovery?
Azure Site Recovery (ASR) is a DRaaS (Disaster Recovery as a Service) that continuously replicates virtual machines from a primary site to a secondary Azure region (or Azure to on-premises). On a failover event:
- ASR promotes replicated disks and network config in the target region.
- VMs start in the recovery region.
- Failback is coordinated once the primary is restored.
Supports: Azure VMs, on-premises Hyper-V/VMware/physical servers to Azure.
How do you test a disaster recovery plan?
Testing DR is as important as building it. Steps:
- Failover drills: Execute planned test failovers in ASR or your DR tool — production unaffected, test VMs start in isolation.
- Runbook validation: Manually walk through each step of the DR runbook; time it.
- Schedule: At least annually, ideally quarterly for mission-critical systems.
- RTO validation: Measure actual recovery time vs. target; fix gaps.
- Communication test: Ensure incident contacts, escalation paths, and status page updates work.
- Document findings: Update DR runbook after each test.
What is a multi-region architecture and what are the trade-offs?
Multi-region deploys application components across two or more Azure regions, connected by Azure Front Door (L7 load balancing) or Traffic Manager (DNS-based).
Benefits: Near-zero RTO, data residency options, improved latency for global users.
Trade-offs:
- Cost: 2× infrastructure cost minimum.
- Consistency: Distributed transactions are hard; need to accept eventual consistency or use serialisable consistency with latency penalty.
- Operational complexity: More infrastructure to monitor, patch, and manage.
- Data sovereignty: Understand which regions data can replicate to for compliance.
How do you architect a zero-downtime deployment?
- Blue/Green deployment: Run two identical environments; switch traffic to green after validation; blue becomes the rollback target.
- Canary release: Route a small percentage of traffic (5–10%) to the new version; monitor error rates; gradually shift 100%.
- Rolling deployment: Replace instances in batches behind a load balancer; health probes prevent sending traffic to unhealthy instances.
- Feature flags: Deploy code but keep features disabled; enable for percentage of users without redeployment.
Behavioural & Scenario Questions
Walk me through how you would migrate a mid-sized on-premises application to Azure.
Discovery phase: Use Azure Migrate to inventory VMs, map dependencies, profile performance. Identify databases, file shares, and external integrations.
Planning phase: Choose migration strategy per component (Rehost/Lift-and-shift, Replatform, Refactor). Set RTO/RPO. Define networking design. Plan identity federation.
Execution phase:
- Set up Azure landing zone (hub VNet, ExpressRoute or VPN, DNS, Azure AD Connect).
- Migrate databases first (Azure DMS), validate with application.
- Migrate application tiers with ASR replication; run test failovers.
- Cutover: brief maintenance window, DNS switch, final validation.
Post-migration: Right-size VMs based on actual usage, implement monitoring, enable backups, conduct DR test.
How do you handle a sudden 10× spike in traffic to a cloud application?
Immediate actions:
- Check auto-scaling configuration — is it configured? Increase max instance count if under-provisioned.
- Enable CDN caching to offload static assets.
- Check database — read replicas, connection pooling, caching layer (Redis).
- Identify bottlenecks: Application Insights live metrics, Azure Monitor.
Longer term:
- Design for horizontal auto-scaling from the start.
- Implement caching (Redis) for hot read paths.
- Use Azure Front Door for global load distribution.
- Load test regularly to find breaking points before production traffic does.
Describe a time you reduced cloud costs significantly.
Strong answer structure:
- Situation: Describe the environment (prod workloads running 24/7 on oversized VMs).
- Analysis: Show how you found the waste (Azure Advisor, cost management reports — e.g., 40 VMs running at < 10% CPU).
- Action: Right-sized VMs, purchased 1-year reservations for stable workloads, implemented auto-shutdown for dev/test environments.
- Result: Quantify the saving (e.g., "Reduced monthly cloud bill from £28,000 to £17,000 — a 39% reduction").
How do you approach a security incident on a cloud resource?
- Contain: Isolate the resource — remove from load balancer, apply deny-all NSG, revoke compromised credentials immediately.
- Assess: Review activity logs, Azure Defender alerts, and network flow logs to understand blast radius.
- Eradicate: Remove malicious access, patch vulnerability, rotate secrets.
- Recover: Restore from known-good backup; deploy clean replacement.
- Post-incident review: Root cause analysis; update runbook, improve detection (alerts, Sentinel rules).
- Communicate: Notify stakeholders per incident response plan; assess if breach notification obligations apply.
What would you check first if an Azure VM suddenly became unreachable?
Systematic triage:
- Azure Portal health: Check VM running status, any platform health events in the region.
- NSG rules: Is the NSG on the NIC or subnet blocking the port you're connecting on?
- OS-level firewall: Inside the VM (Windows Firewall, iptables).
- Route table: Is there a User Defined Route inadvertently null-routing traffic?
- Public IP: Is IP still assigned? Did IP change (dynamic allocation)?
- Application logs: Use Azure Serial Console or Boot Diagnostics if SSH/RDP is blocked.
- Guest OS: Check system logs — OOM killer, disk full, crashed service.
