AWS CloudOps / SysOps Engineer Interview Questions
40+ AWS CloudOps / SysOps interview questions covering SSM, CloudWatch, CloudFormation, VPC, IAM, S3, RDS, and operational best practices — aligned to the SOA-C03 exam level.
Questions
21+
Topics
8
Est. time
3 hours
EC2 & Systems Manager
What is AWS Systems Manager and why would you use it instead of SSH?
AWS Systems Manager (SSM) is a management service that provides a secure channel to EC2 instances — without any open inbound ports or SSH keys.
Key reasons to prefer SSM over SSH:
- No port 22 open — Session Manager connects through HTTPS to the SSM endpoint, eliminating the attack surface of an exposed SSH port.
- Full audit trail — every session is logged to CloudWatch Logs or S3, giving you a complete record of who ran what.
- No key management — no
.pemfiles to rotate, lose, or distribute. Access is controlled entirely through IAM. - Works in private subnets — instances without public IPs or internet access can be reached via SSM VPC endpoints.
- Run Command at scale — execute scripts across hundreds of instances simultaneously by tag or resource group.
When SSH is still needed: legacy AMIs where the SSM agent isn't installed, or very low-level network/boot troubleshooting that Session Manager can't access.
An EC2 instance doesn't appear as a managed node in SSM Fleet Manager. How do you troubleshoot?
Work through this checklist in order:
- IAM instance profile — the instance must have a role with
AmazonSSMManagedInstanceCore. Check:aws ec2 describe-instances --query 'Reservations[].Instances[].IamInstanceProfile'. - SSM Agent running — SSH in (if possible) or use EC2 Instance Connect to run
systemctl status amazon-ssm-agent. If stopped:systemctl start amazon-ssm-agent. - Outbound HTTPS connectivity — the agent must reach SSM endpoints on port 443. Check security groups allow outbound 443, and NACLs allow it too.
- VPC endpoints (private subnet) — if there's no NAT gateway, check that three VPC interface endpoints exist:
ssm,ssmmessages,ec2messages. - Agent version — old agents may not support all features. Update via
aws ssm send-command --document-name AWS-UpdateSSMAgent.
How does SSM Patch Manager work and what is a Patch Group?
Patch Manager automates OS patching across a fleet. The workflow:
- Patch Baseline — defines approval rules: which severities (Critical, Important), which classifications (SecurityUpdates), and an auto-approval delay (e.g., 7 days after release).
- Patch Group — a tag (
Patch Group = prod) applied to instances that links them to a specific baseline. - Maintenance Window — a scheduled time window (e.g., Sundays 2–4am) when patching runs.
- Run Command task — within the window,
AWS-RunPatchBaselineexecutes on tagged instances.
After patching, Patch Manager reports compliance status per instance — which you can aggregate in Security Hub.
Common mistake: the tag key is exactly Patch Group (two words, capital P and G). Wrong case breaks the baseline association.
What is the difference between SSM Run Command and SSM Automation?
| Run Command | Automation | |
|---|---|---|
| Executes | Commands on instances (via SSM Agent) | AWS API calls + Lambda + nested steps |
| Target | Instances only | Any AWS resource |
| Use case | Run a script on N servers | Multi-step operational workflow (patch AMI, remediate Config finding) |
| Trigger | Manual, SSM Maintenance Window | Manual, EventBridge, Config |
Example: Use Run Command to rotate an app's config file on 200 servers. Use Automation to: stop an EC2 instance → create an AMI → patch it → update the Auto Scaling Group launch template → restart instances.
CloudWatch & Monitoring
RAM utilisation is not showing in CloudWatch. How do you fix this?
RAM is not a default EC2 metric — AWS only pushes CPU, network, disk (instance store), and status checks.
Fix:
- Install the CloudWatch Agent on the instance.
- Configure it to collect
mem_used_percent(and optionallydisk_used_percent). - The instance needs an IAM role with
cloudwatch:PutMetricData.
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard
# Select: metrics → mem, disk
# Confirm: output to CloudWatch
After running, metrics appear under the CWAgent namespace.
Explain the difference between CloudWatch, CloudTrail, and AWS Config.
This trio is a classic interview question — always answer with what each one answers:
| Service | Answers | Example |
|---|---|---|
| CloudWatch | What is happening right now / over time? | CPU is at 90%, alarm triggered, logs show 500 errors |
| CloudTrail | Who did what, when, from where? | Alice called DeleteBucket at 14:32 from IP 203.0.113.1 |
| Config | Is my infrastructure compliant? | 3 security groups allow port 22 from 0.0.0.0/0 — non-compliant |
They complement each other: CloudTrail identifies the cause of a Config violation, and CloudWatch alarms can trigger automated Config remediation.
An EC2 instance is failing its system status check. What do you do?
A system status check failure indicates a problem with the underlying AWS hardware (host), not your VM.
Automated remediation: Create a CloudWatch Alarm on StatusCheckFailed_System ≥ 1 for 3 periods → action: EC2 Recover. This migrates the instance to a healthy host, preserving the instance ID, EIPs, and metadata.
Manual: Stop and start the instance (not reboot — a stop/start moves it to a new host). Note: the public IP changes unless you're using an Elastic IP.
CloudFormation
What is CloudFormation drift and how do you detect it?
Drift occurs when someone manually changes a resource (e.g., edits a security group in the console) after CloudFormation deployed it. The actual resource config no longer matches the template.
# Detect drift on a stack
aws cloudformation detect-stack-drift --stack-name my-stack
# Poll for completion
aws cloudformation describe-stack-drift-detection-status --stack-drift-detection-id <id>
# View drifted resources
aws cloudformation describe-stack-resource-drifts \
--stack-name my-stack \
--stack-resource-drift-status-filters MODIFIED DELETED
Remediation options:
- Update the template to match the manual change, then run a stack update.
- Revert the manual change back to the original config.
A CloudFormation stack update fails and can't roll back. How do you recover?
When a stack gets stuck in UPDATE_ROLLBACK_FAILED, use:
aws cloudformation continue-update-rollback --stack-name my-stack
If a specific resource is blocking rollback, skip it:
aws cloudformation continue-update-rollback \
--stack-name my-stack \
--resources-to-skip MyProblematicResource
The skipped resource stays in its current state — you'll need to manually fix it afterwards.
What is the difference between CloudFormation nested stacks and StackSets?
| Nested Stacks | StackSets | |
|---|---|---|
| Purpose | Modularise a single deployment | Deploy the same template to many accounts/regions |
| Scope | One account, one region | Multi-account, multi-region |
| Lifecycle | Child stack managed by parent | Stack instances are independent |
| Use case | Break a large template into reusable modules | Deploy security baselines, CloudTrail, Config across an org |
VPC & Networking
Walk me through how traffic flows from a private EC2 instance to the internet.
Private instance (10.0.2.10)
→ Route table: 0.0.0.0/0 → nat-xxxxxxxx
→ NAT Gateway (in public subnet, has Elastic IP)
→ Route table: 0.0.0.0/0 → igw-xxxxxxxx
→ Internet Gateway
→ Internet
For this to work: the NAT Gateway must be in a public subnet, the private subnet's route table must point to the NAT Gateway (not the IGW), and the public subnet's route table must point to the IGW.
Explain the difference between Security Groups and NACLs.
| Security Group | NACL | |
|---|---|---|
| Level | Instance / ENI | Subnet |
| Stateful | Yes (return traffic automatic) | No (must allow both directions) |
| Rules | Allow only | Allow + Deny |
| Evaluation | All rules | Numbered order (lowest first) |
| Default | Deny all inbound | Allow all |
Key exam trap: NACLs are stateless. If you allow HTTPS inbound (443), you must also explicitly allow the ephemeral return ports (1024–65535) outbound, or responses won't reach the client.
How do you allow an EC2 instance in a private subnet to access S3 without internet?
Create a Gateway VPC endpoint for S3:
aws ec2 create-vpc-endpoint \
--vpc-id vpc-12345 \
--service-name com.amazonaws.us-east-1.s3 \
--route-table-ids rtb-private-12345
This adds a route to the private subnet's route table pointing to the endpoint (no NAT Gateway required). Traffic to S3 stays entirely within the AWS network — more secure and free of data transfer charges.
S3 & Storage
You accidentally deleted a versioned object in S3. How do you restore it?
Deleting a versioned object creates a delete marker — the object isn't gone; it's just hidden behind the marker.
# List all versions including delete markers
aws s3api list-object-versions --bucket my-bucket --prefix myfile.txt
# Remove the delete marker to restore the object
aws s3api delete-object \
--bucket my-bucket \
--key myfile.txt \
--version-id <delete-marker-version-id>
The object reappears as the latest version. To permanently delete all versions, you must delete each version ID explicitly.
What S3 features would you use to protect an S3 bucket from accidental public access?
Layered approach:
- Block Public Access at account level — overrides any bucket policy or ACL granting public access.
- Bucket policy with
Denyifaws:SecureTransport = false— enforce HTTPS only. - S3 Object Lock (compliance mode) for critical data — prevents deletion until retention period expires.
- IAM Access Analyzer for S3 — automatically surfaces buckets that are public or cross-account shared.
- AWS Config rule
s3-bucket-public-read-prohibitedwith auto-remediation.
IAM & Security
What is the difference between an IAM role and an IAM user?
| IAM User | IAM Role | |
|---|---|---|
| Credentials | Long-term (access key + secret) | Temporary (STS tokens, 1–12 hours) |
| Assigned to | A specific person | AWS services, EC2 instances, Lambda, federated users |
| Key rotation | Manual | Automatic (STS handles it) |
| Best practice | Humans with MFA; prefer SSO | Always use for services (never access keys on EC2) |
Follow-up: Why not put access keys directly on an EC2 instance? Keys are long-lived and exposed if the instance is compromised. An IAM role gives temporary credentials that auto-rotate and never touch disk.
Explain the concept of least privilege. How do you implement it for an EC2 application?
Least privilege means granting only the exact permissions required — nothing more.
Implementation steps:
- Start with a deny-all role — blank IAM role with no policies attached.
- Identify access patterns — what AWS services does the app call? What actions on which resources?
- Write a specific policy — use resource ARNs (not
*), specific actions, and condition keys where possible. - Use IAM Access Advisor to see last accessed services and remove unused permissions.
- Review periodically — use IAM Access Analyzer to find over-permissive policies.
Example: an app that reads from one S3 bucket gets s3:GetObject on arn:aws:s3:::my-bucket/* — not s3:* on *.
Disaster Recovery & Backup
What are the four DR strategies in AWS and when would you use each?
| Strategy | RTO | RPO | Cost | Description |
|---|---|---|---|---|
| Backup & Restore | Hours | Hours | Lowest | Periodic snapshots; restore when disaster strikes |
| Pilot Light | ~10 min | Minutes | Low | Core services (DB) always running; scale up compute on failover |
| Warm Standby | Minutes | Seconds | Medium | Reduced-capacity copy always running; scale to full on failover |
| Multi-Site Active/Active | Near zero | Near zero | Highest | Full capacity in both regions simultaneously |
Choose based on your RTO/RPO requirements and budget. Most applications use Backup & Restore or Warm Standby.
How does AWS Backup differ from EBS snapshots?
| EBS Snapshots | AWS Backup | |
|---|---|---|
| Scope | Single EBS volume | Multi-service (EC2, RDS, EFS, DynamoDB, S3, FSx) |
| Policy | Manual or Data Lifecycle Manager | Centralised backup policies |
| Cross-account | Via AMI | Yes, natively |
| Audit | Limited | Full backup audit logs in CloudTrail |
| Retention | Manual | Policy-driven, lifecycle rules |
Use AWS Backup for enterprise-wide backup governance. Use direct EBS snapshots for simple, ad-hoc volume backups.
Account Management & Cost
What is an AWS Service Control Policy (SCP) and how does it differ from an IAM policy?
| SCP | IAM Policy | |
|---|---|---|
| Applied to | Accounts / OUs in Organizations | Users, roles, groups |
| Grants permissions | No — it sets the ceiling | Yes |
| Overrideable | No (even by account root) | Yes, by adding more policies |
| Purpose | Organisation-wide guardrails | Per-identity permissions |
Key point: If an SCP denies an action, no IAM policy in the account can allow it — even for the account's root user. This makes SCPs ideal for enforcing non-negotiable security controls across all accounts.
An AWS cost spike appears. How do you investigate?
- Cost Explorer → filter by service, then by resource tag → identify the spike.
- Cost & Usage Report (CUR) in Athena for line-item detail.
- CloudTrail → search for large-scale resource creations around the spike time.
- Cost Allocation Tags → confirm tags are present on resources; missing tags make attribution hard.
- Trusted Advisor → check for idle resources (unattached EIPs, unused Reserved Instances).
- Compute Optimizer → check for over-provisioned EC2 instances.
Prevention: set AWS Budgets alerts at 80% of expected monthly spend, and CloudWatch alarms on EstimatedCharges.
