intermediate3 hours12 min read

AWS CloudOps / SysOps Engineer Interview Questions

40+ AWS CloudOps / SysOps interview questions covering SSM, CloudWatch, CloudFormation, VPC, IAM, S3, RDS, and operational best practices — aligned to the SOA-C03 exam level.

awscloudopssoa-c03ssmcloudwatchcloudformationvpciams3ec2

Questions

21+

Topics

8

Est. time

3 hours

EC2 & Systems Manager

What is AWS Systems Manager and why would you use it instead of SSH?

AWS Systems Manager (SSM) is a management service that provides a secure channel to EC2 instances — without any open inbound ports or SSH keys.

Key reasons to prefer SSM over SSH:

  • No port 22 open — Session Manager connects through HTTPS to the SSM endpoint, eliminating the attack surface of an exposed SSH port.
  • Full audit trail — every session is logged to CloudWatch Logs or S3, giving you a complete record of who ran what.
  • No key management — no .pem files to rotate, lose, or distribute. Access is controlled entirely through IAM.
  • Works in private subnets — instances without public IPs or internet access can be reached via SSM VPC endpoints.
  • Run Command at scale — execute scripts across hundreds of instances simultaneously by tag or resource group.

When SSH is still needed: legacy AMIs where the SSM agent isn't installed, or very low-level network/boot troubleshooting that Session Manager can't access.


An EC2 instance doesn't appear as a managed node in SSM Fleet Manager. How do you troubleshoot?

Work through this checklist in order:

  1. IAM instance profile — the instance must have a role with AmazonSSMManagedInstanceCore. Check: aws ec2 describe-instances --query 'Reservations[].Instances[].IamInstanceProfile'.
  2. SSM Agent running — SSH in (if possible) or use EC2 Instance Connect to run systemctl status amazon-ssm-agent. If stopped: systemctl start amazon-ssm-agent.
  3. Outbound HTTPS connectivity — the agent must reach SSM endpoints on port 443. Check security groups allow outbound 443, and NACLs allow it too.
  4. VPC endpoints (private subnet) — if there's no NAT gateway, check that three VPC interface endpoints exist: ssm, ssmmessages, ec2messages.
  5. Agent version — old agents may not support all features. Update via aws ssm send-command --document-name AWS-UpdateSSMAgent.

How does SSM Patch Manager work and what is a Patch Group?

Patch Manager automates OS patching across a fleet. The workflow:

  1. Patch Baseline — defines approval rules: which severities (Critical, Important), which classifications (SecurityUpdates), and an auto-approval delay (e.g., 7 days after release).
  2. Patch Group — a tag (Patch Group = prod) applied to instances that links them to a specific baseline.
  3. Maintenance Window — a scheduled time window (e.g., Sundays 2–4am) when patching runs.
  4. Run Command task — within the window, AWS-RunPatchBaseline executes on tagged instances.

After patching, Patch Manager reports compliance status per instance — which you can aggregate in Security Hub.

Common mistake: the tag key is exactly Patch Group (two words, capital P and G). Wrong case breaks the baseline association.


What is the difference between SSM Run Command and SSM Automation?

Run CommandAutomation
ExecutesCommands on instances (via SSM Agent)AWS API calls + Lambda + nested steps
TargetInstances onlyAny AWS resource
Use caseRun a script on N serversMulti-step operational workflow (patch AMI, remediate Config finding)
TriggerManual, SSM Maintenance WindowManual, EventBridge, Config

Example: Use Run Command to rotate an app's config file on 200 servers. Use Automation to: stop an EC2 instance → create an AMI → patch it → update the Auto Scaling Group launch template → restart instances.


CloudWatch & Monitoring

RAM utilisation is not showing in CloudWatch. How do you fix this?

RAM is not a default EC2 metric — AWS only pushes CPU, network, disk (instance store), and status checks.

Fix:

  1. Install the CloudWatch Agent on the instance.
  2. Configure it to collect mem_used_percent (and optionally disk_used_percent).
  3. The instance needs an IAM role with cloudwatch:PutMetricData.
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard
# Select: metrics → mem, disk
# Confirm: output to CloudWatch

After running, metrics appear under the CWAgent namespace.


Explain the difference between CloudWatch, CloudTrail, and AWS Config.

This trio is a classic interview question — always answer with what each one answers:

ServiceAnswersExample
CloudWatchWhat is happening right now / over time?CPU is at 90%, alarm triggered, logs show 500 errors
CloudTrailWho did what, when, from where?Alice called DeleteBucket at 14:32 from IP 203.0.113.1
ConfigIs my infrastructure compliant?3 security groups allow port 22 from 0.0.0.0/0 — non-compliant

They complement each other: CloudTrail identifies the cause of a Config violation, and CloudWatch alarms can trigger automated Config remediation.


An EC2 instance is failing its system status check. What do you do?

A system status check failure indicates a problem with the underlying AWS hardware (host), not your VM.

Automated remediation: Create a CloudWatch Alarm on StatusCheckFailed_System ≥ 1 for 3 periods → action: EC2 Recover. This migrates the instance to a healthy host, preserving the instance ID, EIPs, and metadata.

Manual: Stop and start the instance (not reboot — a stop/start moves it to a new host). Note: the public IP changes unless you're using an Elastic IP.


CloudFormation

What is CloudFormation drift and how do you detect it?

Drift occurs when someone manually changes a resource (e.g., edits a security group in the console) after CloudFormation deployed it. The actual resource config no longer matches the template.

# Detect drift on a stack
aws cloudformation detect-stack-drift --stack-name my-stack

# Poll for completion
aws cloudformation describe-stack-drift-detection-status --stack-drift-detection-id <id>

# View drifted resources
aws cloudformation describe-stack-resource-drifts \
  --stack-name my-stack \
  --stack-resource-drift-status-filters MODIFIED DELETED

Remediation options:

  1. Update the template to match the manual change, then run a stack update.
  2. Revert the manual change back to the original config.

A CloudFormation stack update fails and can't roll back. How do you recover?

When a stack gets stuck in UPDATE_ROLLBACK_FAILED, use:

aws cloudformation continue-update-rollback --stack-name my-stack

If a specific resource is blocking rollback, skip it:

aws cloudformation continue-update-rollback \
  --stack-name my-stack \
  --resources-to-skip MyProblematicResource

The skipped resource stays in its current state — you'll need to manually fix it afterwards.


What is the difference between CloudFormation nested stacks and StackSets?

Nested StacksStackSets
PurposeModularise a single deploymentDeploy the same template to many accounts/regions
ScopeOne account, one regionMulti-account, multi-region
LifecycleChild stack managed by parentStack instances are independent
Use caseBreak a large template into reusable modulesDeploy security baselines, CloudTrail, Config across an org

VPC & Networking

Walk me through how traffic flows from a private EC2 instance to the internet.

Private instance (10.0.2.10)
  → Route table: 0.0.0.0/0 → nat-xxxxxxxx
    → NAT Gateway (in public subnet, has Elastic IP)
      → Route table: 0.0.0.0/0 → igw-xxxxxxxx
        → Internet Gateway
          → Internet

For this to work: the NAT Gateway must be in a public subnet, the private subnet's route table must point to the NAT Gateway (not the IGW), and the public subnet's route table must point to the IGW.


Explain the difference between Security Groups and NACLs.

Security GroupNACL
LevelInstance / ENISubnet
StatefulYes (return traffic automatic)No (must allow both directions)
RulesAllow onlyAllow + Deny
EvaluationAll rulesNumbered order (lowest first)
DefaultDeny all inboundAllow all

Key exam trap: NACLs are stateless. If you allow HTTPS inbound (443), you must also explicitly allow the ephemeral return ports (1024–65535) outbound, or responses won't reach the client.


How do you allow an EC2 instance in a private subnet to access S3 without internet?

Create a Gateway VPC endpoint for S3:

aws ec2 create-vpc-endpoint \
  --vpc-id vpc-12345 \
  --service-name com.amazonaws.us-east-1.s3 \
  --route-table-ids rtb-private-12345

This adds a route to the private subnet's route table pointing to the endpoint (no NAT Gateway required). Traffic to S3 stays entirely within the AWS network — more secure and free of data transfer charges.


S3 & Storage

You accidentally deleted a versioned object in S3. How do you restore it?

Deleting a versioned object creates a delete marker — the object isn't gone; it's just hidden behind the marker.

# List all versions including delete markers
aws s3api list-object-versions --bucket my-bucket --prefix myfile.txt

# Remove the delete marker to restore the object
aws s3api delete-object \
  --bucket my-bucket \
  --key myfile.txt \
  --version-id <delete-marker-version-id>

The object reappears as the latest version. To permanently delete all versions, you must delete each version ID explicitly.


What S3 features would you use to protect an S3 bucket from accidental public access?

Layered approach:

  1. Block Public Access at account level — overrides any bucket policy or ACL granting public access.
  2. Bucket policy with Deny if aws:SecureTransport = false — enforce HTTPS only.
  3. S3 Object Lock (compliance mode) for critical data — prevents deletion until retention period expires.
  4. IAM Access Analyzer for S3 — automatically surfaces buckets that are public or cross-account shared.
  5. AWS Config rule s3-bucket-public-read-prohibited with auto-remediation.

IAM & Security

What is the difference between an IAM role and an IAM user?

IAM UserIAM Role
CredentialsLong-term (access key + secret)Temporary (STS tokens, 1–12 hours)
Assigned toA specific personAWS services, EC2 instances, Lambda, federated users
Key rotationManualAutomatic (STS handles it)
Best practiceHumans with MFA; prefer SSOAlways use for services (never access keys on EC2)

Follow-up: Why not put access keys directly on an EC2 instance? Keys are long-lived and exposed if the instance is compromised. An IAM role gives temporary credentials that auto-rotate and never touch disk.


Explain the concept of least privilege. How do you implement it for an EC2 application?

Least privilege means granting only the exact permissions required — nothing more.

Implementation steps:

  1. Start with a deny-all role — blank IAM role with no policies attached.
  2. Identify access patterns — what AWS services does the app call? What actions on which resources?
  3. Write a specific policy — use resource ARNs (not *), specific actions, and condition keys where possible.
  4. Use IAM Access Advisor to see last accessed services and remove unused permissions.
  5. Review periodically — use IAM Access Analyzer to find over-permissive policies.

Example: an app that reads from one S3 bucket gets s3:GetObject on arn:aws:s3:::my-bucket/* — not s3:* on *.


Disaster Recovery & Backup

What are the four DR strategies in AWS and when would you use each?

StrategyRTORPOCostDescription
Backup & RestoreHoursHoursLowestPeriodic snapshots; restore when disaster strikes
Pilot Light~10 minMinutesLowCore services (DB) always running; scale up compute on failover
Warm StandbyMinutesSecondsMediumReduced-capacity copy always running; scale to full on failover
Multi-Site Active/ActiveNear zeroNear zeroHighestFull capacity in both regions simultaneously

Choose based on your RTO/RPO requirements and budget. Most applications use Backup & Restore or Warm Standby.


How does AWS Backup differ from EBS snapshots?

EBS SnapshotsAWS Backup
ScopeSingle EBS volumeMulti-service (EC2, RDS, EFS, DynamoDB, S3, FSx)
PolicyManual or Data Lifecycle ManagerCentralised backup policies
Cross-accountVia AMIYes, natively
AuditLimitedFull backup audit logs in CloudTrail
RetentionManualPolicy-driven, lifecycle rules

Use AWS Backup for enterprise-wide backup governance. Use direct EBS snapshots for simple, ad-hoc volume backups.


Account Management & Cost

What is an AWS Service Control Policy (SCP) and how does it differ from an IAM policy?

SCPIAM Policy
Applied toAccounts / OUs in OrganizationsUsers, roles, groups
Grants permissionsNo — it sets the ceilingYes
OverrideableNo (even by account root)Yes, by adding more policies
PurposeOrganisation-wide guardrailsPer-identity permissions

Key point: If an SCP denies an action, no IAM policy in the account can allow it — even for the account's root user. This makes SCPs ideal for enforcing non-negotiable security controls across all accounts.


An AWS cost spike appears. How do you investigate?

  1. Cost Explorer → filter by service, then by resource tag → identify the spike.
  2. Cost & Usage Report (CUR) in Athena for line-item detail.
  3. CloudTrail → search for large-scale resource creations around the spike time.
  4. Cost Allocation Tags → confirm tags are present on resources; missing tags make attribution hard.
  5. Trusted Advisor → check for idle resources (unattached EIPs, unused Reserved Instances).
  6. Compute Optimizer → check for over-provisioned EC2 instances.

Prevention: set AWS Budgets alerts at 80% of expected monthly spend, and CloudWatch alarms on EstimatedCharges.