AWS Monitoring, Auditing & Performance — CloudWatch, CloudTrail, Config
Complete monitoring stack for CloudOps. CloudWatch metrics, alarms, logs, dashboards, EventBridge, CloudTrail auditing, and AWS Config compliance — all heavy SOA-C03 exam topics.
What you'll learn
- Understand EC2 default metrics and what requires the CloudWatch Agent
- Create metric alarms and automated remediation actions
- Ship application and OS logs to CloudWatch Logs
- Build event-driven automations with EventBridge
- Audit API calls with CloudTrail
- Enforce compliance rules with AWS Config
Prerequisites
Relevant for certifications
CloudWatch Overview
Amazon CloudWatch is AWS's native monitoring and observability service. It collects metrics, logs, and events from AWS services and your applications.
EC2 instances, RDS, Lambda, etc.
→ push metrics to CloudWatch
→ you create alarms, dashboards, log queries
→ trigger notifications or automated actions
EC2 CloudWatch Metrics
EC2 pushes these metrics to CloudWatch automatically (no agent needed):
| Metric | Description |
|---|---|
CPUUtilization | CPU usage % |
NetworkIn / NetworkOut | Bytes transferred |
StatusCheckFailed_System | AWS hardware/hypervisor issue |
StatusCheckFailed_Instance | Guest OS issue |
StatusCheckFailed_AttachedEBS | EBS volume reachability |
DiskReadOps / DiskWriteOps | Instance store only (not EBS) |
Warning
RAM utilization is NOT included in default EC2 metrics. This is a classic exam trap. You need the CloudWatch Agent to collect memory metrics.
Monitoring intervals
| Mode | Interval | Cost |
|---|---|---|
| Basic monitoring | 5 minutes | Free |
| Detailed monitoring | 1 minute | Small charge |
Enable detailed monitoring when you need faster Auto Scaling reactions or finer-grained alarms.
CloudWatch Agent
The Unified CloudWatch Agent extends monitoring beyond what AWS pushes by default:
- Memory usage
- Disk space and inode utilization
- Custom application logs
- Swap usage
- Any OS-level metric
Installation and configuration
# Amazon Linux / Ubuntu
sudo yum install -y amazon-cloudwatch-agent # Amazon Linux
sudo apt-get install -y amazon-cloudwatch-agent # Ubuntu
# Run the wizard to generate config
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard
# Start the agent with the generated config
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
-a fetch-config \
-m ec2 \
-s \
-c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
Agent config example (collect memory + nginx logs)
{
"metrics": {
"metrics_collected": {
"mem": {
"measurement": ["mem_used_percent"],
"metrics_collection_interval": 60
},
"disk": {
"measurement": ["used_percent"],
"resources": ["/"]
}
}
},
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/nginx/access.log",
"log_group_name": "/ec2/nginx/access",
"log_stream_name": "{instance_id}"
}
]
}
}
}
}
Store config in Parameter Store
Store the CloudWatch Agent config in SSM Parameter Store and use SSM State Manager to push it to all instances. This way, every new instance auto-configures its monitoring.
Custom Metrics
Push your own application metrics to CloudWatch:
# CLI — push a single custom metric
aws cloudwatch put-metric-data \
--namespace "MyApp/BusinessMetrics" \
--metric-name "ActiveUsers" \
--value 1337 \
--unit Count \
--dimensions Environment=Production
# High-resolution custom metric (1 second resolution)
aws cloudwatch put-metric-data \
--namespace "MyApp/Performance" \
--metric-name "RequestLatency" \
--value 45 \
--unit Milliseconds \
--storage-resolution 1 # 1 = high resolution, 60 = standard
Custom metric retention:
- Standard resolution (60s): stored 15 months
- High resolution (1s): stored 3 hours, then rolled up
CloudWatch Alarms
Alarms watch a metric and trigger actions when a threshold is crossed.
Alarm states
| State | Meaning |
|---|---|
OK | Metric is within threshold |
ALARM | Metric has breached threshold |
INSUFFICIENT_DATA | Not enough data to evaluate |
Creating an alarm
aws cloudwatch put-metric-alarm \
--alarm-name "HighCPU-i-0abc123" \
--metric-name CPUUtilization \
--namespace AWS/EC2 \
--dimensions Name=InstanceId,Value=i-0abc123def456 \
--period 300 \
--evaluation-periods 2 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--alarm-actions arn:aws:sns:us-east-1:123456789:ops-alerts \
--ok-actions arn:aws:sns:us-east-1:123456789:ops-alerts
Alarm actions
Alarms can trigger:
- SNS notifications — email, SMS, Lambda, SQS
- EC2 actions — stop, terminate, reboot, recover an instance
- Auto Scaling actions — scale in/out
- SSM OpsCenter — create an OpsItem
EC2 recovery alarm
# Alarm that auto-recovers instance on system status check failure
aws cloudwatch put-metric-alarm \
--alarm-name "EC2-AutoRecover" \
--metric-name StatusCheckFailed_System \
--namespace AWS/EC2 \
--dimensions Name=InstanceId,Value=i-0abc123 \
--period 60 \
--evaluation-periods 3 \
--threshold 1 \
--comparison-operator GreaterThanOrEqualToThreshold \
--alarm-actions arn:aws:automate:us-east-1:ec2:recover
Recover vs Reboot vs Stop+Start
Recover moves the instance to a new host but preserves the same Instance ID, EIP, and metadata. Stop+Start also moves hosts but gets a new public IP (unless EIP is attached).
CloudWatch Logs
Key concepts
| Concept | Description |
|---|---|
| Log Group | Container for log streams (e.g., /ec2/nginx/access) |
| Log Stream | Sequence of events from one source (e.g., one instance) |
| Log Event | A single log entry with timestamp + message |
| Metric Filter | Pattern that extracts a metric from log data |
| Subscription Filter | Real-time stream of logs to Lambda, Kinesis, or OpenSearch |
Retention
By default, CloudWatch Logs never expire. Set retention to control costs:
aws logs put-retention-policy \
--log-group-name "/ec2/nginx/access" \
--retention-in-days 30
CloudWatch Logs Insights
Query your logs with a SQL-like language:
# Top 10 slowest API requests
fields @timestamp, @message
| filter @message like /api/
| parse @message "duration: * ms" as duration
| sort duration desc
| limit 10
# Count errors per minute
filter @message like /ERROR/
| stats count() as errors by bin(1m)
Metric Filters (log → metric)
Extract a CloudWatch metric from log patterns:
# Create metric filter — count HTTP 500 errors
aws logs put-metric-filter \
--log-group-name "/ec2/nginx/access" \
--filter-name "HTTP500Errors" \
--filter-pattern "[host, ident, user, timestamp, request, status=500, size]" \
--metric-transformations \
metricName=HTTP500Count,metricNamespace=NginxMetrics,metricValue=1
Live Tail
Real-time streaming of log events in the console — useful for debugging live issues without having to query.
Amazon EventBridge
EventBridge is the event bus for AWS — it routes events from AWS services, custom applications, and SaaS partners to targets.
Key concepts
| Concept | Description |
|---|---|
| Event Bus | Channel that receives events (default bus or custom) |
| Rule | Pattern matching on events → trigger target |
| Target | What executes when a rule matches (Lambda, SSM, SNS, SQS, Step Functions, etc.) |
| Schedule | Cron or rate expression to trigger targets on a timer |
Event-driven automation patterns
EC2 instance state change (running → stopped)
→ EventBridge rule
→ SNS: "Instance i-0abc123 stopped!"
→ Lambda: update inventory database
Config rule non-compliance
→ EventBridge rule
→ SSM Automation: auto-remediate the resource
CloudTrail: root account login
→ EventBridge rule
→ SNS: "ALERT: Root account used!"
Schedule example
# Run SSM automation every Sunday at 2am UTC
Rate: cron(0 2 ? * SUN *)
Target: SSM Automation — patch all prod instances
EventBridge Input Transformation
Transform the event payload before it reaches the target:
{
"inputPathsMap": {
"instance": "$.detail.instance-id",
"state": "$.detail.state"
},
"inputTemplate": "\"Instance <instance> changed to <state>\""
}
Cross-account targets
EventBridge can send events to targets in other AWS accounts — useful for centralised operations:
Dev account event → EventBridge → Cross-account rule → Ops account Lambda
AWS CloudTrail
CloudTrail records every API call made in your AWS account — who did what, when, from where.
Event types
| Type | Description |
|---|---|
| Management events | Control plane operations (create, delete, modify resources) |
| Data events | Object-level operations (S3 GetObject, Lambda InvokeFunction) |
| Insights events | Unusual API activity patterns (anomaly detection) |
Key properties of a CloudTrail event
{
"eventTime": "2026-04-26T14:30:00Z",
"userIdentity": { "type": "IAMUser", "userName": "utsav" },
"eventName": "TerminateInstances",
"sourceIPAddress": "203.0.113.0",
"requestParameters": { "instancesSet": { "items": [{"instanceId": "i-0abc123"}] } }
}
Exam-critical facts
- CloudTrail events are delivered to S3 within 15 minutes of the API call
- Logs are encrypted with SSE-S3 by default; use SSE-KMS for additional control
- Multi-region trail — single trail that captures events in all regions (recommended)
- Organization trail — captures events from all accounts in an AWS Organization
- Trails can be sent to CloudWatch Logs for real-time alerting
CloudTrail + EventBridge pattern
CloudTrail (management events)
→ EventBridge (filter: "DeleteBucket" by non-approved principal)
→ SNS alert to security team
→ SSM Automation to investigate
CloudTrail for CloudOps
- Detect unauthorised access: search for
ConsoleLoginevents from unknown IPs - Investigate incidents: trace exact API calls leading up to an issue
- Compliance: prove who changed what for auditors
AWS Config
AWS Config continuously evaluates your resource configurations against compliance rules.
Key concepts
| Concept | Description |
|---|---|
| Configuration item | Point-in-time snapshot of a resource's config |
| Configuration history | Timeline of all changes to a resource |
| Config rule | Evaluation logic for compliance (AWS managed or custom Lambda) |
| Conformance pack | Collection of Config rules deployed together |
| Remediation | SSM Automation that fixes non-compliant resources |
| Aggregator | Collects Config data from multiple accounts/regions |
Common AWS managed rules
| Rule | Checks |
|---|---|
restricted-ssh | Security groups should not allow unrestricted SSH (port 22) |
s3-bucket-public-read-prohibited | S3 buckets should not be publicly readable |
encrypted-volumes | EBS volumes should be encrypted |
iam-root-access-key-check | Root account should not have active access keys |
mfa-enabled-for-iam-console-access | IAM users must have MFA |
rds-instance-public-access-check | RDS instances should not be publicly accessible |
Config + Auto-remediation
Config rule: "restricted-ssh" detects open port 22
→ Config triggers SSM Automation: AWS-DisablePublicAccessForSecurityGroup
→ Removes the 0.0.0.0/0 SSH rule automatically
CloudWatch vs CloudTrail vs Config
| Service | Answers | Example |
|---|---|---|
| CloudWatch | What is happening right now? | CPU is at 95% |
| CloudTrail | Who did what and when? | Alice deleted the S3 bucket at 2pm |
| Config | Is my infrastructure compliant? | 3 security groups have port 22 open |
AWS Health Dashboard
Two views:
- Service Health Dashboard (public) — AWS-wide service status
- Personal Health Dashboard (per account) — issues affecting your resources
Health events
- Scheduled maintenance (e.g., host retirement for your EC2 instance)
- Active issues affecting your services
- Notifications about upcoming deprecations
Automation with Health events + EventBridge
AWS Health event: EC2 instance host retirement
→ EventBridge rule
→ Lambda: automatically stop and start the instance
→ Instance migrates to new healthy host
Service Quotas
Service Quotas lets you view and manage limits for AWS services in one place.
# Check current quota for EC2 instances
aws service-quotas get-service-quota \
--service-code ec2 \
--quota-code L-1216C47A
# Request a quota increase
aws service-quotas request-service-quota-increase \
--service-code ec2 \
--quota-code L-1216C47A \
--desired-value 200
Set CloudWatch alarms on quota utilisation — get notified at 80% usage before you hit the limit.
CloudWatch Anomaly Detection
CloudWatch can model the expected baseline of a metric using ML and alert when it deviates:
aws cloudwatch put-anomaly-detector \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-0abc123 \
--stat Average
Use this for metrics with seasonal patterns (more traffic during business hours) where a static threshold would generate false alarms.
Hands-on: Full Monitoring Stack
Goal: Set up end-to-end observability for an EC2 instance.
1. Enable detailed monitoring on the instance
2. Install CloudWatch Agent:
- Collect memory, disk usage
- Ship /var/log/messages to log group /ec2/syslog
3. Create alarms:
- CPUUtilization > 80% for 5 min → SNS email
- mem_used_percent > 85% for 5 min → SNS email
- StatusCheckFailed_System ≥ 1 for 3 min → EC2 Recover action
4. Create metric filter on /ec2/syslog:
- Pattern: [timestamp, host, process, msg="*OOM*"]
- Metric: OOMKillerEvents (count)
- Alarm: OOMKillerEvents ≥ 1 → SNS
5. Enable CloudTrail (multi-region):
- Deliver to S3: s3://mycompany-cloudtrail-logs
- Send to CloudWatch Logs: /cloudtrail/management-events
6. Create EventBridge rule:
- Source: aws.cloudtrail
- Detail: eventName = "ConsoleLogin" AND "additionalEventData.MFAUsed" = "No"
- Target: SNS — "Console login without MFA!"
Common SOA-C03 Exam Questions
Q: RAM utilisation is not in CloudWatch. How do you monitor it?
Install the CloudWatch Agent on the instance. Configure it to collect mem_used_percent. The agent requires the instance to have an IAM role with cloudwatch:PutMetricData permission.
Q: An alarm is in INSUFFICIENT_DATA state. What does this mean? CloudWatch does not have enough data points to evaluate the alarm threshold. This happens when an instance first starts, or when detailed monitoring is disabled and the metric hasn't emitted yet.
Q: How do you get alerted when someone uses the root account?
Create a CloudTrail trail → deliver to CloudWatch Logs → create a metric filter for $.userIdentity.type = "Root" → alarm on the metric → SNS notification.
Q: What's the difference between CloudTrail and Config? CloudTrail records API calls (who did what). Config continuously evaluates whether your resource configuration meets compliance rules (what state are resources in). Use CloudTrail to investigate incidents; use Config to enforce and report on compliance.
What to Learn Next
- AWS Systems Manager — use EventBridge + SSM Automation for auto-remediation
- AWS Security & Compliance — GuardDuty, Inspector, and Security Hub feed into CloudWatch/EventBridge
- AWS Account Management — Config Aggregators and centralized CloudTrail across accounts
