AWS Monitoring, Auditing & Performance — CloudWatch, CloudTrail, Config

IntermediateTopic55 min11 min read26 Apr 2026AWS

Complete monitoring stack for CloudOps. CloudWatch metrics, alarms, logs, dashboards, EventBridge, CloudTrail auditing, and AWS Config compliance — all heavy SOA-C03 exam topics.

What you'll learn

  • Understand EC2 default metrics and what requires the CloudWatch Agent
  • Create metric alarms and automated remediation actions
  • Ship application and OS logs to CloudWatch Logs
  • Build event-driven automations with EventBridge
  • Audit API calls with CloudTrail
  • Enforce compliance rules with AWS Config

Prerequisites

Relevant for certifications

SOA-C03

CloudWatch Overview

Amazon CloudWatch is AWS's native monitoring and observability service. It collects metrics, logs, and events from AWS services and your applications.

EC2 instances, RDS, Lambda, etc.
  → push metrics to CloudWatch
    → you create alarms, dashboards, log queries
      → trigger notifications or automated actions

EC2 CloudWatch Metrics

EC2 pushes these metrics to CloudWatch automatically (no agent needed):

MetricDescription
CPUUtilizationCPU usage %
NetworkIn / NetworkOutBytes transferred
StatusCheckFailed_SystemAWS hardware/hypervisor issue
StatusCheckFailed_InstanceGuest OS issue
StatusCheckFailed_AttachedEBSEBS volume reachability
DiskReadOps / DiskWriteOpsInstance store only (not EBS)

Warning

RAM utilization is NOT included in default EC2 metrics. This is a classic exam trap. You need the CloudWatch Agent to collect memory metrics.

Monitoring intervals

ModeIntervalCost
Basic monitoring5 minutesFree
Detailed monitoring1 minuteSmall charge

Enable detailed monitoring when you need faster Auto Scaling reactions or finer-grained alarms.


CloudWatch Agent

The Unified CloudWatch Agent extends monitoring beyond what AWS pushes by default:

  • Memory usage
  • Disk space and inode utilization
  • Custom application logs
  • Swap usage
  • Any OS-level metric

Installation and configuration

# Amazon Linux / Ubuntu
sudo yum install -y amazon-cloudwatch-agent   # Amazon Linux
sudo apt-get install -y amazon-cloudwatch-agent  # Ubuntu

# Run the wizard to generate config
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard

# Start the agent with the generated config
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
  -a fetch-config \
  -m ec2 \
  -s \
  -c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json

Agent config example (collect memory + nginx logs)

{
  "metrics": {
    "metrics_collected": {
      "mem": {
        "measurement": ["mem_used_percent"],
        "metrics_collection_interval": 60
      },
      "disk": {
        "measurement": ["used_percent"],
        "resources": ["/"]
      }
    }
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/nginx/access.log",
            "log_group_name": "/ec2/nginx/access",
            "log_stream_name": "{instance_id}"
          }
        ]
      }
    }
  }
}

Store config in Parameter Store

Store the CloudWatch Agent config in SSM Parameter Store and use SSM State Manager to push it to all instances. This way, every new instance auto-configures its monitoring.


Custom Metrics

Push your own application metrics to CloudWatch:

# CLI — push a single custom metric
aws cloudwatch put-metric-data \
  --namespace "MyApp/BusinessMetrics" \
  --metric-name "ActiveUsers" \
  --value 1337 \
  --unit Count \
  --dimensions Environment=Production

# High-resolution custom metric (1 second resolution)
aws cloudwatch put-metric-data \
  --namespace "MyApp/Performance" \
  --metric-name "RequestLatency" \
  --value 45 \
  --unit Milliseconds \
  --storage-resolution 1    # 1 = high resolution, 60 = standard

Custom metric retention:

  • Standard resolution (60s): stored 15 months
  • High resolution (1s): stored 3 hours, then rolled up

CloudWatch Alarms

Alarms watch a metric and trigger actions when a threshold is crossed.

Alarm states

StateMeaning
OKMetric is within threshold
ALARMMetric has breached threshold
INSUFFICIENT_DATANot enough data to evaluate

Creating an alarm

aws cloudwatch put-metric-alarm \
  --alarm-name "HighCPU-i-0abc123" \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --dimensions Name=InstanceId,Value=i-0abc123def456 \
  --period 300 \
  --evaluation-periods 2 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:us-east-1:123456789:ops-alerts \
  --ok-actions arn:aws:sns:us-east-1:123456789:ops-alerts

Alarm actions

Alarms can trigger:

  • SNS notifications — email, SMS, Lambda, SQS
  • EC2 actions — stop, terminate, reboot, recover an instance
  • Auto Scaling actions — scale in/out
  • SSM OpsCenter — create an OpsItem

EC2 recovery alarm

# Alarm that auto-recovers instance on system status check failure
aws cloudwatch put-metric-alarm \
  --alarm-name "EC2-AutoRecover" \
  --metric-name StatusCheckFailed_System \
  --namespace AWS/EC2 \
  --dimensions Name=InstanceId,Value=i-0abc123 \
  --period 60 \
  --evaluation-periods 3 \
  --threshold 1 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --alarm-actions arn:aws:automate:us-east-1:ec2:recover

Recover vs Reboot vs Stop+Start

Recover moves the instance to a new host but preserves the same Instance ID, EIP, and metadata. Stop+Start also moves hosts but gets a new public IP (unless EIP is attached).


CloudWatch Logs

Key concepts

ConceptDescription
Log GroupContainer for log streams (e.g., /ec2/nginx/access)
Log StreamSequence of events from one source (e.g., one instance)
Log EventA single log entry with timestamp + message
Metric FilterPattern that extracts a metric from log data
Subscription FilterReal-time stream of logs to Lambda, Kinesis, or OpenSearch

Retention

By default, CloudWatch Logs never expire. Set retention to control costs:

aws logs put-retention-policy \
  --log-group-name "/ec2/nginx/access" \
  --retention-in-days 30

CloudWatch Logs Insights

Query your logs with a SQL-like language:

# Top 10 slowest API requests
fields @timestamp, @message
| filter @message like /api/
| parse @message "duration: * ms" as duration
| sort duration desc
| limit 10
# Count errors per minute
filter @message like /ERROR/
| stats count() as errors by bin(1m)

Metric Filters (log → metric)

Extract a CloudWatch metric from log patterns:

# Create metric filter — count HTTP 500 errors
aws logs put-metric-filter \
  --log-group-name "/ec2/nginx/access" \
  --filter-name "HTTP500Errors" \
  --filter-pattern "[host, ident, user, timestamp, request, status=500, size]" \
  --metric-transformations \
    metricName=HTTP500Count,metricNamespace=NginxMetrics,metricValue=1

Live Tail

Real-time streaming of log events in the console — useful for debugging live issues without having to query.


Amazon EventBridge

EventBridge is the event bus for AWS — it routes events from AWS services, custom applications, and SaaS partners to targets.

Key concepts

ConceptDescription
Event BusChannel that receives events (default bus or custom)
RulePattern matching on events → trigger target
TargetWhat executes when a rule matches (Lambda, SSM, SNS, SQS, Step Functions, etc.)
ScheduleCron or rate expression to trigger targets on a timer

Event-driven automation patterns

EC2 instance state change (running → stopped)
  → EventBridge rule
    → SNS: "Instance i-0abc123 stopped!"
    → Lambda: update inventory database

Config rule non-compliance
  → EventBridge rule
    → SSM Automation: auto-remediate the resource

CloudTrail: root account login
  → EventBridge rule
    → SNS: "ALERT: Root account used!"

Schedule example

# Run SSM automation every Sunday at 2am UTC
Rate: cron(0 2 ? * SUN *)
Target: SSM Automation — patch all prod instances

EventBridge Input Transformation

Transform the event payload before it reaches the target:

{
  "inputPathsMap": {
    "instance": "$.detail.instance-id",
    "state": "$.detail.state"
  },
  "inputTemplate": "\"Instance <instance> changed to <state>\""
}

Cross-account targets

EventBridge can send events to targets in other AWS accounts — useful for centralised operations:

Dev account event → EventBridge → Cross-account rule → Ops account Lambda

AWS CloudTrail

CloudTrail records every API call made in your AWS account — who did what, when, from where.

Event types

TypeDescription
Management eventsControl plane operations (create, delete, modify resources)
Data eventsObject-level operations (S3 GetObject, Lambda InvokeFunction)
Insights eventsUnusual API activity patterns (anomaly detection)

Key properties of a CloudTrail event

{
  "eventTime": "2026-04-26T14:30:00Z",
  "userIdentity": { "type": "IAMUser", "userName": "utsav" },
  "eventName": "TerminateInstances",
  "sourceIPAddress": "203.0.113.0",
  "requestParameters": { "instancesSet": { "items": [{"instanceId": "i-0abc123"}] } }
}

Exam-critical facts

  • CloudTrail events are delivered to S3 within 15 minutes of the API call
  • Logs are encrypted with SSE-S3 by default; use SSE-KMS for additional control
  • Multi-region trail — single trail that captures events in all regions (recommended)
  • Organization trail — captures events from all accounts in an AWS Organization
  • Trails can be sent to CloudWatch Logs for real-time alerting

CloudTrail + EventBridge pattern

CloudTrail (management events)
  → EventBridge (filter: "DeleteBucket" by non-approved principal)
    → SNS alert to security team
    → SSM Automation to investigate

CloudTrail for CloudOps

  • Detect unauthorised access: search for ConsoleLogin events from unknown IPs
  • Investigate incidents: trace exact API calls leading up to an issue
  • Compliance: prove who changed what for auditors

AWS Config

AWS Config continuously evaluates your resource configurations against compliance rules.

Key concepts

ConceptDescription
Configuration itemPoint-in-time snapshot of a resource's config
Configuration historyTimeline of all changes to a resource
Config ruleEvaluation logic for compliance (AWS managed or custom Lambda)
Conformance packCollection of Config rules deployed together
RemediationSSM Automation that fixes non-compliant resources
AggregatorCollects Config data from multiple accounts/regions

Common AWS managed rules

RuleChecks
restricted-sshSecurity groups should not allow unrestricted SSH (port 22)
s3-bucket-public-read-prohibitedS3 buckets should not be publicly readable
encrypted-volumesEBS volumes should be encrypted
iam-root-access-key-checkRoot account should not have active access keys
mfa-enabled-for-iam-console-accessIAM users must have MFA
rds-instance-public-access-checkRDS instances should not be publicly accessible

Config + Auto-remediation

Config rule: "restricted-ssh" detects open port 22
  → Config triggers SSM Automation: AWS-DisablePublicAccessForSecurityGroup
    → Removes the 0.0.0.0/0 SSH rule automatically

CloudWatch vs CloudTrail vs Config

ServiceAnswersExample
CloudWatchWhat is happening right now?CPU is at 95%
CloudTrailWho did what and when?Alice deleted the S3 bucket at 2pm
ConfigIs my infrastructure compliant?3 security groups have port 22 open

AWS Health Dashboard

Two views:

  • Service Health Dashboard (public) — AWS-wide service status
  • Personal Health Dashboard (per account) — issues affecting your resources

Health events

  • Scheduled maintenance (e.g., host retirement for your EC2 instance)
  • Active issues affecting your services
  • Notifications about upcoming deprecations

Automation with Health events + EventBridge

AWS Health event: EC2 instance host retirement
  → EventBridge rule
    → Lambda: automatically stop and start the instance
      → Instance migrates to new healthy host

Service Quotas

Service Quotas lets you view and manage limits for AWS services in one place.

# Check current quota for EC2 instances
aws service-quotas get-service-quota \
  --service-code ec2 \
  --quota-code L-1216C47A

# Request a quota increase
aws service-quotas request-service-quota-increase \
  --service-code ec2 \
  --quota-code L-1216C47A \
  --desired-value 200

Set CloudWatch alarms on quota utilisation — get notified at 80% usage before you hit the limit.


CloudWatch Anomaly Detection

CloudWatch can model the expected baseline of a metric using ML and alert when it deviates:

aws cloudwatch put-anomaly-detector \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-0abc123 \
  --stat Average

Use this for metrics with seasonal patterns (more traffic during business hours) where a static threshold would generate false alarms.


Hands-on: Full Monitoring Stack

Goal: Set up end-to-end observability for an EC2 instance.

1. Enable detailed monitoring on the instance

2. Install CloudWatch Agent:
   - Collect memory, disk usage
   - Ship /var/log/messages to log group /ec2/syslog

3. Create alarms:
   - CPUUtilization > 80% for 5 min → SNS email
   - mem_used_percent > 85% for 5 min → SNS email
   - StatusCheckFailed_System ≥ 1 for 3 min → EC2 Recover action

4. Create metric filter on /ec2/syslog:
   - Pattern: [timestamp, host, process, msg="*OOM*"]
   - Metric: OOMKillerEvents (count)
   - Alarm: OOMKillerEvents ≥ 1 → SNS

5. Enable CloudTrail (multi-region):
   - Deliver to S3: s3://mycompany-cloudtrail-logs
   - Send to CloudWatch Logs: /cloudtrail/management-events

6. Create EventBridge rule:
   - Source: aws.cloudtrail
   - Detail: eventName = "ConsoleLogin" AND "additionalEventData.MFAUsed" = "No"
   - Target: SNS — "Console login without MFA!"

Common SOA-C03 Exam Questions

Q: RAM utilisation is not in CloudWatch. How do you monitor it? Install the CloudWatch Agent on the instance. Configure it to collect mem_used_percent. The agent requires the instance to have an IAM role with cloudwatch:PutMetricData permission.

Q: An alarm is in INSUFFICIENT_DATA state. What does this mean? CloudWatch does not have enough data points to evaluate the alarm threshold. This happens when an instance first starts, or when detailed monitoring is disabled and the metric hasn't emitted yet.

Q: How do you get alerted when someone uses the root account? Create a CloudTrail trail → deliver to CloudWatch Logs → create a metric filter for $.userIdentity.type = "Root" → alarm on the metric → SNS notification.

Q: What's the difference between CloudTrail and Config? CloudTrail records API calls (who did what). Config continuously evaluates whether your resource configuration meets compliance rules (what state are resources in). Use CloudTrail to investigate incidents; use Config to enforce and report on compliance.


What to Learn Next

  1. AWS Systems Manager — use EventBridge + SSM Automation for auto-remediation
  2. AWS Security & Compliance — GuardDuty, Inspector, and Security Hub feed into CloudWatch/EventBridge
  3. AWS Account Management — Config Aggregators and centralized CloudTrail across accounts

More in Amazon Web Services