AWS Monitoring, Auditing & Performance — CloudWatch, CloudTrail, Config

IntermediateTopic55 min11 min read26 Apr 2026AWS

Complete monitoring stack for CloudOps. CloudWatch metrics, alarms, logs, dashboards, EventBridge, CloudTrail auditing, and AWS Config compliance — all heavy SOA-C03 exam topics.

What you'll learn

Understand EC2 default metrics and what requires the CloudWatch Agent
Create metric alarms and automated remediation actions
Ship application and OS logs to CloudWatch Logs
Build event-driven automations with EventBridge
Audit API calls with CloudTrail
Enforce compliance rules with AWS Config

Prerequisites

aws/ec2-basics

Relevant for certifications

SOA-C03

#aws #cloudwatch #cloudtrail #config #eventbridge #alarms #logs #metrics #soa-c03

CloudWatch Overview

Amazon CloudWatch is AWS's native monitoring and observability service. It collects metrics, logs, and events from AWS services and your applications.

EC2 instances, RDS, Lambda, etc.
  → push metrics to CloudWatch
    → you create alarms, dashboards, log queries
      → trigger notifications or automated actions

EC2 CloudWatch Metrics

EC2 pushes these metrics to CloudWatch automatically (no agent needed):

Metric	Description
`CPUUtilization`	CPU usage %
`NetworkIn` / `NetworkOut`	Bytes transferred
`StatusCheckFailed_System`	AWS hardware/hypervisor issue
`StatusCheckFailed_Instance`	Guest OS issue
`StatusCheckFailed_AttachedEBS`	EBS volume reachability
`DiskReadOps` / `DiskWriteOps`	Instance store only (not EBS)

Warning

RAM utilization is NOT included in default EC2 metrics. This is a classic exam trap. You need the CloudWatch Agent to collect memory metrics.

Monitoring intervals

Mode	Interval	Cost
Basic monitoring	5 minutes	Free
Detailed monitoring	1 minute	Small charge

Enable detailed monitoring when you need faster Auto Scaling reactions or finer-grained alarms.

CloudWatch Agent

The Unified CloudWatch Agent extends monitoring beyond what AWS pushes by default:

Memory usage
Disk space and inode utilization
Custom application logs
Swap usage
Any OS-level metric

Installation and configuration

# Amazon Linux / Ubuntu
sudo yum install -y amazon-cloudwatch-agent   # Amazon Linux
sudo apt-get install -y amazon-cloudwatch-agent  # Ubuntu

# Run the wizard to generate config
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard

# Start the agent with the generated config
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
  -a fetch-config \
  -m ec2 \
  -s \
  -c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json

Agent config example (collect memory + nginx logs)

{
  "metrics": {
    "metrics_collected": {
      "mem": {
        "measurement": ["mem_used_percent"],
        "metrics_collection_interval": 60
      },
      "disk": {
        "measurement": ["used_percent"],
        "resources": ["/"]
      }
    }
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/nginx/access.log",
            "log_group_name": "/ec2/nginx/access",
            "log_stream_name": "{instance_id}"
          }
        ]
      }
    }
  }
}

Store config in Parameter Store

Store the CloudWatch Agent config in SSM Parameter Store and use SSM State Manager to push it to all instances. This way, every new instance auto-configures its monitoring.

Custom Metrics

Push your own application metrics to CloudWatch:

# CLI — push a single custom metric
aws cloudwatch put-metric-data \
  --namespace "MyApp/BusinessMetrics" \
  --metric-name "ActiveUsers" \
  --value 1337 \
  --unit Count \
  --dimensions Environment=Production

# High-resolution custom metric (1 second resolution)
aws cloudwatch put-metric-data \
  --namespace "MyApp/Performance" \
  --metric-name "RequestLatency" \
  --value 45 \
  --unit Milliseconds \
  --storage-resolution 1    # 1 = high resolution, 60 = standard

Custom metric retention:

Standard resolution (60s): stored 15 months
High resolution (1s): stored 3 hours, then rolled up

CloudWatch Alarms

Alarms watch a metric and trigger actions when a threshold is crossed.

Alarm states

State	Meaning
`OK`	Metric is within threshold
`ALARM`	Metric has breached threshold
`INSUFFICIENT_DATA`	Not enough data to evaluate

Creating an alarm

aws cloudwatch put-metric-alarm \
  --alarm-name "HighCPU-i-0abc123" \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --dimensions Name=InstanceId,Value=i-0abc123def456 \
  --period 300 \
  --evaluation-periods 2 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:us-east-1:123456789:ops-alerts \
  --ok-actions arn:aws:sns:us-east-1:123456789:ops-alerts

Alarm actions

Alarms can trigger:

SNS notifications — email, SMS, Lambda, SQS
EC2 actions — stop, terminate, reboot, recover an instance
Auto Scaling actions — scale in/out
SSM OpsCenter — create an OpsItem

EC2 recovery alarm

# Alarm that auto-recovers instance on system status check failure
aws cloudwatch put-metric-alarm \
  --alarm-name "EC2-AutoRecover" \
  --metric-name StatusCheckFailed_System \
  --namespace AWS/EC2 \
  --dimensions Name=InstanceId,Value=i-0abc123 \
  --period 60 \
  --evaluation-periods 3 \
  --threshold 1 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --alarm-actions arn:aws:automate:us-east-1:ec2:recover

Recover vs Reboot vs Stop+Start

Recover moves the instance to a new host but preserves the same Instance ID, EIP, and metadata. Stop+Start also moves hosts but gets a new public IP (unless EIP is attached).

CloudWatch Logs

Key concepts

Concept	Description
Log Group	Container for log streams (e.g., `/ec2/nginx/access`)
Log Stream	Sequence of events from one source (e.g., one instance)
Log Event	A single log entry with timestamp + message
Metric Filter	Pattern that extracts a metric from log data
Subscription Filter	Real-time stream of logs to Lambda, Kinesis, or OpenSearch

Retention

By default, CloudWatch Logs never expire. Set retention to control costs:

aws logs put-retention-policy \
  --log-group-name "/ec2/nginx/access" \
  --retention-in-days 30

CloudWatch Logs Insights

Query your logs with a SQL-like language:

# Top 10 slowest API requests
fields @timestamp, @message
| filter @message like /api/
| parse @message "duration: * ms" as duration
| sort duration desc
| limit 10

# Count errors per minute
filter @message like /ERROR/
| stats count() as errors by bin(1m)

Metric Filters (log → metric)

Extract a CloudWatch metric from log patterns:

# Create metric filter — count HTTP 500 errors
aws logs put-metric-filter \
  --log-group-name "/ec2/nginx/access" \
  --filter-name "HTTP500Errors" \
  --filter-pattern "[host, ident, user, timestamp, request, status=500, size]" \
  --metric-transformations \
    metricName=HTTP500Count,metricNamespace=NginxMetrics,metricValue=1

Live Tail

Real-time streaming of log events in the console — useful for debugging live issues without having to query.

Amazon EventBridge

EventBridge is the event bus for AWS — it routes events from AWS services, custom applications, and SaaS partners to targets.

Key concepts

Concept	Description
Event Bus	Channel that receives events (default bus or custom)
Rule	Pattern matching on events → trigger target
Target	What executes when a rule matches (Lambda, SSM, SNS, SQS, Step Functions, etc.)
Schedule	Cron or rate expression to trigger targets on a timer

Event-driven automation patterns

EC2 instance state change (running → stopped)
  → EventBridge rule
    → SNS: "Instance i-0abc123 stopped!"
    → Lambda: update inventory database

Config rule non-compliance
  → EventBridge rule
    → SSM Automation: auto-remediate the resource

CloudTrail: root account login
  → EventBridge rule
    → SNS: "ALERT: Root account used!"

Schedule example

# Run SSM automation every Sunday at 2am UTC
Rate: cron(0 2 ? * SUN *)
Target: SSM Automation — patch all prod instances

EventBridge Input Transformation

Transform the event payload before it reaches the target:

{
  "inputPathsMap": {
    "instance": "$.detail.instance-id",
    "state": "$.detail.state"
  },
  "inputTemplate": "\"Instance <instance> changed to <state>\""
}

Cross-account targets

EventBridge can send events to targets in other AWS accounts — useful for centralised operations:

Dev account event → EventBridge → Cross-account rule → Ops account Lambda

AWS CloudTrail

CloudTrail records every API call made in your AWS account — who did what, when, from where.

Event types

Type	Description
Management events	Control plane operations (create, delete, modify resources)
Data events	Object-level operations (S3 GetObject, Lambda InvokeFunction)
Insights events	Unusual API activity patterns (anomaly detection)

Key properties of a CloudTrail event

{
  "eventTime": "2026-04-26T14:30:00Z",
  "userIdentity": { "type": "IAMUser", "userName": "utsav" },
  "eventName": "TerminateInstances",
  "sourceIPAddress": "203.0.113.0",
  "requestParameters": { "instancesSet": { "items": [{"instanceId": "i-0abc123"}] } }
}

Exam-critical facts

CloudTrail events are delivered to S3 within 15 minutes of the API call
Logs are encrypted with SSE-S3 by default; use SSE-KMS for additional control
Multi-region trail — single trail that captures events in all regions (recommended)
Organization trail — captures events from all accounts in an AWS Organization
Trails can be sent to CloudWatch Logs for real-time alerting

CloudTrail + EventBridge pattern

CloudTrail (management events)
  → EventBridge (filter: "DeleteBucket" by non-approved principal)
    → SNS alert to security team
    → SSM Automation to investigate

CloudTrail for CloudOps

Detect unauthorised access: search for ConsoleLogin events from unknown IPs
Investigate incidents: trace exact API calls leading up to an issue
Compliance: prove who changed what for auditors

AWS Config

AWS Config continuously evaluates your resource configurations against compliance rules.

Key concepts

Concept	Description
Configuration item	Point-in-time snapshot of a resource's config
Configuration history	Timeline of all changes to a resource
Config rule	Evaluation logic for compliance (AWS managed or custom Lambda)
Conformance pack	Collection of Config rules deployed together
Remediation	SSM Automation that fixes non-compliant resources
Aggregator	Collects Config data from multiple accounts/regions

Common AWS managed rules

Rule	Checks
`restricted-ssh`	Security groups should not allow unrestricted SSH (port 22)
`s3-bucket-public-read-prohibited`	S3 buckets should not be publicly readable
`encrypted-volumes`	EBS volumes should be encrypted
`iam-root-access-key-check`	Root account should not have active access keys
`mfa-enabled-for-iam-console-access`	IAM users must have MFA
`rds-instance-public-access-check`	RDS instances should not be publicly accessible

Config + Auto-remediation

Config rule: "restricted-ssh" detects open port 22
  → Config triggers SSM Automation: AWS-DisablePublicAccessForSecurityGroup
    → Removes the 0.0.0.0/0 SSH rule automatically

CloudWatch vs CloudTrail vs Config

Service	Answers	Example
CloudWatch	What is happening right now?	CPU is at 95%
CloudTrail	Who did what and when?	Alice deleted the S3 bucket at 2pm
Config	Is my infrastructure compliant?	3 security groups have port 22 open

AWS Health Dashboard

Two views:

Service Health Dashboard (public) — AWS-wide service status
Personal Health Dashboard (per account) — issues affecting your resources

Health events

Scheduled maintenance (e.g., host retirement for your EC2 instance)
Active issues affecting your services
Notifications about upcoming deprecations

Automation with Health events + EventBridge

AWS Health event: EC2 instance host retirement
  → EventBridge rule
    → Lambda: automatically stop and start the instance
      → Instance migrates to new healthy host

Service Quotas

Service Quotas lets you view and manage limits for AWS services in one place.

# Check current quota for EC2 instances
aws service-quotas get-service-quota \
  --service-code ec2 \
  --quota-code L-1216C47A

# Request a quota increase
aws service-quotas request-service-quota-increase \
  --service-code ec2 \
  --quota-code L-1216C47A \
  --desired-value 200

Set CloudWatch alarms on quota utilisation — get notified at 80% usage before you hit the limit.

CloudWatch Anomaly Detection

CloudWatch can model the expected baseline of a metric using ML and alert when it deviates:

aws cloudwatch put-anomaly-detector \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-0abc123 \
  --stat Average

Use this for metrics with seasonal patterns (more traffic during business hours) where a static threshold would generate false alarms.

Hands-on: Full Monitoring Stack

Goal: Set up end-to-end observability for an EC2 instance.

1. Enable detailed monitoring on the instance

2. Install CloudWatch Agent:
   - Collect memory, disk usage
   - Ship /var/log/messages to log group /ec2/syslog

3. Create alarms:
   - CPUUtilization > 80% for 5 min → SNS email
   - mem_used_percent > 85% for 5 min → SNS email
   - StatusCheckFailed_System ≥ 1 for 3 min → EC2 Recover action

4. Create metric filter on /ec2/syslog:
   - Pattern: [timestamp, host, process, msg="*OOM*"]
   - Metric: OOMKillerEvents (count)
   - Alarm: OOMKillerEvents ≥ 1 → SNS

5. Enable CloudTrail (multi-region):
   - Deliver to S3: s3://mycompany-cloudtrail-logs
   - Send to CloudWatch Logs: /cloudtrail/management-events

6. Create EventBridge rule:
   - Source: aws.cloudtrail
   - Detail: eventName = "ConsoleLogin" AND "additionalEventData.MFAUsed" = "No"
   - Target: SNS — "Console login without MFA!"

Q: RAM utilisation is not in CloudWatch. How do you monitor it? Install the CloudWatch Agent on the instance. Configure it to collect mem_used_percent. The agent requires the instance to have an IAM role with cloudwatch:PutMetricData permission.

Q: An alarm is in INSUFFICIENT_DATA state. What does this mean? CloudWatch does not have enough data points to evaluate the alarm threshold. This happens when an instance first starts, or when detailed monitoring is disabled and the metric hasn't emitted yet.

Q: How do you get alerted when someone uses the root account? Create a CloudTrail trail → deliver to CloudWatch Logs → create a metric filter for $.userIdentity.type = "Root" → alarm on the metric → SNS notification.

Q: What's the difference between CloudTrail and Config? CloudTrail records API calls (who did what). Config continuously evaluates whether your resource configuration meets compliance rules (what state are resources in). Use CloudTrail to investigate incidents; use Config to enforce and report on compliance.

What to Learn Next

AWS Systems Manager — use EventBridge + SSM Automation for auto-remediation
AWS Security & Compliance — GuardDuty, Inspector, and Security Hub feed into CloudWatch/EventBridge
AWS Account Management — Config Aggregators and centralized CloudTrail across accounts

More in Amazon Web Services

AWS Databases for CloudOps

50 min

AWS Disaster Recovery for CloudOps

50 min

EC2 High Availability and Scalability for CloudOps

55 min

Back to Amazon Web Services