Cloud Computing Intermediate +150 XP

AWS Monitoring

AWS Observability Stack: Metrics, Logs & Tracing

Observability is the key to maintaining highly available distributed systems. The native **AWS Observability Stack** breaks down telemetry monitoring into three distinct channels:

Observability Channels:
  • Amazon CloudWatch Metrics: Tracks real-time metrics like CPU utilization, disk throughput, and HTTP requests, triggering alarm alerts when thresholds are breached.
  • Amazon CloudWatch Logs: Collects, centralizes, and parses application and system logs. Features powerful Logs Insights query capabilities.
  • AWS CloudTrail: Audits all API calls made within the account, answering: Who did what, from where, and when?
  • AWS X-Ray: Distributed tracing engine that maps request lifecycles across microservices, identifying network latencies and slow database calls.

CloudWatch Alarms & Metric Filters

Alarms enable active operational awareness. Rather than manually watching dashboards, you write **CloudWatch Alarms** that notify teams via SNS or trigger automated scaling activities (e.g. adding instances to an ASG).

You can also configure **Metric Filters** to parse raw text files in CloudWatch Logs, turning log events like [ERROR] Database connection failed into mathematical metrics to trigger alerts on application-level exceptions.

Interactive Pipeline: CloudWatch Alarm & Self-Healing Loop

See how CloudWatch monitors metrics to automate scaling and notifications. When private EC2 instances hit heavy CPU load, the alarm triggers and commands the ASG to scale dynamically.

Pipeline S: CloudWatch Alarm Loop

Monitor
CPU Utilization
EC2 pushes metrics every 1m
Threshold
Alarm Trigger
CPU > 80% for 2 periods
Action
SNS Alert & ASG
Email team + command scale
Healed
Scale Completed
New instance balances load

CloudWatch Logs Insights Query Syntax

Below is the exact CloudWatch Logs Insights query used to find the top 20 slowest API requests, calculating their average and 95th percentile execution latencies:

fields @timestamp, @message, request_path, duration
| filter duration > 1000
| stats count(*) as request_count, avg(duration) as avg_duration, pct(duration, 95) as p95_duration by request_path
| sort p95_duration desc
| limit 20