Advanced Monitoring
Advanced Monitoring
Implement comprehensive monitoring and observability solutions to ensure system reliability, performance, and rapid incident detection.
Monitoring Strategy
Effective monitoring provides visibility into system health, performance, and user experience through metrics, logs, traces, and alerts.
Four Pillars of Observability
- Metrics: Numerical measurements over time
- Logs: Detailed event records
- Traces: Request flow through systems
- Events: Significant state changes
Metrics Collection
Prometheus Setup
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
region: 'us-east-1'
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Load rules
rule_files:
- "alerts/*.yml"
- "recording_rules/*.yml"
# Scrape configurations
scrape_configs:
# Node exporter
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: '([^:]+)(:[0-9]+)?'
replacement: '${1}'
# Kubernetes cluster monitoring
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
# Service discovery for pods
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
Key Metrics to Monitor
# Recording rules for key metrics
groups:
- name: key_metrics
interval: 30s
rules:
# Request rate
- record: app:request_rate
expr: sum(rate(http_requests_total[5m])) by (service, method, status)
# Error rate
- record: app:error_rate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)
# P95 latency
- record: app:latency_p95
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le))
# CPU usage
- record: instance:cpu_utilization
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage
- record: instance:memory_utilization
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
Logging Architecture
ELK Stack Configuration
# Elasticsearch cluster
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: production-es
spec:
version: 8.11.0
nodeSets:
- name: master
count: 3
config:
node.roles: ["master"]
podTemplate:
spec:
containers:
- name: elasticsearch
resources:
requests:
memory: 2Gi
cpu: 1
limits:
memory: 2Gi
- name: data
count: 5
config:
node.roles: ["data", "ingest"]
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
storageClassName: fast-ssd
---
# Logstash pipeline
apiVersion: v1
kind: ConfigMap
metadata:
name: logstash-pipeline
data:
logstash.conf: |
input {
beats {
port => 5044
}
kafka {
bootstrap_servers => "kafka:9092"
topics => ["application-logs"]
codec => json
}
}
filter {
# Parse JSON logs
if [message] =~ /^\{.*\}$/ {
json {
source => "message"
}
}
# Extract fields from log patterns
grok {
match => {
"message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:msg}"
}
}
# Add GeoIP information
if [client_ip] {
geoip {
source => "client_ip"
target => "geoip"
}
}
# Calculate response time
if [response_time] {
ruby {
code => "event.set('response_time_ms', event.get('response_time').to_f * 1000)"
}
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "logs-%{[@metadata][beat]}-%{+YYYY.MM.dd}"
}
# Send critical errors to alerting
if [level] == "ERROR" or [level] == "FATAL" {
http {
url => "http://alertmanager:9093/api/v1/alerts"
http_method => "post"
format => "json"
}
}
}
Structured Logging
# Application logging configuration
import (
"github.com/sirupsen/logrus"
"github.com/opentracing/opentracing-go"
)
type Logger struct {
*logrus.Logger
}
func NewLogger() *Logger {
log := logrus.New()
log.SetFormatter(&logrus.JSONFormatter{
TimestampFormat: "2006-01-02T15:04:05.999Z07:00",
FieldMap: logrus.FieldMap{
logrus.FieldKeyTime: "@timestamp",
logrus.FieldKeyLevel: "level",
logrus.FieldKeyMsg: "message",
},
})
return &Logger{log}
}
func (l *Logger) LogRequest(r *http.Request, statusCode int, duration time.Duration) {
span := opentracing.SpanFromContext(r.Context())
fields := logrus.Fields{
"method": r.Method,
"path": r.URL.Path,
"status": statusCode,
"duration_ms": duration.Milliseconds(),
"user_agent": r.UserAgent(),
"remote_addr": r.RemoteAddr,
"request_id": r.Header.Get("X-Request-ID"),
}
if span != nil {
fields["trace_id"] = span.Context()
}
if statusCode >= 500 {
l.WithFields(fields).Error("Request failed")
} else if statusCode >= 400 {
l.WithFields(fields).Warn("Client error")
} else {
l.WithFields(fields).Info("Request completed")
}
}
Distributed Tracing
Jaeger Configuration
# Jaeger all-in-one deployment
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: production-tracing
spec:
strategy: production
storage:
type: elasticsearch
options:
es:
server-urls: https://elasticsearch:9200
index-prefix: jaeger
tls:
ca: /es/certificates/ca.crt
collector:
replicas: 3
resources:
limits:
cpu: 1
memory: 1Gi
maxReplicas: 5
autoscale: true
query:
replicas: 2
resources:
limits:
cpu: 500m
memory: 512Mi
agent:
strategy: DaemonSet
daemonset:
hostNetwork: true
---
# Application instrumentation
import (
"github.com/opentracing/opentracing-go"
"github.com/uber/jaeger-client-go"
jaegercfg "github.com/uber/jaeger-client-go/config"
)
func initTracing(serviceName string) (opentracing.Tracer, io.Closer) {
cfg := jaegercfg.Configuration{
ServiceName: serviceName,
Sampler: &jaegercfg.SamplerConfig{
Type: jaeger.SamplerTypeConst,
Param: 1,
},
Reporter: &jaegercfg.ReporterConfig{
LogSpans: true,
LocalAgentHostPort: "jaeger-agent:6831",
},
}
tracer, closer, err := cfg.NewTracer(
jaegercfg.Logger(jaeger.StdLogger),
jaegercfg.Metrics(metrics),
)
opentracing.SetGlobalTracer(tracer)
return tracer, closer
}
Alerting Configuration
Alert Rules
# alerts/application.yml
groups:
- name: application_alerts
rules:
# High error rate
- alert: HighErrorRate
expr: app:error_rate > 0.05
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.service }}"
runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
# Slow response time
- alert: SlowResponseTime
expr: app:latency_p95 > 1
for: 10m
labels:
severity: warning
team: backend
annotations:
summary: "Slow response time on {{ $labels.service }}"
description: "P95 latency is {{ $value }}s for {{ $labels.service }}"
# Pod crash looping
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
description: "Pod has restarted {{ $value }} times in the last 15 minutes"
- name: infrastructure_alerts
rules:
# High CPU usage
- alert: HighCPUUsage
expr: instance:cpu_utilization > 90
for: 15m
labels:
severity: warning
team: infrastructure
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}% on {{ $labels.instance }}"
# Disk space low
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"} / node_filesystem_size_bytes) < 0.1
for: 5m
labels:
severity: critical
team: infrastructure
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Only {{ $value | humanizePercentage }} disk space left on {{ $labels.mountpoint }}"
Alertmanager Configuration
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: 'YOUR_SLACK_WEBHOOK'
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
# Critical alerts go to PagerDuty
- match:
severity: critical
receiver: pagerduty
continue: true
# Team-specific routing
- match:
team: backend
receiver: backend-team
- match:
team: infrastructure
receiver: infra-team
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts'
title: 'Alert: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
- name: 'backend-team'
slack_configs:
- channel: '#backend-alerts'
send_resolved: true
- name: 'infra-team'
email_configs:
- to: '[email protected]'
headers:
Subject: 'Infrastructure Alert: {{ .GroupLabels.alertname }}'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']
Visualization with Grafana
Dashboard Configuration
# Service overview dashboard
{
"dashboard": {
"title": "Service Overview",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (service)",
"legendFormat": "{{ service }}"
}
],
"type": "graph"
},
{
"title": "Error Rate",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)",
"legendFormat": "{{ service }}"
}
],
"type": "graph",
"yaxes": [
{
"format": "percentunit"
}
]
},
{
"title": "Response Time (P50, P95, P99)",
"targets": [
{
"expr": "histogram_quantile(0.5, sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le))",
"legendFormat": "p50 {{ service }}"
},
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le))",
"legendFormat": "p95 {{ service }}"
},
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le))",
"legendFormat": "p99 {{ service }}"
}
],
"type": "graph"
},
{
"title": "Service Health",
"targets": [
{
"expr": "up{job=~\".*service.*\"}",
"format": "table",
"instant": true
}
],
"type": "table"
}
]
}
}
SLO Dashboard
# SLO tracking dashboard
- title: "SLO: 99.9% Availability"
panels:
- title: "Current Availability"
expr: |
(1 - (
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
)) * 100
thresholds:
- value: 99.9
color: green
- value: 99.0
color: yellow
- value: 95.0
color: red
- title: "Error Budget Remaining"
expr: |
(
(0.001 * 30 * 24 * 60) -
sum(increase(http_requests_total{status=~"5.."}[30d])) / sum(increase(http_requests_total[30d])) * 30 * 24 * 60
) / (0.001 * 30 * 24 * 60) * 100
- title: "Burn Rate"
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h]))
) / 0.001
Application Performance Monitoring
Custom Metrics
# Application metrics
from prometheus_client import Counter, Histogram, Gauge, generate_latest
# Define metrics
request_count = Counter('app_requests_total',
'Total requests',
['method', 'endpoint', 'status'])
request_duration = Histogram('app_request_duration_seconds',
'Request duration',
['method', 'endpoint'])
active_users = Gauge('app_active_users', 'Currently active users')
db_connections = Gauge('app_db_connections_active',
'Active database connections',
['pool'])
# Instrument code
@app.route('/api/')
@request_duration.labels(method='GET', endpoint=endpoint).time()
def api_handler(endpoint):
try:
result = process_request(endpoint)
request_count.labels(method='GET', endpoint=endpoint, status=200).inc()
return result
except Exception as e:
request_count.labels(method='GET', endpoint=endpoint, status=500).inc()
raise
# Expose metrics
@app.route('/metrics')
def metrics():
return generate_latest()
Real User Monitoring (RUM)
// Browser-side monitoring
window.addEventListener('load', function() {
const perfData = window.performance.timing;
const pageLoadTime = perfData.loadEventEnd - perfData.navigationStart;
const connectTime = perfData.responseEnd - perfData.requestStart;
const renderTime = perfData.domComplete - perfData.domLoading;
// Send metrics to backend
fetch('/api/rum', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({
page: window.location.pathname,
loadTime: pageLoadTime,
connectTime: connectTime,
renderTime: renderTime,
userAgent: navigator.userAgent,
timestamp: new Date().toISOString()
})
});
// Track JavaScript errors
window.addEventListener('error', function(e) {
fetch('/api/rum/error', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({
message: e.message,
source: e.filename,
line: e.lineno,
column: e.colno,
error: e.error ? e.error.stack : '',
timestamp: new Date().toISOString()
})
});
});
});
Infrastructure Monitoring
Node Monitoring
# Node exporter deployment
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9100"
spec:
hostNetwork: true
hostPID: true
hostIPC: true
containers:
- name: node-exporter
image: prom/node-exporter:latest
args:
- --path.procfs=/host/proc
- --path.sysfs=/host/sys
- --path.rootfs=/host/root
- --collector.filesystem.ignored-mount-points
- ^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/)
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
- name: root
mountPath: /host/root
mountPropagation: HostToContainer
readOnly: true
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
- name: root
hostPath:
path: /
Database Monitoring
# PostgreSQL monitoring queries
-- Connection metrics
SELECT
datname,
count(*) as connections,
count(*) filter (where state = 'active') as active,
count(*) filter (where state = 'idle') as idle,
count(*) filter (where state = 'idle in transaction') as idle_in_transaction,
max(extract(epoch from (now() - state_change))) as max_connection_age
FROM pg_stat_activity
GROUP BY datname;
-- Query performance
SELECT
queryid,
query,
calls,
total_exec_time,
mean_exec_time,
stddev_exec_time,
rows
FROM pg_stat_statements
WHERE query NOT LIKE '%pg_stat_statements%'
ORDER BY total_exec_time DESC
LIMIT 20;
-- Table statistics
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size,
n_live_tup as row_count,
n_dead_tup as dead_rows,
last_vacuum,
last_autovacuum
FROM pg_stat_user_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
Synthetic Monitoring
Blackbox Exporter
# Blackbox exporter configuration
modules:
http_2xx:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: [200, 201, 202, 203, 204]
follow_redirects: true
fail_if_ssl: false
fail_if_not_ssl: true
tls_config:
insecure_skip_verify: false
http_api_check:
prober: http
timeout: 10s
http:
valid_status_codes: [200]
method: POST
headers:
Content-Type: application/json
body: '{"test": true}'
fail_if_body_not_matches_regexp:
- '"status":\s*"ok"'
tcp_connect:
prober: tcp
timeout: 5s
dns_check:
prober: dns
timeout: 5s
dns:
query_name: "example.com"
query_type: "A"
valid_rcodes:
- NOERROR
# Prometheus job configuration
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://example.com
- https://api.example.com/health
- https://admin.example.com
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
Cost Monitoring
Cloud Cost Tracking
# Kubernetes cost allocation
apiVersion: v1
kind: ConfigMap
metadata:
name: kubecost-values
data:
values.yaml: |
global:
prometheus:
enabled: false
fqdn: http://prometheus-server.monitoring.svc.cluster.local
kubecostModel:
allocation:
nodeLabels:
- team
- environment
- app
costModel:
cloudProvider: AWS
awsRegion: us-east-1
awsServiceKey: s3://cost-reports/cur/
reporting:
valuesReporting: true
Monitoring Best Practices
- USE Method: Utilization, Saturation, Errors for resources
- RED Method: Rate, Errors, Duration for services
- Four Golden Signals: Latency, Traffic, Errors, Saturation
- Alert Fatigue: Only alert on actionable issues
- Gradual Rollout: Test monitoring in staging first
- Retention: Balance detail vs. storage costs
- Security: Protect metrics and logs endpoints
Troubleshooting
Common Issues
- Missing Metrics: Check service discovery, firewall rules
- High Cardinality: Review label usage, add recording rules
- Slow Queries: Optimize PromQL, use recording rules
- Alert Storms: Add inhibition rules, group alerts
- Storage Issues: Adjust retention, add more storage
Related Resources
Note: This documentation is provided for reference purposes only. It reflects general best practices and industry-aligned guidelines, and any examples, claims, or recommendations are intended as illustrative—not definitive or binding.