Home / Documentation / Monitoring & Operations / Advanced Monitoring

Advanced Monitoring

15 min read
Updated Jun 19, 2025

Advanced Monitoring

Implement comprehensive monitoring and observability solutions to ensure system reliability, performance, and rapid incident detection.

Monitoring Strategy

Effective monitoring provides visibility into system health, performance, and user experience through metrics, logs, traces, and alerts.

Four Pillars of Observability

  • Metrics: Numerical measurements over time
  • Logs: Detailed event records
  • Traces: Request flow through systems
  • Events: Significant state changes

Metrics Collection

Prometheus Setup

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'production'
    region: 'us-east-1'

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

# Load rules
rule_files:
  - "alerts/*.yml"
  - "recording_rules/*.yml"

# Scrape configurations
scrape_configs:
  # Node exporter
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '([^:]+)(:[0-9]+)?'
        replacement: '${1}'

  # Kubernetes cluster monitoring
  - job_name: 'kubernetes-apiservers'
    kubernetes_sd_configs:
      - role: endpoints
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https

  # Service discovery for pods
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

Key Metrics to Monitor

# Recording rules for key metrics
groups:
  - name: key_metrics
    interval: 30s
    rules:
      # Request rate
      - record: app:request_rate
        expr: sum(rate(http_requests_total[5m])) by (service, method, status)
      
      # Error rate
      - record: app:error_rate
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)
      
      # P95 latency
      - record: app:latency_p95
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le))
      
      # CPU usage
      - record: instance:cpu_utilization
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
      
      # Memory usage
      - record: instance:memory_utilization
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

Logging Architecture

ELK Stack Configuration

# Elasticsearch cluster
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: production-es
spec:
  version: 8.11.0
  nodeSets:
  - name: master
    count: 3
    config:
      node.roles: ["master"]
    podTemplate:
      spec:
        containers:
        - name: elasticsearch
          resources:
            requests:
              memory: 2Gi
              cpu: 1
            limits:
              memory: 2Gi
  - name: data
    count: 5
    config:
      node.roles: ["data", "ingest"]
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 100Gi
        storageClassName: fast-ssd

---
# Logstash pipeline
apiVersion: v1
kind: ConfigMap
metadata:
  name: logstash-pipeline
data:
  logstash.conf: |
    input {
      beats {
        port => 5044
      }
      
      kafka {
        bootstrap_servers => "kafka:9092"
        topics => ["application-logs"]
        codec => json
      }
    }
    
    filter {
      # Parse JSON logs
      if [message] =~ /^\{.*\}$/ {
        json {
          source => "message"
        }
      }
      
      # Extract fields from log patterns
      grok {
        match => {
          "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:msg}"
        }
      }
      
      # Add GeoIP information
      if [client_ip] {
        geoip {
          source => "client_ip"
          target => "geoip"
        }
      }
      
      # Calculate response time
      if [response_time] {
        ruby {
          code => "event.set('response_time_ms', event.get('response_time').to_f * 1000)"
        }
      }
    }
    
    output {
      elasticsearch {
        hosts => ["elasticsearch:9200"]
        index => "logs-%{[@metadata][beat]}-%{+YYYY.MM.dd}"
      }
      
      # Send critical errors to alerting
      if [level] == "ERROR" or [level] == "FATAL" {
        http {
          url => "http://alertmanager:9093/api/v1/alerts"
          http_method => "post"
          format => "json"
        }
      }
    }

Structured Logging

# Application logging configuration
import (
    "github.com/sirupsen/logrus"
    "github.com/opentracing/opentracing-go"
)

type Logger struct {
    *logrus.Logger
}

func NewLogger() *Logger {
    log := logrus.New()
    log.SetFormatter(&logrus.JSONFormatter{
        TimestampFormat: "2006-01-02T15:04:05.999Z07:00",
        FieldMap: logrus.FieldMap{
            logrus.FieldKeyTime:  "@timestamp",
            logrus.FieldKeyLevel: "level",
            logrus.FieldKeyMsg:   "message",
        },
    })
    
    return &Logger{log}
}

func (l *Logger) LogRequest(r *http.Request, statusCode int, duration time.Duration) {
    span := opentracing.SpanFromContext(r.Context())
    
    fields := logrus.Fields{
        "method":        r.Method,
        "path":          r.URL.Path,
        "status":        statusCode,
        "duration_ms":   duration.Milliseconds(),
        "user_agent":    r.UserAgent(),
        "remote_addr":   r.RemoteAddr,
        "request_id":    r.Header.Get("X-Request-ID"),
    }
    
    if span != nil {
        fields["trace_id"] = span.Context()
    }
    
    if statusCode >= 500 {
        l.WithFields(fields).Error("Request failed")
    } else if statusCode >= 400 {
        l.WithFields(fields).Warn("Client error")
    } else {
        l.WithFields(fields).Info("Request completed")
    }
}

Distributed Tracing

Jaeger Configuration

# Jaeger all-in-one deployment
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: production-tracing
spec:
  strategy: production
  
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: https://elasticsearch:9200
        index-prefix: jaeger
        tls:
          ca: /es/certificates/ca.crt
    
  collector:
    replicas: 3
    resources:
      limits:
        cpu: 1
        memory: 1Gi
    maxReplicas: 5
    autoscale: true
    
  query:
    replicas: 2
    resources:
      limits:
        cpu: 500m
        memory: 512Mi
        
  agent:
    strategy: DaemonSet
    daemonset:
      hostNetwork: true

---
# Application instrumentation
import (
    "github.com/opentracing/opentracing-go"
    "github.com/uber/jaeger-client-go"
    jaegercfg "github.com/uber/jaeger-client-go/config"
)

func initTracing(serviceName string) (opentracing.Tracer, io.Closer) {
    cfg := jaegercfg.Configuration{
        ServiceName: serviceName,
        Sampler: &jaegercfg.SamplerConfig{
            Type:  jaeger.SamplerTypeConst,
            Param: 1,
        },
        Reporter: &jaegercfg.ReporterConfig{
            LogSpans:           true,
            LocalAgentHostPort: "jaeger-agent:6831",
        },
    }
    
    tracer, closer, err := cfg.NewTracer(
        jaegercfg.Logger(jaeger.StdLogger),
        jaegercfg.Metrics(metrics),
    )
    
    opentracing.SetGlobalTracer(tracer)
    return tracer, closer
}

Alerting Configuration

Alert Rules

# alerts/application.yml
groups:
  - name: application_alerts
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: app:error_rate > 0.05
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.service }}"
          runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
      
      # Slow response time
      - alert: SlowResponseTime
        expr: app:latency_p95 > 1
        for: 10m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "Slow response time on {{ $labels.service }}"
          description: "P95 latency is {{ $value }}s for {{ $labels.service }}"
      
      # Pod crash looping
      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
          description: "Pod has restarted {{ $value }} times in the last 15 minutes"

  - name: infrastructure_alerts
    rules:
      # High CPU usage
      - alert: HighCPUUsage
        expr: instance:cpu_utilization > 90
        for: 15m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value }}% on {{ $labels.instance }}"
      
      # Disk space low
      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs"} / node_filesystem_size_bytes) < 0.1
        for: 5m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Only {{ $value | humanizePercentage }} disk space left on {{ $labels.mountpoint }}"

Alertmanager Configuration

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'YOUR_SLACK_WEBHOOK'
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'
  
  routes:
    # Critical alerts go to PagerDuty
    - match:
        severity: critical
      receiver: pagerduty
      continue: true
    
    # Team-specific routing
    - match:
        team: backend
      receiver: backend-team
    
    - match:
        team: infrastructure
      receiver: infra-team

receivers:
  - name: 'default'
    slack_configs:
      - channel: '#alerts'
        title: 'Alert: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_KEY'
        description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
        
  - name: 'backend-team'
    slack_configs:
      - channel: '#backend-alerts'
        send_resolved: true
        
  - name: 'infra-team'
    email_configs:
      - to: '[email protected]'
        headers:
          Subject: 'Infrastructure Alert: {{ .GroupLabels.alertname }}'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'cluster', 'service']

Visualization with Grafana

Dashboard Configuration

# Service overview dashboard
{
  "dashboard": {
    "title": "Service Overview",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (service)",
            "legendFormat": "{{ service }}"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)",
            "legendFormat": "{{ service }}"
          }
        ],
        "type": "graph",
        "yaxes": [
          {
            "format": "percentunit"
          }
        ]
      },
      {
        "title": "Response Time (P50, P95, P99)",
        "targets": [
          {
            "expr": "histogram_quantile(0.5, sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le))",
            "legendFormat": "p50 {{ service }}"
          },
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le))",
            "legendFormat": "p95 {{ service }}"
          },
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le))",
            "legendFormat": "p99 {{ service }}"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Service Health",
        "targets": [
          {
            "expr": "up{job=~\".*service.*\"}",
            "format": "table",
            "instant": true
          }
        ],
        "type": "table"
      }
    ]
  }
}

SLO Dashboard

# SLO tracking dashboard
- title: "SLO: 99.9% Availability"
  panels:
    - title: "Current Availability"
      expr: |
        (1 - (
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
        )) * 100
      thresholds:
        - value: 99.9
          color: green
        - value: 99.0
          color: yellow
        - value: 95.0
          color: red
      
    - title: "Error Budget Remaining"
      expr: |
        (
          (0.001 * 30 * 24 * 60) -
          sum(increase(http_requests_total{status=~"5.."}[30d])) / sum(increase(http_requests_total[30d])) * 30 * 24 * 60
        ) / (0.001 * 30 * 24 * 60) * 100
      
    - title: "Burn Rate"
      expr: |
        (
          sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h]))
        ) / 0.001

Application Performance Monitoring

Custom Metrics

# Application metrics
from prometheus_client import Counter, Histogram, Gauge, generate_latest

# Define metrics
request_count = Counter('app_requests_total', 
                       'Total requests', 
                       ['method', 'endpoint', 'status'])

request_duration = Histogram('app_request_duration_seconds',
                           'Request duration',
                           ['method', 'endpoint'])

active_users = Gauge('app_active_users', 'Currently active users')

db_connections = Gauge('app_db_connections_active', 
                      'Active database connections',
                      ['pool'])

# Instrument code
@app.route('/api/')
@request_duration.labels(method='GET', endpoint=endpoint).time()
def api_handler(endpoint):
    try:
        result = process_request(endpoint)
        request_count.labels(method='GET', endpoint=endpoint, status=200).inc()
        return result
    except Exception as e:
        request_count.labels(method='GET', endpoint=endpoint, status=500).inc()
        raise

# Expose metrics
@app.route('/metrics')
def metrics():
    return generate_latest()

Real User Monitoring (RUM)

// Browser-side monitoring
window.addEventListener('load', function() {
    const perfData = window.performance.timing;
    const pageLoadTime = perfData.loadEventEnd - perfData.navigationStart;
    const connectTime = perfData.responseEnd - perfData.requestStart;
    const renderTime = perfData.domComplete - perfData.domLoading;
    
    // Send metrics to backend
    fetch('/api/rum', {
        method: 'POST',
        headers: {'Content-Type': 'application/json'},
        body: JSON.stringify({
            page: window.location.pathname,
            loadTime: pageLoadTime,
            connectTime: connectTime,
            renderTime: renderTime,
            userAgent: navigator.userAgent,
            timestamp: new Date().toISOString()
        })
    });
    
    // Track JavaScript errors
    window.addEventListener('error', function(e) {
        fetch('/api/rum/error', {
            method: 'POST',
            headers: {'Content-Type': 'application/json'},
            body: JSON.stringify({
                message: e.message,
                source: e.filename,
                line: e.lineno,
                column: e.colno,
                error: e.error ? e.error.stack : '',
                timestamp: new Date().toISOString()
            })
        });
    });
});

Infrastructure Monitoring

Node Monitoring

# Node exporter deployment
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
spec:
  selector:
    matchLabels:
      app: node-exporter
  template:
    metadata:
      labels:
        app: node-exporter
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9100"
    spec:
      hostNetwork: true
      hostPID: true
      hostIPC: true
      containers:
      - name: node-exporter
        image: prom/node-exporter:latest
        args:
          - --path.procfs=/host/proc
          - --path.sysfs=/host/sys
          - --path.rootfs=/host/root
          - --collector.filesystem.ignored-mount-points
          - ^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/)
        volumeMounts:
          - name: proc
            mountPath: /host/proc
            readOnly: true
          - name: sys
            mountPath: /host/sys
            readOnly: true
          - name: root
            mountPath: /host/root
            mountPropagation: HostToContainer
            readOnly: true
      volumes:
        - name: proc
          hostPath:
            path: /proc
        - name: sys
          hostPath:
            path: /sys
        - name: root
          hostPath:
            path: /

Database Monitoring

# PostgreSQL monitoring queries
-- Connection metrics
SELECT 
    datname,
    count(*) as connections,
    count(*) filter (where state = 'active') as active,
    count(*) filter (where state = 'idle') as idle,
    count(*) filter (where state = 'idle in transaction') as idle_in_transaction,
    max(extract(epoch from (now() - state_change))) as max_connection_age
FROM pg_stat_activity
GROUP BY datname;

-- Query performance
SELECT 
    queryid,
    query,
    calls,
    total_exec_time,
    mean_exec_time,
    stddev_exec_time,
    rows
FROM pg_stat_statements
WHERE query NOT LIKE '%pg_stat_statements%'
ORDER BY total_exec_time DESC
LIMIT 20;

-- Table statistics
SELECT 
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size,
    n_live_tup as row_count,
    n_dead_tup as dead_rows,
    last_vacuum,
    last_autovacuum
FROM pg_stat_user_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

Synthetic Monitoring

Blackbox Exporter

# Blackbox exporter configuration
modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: [200, 201, 202, 203, 204]
      follow_redirects: true
      fail_if_ssl: false
      fail_if_not_ssl: true
      tls_config:
        insecure_skip_verify: false
        
  http_api_check:
    prober: http
    timeout: 10s
    http:
      valid_status_codes: [200]
      method: POST
      headers:
        Content-Type: application/json
      body: '{"test": true}'
      fail_if_body_not_matches_regexp:
        - '"status":\s*"ok"'
        
  tcp_connect:
    prober: tcp
    timeout: 5s
    
  dns_check:
    prober: dns
    timeout: 5s
    dns:
      query_name: "example.com"
      query_type: "A"
      valid_rcodes:
        - NOERROR

# Prometheus job configuration
- job_name: 'blackbox'
  metrics_path: /probe
  params:
    module: [http_2xx]
  static_configs:
    - targets:
      - https://example.com
      - https://api.example.com/health
      - https://admin.example.com
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: blackbox-exporter:9115

Cost Monitoring

Cloud Cost Tracking

# Kubernetes cost allocation
apiVersion: v1
kind: ConfigMap
metadata:
  name: kubecost-values
data:
  values.yaml: |
    global:
      prometheus:
        enabled: false
        fqdn: http://prometheus-server.monitoring.svc.cluster.local
    
    kubecostModel:
      allocation:
        nodeLabels:
          - team
          - environment
          - app
    
    costModel:
      cloudProvider: AWS
      awsRegion: us-east-1
      awsServiceKey: s3://cost-reports/cur/
      
    reporting:
      valuesReporting: true

Monitoring Best Practices

  • USE Method: Utilization, Saturation, Errors for resources
  • RED Method: Rate, Errors, Duration for services
  • Four Golden Signals: Latency, Traffic, Errors, Saturation
  • Alert Fatigue: Only alert on actionable issues
  • Gradual Rollout: Test monitoring in staging first
  • Retention: Balance detail vs. storage costs
  • Security: Protect metrics and logs endpoints

Troubleshooting

Common Issues

  • Missing Metrics: Check service discovery, firewall rules
  • High Cardinality: Review label usage, add recording rules
  • Slow Queries: Optimize PromQL, use recording rules
  • Alert Storms: Add inhibition rules, group alerts
  • Storage Issues: Adjust retention, add more storage
Note: This documentation is provided for reference purposes only. It reflects general best practices and industry-aligned guidelines, and any examples, claims, or recommendations are intended as illustrative—not definitive or binding.