Troubleshooting Guide

Troubleshooting Methodology

Follow this systematic approach to efficiently identify and resolve issues:

Define the Problem: Clearly identify symptoms and scope
Gather Information: Collect logs, metrics, and error messages
Develop Hypothesis: Form theories about root causes
Test and Isolate: Systematically test each hypothesis
Implement Solution: Apply fixes based on findings
Verify Resolution: Confirm the issue is resolved
Document Findings: Record the solution for future reference

Common Infrastructure Issues

Server Unreachable

Symptoms

Cannot SSH to server
Website/application not responding
Monitoring shows server as down

Diagnostic Steps

# 1. Check network connectivity
ping server-ip-address
traceroute server-ip-address

# 2. Check DNS resolution
nslookup your-domain.com
dig your-domain.com

# 3. Test specific ports
telnet server-ip 22  # SSH
telnet server-ip 80  # HTTP
telnet server-ip 443 # HTTPS

# 4. Check from different locations
# Use online tools or different networks

# 5. Check cloud provider console
# - Instance status
# - Network security groups
# - VPC/subnet configuration

Common Causes & Solutions

Network Security Group: Check inbound rules allow your IP
Server Crashed: Reboot via cloud console
Firewall Rules: Review iptables/ufw configuration
Full Disk: Check disk usage via console
Network Issues: Check VPC, subnet, routing tables

High CPU Usage

Diagnostic Commands

# Identify top CPU consumers
top -c
htop

# Check load average
uptime
w

# Process-specific CPU usage
ps aux --sort=-%cpu | head -20

# Historical CPU data
sar -u 1 10

# Check for CPU wait
iostat -x 1 5

# Trace high CPU process
strace -p [PID] -c

# Profile application (Python example)
py-spy top --pid [PID]

Common Causes

Runaway Process: Kill or restart the process
Insufficient Resources: Scale up instance
Inefficient Code: Profile and optimize
Memory Pressure: Check for swapping
Malware/Mining: Run security scan

Out of Memory (OOM)

Investigation

# Check memory usage
free -h
vmstat 1 5

# Find memory-hungry processes
ps aux --sort=-%mem | head -20

# Check OOM killer logs
dmesg | grep -i "killed process"
journalctl -u [service-name] | grep -i memory

# Memory breakdown
cat /proc/meminfo

# Check for memory leaks
valgrind --leak-check=full [program]

# Java heap analysis
jmap -heap [PID]
jstat -gcutil [PID] 1000

Solutions

Increase server memory
Optimize application memory usage
Configure swap space
Implement memory limits (cgroups)
Fix memory leaks in code

Application Issues

Slow Response Times

# 1. Check application logs
tail -f /var/log/application.log
journalctl -u myapp -f

# 2. Database query analysis
# MySQL
SHOW PROCESSLIST;
SHOW STATUS LIKE 'Slow_queries';
EXPLAIN SELECT ...;

# PostgreSQL
SELECT * FROM pg_stat_activity WHERE state != 'idle';
EXPLAIN ANALYZE SELECT ...;

# 3. Network latency
ping -c 10 database-server
traceroute api-endpoint

# 4. Application profiling
# Node.js
node --prof app.js
# Python
python -m cProfile -o profile.out app.py

# 5. Check resource limits
ulimit -a

Application Won't Start

Checklist

# 1. Check service status
systemctl status myapp
journalctl -xeu myapp

# 2. Verify configuration
[app-binary] --validate-config
nginx -t  # for nginx

# 3. Check port availability
netstat -tulpn | grep :8080
lsof -i :8080

# 4. Verify permissions
ls -la /path/to/app
namei -l /path/to/app/binary

# 5. Check dependencies
ldd /path/to/binary  # Linux
otool -L binary      # macOS

# 6. Environment variables
env | grep APP_
printenv

Common Issues

Port already in use
Missing environment variables
Incorrect file permissions
Configuration syntax errors
Missing dependencies
Database connection failures

Database Troubleshooting

Connection Issues

# Test connectivity
telnet db-host 5432  # PostgreSQL
telnet db-host 3306  # MySQL

# MySQL connection test
mysql -h db-host -u username -p -e "SELECT 1"

# PostgreSQL connection test
psql -h db-host -U username -d dbname -c "SELECT 1"

# Check connection limits
# MySQL
SHOW VARIABLES LIKE 'max_connections';
SHOW STATUS LIKE 'Threads_connected';

# PostgreSQL
SHOW max_connections;
SELECT count(*) FROM pg_stat_activity;

Slow Queries

# Enable slow query log
# MySQL
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 2;

# Find slow queries
# MySQL
SELECT * FROM mysql.slow_log ORDER BY query_time DESC LIMIT 10;

# PostgreSQL
SELECT query, calls, total_time, mean_time 
FROM pg_stat_statements 
ORDER BY total_time DESC 
LIMIT 10;

# Missing indexes
# MySQL
SELECT tables.table_schema, tables.table_name, tables.engine
FROM information_schema.tables
LEFT JOIN information_schema.statistics 
ON tables.table_schema = statistics.table_schema
AND tables.table_name = statistics.table_name
WHERE statistics.index_name IS NULL
AND tables.table_schema NOT IN ('information_schema', 'mysql');

# Query optimization
EXPLAIN ANALYZE [YOUR QUERY];

Replication Lag

# MySQL
SHOW SLAVE STATUS\G
SELECT * FROM mysql.slave_relay_log_info;

# PostgreSQL
SELECT * FROM pg_stat_replication;
SELECT pg_last_wal_receive_lsn() - pg_last_wal_replay_lsn() AS replication_lag;

# Fix replication
# MySQL
STOP SLAVE;
RESET SLAVE;
CHANGE MASTER TO ...;
START SLAVE;

# PostgreSQL
SELECT pg_wal_replay_pause();
SELECT pg_wal_replay_resume();

Kubernetes Troubleshooting

Pod Issues

# Pod not starting
kubectl describe pod [pod-name]
kubectl logs [pod-name] --previous
kubectl get events --sort-by='.lastTimestamp'

# Common issues and fixes
# ImagePullBackOff
kubectl get pod [pod-name] -o yaml | grep -A5 image
# Check image name, registry credentials

# CrashLoopBackOff
kubectl logs [pod-name] --previous
kubectl exec [pod-name] -- /bin/sh
# Check application logs, configuration

# Pending
kubectl describe pod [pod-name] | grep -A5 Events
kubectl get nodes
kubectl describe node [node-name]
# Check resource availability, node selectors

# Resource issues
kubectl top nodes
kubectl top pods
kubectl describe resourcequota

Service Discovery

# Service not reachable
kubectl get svc
kubectl get endpoints [service-name]
kubectl describe svc [service-name]

# DNS resolution
kubectl exec -it [pod-name] -- nslookup [service-name]
kubectl exec -it [pod-name] -- cat /etc/resolv.conf

# Test connectivity
kubectl exec -it [pod-name] -- curl [service-name]:[port]
kubectl exec -it [pod-name] -- nc -zv [service-name] [port]

# Check network policies
kubectl get networkpolicies
kubectl describe networkpolicy [policy-name]

Storage Issues

# PVC not binding
kubectl get pvc
kubectl describe pvc [pvc-name]
kubectl get pv

# Storage class issues
kubectl get storageclass
kubectl describe storageclass [class-name]

# Volume mount problems
kubectl describe pod [pod-name] | grep -A10 Volumes
kubectl exec [pod-name] -- df -h
kubectl exec [pod-name] -- ls -la /mount/path

Network Troubleshooting

DNS Issues

# Test DNS resolution
nslookup domain.com
dig domain.com
host domain.com

# Check DNS configuration
cat /etc/resolv.conf
systemd-resolve --status

# Flush DNS cache
# Linux
systemctl restart systemd-resolved
# or
sudo killall -HUP mDNSResponder

# Test specific DNS server
dig @8.8.8.8 domain.com
nslookup domain.com 8.8.8.8

# Check DNS propagation
dig +trace domain.com

SSL/TLS Issues

# Test SSL certificate
openssl s_client -connect domain.com:443 -servername domain.com

# Check certificate details
openssl s_client -connect domain.com:443 < /dev/null | openssl x509 -text

# Verify certificate chain
openssl s_client -showcerts -connect domain.com:443

# Test specific TLS version
openssl s_client -connect domain.com:443 -tls1_2

# Check cipher suites
nmap --script ssl-enum-ciphers -p 443 domain.com

# Certificate expiration
echo | openssl s_client -connect domain.com:443 2>/dev/null | openssl x509 -noout -dates

Load Balancer Issues

# Check backend health
curl -I http://backend-server:port/health

# Test load balancer directly
curl -H "Host: domain.com" http://lb-ip/

# Check load balancer logs
# AWS ALB
aws elbv2 describe-target-health --target-group-arn [arn]

# NGINX
tail -f /var/log/nginx/error.log
nginx -t  # Test configuration

# HAProxy
haproxy -c -f /etc/haproxy/haproxy.cfg
echo "show stat" | socat stdio /var/run/haproxy.sock

Performance Troubleshooting

Disk I/O Issues

# Check disk usage
df -h
du -sh /* | sort -h

# I/O statistics
iostat -x 1 5
iotop -o

# Find large files
find / -type f -size +1G -exec ls -lh {} \;

# Check disk health
smartctl -a /dev/sda
badblocks -v /dev/sda

# File system issues
fsck -n /dev/sda1  # Dry run
dmesg | grep -i error

Network Performance

# Bandwidth testing
iperf3 -s  # Server
iperf3 -c server-ip  # Client

# Packet loss
ping -c 100 destination | grep loss
mtr destination

# Network utilization
iftop
nethogs
vnstat

# TCP tuning check
sysctl net.ipv4.tcp_congestion_control
sysctl net.core.rmem_max
sysctl net.core.wmem_max

# Connection tracking
conntrack -L
ss -s

Security Incident Response

Suspected Breach

# 1. Isolate the system
# Update security groups/firewall rules

# 2. Capture evidence
# Memory dump
sudo dd if=/dev/mem of=/evidence/memory.dump

# Network connections
netstat -tulpan > /evidence/netstat.txt
ss -tulpan > /evidence/ss.txt
lsof -i > /evidence/lsof.txt

# Process list
ps auxfww > /evidence/processes.txt
pstree -p > /evidence/pstree.txt

# 3. Check for unauthorized access
last -f /var/log/wtmp
lastb -f /var/log/btmp
grep "Accepted" /var/log/auth.log

# 4. Look for suspicious processes
ps aux | grep -v "\\[.*\\]" | awk '{print $11}' | sort | uniq -c | sort -n
lsof -n | grep ESTABLISHED

# 5. Check for rootkits
rkhunter --check
chkrootkit

# 6. Review logs
journalctl --since "2 hours ago"
find /var/log -type f -mtime -1 -exec ls -la {} \;

Cloud Provider Specific

AWS Troubleshooting

# EC2 instance issues
aws ec2 describe-instance-status --instance-id i-xxxxx
aws ec2 get-console-output --instance-id i-xxxxx

# Check CloudWatch logs
aws logs tail /aws/lambda/function-name --follow

# RDS issues
aws rds describe-db-instances --db-instance-identifier mydb
aws rds describe-events --source-identifier mydb --source-type db-instance

# S3 access issues
aws s3api get-bucket-policy --bucket my-bucket
aws s3api get-bucket-acl --bucket my-bucket

Azure Troubleshooting

# VM diagnostics
az vm boot-diagnostics get-boot-log --name MyVM --resource-group MyRG

# Check VM status
az vm get-instance-view --name MyVM --resource-group MyRG

# Storage issues
az storage account show --name mystorageaccount
az storage account keys list --account-name mystorageaccount

GCP Troubleshooting

# Instance serial console
gcloud compute instances get-serial-port-output instance-name

# Check instance status
gcloud compute instances describe instance-name

# View logs
gcloud logging read "resource.type=gce_instance"

Emergency Procedures

System Unresponsive

Try to access via out-of-band management (IPMI/iLO)
Check cloud provider console for instance status
Attempt forced reboot via provider console
If critical, initiate failover to standby system
Create snapshot before any destructive actions

Data Recovery

# Attempt file recovery
testdisk /dev/sda
photorec /dev/sda

# MySQL binary log recovery
mysqlbinlog mysql-bin.000001 | mysql -u root -p

# PostgreSQL WAL replay
pg_resetwal -f /var/lib/postgresql/data

# Emergency backup
dd if=/dev/sda of=/backup/emergency.img bs=64K conv=noerror,sync

Troubleshooting Tools

Essential Tools

System: htop, iotop, iftop, sysstat, dstat
Network: tcpdump, wireshark, nmap, netcat, mtr
Performance: perf, strace, ltrace, systemtap
Debugging: gdb, valgrind, dtrace
Logs: multitail, lnav, stern (Kubernetes)
Cloud: aws-cli, azure-cli, gcloud

Troubleshooting Methodology

Common Infrastructure Issues

Server Unreachable

Symptoms

Diagnostic Steps

Common Causes & Solutions

High CPU Usage

Diagnostic Commands

Common Causes

Out of Memory (OOM)

Investigation

Solutions

Application Issues

Slow Response Times

Application Won't Start

Checklist

Common Issues

Database Troubleshooting

Connection Issues

Slow Queries

Replication Lag

Kubernetes Troubleshooting

Pod Issues

Service Discovery

Storage Issues

Network Troubleshooting

DNS Issues

SSL/TLS Issues

Load Balancer Issues

Performance Troubleshooting

Disk I/O Issues

Network Performance

Security Incident Response

Suspected Breach

Cloud Provider Specific

AWS Troubleshooting

Azure Troubleshooting

GCP Troubleshooting

Emergency Procedures

System Unresponsive

Data Recovery

Troubleshooting Tools

Essential Tools

Related Resources

Related Documentation

Advanced Monitoring

Performance Tuning

Technology Standards & Best Practices