Troubleshooting Guide
Troubleshooting Guide
Comprehensive troubleshooting guide for diagnosing and resolving common infrastructure issues, performance problems, and system failures.
Troubleshooting Methodology
Follow this systematic approach to efficiently identify and resolve issues:
- Define the Problem: Clearly identify symptoms and scope
- Gather Information: Collect logs, metrics, and error messages
- Develop Hypothesis: Form theories about root causes
- Test and Isolate: Systematically test each hypothesis
- Implement Solution: Apply fixes based on findings
- Verify Resolution: Confirm the issue is resolved
- Document Findings: Record the solution for future reference
Common Infrastructure Issues
Server Unreachable
Symptoms
- Cannot SSH to server
- Website/application not responding
- Monitoring shows server as down
Diagnostic Steps
# 1. Check network connectivity
ping server-ip-address
traceroute server-ip-address
# 2. Check DNS resolution
nslookup your-domain.com
dig your-domain.com
# 3. Test specific ports
telnet server-ip 22 # SSH
telnet server-ip 80 # HTTP
telnet server-ip 443 # HTTPS
# 4. Check from different locations
# Use online tools or different networks
# 5. Check cloud provider console
# - Instance status
# - Network security groups
# - VPC/subnet configuration
Common Causes & Solutions
- Network Security Group: Check inbound rules allow your IP
- Server Crashed: Reboot via cloud console
- Firewall Rules: Review iptables/ufw configuration
- Full Disk: Check disk usage via console
- Network Issues: Check VPC, subnet, routing tables
High CPU Usage
Diagnostic Commands
# Identify top CPU consumers
top -c
htop
# Check load average
uptime
w
# Process-specific CPU usage
ps aux --sort=-%cpu | head -20
# Historical CPU data
sar -u 1 10
# Check for CPU wait
iostat -x 1 5
# Trace high CPU process
strace -p [PID] -c
# Profile application (Python example)
py-spy top --pid [PID]
Common Causes
- Runaway Process: Kill or restart the process
- Insufficient Resources: Scale up instance
- Inefficient Code: Profile and optimize
- Memory Pressure: Check for swapping
- Malware/Mining: Run security scan
Out of Memory (OOM)
Investigation
# Check memory usage
free -h
vmstat 1 5
# Find memory-hungry processes
ps aux --sort=-%mem | head -20
# Check OOM killer logs
dmesg | grep -i "killed process"
journalctl -u [service-name] | grep -i memory
# Memory breakdown
cat /proc/meminfo
# Check for memory leaks
valgrind --leak-check=full [program]
# Java heap analysis
jmap -heap [PID]
jstat -gcutil [PID] 1000
Solutions
- Increase server memory
- Optimize application memory usage
- Configure swap space
- Implement memory limits (cgroups)
- Fix memory leaks in code
Application Issues
Slow Response Times
# 1. Check application logs
tail -f /var/log/application.log
journalctl -u myapp -f
# 2. Database query analysis
# MySQL
SHOW PROCESSLIST;
SHOW STATUS LIKE 'Slow_queries';
EXPLAIN SELECT ...;
# PostgreSQL
SELECT * FROM pg_stat_activity WHERE state != 'idle';
EXPLAIN ANALYZE SELECT ...;
# 3. Network latency
ping -c 10 database-server
traceroute api-endpoint
# 4. Application profiling
# Node.js
node --prof app.js
# Python
python -m cProfile -o profile.out app.py
# 5. Check resource limits
ulimit -a
Application Won't Start
Checklist
# 1. Check service status
systemctl status myapp
journalctl -xeu myapp
# 2. Verify configuration
[app-binary] --validate-config
nginx -t # for nginx
# 3. Check port availability
netstat -tulpn | grep :8080
lsof -i :8080
# 4. Verify permissions
ls -la /path/to/app
namei -l /path/to/app/binary
# 5. Check dependencies
ldd /path/to/binary # Linux
otool -L binary # macOS
# 6. Environment variables
env | grep APP_
printenv
Common Issues
- Port already in use
- Missing environment variables
- Incorrect file permissions
- Configuration syntax errors
- Missing dependencies
- Database connection failures
Database Troubleshooting
Connection Issues
# Test connectivity
telnet db-host 5432 # PostgreSQL
telnet db-host 3306 # MySQL
# MySQL connection test
mysql -h db-host -u username -p -e "SELECT 1"
# PostgreSQL connection test
psql -h db-host -U username -d dbname -c "SELECT 1"
# Check connection limits
# MySQL
SHOW VARIABLES LIKE 'max_connections';
SHOW STATUS LIKE 'Threads_connected';
# PostgreSQL
SHOW max_connections;
SELECT count(*) FROM pg_stat_activity;
Slow Queries
# Enable slow query log
# MySQL
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 2;
# Find slow queries
# MySQL
SELECT * FROM mysql.slow_log ORDER BY query_time DESC LIMIT 10;
# PostgreSQL
SELECT query, calls, total_time, mean_time
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;
# Missing indexes
# MySQL
SELECT tables.table_schema, tables.table_name, tables.engine
FROM information_schema.tables
LEFT JOIN information_schema.statistics
ON tables.table_schema = statistics.table_schema
AND tables.table_name = statistics.table_name
WHERE statistics.index_name IS NULL
AND tables.table_schema NOT IN ('information_schema', 'mysql');
# Query optimization
EXPLAIN ANALYZE [YOUR QUERY];
Replication Lag
# MySQL
SHOW SLAVE STATUS\G
SELECT * FROM mysql.slave_relay_log_info;
# PostgreSQL
SELECT * FROM pg_stat_replication;
SELECT pg_last_wal_receive_lsn() - pg_last_wal_replay_lsn() AS replication_lag;
# Fix replication
# MySQL
STOP SLAVE;
RESET SLAVE;
CHANGE MASTER TO ...;
START SLAVE;
# PostgreSQL
SELECT pg_wal_replay_pause();
SELECT pg_wal_replay_resume();
Kubernetes Troubleshooting
Pod Issues
# Pod not starting
kubectl describe pod [pod-name]
kubectl logs [pod-name] --previous
kubectl get events --sort-by='.lastTimestamp'
# Common issues and fixes
# ImagePullBackOff
kubectl get pod [pod-name] -o yaml | grep -A5 image
# Check image name, registry credentials
# CrashLoopBackOff
kubectl logs [pod-name] --previous
kubectl exec [pod-name] -- /bin/sh
# Check application logs, configuration
# Pending
kubectl describe pod [pod-name] | grep -A5 Events
kubectl get nodes
kubectl describe node [node-name]
# Check resource availability, node selectors
# Resource issues
kubectl top nodes
kubectl top pods
kubectl describe resourcequota
Service Discovery
# Service not reachable
kubectl get svc
kubectl get endpoints [service-name]
kubectl describe svc [service-name]
# DNS resolution
kubectl exec -it [pod-name] -- nslookup [service-name]
kubectl exec -it [pod-name] -- cat /etc/resolv.conf
# Test connectivity
kubectl exec -it [pod-name] -- curl [service-name]:[port]
kubectl exec -it [pod-name] -- nc -zv [service-name] [port]
# Check network policies
kubectl get networkpolicies
kubectl describe networkpolicy [policy-name]
Storage Issues
# PVC not binding
kubectl get pvc
kubectl describe pvc [pvc-name]
kubectl get pv
# Storage class issues
kubectl get storageclass
kubectl describe storageclass [class-name]
# Volume mount problems
kubectl describe pod [pod-name] | grep -A10 Volumes
kubectl exec [pod-name] -- df -h
kubectl exec [pod-name] -- ls -la /mount/path
Network Troubleshooting
DNS Issues
# Test DNS resolution
nslookup domain.com
dig domain.com
host domain.com
# Check DNS configuration
cat /etc/resolv.conf
systemd-resolve --status
# Flush DNS cache
# Linux
systemctl restart systemd-resolved
# or
sudo killall -HUP mDNSResponder
# Test specific DNS server
dig @8.8.8.8 domain.com
nslookup domain.com 8.8.8.8
# Check DNS propagation
dig +trace domain.com
SSL/TLS Issues
# Test SSL certificate
openssl s_client -connect domain.com:443 -servername domain.com
# Check certificate details
openssl s_client -connect domain.com:443 < /dev/null | openssl x509 -text
# Verify certificate chain
openssl s_client -showcerts -connect domain.com:443
# Test specific TLS version
openssl s_client -connect domain.com:443 -tls1_2
# Check cipher suites
nmap --script ssl-enum-ciphers -p 443 domain.com
# Certificate expiration
echo | openssl s_client -connect domain.com:443 2>/dev/null | openssl x509 -noout -dates
Load Balancer Issues
# Check backend health
curl -I http://backend-server:port/health
# Test load balancer directly
curl -H "Host: domain.com" http://lb-ip/
# Check load balancer logs
# AWS ALB
aws elbv2 describe-target-health --target-group-arn [arn]
# NGINX
tail -f /var/log/nginx/error.log
nginx -t # Test configuration
# HAProxy
haproxy -c -f /etc/haproxy/haproxy.cfg
echo "show stat" | socat stdio /var/run/haproxy.sock
Performance Troubleshooting
Disk I/O Issues
# Check disk usage
df -h
du -sh /* | sort -h
# I/O statistics
iostat -x 1 5
iotop -o
# Find large files
find / -type f -size +1G -exec ls -lh {} \;
# Check disk health
smartctl -a /dev/sda
badblocks -v /dev/sda
# File system issues
fsck -n /dev/sda1 # Dry run
dmesg | grep -i error
Network Performance
# Bandwidth testing
iperf3 -s # Server
iperf3 -c server-ip # Client
# Packet loss
ping -c 100 destination | grep loss
mtr destination
# Network utilization
iftop
nethogs
vnstat
# TCP tuning check
sysctl net.ipv4.tcp_congestion_control
sysctl net.core.rmem_max
sysctl net.core.wmem_max
# Connection tracking
conntrack -L
ss -s
Security Incident Response
Suspected Breach
# 1. Isolate the system
# Update security groups/firewall rules
# 2. Capture evidence
# Memory dump
sudo dd if=/dev/mem of=/evidence/memory.dump
# Network connections
netstat -tulpan > /evidence/netstat.txt
ss -tulpan > /evidence/ss.txt
lsof -i > /evidence/lsof.txt
# Process list
ps auxfww > /evidence/processes.txt
pstree -p > /evidence/pstree.txt
# 3. Check for unauthorized access
last -f /var/log/wtmp
lastb -f /var/log/btmp
grep "Accepted" /var/log/auth.log
# 4. Look for suspicious processes
ps aux | grep -v "\\[.*\\]" | awk '{print $11}' | sort | uniq -c | sort -n
lsof -n | grep ESTABLISHED
# 5. Check for rootkits
rkhunter --check
chkrootkit
# 6. Review logs
journalctl --since "2 hours ago"
find /var/log -type f -mtime -1 -exec ls -la {} \;
Cloud Provider Specific
AWS Troubleshooting
# EC2 instance issues
aws ec2 describe-instance-status --instance-id i-xxxxx
aws ec2 get-console-output --instance-id i-xxxxx
# Check CloudWatch logs
aws logs tail /aws/lambda/function-name --follow
# RDS issues
aws rds describe-db-instances --db-instance-identifier mydb
aws rds describe-events --source-identifier mydb --source-type db-instance
# S3 access issues
aws s3api get-bucket-policy --bucket my-bucket
aws s3api get-bucket-acl --bucket my-bucket
Azure Troubleshooting
# VM diagnostics
az vm boot-diagnostics get-boot-log --name MyVM --resource-group MyRG
# Check VM status
az vm get-instance-view --name MyVM --resource-group MyRG
# Storage issues
az storage account show --name mystorageaccount
az storage account keys list --account-name mystorageaccount
GCP Troubleshooting
# Instance serial console
gcloud compute instances get-serial-port-output instance-name
# Check instance status
gcloud compute instances describe instance-name
# View logs
gcloud logging read "resource.type=gce_instance"
Emergency Procedures
System Unresponsive
- Try to access via out-of-band management (IPMI/iLO)
- Check cloud provider console for instance status
- Attempt forced reboot via provider console
- If critical, initiate failover to standby system
- Create snapshot before any destructive actions
Data Recovery
# Attempt file recovery
testdisk /dev/sda
photorec /dev/sda
# MySQL binary log recovery
mysqlbinlog mysql-bin.000001 | mysql -u root -p
# PostgreSQL WAL replay
pg_resetwal -f /var/lib/postgresql/data
# Emergency backup
dd if=/dev/sda of=/backup/emergency.img bs=64K conv=noerror,sync
Troubleshooting Tools
Essential Tools
- System: htop, iotop, iftop, sysstat, dstat
- Network: tcpdump, wireshark, nmap, netcat, mtr
- Performance: perf, strace, ltrace, systemtap
- Debugging: gdb, valgrind, dtrace
- Logs: multitail, lnav, stern (Kubernetes)
- Cloud: aws-cli, azure-cli, gcloud
Related Resources
Note: This documentation is provided for reference purposes only. It reflects general best practices and industry-aligned guidelines, and any examples, claims, or recommendations are intended as illustrative—not definitive or binding.