Home / Documentation / Infrastructure Management / Container Orchestration

Container Orchestration

7 min read
Updated Jun 19, 2025

Container Orchestration

Master container orchestration platforms and patterns for deploying, scaling, and managing containerized applications in production.

Container Orchestration Overview

Container orchestration automates the deployment, management, scaling, and networking of containers, enabling efficient operation of containerized applications at scale.

Key Orchestration Features

  • Service Discovery: Automatic container location and communication
  • Load Balancing: Distribute traffic across containers
  • Scaling: Automatic scaling based on demand
  • Self-Healing: Restart failed containers automatically
  • Rolling Updates: Zero-downtime deployments
  • Secret Management: Secure handling of sensitive data

Orchestration Platforms Comparison

Platform Best For Complexity Key Features
Kubernetes Large-scale production High Extensible, ecosystem, multi-cloud
Docker Swarm Simple deployments Low Native Docker integration
Amazon ECS AWS workloads Medium AWS integration, Fargate serverless
HashiCorp Nomad Multi-workload Medium Supports non-container workloads

Docker Swarm

Swarm Initialization

# Initialize swarm manager
docker swarm init --advertise-addr 10.0.1.10

# Join worker nodes
docker swarm join --token SWMTKN-1-xxxxx 10.0.1.10:2377

# Join additional managers
docker swarm join-token manager

# Deploy stack
docker stack deploy -c docker-compose.yml myapp

# Scale service
docker service scale myapp_web=5

# Update service
docker service update --image myapp:v2 myapp_web

Docker Compose for Swarm

# docker-compose.yml
version: '3.8'

services:
  web:
    image: myapp:latest
    deploy:
      replicas: 3
      update_config:
        parallelism: 1
        delay: 10s
        failure_action: rollback
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
      placement:
        constraints:
          - node.role == worker
          - node.labels.zone == us-east
    ports:
      - "80:8080"
    networks:
      - webnet
    secrets:
      - db_password
    configs:
      - source: app_config
        target: /app/config.json

  db:
    image: postgres:14
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.labels.type == database
    environment:
      POSTGRES_PASSWORD_FILE: /run/secrets/db_password
    volumes:
      - db-data:/var/lib/postgresql/data
    networks:
      - webnet
    secrets:
      - db_password

networks:
  webnet:
    driver: overlay
    driver_opts:
      encrypted: "true"

volumes:
  db-data:
    driver: local

secrets:
  db_password:
    external: true

configs:
  app_config:
    file: ./config.json

Amazon ECS

ECS Task Definition

{
  "family": "myapp",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "containerDefinitions": [
    {
      "name": "app",
      "image": "123456789.dkr.ecr.us-east-1.amazonaws.com/myapp:latest",
      "portMappings": [
        {
          "containerPort": 8080,
          "protocol": "tcp"
        }
      ],
      "essential": true,
      "environment": [
        {
          "name": "NODE_ENV",
          "value": "production"
        }
      ],
      "secrets": [
        {
          "name": "DB_PASSWORD",
          "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789:secret:db-password"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/myapp",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      },
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
        "interval": 30,
        "timeout": 5,
        "retries": 3
      }
    }
  ],
  "taskRoleArn": "arn:aws:iam::123456789:role/ecsTaskRole",
  "executionRoleArn": "arn:aws:iam::123456789:role/ecsTaskExecutionRole"
}

ECS Service with Auto Scaling

# Create ECS service
aws ecs create-service \
  --cluster production \
  --service-name myapp \
  --task-definition myapp:1 \
  --desired-count 3 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[subnet-xxxxx],securityGroups=[sg-xxxxx],assignPublicIp=ENABLED}" \
  --load-balancers targetGroupArn=arn:aws:elasticloadbalancing:...,containerName=app,containerPort=8080

# Configure auto scaling
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/production/myapp \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 3 \
  --max-capacity 20

# Add scaling policy
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --scalable-dimension ecs:service:DesiredCount \
  --resource-id service/production/myapp \
  --policy-name cpu-scaling \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 70.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
    }
  }'

HashiCorp Nomad

Nomad Job Specification

# myapp.nomad
job "myapp" {
  datacenters = ["dc1"]
  type = "service"

  group "web" {
    count = 3

    update {
      max_parallel     = 1
      min_healthy_time = "30s"
      healthy_deadline = "5m"
      auto_revert      = true
      canary           = 1
    }

    network {
      port "http" {
        to = 8080
      }
    }

    service {
      name = "myapp-web"
      port = "http"
      
      check {
        type     = "http"
        path     = "/health"
        interval = "10s"
        timeout  = "2s"
      }

      connect {
        sidecar_service {}
      }
    }

    task "app" {
      driver = "docker"

      config {
        image = "myapp:latest"
        ports = ["http"]
        
        auth {
          username = "${DOCKER_USER}"
          password = "${DOCKER_PASS}"
        }
      }

      env {
        NODE_ENV = "production"
        PORT     = "${NOMAD_PORT_http}"
      }

      resources {
        cpu    = 500
        memory = 256
      }

      template {
        data = <<EOH
DB_HOST={{ with service "postgres" }}{{ with index . 0 }}{{ .Address }}{{ end }}{{ end }}
DB_PORT={{ with service "postgres" }}{{ with index . 0 }}{{ .Port }}{{ end }}{{ end }}
DB_PASSWORD={{ with secret "database/creds/myapp" }}{{ .Data.password }}{{ end }}
EOH
        destination = "local/env"
        env         = true
      }
    }
  }
}

Service Mesh Integration

Istio Service Mesh

# Install Istio
istioctl install --set profile=production

# Enable sidecar injection
kubectl label namespace production istio-injection=enabled

# Virtual Service configuration
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: myapp
spec:
  hosts:
  - myapp
  http:
  - match:
    - headers:
        x-version:
          exact: v2
    route:
    - destination:
        host: myapp
        subset: v2
  - route:
    - destination:
        host: myapp
        subset: v1
      weight: 90
    - destination:
        host: myapp
        subset: v2
      weight: 10

---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: myapp
spec:
  host: myapp
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        http1MaxPendingRequests: 10
        http2MaxRequests: 100
    loadBalancer:
      simple: LEAST_REQUEST
    outlierDetection:
      consecutiveErrors: 5
      interval: 30s
      baseEjectionTime: 30s
  subsets:
  - name: v1
    labels:
      version: v1
  - name: v2
    labels:
      version: v2

Multi-Cluster Orchestration

Federation Setup

# Kubernetes Federation v2 (KubeFed)
# Install KubeFed
helm repo add kubefed-charts https://raw.githubusercontent.com/kubernetes-sigs/kubefed/master/charts
helm install kubefed kubefed-charts/kubefed --namespace kube-federation-system --create-namespace

# Join clusters
kubefedctl join cluster1 --cluster-context cluster1 --host-cluster-context cluster1
kubefedctl join cluster2 --cluster-context cluster2 --host-cluster-context cluster1

# Federated deployment
apiVersion: types.kubefed.io/v1beta1
kind: FederatedDeployment
metadata:
  name: myapp
  namespace: production
spec:
  template:
    metadata:
      labels:
        app: myapp
    spec:
      replicas: 6
      selector:
        matchLabels:
          app: myapp
      template:
        metadata:
          labels:
            app: myapp
        spec:
          containers:
          - name: app
            image: myapp:latest
  placement:
    clusters:
    - name: cluster1
    - name: cluster2
  overrides:
  - clusterName: cluster1
    clusterOverrides:
    - path: "/spec/replicas"
      value: 4
  - clusterName: cluster2
    clusterOverrides:
    - path: "/spec/replicas"
      value: 2

Container Security

Runtime Security

# Falco runtime security rules
- rule: Unauthorized Process
  desc: Detect unauthorized process execution
  condition: >
    spawned_process and container and
    not proc.name in (allowed_processes) and
    not container.image.repository in (trusted_repos)
  output: >
    Unauthorized process started
    (user=%user.name command=%proc.cmdline container=%container.name image=%container.image.repository)
  priority: WARNING

# Pod Security Policy
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: restricted
spec:
  privileged: false
  allowPrivilegeEscalation: false
  requiredDropCapabilities:
    - ALL
  volumes:
    - 'configMap'
    - 'emptyDir'
    - 'projected'
    - 'secret'
    - 'downwardAPI'
    - 'persistentVolumeClaim'
  hostNetwork: false
  hostIPC: false
  hostPID: false
  runAsUser:
    rule: 'MustRunAsNonRoot'
  seLinux:
    rule: 'RunAsAny'
  supplementalGroups:
    rule: 'RunAsAny'
  fsGroup:
    rule: 'RunAsAny'
  readOnlyRootFilesystem: true

Monitoring and Observability

Container Metrics

# Prometheus configuration
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
    - role: pod
    relabel_configs:
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: true
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
      action: replace
      target_label: __metrics_path__
      regex: (.+)
    - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
      action: replace
      regex: ([^:]+)(?::\d+)?;(\d+)
      replacement: $1:$2
      target_label: __address__

# Grafana dashboard query examples
# Container CPU usage
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)

# Container memory usage
sum(container_memory_usage_bytes) by (pod)

# Container restart count
sum(increase(kube_pod_container_status_restarts_total[1h])) by (pod)

Disaster Recovery

Backup and Restore

# Velero backup configuration
velero backup create prod-backup \
  --include-namespaces production \
  --include-resources deployments,services,configmaps,secrets \
  --ttl 720h

# Disaster recovery runbook
1. Verify backup integrity
   velero backup describe prod-backup

2. Prepare recovery cluster
   kubectl create namespace production

3. Restore application
   velero restore create --from-backup prod-backup

4. Verify restoration
   kubectl get all -n production

5. Update DNS/Load balancer
   kubectl patch service myapp -p '{"spec":{"type":"LoadBalancer"}}'

6. Validate application
   curl https://recovery.example.com/health

Performance Optimization

Resource Optimization

  • Right-sizing: Use VPA to optimize resource requests
  • Node affinity: Place containers on appropriate nodes
  • Pod disruption budgets: Maintain availability during updates
  • Horizontal scaling: Scale based on actual metrics

Network Optimization

# Optimize container networking
# Use host networking for high-performance apps
apiVersion: v1
kind: Pod
spec:
  hostNetwork: true
  dnsPolicy: ClusterFirstWithHostNet

# Configure service mesh for efficient routing
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
spec:
  http:
  - timeout: 30s
    retries:
      attempts: 3
      perTryTimeout: 10s
      retryOn: 5xx

Cost Management

Cost Optimization Strategies

  • Use spot instances for non-critical workloads
  • Implement cluster autoscaling
  • Schedule batch jobs during off-peak
  • Use multi-tenant clusters
  • Optimize container images

Best Practices

  • Image Management: Use minimal base images, scan for vulnerabilities
  • Configuration: Externalize config, use secrets management
  • Monitoring: Implement comprehensive observability
  • Security: Apply least privilege, use network policies
  • Updates: Use rolling updates, test in staging
  • Documentation: Maintain runbooks and architecture diagrams
Note: This documentation is provided for reference purposes only. It reflects general best practices and industry-aligned guidelines, and any examples, claims, or recommendations are intended as illustrative—not definitive or binding.