Container Orchestration

7 min read

Updated Aug 05, 2025

Container Orchestration

Master container orchestration platforms and patterns for deploying, scaling, and managing containerized applications in production.

Container Orchestration Overview

Container orchestration automates the deployment, management, scaling, and networking of containers, enabling efficient operation of containerized applications at scale.

Key Orchestration Features

Service Discovery: Automatic container location and communication
Load Balancing: Distribute traffic across containers
Scaling: Automatic scaling based on demand
Self-Healing: Restart failed containers automatically
Rolling Updates: Zero-downtime deployments
Secret Management: Secure handling of sensitive data

Orchestration Platforms Comparison

Platform	Best For	Complexity	Key Features
Kubernetes	Large-scale production	High	Extensible, ecosystem, multi-cloud
Docker Swarm	Simple deployments	Low	Native Docker integration
Amazon ECS	AWS workloads	Medium	AWS integration, Fargate serverless
HashiCorp Nomad	Multi-workload	Medium	Supports non-container workloads

Docker Swarm

Swarm Initialization

# Initialize swarm manager
docker swarm init --advertise-addr 10.0.1.10

# Join worker nodes
docker swarm join --token SWMTKN-1-xxxxx 10.0.1.10:2377

# Join additional managers
docker swarm join-token manager

# Deploy stack
docker stack deploy -c docker-compose.yml myapp

# Scale service
docker service scale myapp_web=5

# Update service
docker service update --image myapp:v2 myapp_web

Docker Compose for Swarm

# docker-compose.yml
version: '3.8'

services:
  web:
    image: myapp:latest
    deploy:
      replicas: 3
      update_config:
        parallelism: 1
        delay: 10s
        failure_action: rollback
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
      placement:
        constraints:
          - node.role == worker
          - node.labels.zone == us-east
    ports:
      - "80:8080"
    networks:
      - webnet
    secrets:
      - db_password
    configs:
      - source: app_config
        target: /app/config.json

  db:
    image: postgres:14
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.labels.type == database
    environment:
      POSTGRES_PASSWORD_FILE: /run/secrets/db_password
    volumes:
      - db-data:/var/lib/postgresql/data
    networks:
      - webnet
    secrets:
      - db_password

networks:
  webnet:
    driver: overlay
    driver_opts:
      encrypted: "true"

volumes:
  db-data:
    driver: local

secrets:
  db_password:
    external: true

configs:
  app_config:
    file: ./config.json

Amazon ECS

ECS Task Definition

{
  "family": "myapp",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "containerDefinitions": [
    {
      "name": "app",
      "image": "123456789.dkr.ecr.us-east-1.amazonaws.com/myapp:latest",
      "portMappings": [
        {
          "containerPort": 8080,
          "protocol": "tcp"
        }
      ],
      "essential": true,
      "environment": [
        {
          "name": "NODE_ENV",
          "value": "production"
        }
      ],
      "secrets": [
        {
          "name": "DB_PASSWORD",
          "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789:secret:db-password"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/myapp",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      },
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
        "interval": 30,
        "timeout": 5,
        "retries": 3
      }
    }
  ],
  "taskRoleArn": "arn:aws:iam::123456789:role/ecsTaskRole",
  "executionRoleArn": "arn:aws:iam::123456789:role/ecsTaskExecutionRole"
}

ECS Service with Auto Scaling

# Create ECS service
aws ecs create-service \
  --cluster production \
  --service-name myapp \
  --task-definition myapp:1 \
  --desired-count 3 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[subnet-xxxxx],securityGroups=[sg-xxxxx],assignPublicIp=ENABLED}" \
  --load-balancers targetGroupArn=arn:aws:elasticloadbalancing:...,containerName=app,containerPort=8080

# Configure auto scaling
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/production/myapp \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 3 \
  --max-capacity 20

# Add scaling policy
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --scalable-dimension ecs:service:DesiredCount \
  --resource-id service/production/myapp \
  --policy-name cpu-scaling \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 70.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
    }
  }'

HashiCorp Nomad

Nomad Job Specification

# myapp.nomad
job "myapp" {
  datacenters = ["dc1"]
  type = "service"

  group "web" {
    count = 3

    update {
      max_parallel     = 1
      min_healthy_time = "30s"
      healthy_deadline = "5m"
      auto_revert      = true
      canary           = 1
    }

    network {
      port "http" {
        to = 8080
      }
    }

    service {
      name = "myapp-web"
      port = "http"
      
      check {
        type     = "http"
        path     = "/health"
        interval = "10s"
        timeout  = "2s"
      }

      connect {
        sidecar_service {}
      }
    }

    task "app" {
      driver = "docker"

      config {
        image = "myapp:latest"
        ports = ["http"]
        
        auth {
          username = "${DOCKER_USER}"
          password = "${DOCKER_PASS}"
        }
      }

      env {
        NODE_ENV = "production"
        PORT     = "${NOMAD_PORT_http}"
      }

      resources {
        cpu    = 500
        memory = 256
      }

      template {
        data = <<EOH
DB_HOST={{ with service "postgres" }}{{ with index . 0 }}{{ .Address }}{{ end }}{{ end }}
DB_PORT={{ with service "postgres" }}{{ with index . 0 }}{{ .Port }}{{ end }}{{ end }}
DB_PASSWORD={{ with secret "database/creds/myapp" }}{{ .Data.password }}{{ end }}
EOH
        destination = "local/env"
        env         = true
      }
    }
  }
}

Service Mesh Integration

Istio Service Mesh

# Install Istio
istioctl install --set profile=production

# Enable sidecar injection
kubectl label namespace production istio-injection=enabled

# Virtual Service configuration
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: myapp
spec:
  hosts:
  - myapp
  http:
  - match:
    - headers:
        x-version:
          exact: v2
    route:
    - destination:
        host: myapp
        subset: v2
  - route:
    - destination:
        host: myapp
        subset: v1
      weight: 90
    - destination:
        host: myapp
        subset: v2
      weight: 10

---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: myapp
spec:
  host: myapp
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        http1MaxPendingRequests: 10
        http2MaxRequests: 100
    loadBalancer:
      simple: LEAST_REQUEST
    outlierDetection:
      consecutiveErrors: 5
      interval: 30s
      baseEjectionTime: 30s
  subsets:
  - name: v1
    labels:
      version: v1
  - name: v2
    labels:
      version: v2

Multi-Cluster Orchestration

Federation Setup

# Kubernetes Federation v2 (KubeFed)
# Install KubeFed
helm repo add kubefed-charts https://raw.githubusercontent.com/kubernetes-sigs/kubefed/master/charts
helm install kubefed kubefed-charts/kubefed --namespace kube-federation-system --create-namespace

# Join clusters
kubefedctl join cluster1 --cluster-context cluster1 --host-cluster-context cluster1
kubefedctl join cluster2 --cluster-context cluster2 --host-cluster-context cluster1

# Federated deployment
apiVersion: types.kubefed.io/v1beta1
kind: FederatedDeployment
metadata:
  name: myapp
  namespace: production
spec:
  template:
    metadata:
      labels:
        app: myapp
    spec:
      replicas: 6
      selector:
        matchLabels:
          app: myapp
      template:
        metadata:
          labels:
            app: myapp
        spec:
          containers:
          - name: app
            image: myapp:latest
  placement:
    clusters:
    - name: cluster1
    - name: cluster2
  overrides:
  - clusterName: cluster1
    clusterOverrides:
    - path: "/spec/replicas"
      value: 4
  - clusterName: cluster2
    clusterOverrides:
    - path: "/spec/replicas"
      value: 2

Container Security

Runtime Security

# Falco runtime security rules
- rule: Unauthorized Process
  desc: Detect unauthorized process execution
  condition: >
    spawned_process and container and
    not proc.name in (allowed_processes) and
    not container.image.repository in (trusted_repos)
  output: >
    Unauthorized process started
    (user=%user.name command=%proc.cmdline container=%container.name image=%container.image.repository)
  priority: WARNING

# Pod Security Policy
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: restricted
spec:
  privileged: false
  allowPrivilegeEscalation: false
  requiredDropCapabilities:
    - ALL
  volumes:
    - 'configMap'
    - 'emptyDir'
    - 'projected'
    - 'secret'
    - 'downwardAPI'
    - 'persistentVolumeClaim'
  hostNetwork: false
  hostIPC: false
  hostPID: false
  runAsUser:
    rule: 'MustRunAsNonRoot'
  seLinux:
    rule: 'RunAsAny'
  supplementalGroups:
    rule: 'RunAsAny'
  fsGroup:
    rule: 'RunAsAny'
  readOnlyRootFilesystem: true

Monitoring and Observability

Container Metrics

# Prometheus configuration
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
    - role: pod
    relabel_configs:
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: true
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
      action: replace
      target_label: __metrics_path__
      regex: (.+)
    - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
      action: replace
      regex: ([^:]+)(?::\d+)?;(\d+)
      replacement: $1:$2
      target_label: __address__

# Grafana dashboard query examples
# Container CPU usage
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)

# Container memory usage
sum(container_memory_usage_bytes) by (pod)

# Container restart count
sum(increase(kube_pod_container_status_restarts_total[1h])) by (pod)

Disaster Recovery

Backup and Restore

# Velero backup configuration
velero backup create prod-backup \
  --include-namespaces production \
  --include-resources deployments,services,configmaps,secrets \
  --ttl 720h

# Disaster recovery runbook
1. Verify backup integrity
   velero backup describe prod-backup

2. Prepare recovery cluster
   kubectl create namespace production

3. Restore application
   velero restore create --from-backup prod-backup

4. Verify restoration
   kubectl get all -n production

5. Update DNS/Load balancer
   kubectl patch service myapp -p '{"spec":{"type":"LoadBalancer"}}'

6. Validate application
   curl https://recovery.example.com/health

Performance Optimization

Resource Optimization

Right-sizing: Use VPA to optimize resource requests
Node affinity: Place containers on appropriate nodes
Pod disruption budgets: Maintain availability during updates
Horizontal scaling: Scale based on actual metrics

Network Optimization

# Optimize container networking
# Use host networking for high-performance apps
apiVersion: v1
kind: Pod
spec:
  hostNetwork: true
  dnsPolicy: ClusterFirstWithHostNet

# Configure service mesh for efficient routing
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
spec:
  http:
  - timeout: 30s
    retries:
      attempts: 3
      perTryTimeout: 10s
      retryOn: 5xx

Cost Management

Cost Optimization Strategies

Use spot instances for non-critical workloads
Implement cluster autoscaling
Schedule batch jobs during off-peak
Use multi-tenant clusters
Optimize container images

Best Practices

Image Management: Use minimal base images, scan for vulnerabilities
Configuration: Externalize config, use secrets management
Monitoring: Implement comprehensive observability
Security: Apply least privilege, use network policies
Updates: Use rolling updates, test in staging
Documentation: Maintain runbooks and architecture diagrams

Related Resources

Note: This documentation is provided for reference purposes only. It reflects general best practices and industry-aligned guidelines, and any examples, claims, or recommendations are intended as illustrative—not definitive or binding.

Container Orchestration

Container Orchestration Overview

Key Orchestration Features

Orchestration Platforms Comparison

Docker Swarm

Swarm Initialization

Docker Compose for Swarm

Amazon ECS

ECS Task Definition

ECS Service with Auto Scaling

HashiCorp Nomad

Nomad Job Specification

Service Mesh Integration

Istio Service Mesh

Multi-Cluster Orchestration

Federation Setup

Container Security

Runtime Security

Monitoring and Observability

Container Metrics

Disaster Recovery

Backup and Restore

Performance Optimization

Resource Optimization

Network Optimization

Cost Management

Cost Optimization Strategies

Best Practices

Related Resources

Related Documentation

Kubernetes Deployment Guide

Deploy Kubernetes Cluster

Kubernetes Best Practices