Failover mechanisms

1. Introduction

In distributed systems and microservices architectures, failover mechanisms are critical for ensuring high availability, fault tolerance, and minimal downtime. This tutorial explains two common failover strategies: Active-Active and Active-Passive.

2. What is Failover?

Failover refers to the process of automatically transferring workloads to a backup system when the primary system fails. The goal is to ensure continuous availability of services even in the case of component or system failures.

2.1 Active-Passive Failover Architecture

In an Active-Passive setup, there is one active node (primary instance) handling all the traffic and one or more passive nodes (standby instances) that remain idle or in a warm state.

When the active node fails, one of the passive nodes takes over.

In an Active-Passive configuration, the primary instance is responsible for handling all the incoming production traffic and managing the workload of the system under normal operating conditions. This means that all read and write operations, user requests, and processing tasks are directed to this active node. The passive instance, on the other hand, remains in an idle or warm standby state, meaning it does not process any traffic under normal circumstances but is kept ready to take over in case of failure.

The failover process is initiated when a failure is detected in the primary instance. This typically involves automated health checks or monitoring systems that continuously verify the availability and performance of the active node. Upon detecting a failure, one of the standby nodes, usually the most up-to-date or prioritized passive instance, is promoted to active status and starts handling all the traffic.

There is usually a brief downtime during this transition period, known as the recovery time. This happens because the failover mechanism takes some time to detect the failure, update system configurations or routing, and fully activate the standby instance. Depending on the infrastructure setup, this downtime can range from a few seconds to a few minutes, but it is generally minimized through automation and optimized failover strategies.

2.2 Active-Passive in Kubernetes:

In Kubernetes, Active-Passive failover is typically implemented using:

Deployments with Replica Count 1: Only one replica of the pod is active.
Leader Election Mechanisms: Applications like etcd or tools like Kubernetes leader-election library can be used to elect a single leader among multiple instances.
Readiness and Liveness Probes: Kubernetes health checks ensure that if the active pod fails, it gets restarted or rescheduled.
Failover Controllers: Kubernetes Operators or StatefulSets with failover logic can promote a standby pod to active when needed.

2.3 Sample YAML for Active-Passive:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app-active-passive
spec:
  replicas: 1  # Only one active instance
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: my-app-container
          image: my-app-image:latest
          ports:
            - containerPort: 8080
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 20
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10

This Deployment creates a single replica pod, representing the active instance. The livenessProbe monitors whether the application is healthy and restarts the pod if it becomes unresponsive. The readinessProbe checks if the application is ready to accept traffic, ensuring that the pod is only used when fully operational. In case of failure, Kubernetes automatically restarts the pod or reschedules it to a healthy node, simulating a failover scenario.

3. Active-Active Failover Architecture

In an Active-Active setup, multiple active nodes are running simultaneously and share the traffic load. All nodes serve requests at the same time.

If one node fails, the remaining active nodes continue to serve requests without noticeable disruption.

3.1 Characteristics:

Multiple Active Instances: All instances handle traffic concurrently.
Load Balancer: Distributes incoming traffic across all nodes.
Failover Process: Automatic redistribution of traffic; no need to promote a passive node.
Zero Downtime: Failures are handled seamlessly.

3.2 Active-Active in Kubernetes:

In Kubernetes, Active-Active failover is commonly achieved via:

Deployments with Multiple Replicas: Multiple pods are running and serving traffic simultaneously.
Service Abstraction: Kubernetes Service acts as a load balancer, routing requests to available pods.
Horizontal Pod Autoscaler (HPA): Automatically scales pods up or down based on demand.
Ingress Controllers (like NGINX, Traefik) or Service Meshes (like Istio, Linkerd) distribute traffic across active instances.

This model provides zero downtime, and scaling is horizontal by simply increasing the replica count.

3.3 Sample YAML for Active-Active

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app-active-active
spec:
  replicas: 3  # Multiple active instances
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: my-app-container
          image: my-app-image:latest
          ports:
            - containerPort: 8080
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 20
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: my-app-service
spec:
  selector:
    app: my-app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080
  type: ClusterIP

This Deployment runs three replicas of the application, making all of them active. The Kubernetes Service routes traffic to all available healthy pods using internal load balancing. Liveness and readiness probes ensure only healthy pods receive traffic. If any pod fails, Kubernetes restarts it automatically, while the Service continues routing traffic to the healthy instances, ensuring seamless failover and zero downtime.

4. Active-Active vs. Active-Passive Comparison Table

Feature	Active-Passive	Active-Active
Availability	High (after failover)	Very High (continuous availability)
Failover Time	Some downtime (failover transition)	Zero downtime
Resource Usage	Passive nodes are underutilized	All nodes fully utilized
Complexity	Lower complexity	Higher complexity
Scalability	Limited (mostly vertical scaling)	Highly scalable (horizontal scaling)
Cost	More cost-effective for basic setups	Higher costs due to active resources
Kubernetes Implementation	Deployment with leader election and readiness probes	Multiple replicas behind a Service and Ingress

5. How to Choose Between Active-Active and Active-Passive

Consideration	Recommended Architecture
Cost Sensitivity	Active-Passive
Zero Downtime Needed	Active-Active
Simple Setup Preferred	Active-Passive
High Traffic Applications	Active-Active
Backup/Disaster Recovery Only	Active-Passive