High Availability

Deploy Parallax with multi-node clustering for automatic failover and zero-downtime operations.

Overview

High Availability (HA) mode runs multiple control plane instances that coordinate through a shared state store. If any instance fails, others automatically take over without service interruption.

Components

Control Plane Cluster

Multiple instances share state and coordinate leadership:

Component	Role
Primary	Handles coordination, leader election, garbage collection
Standbys	Process requests, ready for failover
All nodes	Accept API requests, handle WebSocket connections

Redis (Required)

Redis provides:

State synchronization: Share execution state across nodes
Leader election: Coordinate primary selection via locks
Pub/Sub: Real-time event distribution
Session storage: Maintain WebSocket sessions across failover

Load Balancer

Distributes traffic across all healthy nodes:

Health check endpoint: /health/ready
WebSocket sticky sessions: Optional but recommended
Graceful drain: 30-second connection drain on removal

Configuration

Enable HA Mode

# parallax.config.yaml
highAvailability:
  enabled: true

  # Cluster identification
  clusterId: production

  # Leader election
  leaderElection:
    enabled: true
    leaseDuration: 15s
    renewDeadline: 10s
    retryPeriod: 2s

  # Health monitoring
  memberHealth:
    checkInterval: 5s
    timeout: 3s
    unhealthyThreshold: 3

# Redis connection (required for HA)
redis:
  url: redis://redis-cluster:6379
  # Or for Redis Cluster
  cluster:
    nodes:
      - redis-1:6379
      - redis-2:6379
      - redis-3:6379

Helm Values

# values.yaml
controlPlane:
  replicas: 3

  config:
    highAvailability:
      enabled: true
      clusterId: production

  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app: parallax-control-plane
          topologyKey: kubernetes.io/hostname

  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule

redis:
  enabled: true
  architecture: replication
  replica:
    replicaCount: 3
  sentinel:
    enabled: true

Environment Variables

Variable	Description	Default
`PARALLAX_HA_ENABLED`	Enable HA mode	`false`
`PARALLAX_HA_CLUSTER_ID`	Cluster identifier	`default`
`PARALLAX_REDIS_URL`	Redis connection URL	-
`PARALLAX_HA_LEASE_DURATION`	Leader lease duration	`15s`

Leader Election

How It Works

Startup: Each node attempts to acquire a Redis lock
Winner: First to acquire becomes primary
Renewal: Primary renews lock periodically
Failover: If renewal fails, other nodes compete for leadership

Leader Responsibilities

The primary node handles:

Garbage collection: Clean up expired executions
Agent rebalancing: Redistribute agents across nodes
Metrics aggregation: Consolidated cluster metrics
Background tasks: Periodic maintenance

Viewing Leader Status

# Via CLI
parallax ha status

# Output:
# Cluster: production
# Members: 3
# Leader: parallax-control-plane-0
#
# Node                      Status    Role      Last Seen
# parallax-control-plane-0  healthy   leader    2s ago
# parallax-control-plane-1  healthy   standby   1s ago
# parallax-control-plane-2  healthy   standby   1s ago

# Via API
curl http://localhost:8080/ha/status

{
  "clusterId": "production",
  "members": [
    {
      "id": "parallax-control-plane-0",
      "address": "10.0.0.1:8080",
      "role": "leader",
      "status": "healthy",
      "lastHeartbeat": "2024-01-15T10:30:00Z"
    },
    {
      "id": "parallax-control-plane-1",
      "address": "10.0.0.2:8080",
      "role": "standby",
      "status": "healthy",
      "lastHeartbeat": "2024-01-15T10:30:01Z"
    }
  ]
}

State Synchronization

Execution State

Executions are tracked in Redis for cross-node visibility:

redis> HGETALL parallax:execution:exec_abc123
1) "status"
2) "running"
3) "pattern"
4) "content-classifier"
5) "startedAt"
6) "2024-01-15T10:30:00Z"
7) "nodeId"
8) "parallax-control-plane-0"

Agent Sessions

WebSocket connections are tracked for session affinity:

redis> HGETALL parallax:agent:agent_xyz789
1) "nodeId"
2) "parallax-control-plane-1"
3) "capabilities"
4) "[\"classification\",\"analysis\"]"
5) "connectedAt"
6) "2024-01-15T10:00:00Z"

Pub/Sub Events

Real-time events distributed via Redis pub/sub:

parallax:events:execution - Execution state changes
parallax:events:agent - Agent connect/disconnect
parallax:events:cluster - Cluster membership changes

Failover

Automatic Failover

When a node fails:

Detection: Other nodes notice missing heartbeats (default: 15s)
Election: Remaining nodes compete for leadership
Promotion: New leader takes over coordination duties
Recovery: In-flight executions are resumed or retried

Agent Reconnection

Agents automatically reconnect to healthy nodes:

const agent = new ParallaxAgent({
  controlPlaneUrl: 'http://load-balancer:8080',
  reconnect: {
    enabled: true,
    maxRetries: 10,
    backoff: {
      initial: 1000,
      max: 30000,
      multiplier: 2,
    },
  },
});

Execution Recovery

In-flight executions during failover:

Execution State	Recovery Action
Pending	Re-queued automatically
Running (agents working)	Continues on agents
Aggregating	Resumed by new node
Failed	Reported as failed

Manual Failover

Trigger manual leader election:

# Step down current leader
parallax ha stepdown

# Force election
parallax ha elect --force

Scaling

Horizontal Scaling

Add nodes to handle more load:

# Kubernetes
kubectl scale statefulset parallax-control-plane --replicas=5

# Docker
docker compose up -d --scale control-plane=5

Recommendations

Workload	Nodes	Redis
Development	1	Single
Small (< 100 agents)	3	Single
Medium (< 1000 agents)	3-5	Sentinel
Large (< 10000 agents)	5-7	Cluster
Very Large	7+	Cluster

Node Sizing

Node Size	CPU	Memory	Agents	Executions/sec
Small	1	1GB	100	10
Medium	2	2GB	500	50
Large	4	4GB	2000	200
XLarge	8	8GB	5000	500

Load Balancing

Kubernetes Service

apiVersion: v1
kind: Service
metadata:
  name: parallax-control-plane
spec:
  type: ClusterIP
  selector:
    app: parallax-control-plane
  ports:
    - port: 8080
      targetPort: 8080

Ingress with Session Affinity

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: parallax
  annotations:
    nginx.ingress.kubernetes.io/affinity: "cookie"
    nginx.ingress.kubernetes.io/session-cookie-name: "PARALLAX_AFFINITY"
    nginx.ingress.kubernetes.io/session-cookie-expires: "172800"
spec:
  rules:
    - host: parallax.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: parallax-control-plane
                port:
                  number: 8080

AWS ALB

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/healthcheck-path: /health/ready
    alb.ingress.kubernetes.io/stickiness.enabled: "true"
    alb.ingress.kubernetes.io/stickiness.type: lb_cookie

Health Checks

Endpoints

Endpoint	Purpose	Use For
`/health`	Basic health	General monitoring
`/health/live`	Liveness	Kubernetes liveness probe
`/health/ready`	Readiness	Load balancer, readiness probe

Liveness Probe

Checks if the process is running:

livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10
  failureThreshold: 3

Readiness Probe

Checks if the node can accept traffic:

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 3

Monitoring

Cluster Metrics

# Leader status (1 = leader, 0 = standby)
parallax_ha_is_leader{node="parallax-control-plane-0"} 1

# Cluster members
parallax_ha_cluster_members 3

# Healthy members
parallax_ha_healthy_members 3

# Leader elections
parallax_ha_elections_total 5

# State sync latency
parallax_ha_sync_latency_seconds{quantile="0.99"} 0.005

Alerts

groups:
  - name: parallax-ha
    rules:
      - alert: ParallaxNoLeader
        expr: sum(parallax_ha_is_leader) == 0
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: No leader elected

      - alert: ParallaxSplitBrain
        expr: sum(parallax_ha_is_leader) > 1
        for: 10s
        labels:
          severity: critical
        annotations:
          summary: Multiple leaders detected

      - alert: ParallaxUnhealthyMember
        expr: parallax_ha_healthy_members < parallax_ha_cluster_members
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: Cluster member unhealthy

Troubleshooting

No Leader Elected

Symptoms: Cluster has no leader, coordination tasks not running

Check:

parallax ha status
redis-cli GET parallax:ha:leader

Causes:

Redis connection issues
Network partition
All nodes unhealthy

Fix:

Check Redis connectivity
Verify network between nodes
Restart unhealthy nodes

Split Brain

Symptoms: Multiple nodes think they're leader

Check:

parallax ha status
# Shows multiple leaders

Causes:

Network partition
Redis cluster issues
Clock skew

Fix:

Check network connectivity
Verify Redis cluster health
Force election: parallax ha elect --force

State Sync Lag

Symptoms: Executions appear on wrong node, stale data

Check:

parallax_ha_sync_latency_seconds{quantile="0.99"} > 1

Causes:

Redis overloaded
Network latency
High execution volume

Fix:

Scale Redis
Check network latency
Increase sync batch size

Best Practices

Always use 3+ nodes - Allows for one failure while maintaining quorum
Spread across zones - Use topology constraints to survive zone failures
Monitor cluster health - Set up alerts for leader elections and member health
Use Redis Sentinel/Cluster - Match Redis HA to control plane HA
Test failover regularly - Practice chaos engineering
Set appropriate timeouts - Balance between fast failover and stability

Next Steps

Persistence - Durable storage
Multi-Region - Geographic distribution
Security - Secure the cluster

Overview​

Components​

Control Plane Cluster​

Redis (Required)​

Load Balancer​

Configuration​

Enable HA Mode​

Helm Values​

Environment Variables​

Leader Election​

How It Works​

Leader Responsibilities​

Viewing Leader Status​

State Synchronization​

Execution State​

Agent Sessions​

Pub/Sub Events​

Failover​

Automatic Failover​

Agent Reconnection​

Execution Recovery​

Manual Failover​

Scaling​

Horizontal Scaling​

Recommendations​

Node Sizing​

Load Balancing​

Kubernetes Service​

Ingress with Session Affinity​

AWS ALB​

Health Checks​

Endpoints​

Liveness Probe​

Readiness Probe​

Monitoring​

Cluster Metrics​

Alerts​

Troubleshooting​

No Leader Elected​

Split Brain​

State Sync Lag​

Best Practices​

Next Steps​

Overview

Components

Control Plane Cluster

Redis (Required)

Load Balancer

Configuration

Enable HA Mode

Helm Values

Environment Variables

Leader Election

How It Works

Leader Responsibilities

Viewing Leader Status

State Synchronization

Execution State

Agent Sessions

Pub/Sub Events

Failover

Automatic Failover

Agent Reconnection

Execution Recovery

Manual Failover

Scaling

Horizontal Scaling

Recommendations

Node Sizing

Load Balancing

Kubernetes Service

Ingress with Session Affinity

AWS ALB

Health Checks

Endpoints

Liveness Probe

Readiness Probe

Monitoring

Cluster Metrics

Alerts

Troubleshooting

No Leader Elected

Split Brain

State Sync Lag

Best Practices

Next Steps