Skip to main content
Version: Release 25.1

Digital.ai Release – Multi-Datacenter High Availability (HA) Setup

Digital.ai Release Multi-DC High Availability (HA) empowers organizations to achieve true enterprise-grade uptime, disaster recovery, and operational agility across geographically distributed datacenters. This reference guide showcases the advanced capabilities and architecture of our Multi-DC HA solution, designed to inspire and inform potential customers about what’s possible with Digital.ai Release. For a tailored implementation, our expert Professional Services team is ready to help you realize this vision in your environment.

Why Multi-DC HA with Digital.ai Release?

  • Uninterrupted Business Operations: Seamless failover and rapid recovery ensure your critical release pipelines and applications remain available—even in the face of datacenter outages or disasters.
  • Enterprise-Grade Automation: The Cluster Manager orchestrates health monitoring, failover, and recovery with minimal manual intervention, reducing operational risk and complexity.
  • Flexible Architecture: Supports both multi-master and master-slave database topologies, with robust session and file persistence options for diverse enterprise needs.
  • Observability & Control: Real-time dashboards, logs, and integration with Prometheus/Grafana provide deep visibility and actionable insights.
  • Proven at Scale: Designed for large, regulated enterprises with demanding uptime, compliance, and audit requirements.

Key Capabilities

  • Multi-Master or Master-Slave Support
  • Prometheus/Grafana Compatibility via health and metrics endpoints
  • Quorum and Locking to avoid split-brain
  • Graceful Shutdowns with queued and in-flight task handling
  • Session Persistence Options and external file support
  • Automated and Manual Failover with configurable thresholds
  • Comprehensive Monitoring (Release clusters, DB, load balancers, synthetic checks)
  • Real-Time UI for status, logs, and failover history
  • Enterprise Security: Role-based access, audit trails, and compliance-ready architecture

Architecture Overview

multi-dc HA architecture

  • Deployment Model: Active-Passive between two geographically separated datacenters (DCs; for example, US and EU).
  • Digital.ai Release Clusters: Each datacenter (DC) hosts a cluster of (at least) 3 Release nodes.
  • Load Balancers:
    • Local: F5/NGINX within each DC
    • Global: GSLB (for example, DNS-based routing) redirects users to active DC
  • Database:
    • Preferred: Multi-master (for example, EnterpriseDB Postgres Distributed (PGD))
    • Alternative: Master-slave (requires orchestration for promotion)
  • Orchestration: Cluster Manager automates and coordinates failover

Cluster Manager Highlights

The Digital.ai Release Cluster Manager is a purpose-built orchestration and monitoring component that ensures high availability, automated failover, and operational transparency across multi-datacenter Release deployments. Key features and capabilities include:

  • Comprehensive Monitoring: Continuously monitors the health of Release clusters, databases, load balancers, and synthetic monitoring endpoints (for example, Dynatrace). Health checks leverage Spring Boot Actuator endpoints and custom probes for real-time status.
  • Automated and Manual Failover: Supports both automatic failover (triggered by configurable health thresholds) and manual failover (for planned maintenance or controlled recovery). Automated failover logic includes cool-down intervals and quorum-based decision-making to prevent split-brain scenarios.
  • Quorum and Distributed Locking: Utilizes distributed locking and quorum mechanisms to coordinate failover actions and ensure only one datacenter is active at a time, maintaining data consistency and service integrity.
  • Real-Time UI and Observability: Provides a web-based dashboard for real-time cluster status, logs, failover history, and actionable alerts. Integrates with enterprise observability tools (for example, Prometheus, Grafana) for unified monitoring.
  • Role-Based Access and Audit Trails: Enforces enterprise security with role-based access control (RBAC), detailed audit logs, and compliance-ready architecture. All failover actions and configuration changes are logged for traceability.
  • Extensible Configuration: Centralized configuration via the monitor.yaml file allows fine-tuning of health thresholds, failover logic, notification channels, and integration points.
  • Operational Best Practices: Includes support for dry-run failover testing, RTO/RPO validation, and alerting to ensure readiness and minimize downtime during real incidents.
  • Resilience and Recovery: Designed to handle a wide range of failure scenarios, including node, database, network, and quorum loss, with automated recovery and business continuity as primary goals.

The Cluster Manager is a critical enabler for enterprise-grade multi-DC HA, providing the automation, visibility, and control required for robust, compliant, and resilient Release operations.

Configuration Requirements (Release 25.1 and later)

  • Cluster Awareness:
    • Nodes tagged per DC (for example, us-stl, eu-bel)
    • Uses Pekko-based clustering
  • Health Checks: Based on Spring Boot Actuator (/health, /readiness, etc.)
  • Datacenter State Awareness:
    • Automated detection and transition
    • Cool-down intervals to avoid flapping
  • Session Handling:
    • Sticky sessions recommended
    • Optional DB-backed session store (performance trade-off)
  • File System:
    • Use shared or object storage if file access is required post-failover
  • Messaging:
    • For Release 25.1: Embedded or external JMS (for example, RabbitMQ Quorum) is required for webhook reliability
    • For Release 25.3 and later: No external JMS required. Webhook events are now processed directly using a database-backed queue, simplifying architecture and reducing operational overhead.

Database Configuration - EnterpriseDB Postgres Distributed (PGD)

note

The following configuration considerations use EnterpriseDB Postgres Distributed (PGD) as an illustrative example. These recommendations may vary depending on the specific database clustering or replication solution in use. Please consult your platform-specific guidance.

  • Sequences: Use galloc to avoid conflicts across nodes
  • Schema Handling: Use Liquibase hotfix to avoid DDL replication issues
  • Replica Identity:
    • Prefer DEFAULT over FULL for performance and WAL efficiency
  • RPO Assurance: Regular replication lag checks required

Failover Scenarios

This section outlines the types of failover events and disruptions that the Digital.ai Release Multi-DC HA solution is designed to handle, ensuring business continuity and resilience:

  • Node crashes
  • Database (DB) or Load Balancer failures
  • Network partitions
  • Cluster Manager quorum loss
  • Stress and repeat failover events
  • Recovery Time Objective (RTO) and Recovery Point Objective (RPO) measurements
    • RTO (Recovery Time Objective): The maximum acceptable time to restore service after a disruption
    • RPO (Recovery Point Objective): The maximum acceptable amount of data loss measured in time

Business Benefits

  • Minimize Downtime: Achieve industry-leading RTO/RPO for mission-critical release operations.
  • Reduce Operational Overhead: Automation and self-healing reduce the need for manual intervention and lower TCO.
  • Enhance Compliance: Built-in auditability and security controls support regulatory requirements.
  • Future-Proof Scalability: Easily extend to new regions or cloud providers as your business grows.

Professional Services Engagement

While this document illustrates the possibilities with Digital.ai Release Multi-DC HA, a successful implementation requires careful planning, integration, and validation. Our Professional Services team brings deep expertise to ensure your deployment is robust, secure, and tailored to your unique needs. Contact us to learn how we can help you realize the full value of Multi-DC HA.

Procedure to Set up Multi-DC HA

note

The following procedure uses EnterpriseDB Postgres Distributed (PGD) as an illustrative example. These recommendations may vary depending on the specific database clustering or replication solution in use. Please consult your platform-specific guidance.

  1. Install Release Clusters on both DCs with Pekko configuration
  2. Set up EnterpriseDB PGD for multi-master sync (or master-slave with failover hooks)
  3. Configure Cluster Manager across at least 3 nodes
  4. Set up Load Balancers (F5 local + GSLB global)
  5. Deploy Dynatrace or synthetic monitoring
  6. Configure failover thresholds and checks in monitor.yaml
  7. Perform dry-run failover tests and verify RTO/RPO
  8. Set up alerts, logging, and notifications