Skip to content

How to deploy a standby cluster for Disaster Recovery

Disaster recovery is not optional for businesses operating in the digital age. With the ever-increasing reliance on data, system outages or data loss can be catastrophic, causing significant business disruptions and financial losses.

With multi-cloud or multi-regional PostgreSQL deployments, the complexity of managing disaster recovery only increases. This is where the Percona Operators come in, providing a solution to streamline disaster recovery for PostgreSQL clusters running on Kubernetes. With the Percona Operators, businesses can manage multi-cloud or hybrid-cloud PostgreSQL deployments with ease, ensuring that critical data is always available and secure, no matter what happens.

Solution overview

Operators automate routine tasks and remove toil. For standby, the Percona Operator for PostgreSQL version 2 provides the following options:

  1. pgBackrest repo based standby
  2. Streaming replication
  3. Combination of (1) and (2)

This document describes the pgBackRest repo-based standby as the simplest one. The following is the architecture diagram:

image

  1. This solution describes two Kubernetes clusters in different regions, clouds or running in hybrid mode (on-premises and cloud). One cluster is Main and the other is Disaster Recovery (DR)

  2. Each cluster includes the following components:

    1. Percona Operator
    2. PostgreSQL cluster
    3. pgBackrest
    4. pgBouncer
  3. pgBackrest on the Main site streams backups and Write Ahead Logs (WALs) to the object storage

  4. pgBackrest on the DR site takes these backups and streams them to the standby cluster

Deploy disaster recovery for PostgreSQL on Kubernetes

Configure Main site

  1. Deploy the Operator using your favorite method. Once installed, configure the Custom Resource manifest, so that pgBackrest starts using the Object Storage of your choice. Skip this step if you already have it configured.

  2. Configure the backups.pgbackrest.repos section by adding the necessary configuration. The below example is for Google Cloud Storage (GCS):

    spec:
      backups:
        configuration:
          - secret:
              name: main-pgbackrest-secrets
        pgbackrest:
          repos:
          - name: repo1
            gcs:
              bucket: MY-BUCKET
    

    The main-pgbackrest-secrets value contains the keys for GCS. Read more about the configuration in the backup and restore tutorial.

  3. Once configured, apply the custom resource:

    $ kubectl apply -f deploy/cr.yaml 
    
    Expected output
    perconapgcluster.pg.percona.com/standby created
    

    The backups should appear in the object storage. By default pgBackrest puts them into the pgbackrest folder.

Configure DR site

The configuration of the disaster recovery site is similar to that of the Main site, with the only difference in standby settings.

The following manifest has standby.enabled set to true and points to the repoName where backups are (GCS in our case):

metadata:
  name: standby
spec: 
...
  backups:
    configuration:
      - secret:
          name: standby-pgbackrest-secrets
    pgbackrest:
      repos:
      - name: repo1
        gcs:
          bucket: MY-BUCKET
  standby:
    enabled: true
    repoName: repo1

Deploy the standby cluster by applying the manifest:

$ kubectl apply -f deploy/cr.yaml
Expected output
perconapgcluster.pg.percona.com/standby created

Failover

In case of the Main site failure or in other cases, you can promote the standby cluster. The promotion effectively allows writing to the cluster. This creates a net effect of pushing Write Ahead Logs (WALs) to the pgBackrest repository. It might create a split-brain situation where two primary instances attempt to write to the same repository. To avoid this, make sure the primary cluster is either deleted or shut down before trying to promote the standby cluster.

Once the primary is down or inactive, promote the standby through changing the corresponding section:

spec:
  standby:
    enabled: false

Now you can start writing to the cluster.

Split brain

There might be a case, where your old primary comes up and starts writing to the repository. To recover from this situation, do the following:

  1. Keep only one primary with the latest data running
  2. Stop the writes on the other one
  3. Take the new full backup from the primary and upload it to the repo

Automate the failover

Automated failover consists of multiple steps and is outside of the Operator’s scope. There are a few steps that you can take to reduce the Recovery Time Objective (RTO). To detect the failover we recommend having the 3rd site to monitor both DR and Main sites. In this case you can be sure that Main really failed and it is not a network split situation.

Another aspect of automation is to switch the traffic for the application from Main to Standby after promotion. It can be done through various Kubernetes configurations and heavily depends on how your networking and application are designed. The following options are quite common:

  1. Global Load Balancer - various clouds and vendors provide their solutions
  2. Multi Cluster Services or MCS - available on most of the public clouds
  3. Federation or other multi-cluster solutions

Get expert help

If you need assistance, visit the community forum for comprehensive and free database knowledge, or contact our Percona Database Experts for professional support and services. Join K8S Squad to benefit from early access to features and “ask me anything” sessions with the Experts.


Last update: 2024-04-17