Sleep until promoted – Using Zalando Postgres Operator for Standby Postgres setup

Wanting to have an (off-site) standby replica of your Postgres instance? Patroni and Zalando Postgres Operator can help you here. By either having a constant streaming from your primary cluster, or by applying WAL archives constantly as you backup them to S3 storage.

Lately I found out about another nice feature of Patroni and the Zalando Postgres Operator. It’s called “Standby cluster” and enables you, to run an (probably off-site) standby replica of your Postgres cluster on Kubernetes. I admit, this seems a niche feature on first sight. But it has it’s relevance when you think about securing your Postgres setup over multiple regions maybe. There are some caveats to be known, more on them towards the ending of this article.

If you haven’t deployed Zalando Postgres Operator or backups using WAL-G, you can find instructions on my blog here and here. You will need both in order to follow along.

So why to use a standby cluster instead of “stretching” your “normal” cluster setup over multiple regions. Kubernetes clusters may operate over multiple availability zones in a region, but if Zombie Apocalypse strikes, you may want to have your Postgres cluster still available in that abandoned mine in Norway (where it’s too cold for Zombies, as you know). So you can think of a standby cluster, as another complete cluster (consisting of maybe more than one node), that runs only to apply new WAL information from your source cluster. Like a normal cluster replica would. But Zalando has built in two possible methods to get those WAL information.

Cloning from your source primary instance to build an initial standby instance and then, using direct replication from that primary source to apply WAL information. I see a potential issue here. When, for whatever reason, the connection between your source and standby is lost and during that time, WALs have been backed up and removed from the actual primary instance, I would not know, how the standby would close that gap automatically (maybe by taking a new basebackup of the source?). I haven’t been able to test this scenario out so far.
Cloning from the S3 backup location of your source instance. Getting an initial clone done by using those backups and then apply new WAL archives as they arrive on S3. This method seemed more useful to me first, because you don’t have a hard dependency on connectivity to the source primary. You just need connectivity to the S3 bucket and here, if you lose connection for some time, it’s not an issue. Since the WAL archives will stay much longer available on S3.

Configuration

I decided for using the second option. This seemed quite straightforward, but I soon stumbled over a lack in documentation at the Zalando Operator docs. If you want to follow along with a working manifest, you can find those in my Github repository here and here. First step, is to add a standby section in your manifest of the standby cluster. The cluster must not be deployed already, adding the standby section to an already running cluster, will have no effect. A standby cluster can itself consist out of more than one replica. With the difference, the standby replicas, will follow the standby leader, not the primary leader.

postgres-demo-standby-cluster.yaml

apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
  name: postgres-demo-standby-cluster-target
  namespace: postgres
...
spec:
...
  standby:
    s3_wal_path: "s3://<bucket_name>/spilo/<source_cluster_name>/<UID>/wal/<pg_version>"
...

If you have followed my guides to configure S3 backups with WAL-G in the above mentioned other posts, you will NOT need the <UID> set in the path, since it’s left out.

In the above snippet, we tell the standby cluster, where to find the base backups and WAL archives of its source cluster. And this is not much information right? I assumed first, that Zalando Operator would use the needed environment variables to do the cloning from S3 storage, from the configmap I’m using in my setup (see here). But this turned out not to be the case as I read in another documentation in the Zalando Operator repo here. For every WAL-G backup environment variable / parameter, there is a STANDBY_ prefixed equivalent which is used for bootstrapping the standby cluster. In my testing, I didn’t add those STANDBY_ prefixed variables to my pod-config configmap. I decided, that this time, it will be better placed in the standby cluster manifest itself. So I added the following environment variables.

postgres-demo-standby-cluster.yaml

apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
  name: postgres-demo-standby-cluster-target
  namespace: postgres
...
spec:
...
  env:
  - name: STANDBY_AWS_ACCESS_KEY_ID
    value: postgresql
  - name: STANDBY_WITH_WALE
    value: "true"
  - name: STANDBY_USE_WALG_BACKUP
    value: "true"
  - name: STANDBY_USE_WALG_RESTORE
    value: "true"
  - name: STANDBY_AWS_SECRET_ACCESS_KEY
    value: Supersecret
  - name: STANDBY_AWS_ENDPOINT
    value: "http://minio.home.lab:9000"
  - name: STANDBY_AWS_REGION
    value: de01
  - name: STANDBY_WAL_BUCKET_SCOPE_PREFIX
    value: ""
  - name: STANDBY_WAL_S3_BUCKET
    value: postgresql
  - name: STANDBY_AWS_S3_FORCE_PATH_STYLE
    value: "true"
  - name: STANDBY_WALE_ENV_DIR
    value: "/run/etc/wal-e.d/env-standby"
...

I haven’t tested out if really all of those variables are needed. Especially the variable STANDBY_WALE_ENV_DIR did cause me some trouble to find out. I wasn’t aware of this configuration option. But without that parameter, my standby cluster bootstrap would fail with an error like Key 'STANDBY_WALE_ENV_DIR' not found: first part.

Creating

Before jumping into creation of the standby cluster, I want to share a word about secrets. Since the standby instance is a 1:1 replica of your primary instance, all passwords will also be replicated. But Zalando Operator (which can’t tell those passwords in every situation, since the primary is probably running on a different Kubernetes cluster), will create new secrets for the standby cluster. They obiously will differ from the primary passwords. You could ignore this fact and it will never become a problem until you promote the standby cluster to be no standby cluster anymore. The system and application users (like standby, postgres etc.) all have a password that does not match the credentials stored in secrets which are created by the operator. Replication on the newly promoted standby cluster will throw errors and everyone will become unhappy.

So I decided to create the secrets of the primary cluster as part of the standby cluster deployment, as you can see here. Your secrets and content will obviously differ when you do this in real life. But if you follow the two manifests in my repo, you should be good.

So apply the source / primary manifest to your cluster and wait for it to bootstrap and create a fresh new backup on S3.

kubectl apply -f demo-standby-cluster-source.yaml

secret/demouser.postgres-demo-standby-cluster-source.credentials.postgresql.acid.zalan.do created
secret/postgres.postgres-demo-standby-cluster-source.credentials.postgresql.acid.zalan.do created
secret/standby.postgres-demo-standby-cluster-source.credentials.postgresql.acid.zalan.do created
postgresql.acid.zalan.do/postgres-demo-standby-cluster-source created

Then, apply the target / standby manifest:

kubectl apply -f demo-standby-cluster-target.yaml

secret/demouser.postgres-demo-standby-cluster-target.credentials.postgresql.acid.zalan.do created
secret/postgres.postgres-demo-standby-cluster-target.credentials.postgresql.acid.zalan.do created
secret/standby.postgres-demo-standby-cluster-target.credentials.postgresql.acid.zalan.do created
postgresql.acid.zalan.do/postgres-demo-standby-cluster-target created

After some time, you should see logs in your standby leader pod that tell you, that it’s bootstrapping from a found S3 backup.

2023-10-07 18:57:05,965 INFO: trying to bootstrap a new standby leader
INFO: 2023/10/07 18:57:06.048426 Selecting the latest backup...
INFO: 2023/10/07 18:57:06.070941 LATEST backup is: 'base_000000010000000000000002'
INFO: 2023/10/07 18:57:06.112007 Finished extraction of part_003.tar.lz4
INFO: 2023/10/07 18:57:13.507365 Finished extraction of part_001.tar.lz4
INFO: 2023/10/07 18:57:13.523374 Finished extraction of pg_control.tar.lz4
INFO: 2023/10/07 18:57:13.523608
Backup extraction complete.
2023-10-07 18:57:13,534 INFO: replica has been created using bootstrap_standby_with_wale

In my case, the standby cluster has in general also a backup configuration. But standby clusters don’t do backups as long as they are not promoted.

Promoting

There is no automated way to trigger a failover / promotion of the standby cluster using Zalando Operator. You have to edit the Patroni cluster configuration manually and then remove the standby section (and probably also the configured STANDBY_ env vars from the postgresql CRD.

To edit the Patroni configuration, open a shell into the standby leader pod and edit the configuration with the command patronictl edit-config. Remove the following lines from the configuration, then save and exit it.

standby_cluster:
  create_replica_methods:
  - bootstrap_standby_with_wale
  - basebackup_fast_xlog
  restore_command: envdir "/run/etc/wal-e.d/env-standby" timeout "0" /scripts/restore_command.sh "%f" "%p"

You will directly see the effect in the pod logs. It should look something like this:

2023-10-08 12:44:27,786 INFO: no action. I am (postgres-demo-standby-cluster-target-0), the standby leader with the lock
2023-10-08 12:44:37,515 INFO: Lock owner: postgres-demo-standby-cluster-target-0; I am postgres-demo-standby-cluster-target-0
...
2023-10-08 12:44:37,569 INFO: promoted self to leader because I had the session lock
2023-10-08 12:44:37,573 INFO: Changed tcp_keepalives_idle from 0 to 900
2023-10-08 12:44:37,573 INFO: Changed tcp_keepalives_interval from 0 to 100
2023-10-08 12:44:37,575 INFO: Reloading PostgreSQL configuration.
server promoting
2023-10-08 12:44:37,577 INFO: cleared rewind state after becoming the leader
server signaled
2023-10-08 12:44:38,789 INFO: no action. I am (postgres-demo-standby-cluster-target-0), the leader with the lock
...
2023-10-08 12:44:47.493 UTC [36] LOG Starting pgqd 3.5
2023-10-08 12:44:47.493 UTC [36] LOG auto-detecting dbs ...
2023-10-08 12:44:49,454 INFO: no action. I am (postgres-demo-standby-cluster-target-0), the leader with the lock

Your former standby cluster is now detached from the primary and is writeable. As mentioned above, when you configured backups for all your Postgres clusters, the leader will now also write backups and WAL archives to S3. But be aware, this hasn’t happened directly after the promotion. You better trigger a manual full backup immediately after promotion, or restart the leader pod to get a backup triggered. Else, the next full backup will be written soonest on the next backup schedule timestamp. This will leave a risk, that you can’t restore / recover the instance between promotion and the first full backup.

As written above, there is no automated way for failover / promotion of a standby cluster. This could be an issue, depending on your desaster recovery plan. It’s not reasonable, when a region / datacenter site is lost, that you manually promote hundrets of Postgres clusters by hand.

No way back?

Here it starts for me, to get a bit “not thought through” by Zalando. For me, promoting a standby cluster is part of a desaster recovery scenario. As long there is hope left for mankind, you will plan to switch the workload back some time. This is not transparently possible as far as my research goes. You can still build a new standby cluster on your original location, replicate it all back and then do the promotion again to the original source. But there is no automatism behind it. Also, this will be a new instance which start a new backup history after promotion. So old backups will be lost or orphaned. This is also not solveable automatically, since the primary doesn’t know about a standby.

Philip

Update 07.05.2024: Zalando introduced a change into their Operator version 1.11.0. This change enables a more automated way to do a standby cluster promotion by simply removing the standby section from the postgresql CRD. This will lead the Operator to remove the according configuration in Patroni and the standby will do a role change to primary. I haven’t tested this change yet, so I can’t tell for sure, but Zalando states in the README, that it will also remove the STANDBY_* configuration from the pod. I would assume that it will again be added after a pod restart, so removing it from the CRD or configmap will still be your todo.

Sleep until promoted – Using Zalando Postgres Operator for Standby Postgres setup

Configuration

Creating

Promoting

No way back?

Ähnliche Beiträge:

Leave a Reply Cancel reply