Skip to content

Restore and Clone from S3 – Configure Zalando Postgres Operator Restore with WAL-G

In an earlier blogpost, I wrote about doing Postgres backups using WAL-G to a S3 storage. This time we have a look to either clone or “inplace” restore a postgres instance from these backups. All of this using the Zalando Postgres Operator. If you haven’t done so, please read the backup post before proceeding reading this one.

There are several scenarios doing clones from an earlier or already running Postgres instance using the Zalando Postgres Operator. They are documented here. We have a deeper look on two of these scenarios, both of them involving using WAL-G backups. Only for the records, you can clone a running Postgres Instance using the same manifest syntax we use down below in the same namespace as a running Postgres cluster. This will lead to a clone using pg_basebackup directly from the existing cluster and not using your S3 backups.

One general prerequisite for creating clones using WAL-G is to specify the S3 bucket configuration for cloning. We did for the backup in the above named blog post, but it has to be done as well for cloning. This is done by adding the same environment variables you have specified for backup but with a CLONE_ prefix. I did so by having a pod-config configmap, where all variables will be added to all Postgres clusters which are managed by the Zalando Operator (see here). You can also specify all those settings in the postgresql CRD, if you do so, here is the list of variables you probably need to set in order to get the cloning to work:

YAML
  CLONE_USE_WALG_RESTORE: "true"
  CLONE_AWS_ACCESS_KEY_ID: postgresql
  CLONE_AWS_SECRET_ACCESS_KEY: supersecret
  CLONE_AWS_ENDPOINT: http://minio.home.lab:9000 # Endpoint URL to your S3 Endpoint; MinIO in this example
  CLONE_AWS_REGION: de01
  CLONE_METHOD: CLONE_WITH_WALE
  CLONE_WAL_BUCKET_SCOPE_PREFIX: ""
  CLONE_WAL_S3_BUCKET: postgresql
  CLONE_AWS_S3_FORCE_PATH_STYLE: "true"

Inplace restore

When speaking of an inplace restore, we mean restoring a Postgres instance to an earlier point in time and keeping the instance name the same. You cannot overwrite a running cluster, so you first need to delete it and then (re)create it newly with the clone configuration down below. You can change some of the configurations also during an inplace restore, but I would advice you not to do so. It’s already a “risky” process. So if you’re unsure about the specific point in time or you want to try it out without deleting the actual instance, use the clone restore method described in the next chapter.

Let’s assume we’ve created a Postgres cluster with the name postgres-demo-cluster with the following definition (you can also find the manifest in my Github repository)

demo-cluster.yaml
apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
  name: postgres-demo-cluster
  namespace: postgres
spec:
  teamId: "postgres"
  volume:
    size: 2Gi
  numberOfInstances: 2
  users:
    demouser:  # database owner
    - superuser
    - createdb
  databases:
    demo: demo_owner  # dbname: owner
  preparedDatabases:
    demo: {}
  postgresql:
    version: "14"

Ensure, that you have a valid full backup and recent WAL archives in your S3 bucket, which cover the point in time you want to restore the instance to. Then delete the cluster from your Kubernetes environment.

kubectl delete postgresqls -n postgres postgres-demo-cluster
postgresql.acid.zalan.do "postgres-demo-cluster" deleted

Let’s wait for the cluster to get removed, if you check the S3 bucket again, you will find out, that a recent WAL archive is getting created during the shutdown / termination of the instance.

So now let’s update the CRD manifest and add a clone section to it (again, you can find the manifest also here).

demo-cluster-inplace-restore.yaml
apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
  name: postgres-demo-cluster
  namespace: postgres
spec:
  teamId: "postgres"
  volume:
    size: 2Gi
  numberOfInstances: 2
  users:
    demouser:  # database owner
    - superuser
    - createdb
  databases:
    demo: demo_owner  # dbname: owner
  preparedDatabases:
    demo: {}
  postgresql:
    version: "14"
  clone:
    cluster: "postgres-demo-cluster"  # Inplace restore when having the same cluster name as the source
    timestamp: "2022-05-03T19:50:00+00:00"  # timezone required (offset relative to UTC, see RFC 3339 section 5.6)

If you’ve read the official documentation, you might wonder why I don’t use the uid parameter in the clone section. It’s because in my backup setup, I set the WAL_BUCKET_SCOPE_PREFIX and WAL_BUCKET_SCOPE_SUFFIX both to null. This leaves out the cluster uid in the S3 backup path and makes it more straight forward in my opinion.

Let’s apply the updated manifest (kubectl apply -f demo-cluster-inplace-restore.yaml) and check the logfiles of the Postgres pods. You will find something like this in it:

...
2022-05-03 20:05:04,811 - bootstrapping - INFO - Configuring wal-e
2022-05-03 20:05:04,811 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_PREFIX
2022-05-03 20:05:04,812 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_S3_PREFIX
2022-05-03 20:05:04,812 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ACCESS_KEY_ID
2022-05-03 20:05:04,812 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY
2022-05-03 20:05:04,812 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_ENDPOINT
2022-05-03 20:05:04,813 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ENDPOINT
2022-05-03 20:05:04,813 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_REGION
2022-05-03 20:05:04,813 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_DISABLE_S3_SSE
2022-05-03 20:05:04,813 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DISABLE_S3_SSE
2022-05-03 20:05:04,813 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_S3_FORCE_PATH_STYLE
2022-05-03 20:05:04,813 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DOWNLOAD_CONCURRENCY
2022-05-03 20:05:04,814 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_UPLOAD_CONCURRENCY
2022-05-03 20:05:04,814 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_BACKUP
2022-05-03 20:05:04,814 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_RESTORE
2022-05-03 20:05:04,814 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_LOG_DESTINATION
2022-05-03 20:05:04,814 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/PGPORT
2022-05-03 20:05:04,815 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/BACKUP_NUM_TO_RETAIN
2022-05-03 20:05:04,815 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/TMPDIR
2022-05-03 20:05:04,815 - bootstrapping - INFO - Configuring pam-oauth2
2022-05-03 20:05:04,815 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2022-05-03 20:05:04,815 - bootstrapping - INFO - Configuring certificate
2022-05-03 20:05:04,815 - bootstrapping - INFO - Generating ssl self-signed certificate
2022-05-03 20:05:04,847 - bootstrapping - INFO - Configuring bootstrap
2022-05-03 20:05:04,847 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/WALE_S3_PREFIX
2022-05-03 20:05:04,848 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/WALG_S3_PREFIX
2022-05-03 20:05:04,848 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/AWS_ACCESS_KEY_ID
2022-05-03 20:05:04,848 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/AWS_SECRET_ACCESS_KEY
2022-05-03 20:05:04,848 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/WALE_S3_ENDPOINT
2022-05-03 20:05:04,848 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/AWS_ENDPOINT
2022-05-03 20:05:04,848 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/AWS_REGION
2022-05-03 20:05:04,849 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/WALE_DISABLE_S3_SSE
2022-05-03 20:05:04,849 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/WALG_DISABLE_S3_SSE
2022-05-03 20:05:04,849 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/AWS_S3_FORCE_PATH_STYLE
2022-05-03 20:05:04,849 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/USE_WALG_BACKUP
2022-05-03 20:05:04,849 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/USE_WALG_RESTORE
2022-05-03 20:05:04,849 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/WALE_LOG_DESTINATION
2022-05-03 20:05:04,849 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/TMPDIR
2022-05-03 20:05:06,051 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2022-05-03 20:05:06,065 INFO: No PostgreSQL configuration items changed, nothing to reload.
2022-05-03 20:05:06,067 INFO: Lock owner: None; I am postgres-demo-cluster-0
2022-05-03 20:05:06,097 INFO: trying to bootstrap a new cluster
2022-05-03 20:05:06,098 INFO: Running custom bootstrap script: envdir "/run/etc/wal-e.d/env-clone-postgres-demo-cluster" python3 /scripts/clone_with_wale.py --recovery-target-time="2022-05-03T19:50:00+00:00"
2022-05-03 20:05:06,230 INFO: Trying s3://postgresql/spilo/postgres-demo-cluster/wal/14/ for clone
2022-05-03 20:05:06,290 INFO: cloning cluster postgres-demo-cluster using wal-g backup-fetch /home/postgres/pgdata/pgroot/data base_000000010000000000000004
INFO: 2022/05/03 20:05:06.313280 Selecting the backup with name base_000000010000000000000004...
INFO: 2022/05/03 20:05:06.409909 Finished decompression of part_003.tar.lz4
INFO: 2022/05/03 20:05:06.409930 Finished extraction of part_003.tar.lz4
2022-05-03 20:05:16,568 INFO: Lock owner: None; I am postgres-demo-cluster-0
2022-05-03 20:05:16,568 INFO: not healthy enough for leader race
2022-05-03 20:05:16,587 INFO: bootstrap in progress
...

There are two things to mention, in the upper part of the log, you can see, that the Postgres Operator sets some environment variables in the envdir /run/etc/wal-e.d/env-clone-postgres-demo-cluster/. These are the settings WAL-G needs to find the backups of your source instance (where source and target is the same in this scenario). The second thing to mention is the actual restore that takes place. Once in the line with out specified target time for the point in time restore and then second, that he has found the backup of our source database.

When you shell into your database pod and issue a patronictl list command, you can also see, that the timeline (so the incarnation of the instance) has increased. We took the backup from the 1st timeline, after the restore has been finished, we’re now at timeline 3.

root@postgres-demo-cluster-0:/home/postgres# patronictl list
| Member                  | Host         | Role    | State   | TL | Lag in MB |
+-------------------------+--------------+---------+---------+----+-----------+
| postgres-demo-cluster-0 | 10.244.1.11  | Leader  | running |  3 |           |
...

Let’s proceed to the next scenario, this time, we clone to a new instance / cluster in a different namespace.

Clone restore

This time, out manifest for cloning our postgres-demo-cluster looks like this. It’s mostly the same than in the first scenario, with the exception, that we use a different namespace and also a different cluster name. (Also, you can find the manifest here).

demo-cluster-clone.yaml
apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
  name: postgres-demo-cluster-clone
  namespace: postgres-clone
spec:
  teamId: "postgres"
  volume:
    size: 2Gi
  numberOfInstances: 1
  users:
    demouser:  # database owner
    - superuser
    - createdb
  databases:
    demo: demo_owner  # dbname: owner
  preparedDatabases:
    demo: {}
  postgresql:
    version: "14"
  clone:
    cluster: "postgres-demo-cluster"  # Source instance name; Instance name to clone from
    timestamp: "2022-05-03T19:50:00+00:00"  # timezone required (offset relative to UTC, see RFC 3339 section 5.6)

When doing a restore as a clone, you have more choices in changing the configuration of your database. I would still advice you not to change too much. On the positive side is, if there is a problem with this cloned restore, just delete it and start fresh. Also your source instance does not need to be deleted. You can doublecheck if the clone delivers the expected result first. Apply the manifest:

Fish
kubectl apply -f demo-cluster-clone.yaml
postgresql.acid.zalan.do/postgres-demo-cluster-clone created

...
2022-05-03 20:20:04,356 - bootstrapping - INFO - Configuring bootstrap
2022-05-03 20:20:04,357 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/WALE_S3_PREFIX
2022-05-03 20:20:04,357 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/WALG_S3_PREFIX
2022-05-03 20:20:04,357 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/AWS_ACCESS_KEY_ID
2022-05-03 20:20:04,357 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/AWS_SECRET_ACCESS_KEY
2022-05-03 20:20:04,357 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/WALE_S3_ENDPOINT
2022-05-03 20:20:04,358 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/AWS_ENDPOINT
2022-05-03 20:20:04,358 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/AWS_REGION
2022-05-03 20:20:04,358 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/WALE_DISABLE_S3_SSE
2022-05-03 20:20:04,358 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/WALG_DISABLE_S3_SSE
2022-05-03 20:20:04,358 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/AWS_S3_FORCE_PATH_STYLE
2022-05-03 20:20:04,358 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/USE_WALG_BACKUP
2022-05-03 20:20:04,359 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/USE_WALG_RESTORE
2022-05-03 20:20:04,359 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/WALE_LOG_DESTINATION
2022-05-03 20:20:04,359 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/TMPDIR
2022-05-03 20:20:04,359 - bootstrapping - INFO - Configuring certificate
2022-05-03 20:20:04,359 - bootstrapping - INFO - Generating ssl self-signed certificate
2022-05-03 20:20:04,417 - bootstrapping - INFO - Configuring standby-cluster
2022-05-03 20:20:04,417 - bootstrapping - INFO - Configuring log
2022-05-03 20:20:04,417 - bootstrapping - INFO - Configuring pgbouncer
2022-05-03 20:20:04,417 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2022-05-03 20:20:04,418 - bootstrapping - INFO - Configuring crontab
2022-05-03 20:20:04,418 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2022-05-03 20:20:04,431 - bootstrapping - INFO - Configuring pam-oauth2
2022-05-03 20:20:04,431 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2022-05-03 20:20:04,431 - bootstrapping - INFO - Configuring wal-e
2022-05-03 20:20:04,431 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_PREFIX
2022-05-03 20:20:04,431 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_S3_PREFIX
2022-05-03 20:20:04,432 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ACCESS_KEY_ID
2022-05-03 20:20:04,432 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY
2022-05-03 20:20:04,432 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_ENDPOINT
2022-05-03 20:20:04,432 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ENDPOINT
2022-05-03 20:20:04,432 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_REGION
2022-05-03 20:20:04,432 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_DISABLE_S3_SSE
2022-05-03 20:20:04,432 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DISABLE_S3_SSE
2022-05-03 20:20:04,432 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_S3_FORCE_PATH_STYLE
2022-05-03 20:20:04,433 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DOWNLOAD_CONCURRENCY
2022-05-03 20:20:04,433 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_UPLOAD_CONCURRENCY
2022-05-03 20:20:04,433 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_BACKUP
2022-05-03 20:20:04,433 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_RESTORE
2022-05-03 20:20:04,433 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_LOG_DESTINATION
2022-05-03 20:20:04,433 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/PGPORT
2022-05-03 20:20:04,433 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/BACKUP_NUM_TO_RETAIN
2022-05-03 20:20:04,434 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/TMPDIR
2022-05-03 20:20:05,608 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2022-05-03 20:20:05,620 INFO: No PostgreSQL configuration items changed, nothing to reload.
2022-05-03 20:20:05,623 INFO: Lock owner: None; I am postgres-demo-cluster-clone-0
2022-05-03 20:20:05,643 INFO: trying to bootstrap a new cluster
2022-05-03 20:20:05,644 INFO: Running custom bootstrap script: envdir "/run/etc/wal-e.d/env-clone-postgres-demo-cluster" python3 /scripts/clone_with_wale.py --recovery-target-time="2022-05-03T19:50:00+00:00"
2022-05-03 20:20:05,776 INFO: Trying s3://postgresql/spilo/postgres-demo-cluster/wal/14/ for clone
2022-05-03 20:20:05,884 INFO: cloning cluster postgres-demo-cluster-clone using wal-g backup-fetch /home/postgres/pgdata/pgroot/data base_000000010000000000000004
INFO: 2022/05/03 20:20:05.906757 Selecting the backup with name base_000000010000000000000004...
INFO: 2022/05/03 20:20:05.989340 Finished decompression of part_003.tar.lz4
INFO: 2022/05/03 20:20:05.989414 Finished extraction of part_003.tar.lz4
INFO: 2022/05/03 20:20:15.138568 Finished decompression of part_001.tar.lz4
INFO: 2022/05/03 20:20:15.138590 Finished extraction of part_001.tar.lz4
INFO: 2022/05/03 20:20:15.153621 Finished decompression of pg_control.tar.lz4
INFO: 2022/05/03 20:20:15.153764 Finished extraction of pg_control.tar.lz4
INFO: 2022/05/03 20:20:15.154067
Backup extraction complete.
...

Again, you will find the same output then before. Shell into the database pod and check the cluster status by patronictl list.

root@postgres-demo-cluster-clone-0:/home/postgres# patronictl list
+ Cluster: postgres-demo-cluster-clone (7093595735000756305) ----+----+-----------+
| Member                        | Host        | Role   | State   | TL | Lag in MB |
+-------------------------------+-------------+--------+---------+----+-----------+
| postgres-demo-cluster-clone-0 | 10.244.1.13 | Leader | running |  4 |           |
+-------------------------------+-------------+--------+---------+----+-----------+

All seems fine. One more word regarding the cloned instance. If you’ve followed my backup setup instructions from this blogpost, the cloned instance will do a backup right after the restore. This might be a wanted behavior or not, this totally depends on you. So keep it on your list.

Philip

41 thoughts on “Restore and Clone from S3 – Configure Zalando Postgres Operator Restore with WAL-G”

  1. One question… On restore, I need to set a timestamp. Does this need to be the exact timestamp for an existing backup, or something relative to a possible backup? How can I know the exact timestamp once the deployment was destroyed, for example?

    1. Avatar photo

      Hello Michael,

      the timestamp is a point in time between the oldest existing backup and the newest WAL archive. So it doesn’t need to be an exact timestamp of a basebackup. There are some issues when you want to restore to a timeline before the current one. But this topic is adressed and actively worked on.

      In case you have lost the complete deployment, the only way to know which backups exist is to check the S3 bucket for existing backups or to recreate the deployment as a single instance with the same name. What then will happen is, that a new timeline will be created but you can check the existing backups with the wal-g commands then. (At least when you followed the instructions within the blogpost with unsetting prefix and suffix)
      I would suggest, you check the S3 storage though.

      Philip

  2. Hello, I think I did everything correctly but when I try to clone restore it fails with credentials error. AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY don’t appear in env-clone- directory. Backup works correctly. Is there any additional configuration that needs to be done, perhaps?

    1. Avatar photo

      Hello Alexander,

      is the env-dir for the clone existing? If yes, can you provide an ls -l of the directory?
      Have you ensured, that you set the CLONE_* properties in the configmap according to the article?
      Are the env variables correctly set? Can you provide the output of env | grep -i clone (pls hide the actual values when you do so)

      Kind regards
      Philip

  3. Mr. TheDatabaseMe, good day. I want to ask you a question. Could you provide me any advice on how to backup and restore a very sizable PostgreSQL database running on k8s cluster? 11 TB is the current size. There isn’t any initial backup for it. I appreciate your good article.

    1. Avatar photo

      Hello LukaPham,
      this depends a bit on which Container image and / or if you’re using a Kubernetes Operator (like Zalando Postgres Operator). With that normally 1-2 backup tools are implemented which should be used.
      In general, a database of this size is potentially hard to backup using a conventional backup method (like pg_basebackup, WAL-G…). With WAL-G you can write incremental basebackups which will fasten the backups after the initial backup. But a restore 11TB will easily take multiple hours, so not sure if this is reasonable for your needs.

      Another possible way for backing up Postgres is to use Storage snapshots. But this can be hard on Kubernetes for two reasons.
      1. Your Storage provider needs to support Snapshots
      2. If you are using an Operator (like Zalando) or a cluster software included in your container image (like Patroni in the Spilo images used by Zalando), a restore may be hard to implement. Probably you have to “patch” some code in these components cause there is no implementation of Snapshot restores.

      For me, a snapshot backup would be the only reasonable way to go still. But you need to ensure, that you backup WAL archives also, else you will not be able to do a point in time restore and this may also conflict with bootstrapping the container or your Operator. Without having it tried, I would test to “restore” a storage snapshot to the original volume (Kubernetes PV) and in theory, the database should do a crash recovery (if all WAL information is also part of the snapshot) and be back up to the point in time you’ve done the snapshot.
      I know of some Kubernetes components which will act as a storage provider in your Kubernetes cluster, which also support snapshots (e.g. Rancher Longhorn). This I would consider if your underlying storage does not support snapshots.

      More information on using storage snapshots in combination with Postgres can be found here.

      Hope this helps.
      Kind regards
      Philip

  4. Charles Thomas Johnson

    In the article you mention “he cloned instance will do a backup right after the restore”. Is there way to configure the clone not to do this? I found that this causes issues if you want to clone a live database which may be written to during the cloning process and hence causes a fork in timelines which causes me to not be able to automatically create another clone.

    1. Avatar photo

      Hello,

      yes, this is possible by not setting the needed WAL* env vars for the clone instance. But therefore, you need to specifically unset them all in the `postgresql` custom resource if you set them as a configmap like in this tutorial.
      This will prevent to do a backup cause the pod can’t connect to any S3 bucket.

      Hope this helps.
      Philip

  5. hello, I have a problem, for some reason the backup does not want to be restored if the bucket is located in another region. I have the variable AWS_REGION: eu-central-1. But he ignores her. Any ideas ?

    “`
    ERROR: 2023/03/09 17:52:27.978381 failed to list s3 folder: “*********/12/basebackups_005/’: BucketRegionError: incorrect region, the bucket is not in ‘eu-west-1’ region at endpoint ”

  6. TheDatabaseMe
    I found the reason, in the configmap (or in the secret) It is necessary to specify the variable for the clone
    CLONE_USE_WALG_RESTORE: “true”
    CLONE_AWS_REGION: “REGION”
    CLONE_WALG_S3_PREFIX: “s3://YOUR-BUCKET/$(CLONE_SCOPE)/$(PGVERSION)”

  7. TheDatabaseMe
    Do you have any idea how I can create multiple configmaps (secrets) for different clusters using one pg operator zalando?

    Thank you very much <3

    1. Avatar photo

      Hi,

      you can do so by specifying the parameter pod_environment_secret within your postgres-operator configmap.
      This will lead to the following. Zalando Operator will add all env vars specified within this secret to the Postgres cluster pods.
      You have to create separate namespaces for every Postgres cluster and the specified secret must be located within this namespace.
      Be aware, that having the same env var specified within a pod_environment_configmap and the pod_environment_secret will lead to a conflict.

      Another option is, to specify the dedicated env vars for every cluster within the actual postgresql CRD, e.g.:


      env:
      - name: BACKUP_SCHEDULE
      value: 0 2 * * *

  8. Hi!
    I did everything like you have listed in this post and the one before this regarding backup (at least i hope so)
    My postgres cluster is up and running, creating a backup into S3 every day.
    The problem comes, when I tried to restore it with the first method you listed. Delete the whole cluster in the same Namespace and reapply it.
    You mentioned, that the uid would not be necessary, but it seems like my secondary postgres node can’t boostrap from the backup, because of some id missmatch.

    apiVersion: acid.zalan.do/v1
    kind: postgresql
    metadata:
    name: paas-db-cluster
    postgresql:
    version: "14"
    parameters: # Expert section
    password_encryption: scram-sha-256
    shared_buffers: "32MB"
    max_connections: "100"
    log_statement: "all"
    env:
    - name: WAL_S3_BUCKET
    value: "postgres"
    - name: WAL_BUCKET_SCOPE_PREFIX
    value: ""
    - name: WAL_BUCKET_SCOPE_SUFFIX
    value: ""
    - name: USE_WALG_BACKUP
    value: "true"
    - name: USE_WALG_RESTORE
    value: "true"
    - name: BACKUP_SCHEDULE
    value: '00 04 * * *'
    - name: AWS_ACCESS_KEY_ID
    valueFrom:
    secretKeyRef:
    key: id
    name: hcp-credentials
    - name: AWS_SECRET_ACCESS_KEY
    valueFrom:
    secretKeyRef:
    key: key
    name: hcp-credentials
    - name: AWS_ENDPOINT
    value: "https://*****-sandbox.hcp.intra"
    - name: AWS_S3_FORCE_PATH_STYLE
    value: "false"
    - name: WALG_DISABLE_S3_SSE
    value: "true"
    - name: WALG_S3_CA_CERT_FILE
    value: "/tlsca/MT_CA_FILE"
    clone:
    cluster: "paas-db-cluster"
    timestamp: "2023-05-18T13:50:33+02:00"

    And the error, that the secondary postgres node gives:
    2023-05-19 06:32:26,311 INFO: No PostgreSQL configuration items changed, nothing to reload.
    2023-05-19 06:32:26,353 CRITICAL: system ID mismatch, node paas-harbor-db-cluster-1 belongs to a different cluster: 7234777133454319697 != 7234523812773580881

    paas-harbor-db-cluster-1 is up and running, its the secondary, that cant bootstrap.

    Any idea what I did wrong?
    Thanks!

    1. Avatar photo

      Hello,
      but your leader pod has got restored as it should? Can you share me the Spilo image version and the Operator version you’re using?

      As mentioned in the article, an inplace restore is quite tricky. What happened here in your case (at least it seems so). The secondary node got bootstrapped as a new cluster, so he didn’t take over the same Patroni cluster ID than the primary. So he can’t bootstrap himself from the leader cause he says, it’s a different cluster.

      To fix this situation up, you can enforce to recreate the secondary by deleting the pvc of the secondary pod (so the volume), then delete the pod of the secondary. It will come back up and getting recreated (with a new volume). There will be no data left when it comes back, so he will either try to bootstrap newly from the leader of the latest backup.

      Please check if the backups are getting written correctly from your restored cluster in general. If there seems something fishy to you, I would recommend that you use the second scenario (restore as clone) from the article.

      Kind regards
      Philip

      1. Thanks for the very fast reply! 🙂

        Spilo:8 spilo-14:2.1-p6
        Operator: postgres-operator:v1.10.0

        I was testing this out today in our sandbox cluster.
        It seems like, your suggestion fixes the issue with the id missmatch. I deleted the postgres cluster after I loaded some data into it and it wrote the first backup into S3, then I reapplied the postgres.yaml, just like before. I got the ID missmatch, deleted the PVC, then the Secondary pod and the new pod joined the postgres cluster as a secondary.
        The problem came, when I checked and there were no real data loaded from the backup. So my ID problem seems to be fixed by the workaround, but the restore doesn’t work as intented when trying the in-place restore.
        It would be nice if it would work tho, because we use the Postgres as a DB for Harbor image repository. They are both located in the same Namespace, so the restore as clone wouldn’t be a viable option for us, as they need to be in the same NS.

        Master Postgres log after deletion and recreation:
        2023-05-19 11:44:11.910 - /scripts/postgres_backup.sh - producing a new backup
        INFO: 2023/05/19 11:44:12.105522 Calling pg_start_backup()
        INFO: 2023/05/19 11:44:12.403942 Starting a new tar bundle
        INFO: 2023/05/19 11:44:12.404031 Walking ...
        INFO: 2023/05/19 11:44:12.404663 Starting part 1 ...
        2023-05-19 11:44:12.810 46 LOG Starting pgqd 3.3
        2023-05-19 11:44:12.810 46 LOG auto-detecting dbs ...
        INFO: 2023/05/19 11:44:13.215011 Packing ...
        INFO: 2023/05/19 11:44:13.223221 Finished writing part 1.
        INFO: 2023/05/19 11:44:13.665593 Starting part 2 ...
        INFO: 2023/05/19 11:44:13.665658 /global/pg_control
        INFO: 2023/05/19 11:44:13.667669 Finished writing part 2.
        INFO: 2023/05/19 11:44:13.667718 Calling pg_stop_backup()
        INFO: 2023/05/19 11:44:14.690746 Starting part 3 ...
        INFO: 2023/05/19 11:44:14.707151 backup_label
        INFO: 2023/05/19 11:44:14.707177 tablespace_map
        INFO: 2023/05/19 11:44:14.707222 Finished writing part 3.
        INFO: 2023/05/19 11:44:14.961771 Wrote backup with name base_000000010000000000000002

        Secondary Postgres logs:
        2023-05-19 11:48:23,327 INFO: trying to bootstrap from leader 'paas-db-cluster-0'
        Current cluster size: 26814537
        Wals generated since the last backup: 16933064
        not restoring from backup because of amount of generated wals exceeds 30% of cluster_size
        2023-05-19 11:48:23,788 ERROR: Error creating replica using method wal_e: envdir /run/etc/wal-e.d/env bash /scripts/wale_restore.sh exited with code=1
        1024+0 records in
        1024+0 records out
        16777216 bytes (17 MB, 16 MiB) copied, 0.0149227 s, 1.1 GB/s
        NOTICE: all required WAL segments have been archived
        2023-05-19 11:48:31,188 INFO: replica has been created using basebackup_fast_xlog
        2023-05-19 11:48:31,191 INFO: bootstrapped from leader 'paas-harbor-db-cluster-0'
        2023-05-19 11:48:31 UTC [117]: [1-1] 6467620f.75 0 LOG: Auto detecting pg_stat_kcache.linux_hz parameter...
        2023-05-19 11:48:31 UTC [117]: [2-1] 6467620f.75 0 LOG: pg_stat_kcache.linux_hz is set to 199999
        2023-05-19 11:48:31,653 INFO: postmaster pid=117
        2023-05-19 11:48:31 UTC [117]: [3-1] 6467620f.75 0 LOG: redirecting log output to logging collector process
        2023-05-19 11:48:31 UTC [117]: [4-1] 6467620f.75 0 HINT: Future log output will appear in directory "../pg_log".
        /var/run/postgresql:5432 - no response
        /var/run/postgresql:5432 - rejecting connections
        /var/run/postgresql:5432 - rejecting connections
        /var/run/postgresql:5432 - accepting connections

        Best,
        Martin

        1. Avatar photo

          Hello Martin,

          ok, this makes sense now to me. I’ve checked your first comment postgresql CRD and saw, that you haven’t specified the needed env vars for cloning (so all env vars you set in order to get the backup running (e.g. WAL_S3_BUCKET) need to be specified with a CLONE_ prefix.

          So this is my assumption now what have happened. You deleted and recreated the postgresql CRD. The cluster spun up. You specified a clone section within the CRD. But without the env vars for CLONE_ set, no restore happened really, so the primary came up as a brand new cluster, just with the old name. Now something nasty happens, since the new cluster has the same settings on how to make backups and where to store them, the primary created a backup (with a new cluster ID) with the timeline ID 1, maybe overwriting even your existing backups of the old instance if there were any with the named timeline.
          So now, the secondary pod comes up. Needs to be bootstrapped either by leader or by backup. Since there was a backup created by the primary, the secondary chose to bootstrap from backup as of the highest timeline ID and timestamp which he finds on S3 (which are the backups of your original instance). He tries to bootstrap from those but finds out, that those backups belong to another instance ID / cluster (the original one) which differs from the new leaders instance ID. So he can’t bootstrap.

          To prevent this mess with your backups, Spilo uses by default the Prefix + Suffix stuff in the “path” of the S3 backups. So it’s ensured, that no backups will get overwritten by a new instance with the same name. But this makes cloning harder (cause you need to remember, save the original UID of the postgresql cluster).

          So I would recommend now, to doublecheck that at least one timeline / timestamp is still existing on S3 and not been overwritten. Or even (because you named it’s a sandbox system) start from scratch. Ensure you have set all the needed CLONE_ env vars. For the next try, before you do an inplace restore, try to restore as a real clone with a different name. This ensures to have the settings right. When this works as “restore as clone”, then proceed to restore “in-place”.

          I doublechecked the blog post of mine, I really haven’t written about the CLONE_ variables. I assumed that people will use by backup configuration in my Github repo. There I set all env vars within one configmap which is used by all Postgres instances deployed by Zalando Operator as an “overlay”. I will update the blogpost to at least mention this configmap in specific. You can find the configmap (and therefore all env vars you need to set in your postgresql CRD) here.

          Kind regards
          Philip

  9. Hi!
    Thanks for the great advice again!
    I think I was able to set up the cloning, but it doesn’t spin up from the backup it’s just stuck in a bootstrap.
    What I did:
    – set up a postgres in namespace A (db-cluster) with backup variables
    – I made sure there was a backup created on my S3
    – set up postgres in namespace B (db-cluster-clone), with the now right env variables for cloning
    postgres logs in namespace B:
    2023-05-22 09:16:59,816 INFO: No PostgreSQL configuration items changed, nothing to reload.
    2023-05-22 09:16:59,820 INFO: Lock owner: None; I am paas-harbor-db-cluster-clone-0
    2023-05-22 09:16:59,869 INFO: trying to bootstrap a new cluster
    2023-05-22 09:16:59,870 INFO: Running custom bootstrap script: envdir "/run/etc/wal-e.d/env-clone-paas-harbor-db-cluster" python3 /scripts/clone_with_wale.py --recovery-target-time="2023-05-22T10:50:33+02:00"
    2023-05-22 09:17:00,243 INFO: Trying s3://postgres-harbor/spilo/paas-harbor-db-cluster/wal/14/ for clone
    2023-05-22 09:17:15,321 INFO: Lock owner: None; I am paas-harbor-db-cluster-clone-0
    2023-05-22 09:17:15,322 INFO: not healthy enough for leader race
    2023-05-22 09:17:15,350 INFO: bootstrap in progress

    And this bootstrap goes on for forever.

    postgres env variables in Namespace B:
    - name: WAL_S3_BUCKET
    value: "postgres-harbor"
    - name: WAL_BUCKET_SCOPE_PREFIX
    value: ""
    - name: WAL_BUCKET_SCOPE_SUFFIX
    value: ""
    - name: USE_WALG_BACKUP
    value: "true"
    - name: USE_WALG_RESTORE
    value: "true"
    - name: BACKUP_SCHEDULE
    value: '00 04 * * *'
    - name: AWS_ACCESS_KEY_ID
    valueFrom:
    secretKeyRef:
    key: id
    name: hcp-credentials
    - name: AWS_SECRET_ACCESS_KEY
    valueFrom:
    secretKeyRef:
    key: key
    name: hcp-credentials
    - name: AWS_ENDPOINT
    value: "https://*******.hcp.intra"
    - name: AWS_S3_FORCE_PATH_STYLE
    value: "false"
    - name: WALG_DISABLE_S3_SSE
    value: "true"
    - name: WALG_S3_CA_CERT_FILE
    value: "/tlsca/MT_CA_FILE"
    # - name: BACKUP_NUM_TO_RETAIN
    # value: "5"
    - name: CLONE_USE_WALG_RESTORE
    value: "true"
    - name: CLONE_AWS_ACCESS_KEY_ID
    valueFrom:
    secretKeyRef:
    key: id
    name: hcp-credentials
    - name: CLONE_AWS_SECRET_ACCESS_KEY
    valueFrom:
    secretKeyRef:
    key: key
    name: hcp-credentials
    - name: CLONE_AWS_ENDPOINT
    value: "https://********.hcp.intra"
    - name: CLONE_METHOD
    value: "CLONE_WITH_WALE"
    - name: CLONE_WAL_BUCKET_SCOPE_PREFIX
    value: ""
    - name: CLONE_WAL_S3_BUCKET
    value: "postgres-harbor"
    - name: CLONE_AWS_S3_FORCE_PATH_STYLE
    value: "false"
    clone:
    cluster: "db-cluster"
    timestamp: "2023-05-22T10:50:33+02:00"

    Maybe I should provide an other path to search for the backups?
    Thanks,
    Martin

    1. Avatar photo

      Hello Martin,

      I think something got mixed up. Can you verify that there are backups under s3://postgres-harbor/spilo/paas-harbor-db-cluster/wal/14/basebackups_005/? Also I’m confused why it tries to search for a backup there, cause you specified to clone from backups from db-cluster and not paas-harbor-db-cluster which he searches for.

      Kind Regards
      Philip

  10. Hi Philip!

    Sorry for the late reply. Worked on it all afternoon.
    I had some issues with TLS certificates, that only got logged out after 10 mins of bootstrapping.
    I loaded the S3 CA into the container and it found the backup instantly. Tried both cloning and in-place works like a charm.
    Thanks for the great advices!

    Best,
    Martin

  11. Hi Philip!

    I have been using Postgres backup and I have made a full recovery on one of my clusters.
    It has been working fine and have tested multiple time what happens if one of my node from the two restarts. It always got spun up in a minute or two and worked fine.
    However yesterday one of my secondary nodes got restarted and got into an intinite loop of bootstrap.
    After an hour or so it got it self together, but its very odd for me.
    Have you ever encountered this issue? There is a Github issue on this, but didn’t help at all.

    2023-07-21 09:39:42,776 INFO: Lock owner: paas-harbor-db-cluster-0; I am paas-harbor-db-cluster-1
    2023-07-21 09:39:42,777 INFO: restarting after failure in progress
    /var/run/postgresql:5432 - rejecting connections
    /var/run/postgresql:5432 - rejecting connections
    /var/run/postgresql:5432 - rejecting connections
    /var/run/postgresql:5432 - rejecting connections
    /var/run/postgresql:5432 - rejecting connections
    /var/run/postgresql:5432 - rejecting connections
    /var/run/postgresql:5432 - rejecting connections
    /var/run/postgresql:5432 - rejecting connections
    /var/run/postgresql:5432 - rejecting connections
    /var/run/postgresql:5432 - rejecting connections
    /var/run/postgresql:5432 - rejecting connections
    /var/run/postgresql:5432 - rejecting connections
    /var/run/postgresql:5432 - rejecting connections
    2023-07-21 09:39:57,836 INFO: Lock owner: paas-harbor-db-cluster-0; I am paas-harbor-db-cluster-1
    2023-07-21 09:39:57,836 INFO: Still starting up as a standby.
    2023-07-21 09:39:57,838 INFO: Lock owner: paas-harbor-db-cluster-0; I am paas-harbor-db-cluster-1
    2023-07-21 09:39:57,839 INFO: establishing a new patroni connection to the postgres cluster
    2023-07-21 09:39:58,191 INFO: establishing a new patroni connection to the postgres cluster
    2023-07-21 09:39:58,194 WARNING: Retry got exception: 'connection problems'
    2023-07-21 09:39:58,196 WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role
    2023-07-21 09:39:58,198 INFO: no action. I am (paas-harbor-db-cluster-1), a secondary, and following a leader (paas-harbor-db-cluster-0)
    /var/run/postgresql:5432 - rejecting connections
    2023-07-21 09:40:12,793 INFO: Lock owner: paas-harbor-db-cluster-0; I am paas-harbor-db-cluster-1
    2023-07-21 09:40:12,793 INFO: Still starting up as a standby.
    2023-07-21 09:40:12,794 INFO: Lock owner: paas-harbor-db-cluster-0; I am paas-harbor-db-cluster-1
    2023-07-21 09:40:12,795 INFO: establishing a new patroni connection to the postgres cluster
    2023-07-21 09:40:13,219 INFO: establishing a new patroni connection to the postgres cluster
    2023-07-21 09:40:13,223 WARNING: Retry got exception: 'connection problems'
    2023-07-21 09:40:13,223 WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role
    2023-07-21 09:40:13,226 INFO: no action. I am (paas-harbor-db-cluster-1), a secondary, and following a leader (paas-harbor-db-cluster-0)
    /var/run/postgresql:5432 - rejecting connections

    Sorry for the slightly out of context comment, but I couldn’t find help anywhere else. Maybe you have seen this issue before?
    Thanks in advance,
    Martin

    1. Avatar photo

      Hello Martin,
      you are right, this is not restore topic. I haven’t encountered something like this by myself. But I would have a guess why this might be.

      Some questions / remarks in general:
      – Do you know the reason, why the secondary got unavailable / restarted in the first place?
      – Do you have logs from the time when the secondary came back up in the end?
      – This is not bootstrapping exactly. The secondary is still aware, that it is part of a cluster and was secondary as his last role

      What the logs tell me is, that the secondary can’t connect to the primary for unknown reason. It still knows that it was a secondary
      node and tries to get a connection to the primary node again and again. So this could just be a network issue on your Kubernetes. Was
      the primary still running without any log errors / warnings and were you able to connect to it still?

      Kind regards
      Philip

  12. Hi Philip!

    Thanks for the lightning fast reply and again I’m sorry for my out-of-context question.
    – the secondary got restarted because the node that it was on got drained.
    – logs from when it came up available again:

    2023-07-21 10:30:13,004 INFO: no action. I am (paas-harbor-db-cluster-1), a secondary, and following a leader (paas-harbor-db-cluster-0)
    2023-07-21 10:30:27,838 INFO: no action. I am (paas-harbor-db-cluster-1), a secondary, and following a leader (paas-harbor-db-cluster-0)
    2023-07-21 10:30:42,840 INFO: no action. I am (paas-harbor-db-cluster-1), a secondary, and following a leader (paas-harbor-db-cluster-0)
    2023-07-21 10:30:57,882 INFO: no action. I am (paas-harbor-db-cluster-1), a secondary, and following a leader (paas-harbor-db-cluster-0)
    2023-07-21 10:31:12,840 INFO: no action. I am (paas-harbor-db-cluster-1), a secondary, and following a leader (paas-harbor-db-cluster-0)

    I don’t know any issues with the Kubernetes networking, but will look into it.

    Thanks,
    Martin

  13. Yes just like that.

    I’m suspecting, that it has something to do with the Restore function.
    After I tested the Restore on these clusters I left the env variables and the clone section in the Postgres config. 2 days ago there was a Cert change on our DB, which I use to store the backups. After this change the clusters configured with Restore stopped producing backups and couldn’t start only after an hour of crashlooping.

    I also have a Postgres cluster which isn’t configured with restore, only makes backups, which we would use in case of a disaster.
    This cluster works fine, producing backups as it should and could handle the event of Postgres pods restart.

    So I’m gonna go into this direction, thanks a lot for the input.

    Best,
    Martin

  14. Hello @TheDatabaseMe

    Do you try to clone the cluster from the backup using IAM access instead access ID and key?
    I’ve tried it in multiple ways and without success. My backup pod see the backups:

    root@test-clone-postgresql-0:/home/postgres# envdir “/run/etc/wal-e.d/env-clone-test-postgresql/” wal-g backup-list
    name modified wal_segment_backup_start
    base_000000010000000000000002 2023-08-17T12:47:14Z 000000010000000000000002
    root@test-clone-postgresql-0:/home/postgres#

    But the cloning procedure failed with an error:
    2023-08-17 13:53:17,613 INFO: Trying s3://el-postgres-backup-int/spilo/postgresql-test-postgresql/wal/14/ for clone
    ERROR: 2023/08/17 13:53:20.815160 failed to list s3 folder: ‘spilo/postgresql-test-postgresql/wal/14/basebackups_005/’: AccessDenied: Access Denied
    status code: 403, request id: P9FAGGKSC9YJ7B47, host id: 9m+4YFj46reDSATMmO76kpMcpdUuzfLb9LiZlTHnYfefcNC6zZ0qIx7WQe6xSqSl9iNw2JihxGcJGYn+KqMx2w==
    2023-08-17 13:53:20,818 ERROR: Clone failed

    1. Avatar photo

      Hello,
      no, I haven’t tried IAM for cloning.

      I found your GitHub issue @Zalando and checked those information as well. In my opinion, you got something wrong with the clone section. When specifying the UID there, this could lead to issues since the UID is not appended during backup in the path. So leave this one out would be my first suggestion. Also, remove the s3_wal_path parameter and try again.

      Kind regards
      Philip

  15. Hi Philip,
    I am fairly new to kubernetes and followed your steps. I was able to have the backups on s3 using wal-g. However, I am not able to clone the cluster in a different location. I need to be able to do bi-weekly backup/restore tests to ensure that the backups we have are OK. I am getting the warning below. When I check my cluster it has not restored the databases or the files inside. Is there anywhere I was suppose to specify the database names of the databases I want to be restored?

    2023-09-06 11:24:01,044 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
    2023-09-06 11:24:01,100 - bootstrapping - WARNING - Cloning with WAL-E is only possible when CLONE_WALE_*_PREFIX or CLONE_WALG_*_PREFIX or CLONE_WAL_*_BUCKET and CLONE_SCOPE are set.
    2023-09-06 11:24:01,101 - bootstrapping - INFO - No meta-data available for this provider
    2023-09-06 11:24:01,101 - bootstrapping - INFO - Looks like you are running unsupported
    2023-09-06 11:24:01,146 - bootstrapping - INFO - Configuring bootstrap
    2023-09-06 11:24:01,146 - bootstrapping - INFO - Configuring certificate
    2023-09-06 11:24:01,147 - bootstrapping - INFO - Generating ssl self-signed certificate
    2023-09-06 11:24:01,293 - bootstrapping - INFO - Configuring crontab
    2023-09-06 11:24:01,293 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
    2023-09-06 11:24:01,309 - bootstrapping - INFO - Configuring pam-oauth2
    2023-09-06 11:24:01,309 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
    2023-09-06 11:24:01,310 - bootstrapping - INFO - Configuring wal-e
    2023-09-06 11:24:01,311 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_PREFIX
    2023-09-06 11:24:01,311 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_S3_PREFIX
    2023-09-06 11:24:01,312 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ACCESS_KEY_ID
    2023-09-06 11:24:01,312 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY
    2023-09-06 11:24:01,312 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_ENDPOINT
    2023-09-06 11:24:01,313 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ENDPOINT
    2023-09-06 11:24:01,313 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_REGION
    2023-09-06 11:24:01,313 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_DISABLE_S3_SSE
    2023-09-06 11:24:01,313 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DISABLE_S3_SSE
    2023-09-06 11:24:01,314 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_S3_FORCE_PATH_STYLE
    2023-09-06 11:24:01,314 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DOWNLOAD_CONCURRENCY
    2023-09-06 11:24:01,314 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_UPLOAD_CONCURRENCY
    2023-09-06 11:24:01,315 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_BACKUP
    2023-09-06 11:24:01,315 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_RESTORE
    2023-09-06 11:24:01,315 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_LOG_DESTINATION
    2023-09-06 11:24:01,316 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/PGPORT
    2023-09-06 11:24:01,316 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/BACKUP_NUM_TO_RETAIN
    2023-09-06 11:24:01,316 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/TMPDIR
    2023-09-06 11:24:01,317 - bootstrapping - INFO - Configuring log
    2023-09-06 11:24:01,317 - bootstrapping - INFO - Configuring pgbouncer
    2023-09-06 11:24:01,317 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
    2023-09-06 11:24:01,317 - bootstrapping - INFO - Configuring patroni
    2023-09-06 11:24:01,329 - bootstrapping - INFO - Writing to file /run/postgres.yml
    2023-09-06 11:24:01,330 - bootstrapping - INFO - Configuring standby-cluster
    2023-09-06 11:24:01,330 - bootstrapping - INFO - Configuring pgqd

    1. Avatar photo

      Hello,

      can you please share your manifest of the to be cloned postgresql?

      What I can see from the log, you haven’t specified the environment variables needed for cloning with WAL-G as it seems. (CLONE_*)
      Keep in mind, for every WAL_G environment variable, there is also one with a CLONE_ prefix which will cover the cloning.

      Kind regards
      Philip

  16. Thanks Philip for your quick response.
    I checked what you suggested and edited my pod-config.yaml as below
    apiVersion: v1
    kind: ConfigMap
    metadata:
    name: pod-config
    data:
    WAL_S3_BUCKET: postgresql
    WAL_BUCKET_SCOPE_PREFIX: “”
    WAL_BUCKET_SCOPE_SUFFIX: “”
    USE_WALG_BACKUP: “true”
    USE_WALG_RESTORE: “true”
    BACKUP_SCHEDULE: ’00 10 * * *’
    AWS_ACCESS_KEY_ID: *****
    AWS_SECRET_ACCESS_KEY: ******
    AWS_S3_FORCE_PATH_STYLE: “true” # needed for MinIO
    AWS_ENDPOINT: http://****:9000 # Endpoint URL to your S3 Endpoint; MinIO in this example
    AWS_REGION: nyc3
    WALG_DISABLE_S3_SSE: “true”
    BACKUP_NUM_TO_RETAIN: “5”
    CLONE_USE_WALG_RESTORE: “true”
    CLONE_AWS_ACCESS_KEY_ID: ******
    CLONE_AWS_SECRET_ACCESS_KEY: *****
    CLONE_AWS_ENDPOINT: http://***:9000 # Endpoint URL to your S3 Endpoint; MinIO in this example
    CLONE_AWS_REGION: nyc3
    CLONE_METHOD: CLONE_WITH_WALE
    CLONE_WAL_BUCKET_SCOPE_PREFIX: “”
    CLONE_WAL_S3_BUCKET: postgresql
    CLONE_AWS_S3_FORCE_PATH_STYLE: “true”
    It is working now.. Thank you.
    Regards,
    Liz

  17. Hey man! How are you doing? Amazing tutorial:D

    One quick question, whenever I spin up a new Zalando’s cluster it’ll fail. I’m trying to do a clone directly from cluster to cluster.

    Here’s the error:

    2023-09-06 19:24:25,024 WARNING: Kubernetes RBAC doesn’t allow GET access to the ‘kubernetes’ endpoint in the ‘default’ namespace. Disabling ‘bypass_api_service’.
    2023-09-06 19:24:25,050 INFO: No PostgreSQL configuration items changed, nothing to reload.
    2023-09-06 19:24:25,051 INFO: Lock owner: None; I am postgres-demo-cluster-0
    2023-09-06 19:24:25,181 INFO: trying to bootstrap a new cluster
    2023-09-06 19:24:25,181 INFO: Running custom bootstrap script: python3 /scripts/clone_with_basebackup.py –pgpass=/home/postgres/.pgpass_postgresql-1-backoffice-postgresql-1 –host=172.20.12.231 –port=5432 –user=”standby”
    2023-09-06 19:24:25,210 INFO: cloning cluster postgres-demo-cluster from “host=172.20.12.231 port=5432 user=standby dbname=postgres”
    pg_basebackup: error: connection to server at “172.20.12.231”, port 5432 failed: FATAL: password authentication failed for user “standby”
    password retrieved from file “/home/postgres/.pgpass_postgresql-1-backoffice-postgresql-1”
    connection to server at “172.20.12.231”, port 5432 failed: FATAL: password authentication failed for user “standby”
    password retrieved from file “/home/postgres/.pgpass_postgresql-1-backoffice-postgresql-1”
    2023-09-06 19:24:25,275 ERROR: Clone failed

    The thing that I wanted to ask is, do you need to create standby user on the source database/cluster? Cause for example when I create this cluster, it’ll always try to find a secret that doesn’t exist, maybe it’s related to that.

    Thanks!

    1. Avatar photo

      Hello,
      I’m not entirely sure if I got what you want to do. Can you please share the manifest of
      the source and clone cluster?

      In general, the standby user and also the secret for it, is managed by the Zalando Operator. It needs
      to be created by it.

      Kind regards
      Philip

  18. Hi Philip,

    I am wondering, why restores are rather streamed from the leader than from S3 backups. Any idea how I can enforce using s3?

    S3 restore fails with following error:
    >> not restoring from backup because of amount of generated wals exceeds 30% of cluster_size
    >> ERROR: Error creating replica using method wal_e: envdir /run/etc/wal-e.d/env bash /scripts/wale_restore.sh exited with code=1

    I see in /scripts/wale_restore.sh THRESHOLD_PERCENTAGE=30 but don’t understand the reasoning behind it. There’s a supported parameter threshold_backup_size_percentage that could probably be used to pass a jumbo number to suppress this behaviour but I don’t know how I can do this.

    Cheers
    Remo

    1. Avatar photo

      Hello Remo,
      you can configure the threshold when to use S3 over creating a basebackup from the leader by setting the environment variable WALE_BACKUP_THRESHOLD_PERCENTAGE in the spilo container.
      You can do so by using a pod-config configmap or by directly adding the env var to the postgresql custom resource manifest.

      The idea behind this threshold is, that the more WALs have been written in comparison to the database size, a new basebackup will be faster to restore from. For very small instances, the
      default setting will almost every time lead to a new basebackup which is replicated, where it doesn’t really matter if a clone will take some seconds more or less. For large databases, this approach makes totally sense. Think of a db size of 500GB, where 30% would be 150GB in WALs alone which would have to get fetched and applied from S3. Instead, creating a new basebackup from the leader would probably be faster because you save the amount of WALs to be transfered and applied.

      I hope this helps.
      Philip

        1. Avatar photo

          AFAIK, Patroni / Spilo always pulls from the leader. I would say this makes also kind of sense most of the time. Since you can’t rely on, that a secondary node is on the most current state. You might have an apply gap between a primary and an already existing secondary.

  19. Hi TheDatabaseMe,

    First of all thanks for this article, it was very helpful in getting backups configured with the Zalando operator. My colleagues and I find that the documentation is either plain woeful or written by people who assume you somewhat know about the operator already, which, having never used it before, how could you?

    Anyway, we are working on an outstanding query of whether it is possible to do individual database backups/restores. I haven’t seen it mentioned in the documentation and it certainly looks like only cluster/instance-wide. What would your thoughts be on achieving per-db backups? cluster-per-db? seems wasteful but it’s k8s… they could be lean pods…

    Cheers

  20. Hey Philip,
    Thank you for these amazing articles. Could you please advise on how to migrate a zalando production database to another managed database like RDS PostgreSQL? Thank you once again.

    1. Avatar photo

      Hello,
      I’m sorry, but I can’t give you general advice without knowing the environment and all dependencies. Maybe some real
      high level advices. I assume, the simplest way will be using pg_dump. If you want to have a real minimal downtime,
      you could try to setup another replication to RDS PostgreSQL. But this will involve a lot of preparation and conceptual
      work first. I can’t guarantee, that this will even work.

      Kind regards
      Philip

Leave a Reply

Your email address will not be published. Required fields are marked *