Restore and Clone from S3 – Configure Zalando Postgres Operator Restore with WAL-G

In an earlier blogpost, I wrote about doing Postgres backups using WAL-G to a S3 storage. This time we have a look to either clone or “inplace” restore a postgres instance from these backups. All of this using the Zalando Postgres Operator. If you haven’t done so, please read the backup post before proceeding reading this one.

There are several scenarios doing clones from an earlier or already running Postgres instance using the Zalando Postgres Operator. They are documented here. We have a deeper look on two of these scenarios, both of them involving using WAL-G backups. Only for the records, you can clone a running Postgres Instance using the same manifest syntax we use down below in the same namespace as a running Postgres cluster. This will lead to a clone using pg_basebackup directly from the existing cluster and not using your S3 backups.

One general prerequisite for creating clones using WAL-G is to specify the S3 bucket configuration for cloning. We did for the backup in the above named blog post, but it has to be done as well for cloning. This is done by adding the same environment variables you have specified for backup but with a CLONE_ prefix. I did so by having a pod-config configmap, where all variables will be added to all Postgres clusters which are managed by the Zalando Operator (see here). You can also specify all those settings in the postgresql CRD, if you do so, here is the list of variables you probably need to set in order to get the cloning to work:

YAML

  CLONE_USE_WALG_RESTORE: "true"
  CLONE_AWS_ACCESS_KEY_ID: postgresql
  CLONE_AWS_SECRET_ACCESS_KEY: supersecret
  CLONE_AWS_ENDPOINT: http://minio.home.lab:9000 # Endpoint URL to your S3 Endpoint; MinIO in this example
  CLONE_AWS_REGION: de01
  CLONE_METHOD: CLONE_WITH_WALE
  CLONE_WAL_BUCKET_SCOPE_PREFIX: ""
  CLONE_WAL_S3_BUCKET: postgresql
  CLONE_AWS_S3_FORCE_PATH_STYLE: "true"

Inplace restore

When speaking of an inplace restore, we mean restoring a Postgres instance to an earlier point in time and keeping the instance name the same. You cannot overwrite a running cluster, so you first need to delete it and then (re)create it newly with the clone configuration down below. You can change some of the configurations also during an inplace restore, but I would advice you not to do so. It’s already a “risky” process. So if you’re unsure about the specific point in time or you want to try it out without deleting the actual instance, use the clone restore method described in the next chapter.

Let’s assume we’ve created a Postgres cluster with the name postgres-demo-cluster with the following definition (you can also find the manifest in my Github repository)

demo-cluster.yaml

apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
  name: postgres-demo-cluster
  namespace: postgres
spec:
  teamId: "postgres"
  volume:
    size: 2Gi
  numberOfInstances: 2
  users:
    demouser:  # database owner
    - superuser
    - createdb
  databases:
    demo: demo_owner  # dbname: owner
  preparedDatabases:
    demo: {}
  postgresql:
    version: "14"

Ensure, that you have a valid full backup and recent WAL archives in your S3 bucket, which cover the point in time you want to restore the instance to. Then delete the cluster from your Kubernetes environment.

kubectl delete postgresqls -n postgres postgres-demo-cluster
postgresql.acid.zalan.do "postgres-demo-cluster" deleted

Let’s wait for the cluster to get removed, if you check the S3 bucket again, you will find out, that a recent WAL archive is getting created during the shutdown / termination of the instance.

So now let’s update the CRD manifest and add a clone section to it (again, you can find the manifest also here).

demo-cluster-inplace-restore.yaml

apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
  name: postgres-demo-cluster
  namespace: postgres
spec:
  teamId: "postgres"
  volume:
    size: 2Gi
  numberOfInstances: 2
  users:
    demouser:  # database owner
    - superuser
    - createdb
  databases:
    demo: demo_owner  # dbname: owner
  preparedDatabases:
    demo: {}
  postgresql:
    version: "14"
  clone:
    cluster: "postgres-demo-cluster"  # Inplace restore when having the same cluster name as the source
    timestamp: "2022-05-03T19:50:00+00:00"  # timezone required (offset relative to UTC, see RFC 3339 section 5.6)

If you’ve read the official documentation, you might wonder why I don’t use the uid parameter in the clone section. It’s because in my backup setup, I set the WAL_BUCKET_SCOPE_PREFIX and WAL_BUCKET_SCOPE_SUFFIX both to null. This leaves out the cluster uid in the S3 backup path and makes it more straight forward in my opinion.

Let’s apply the updated manifest (kubectl apply -f demo-cluster-inplace-restore.yaml) and check the logfiles of the Postgres pods. You will find something like this in it:

...
2022-05-03 20:05:04,811 - bootstrapping - INFO - Configuring wal-e
2022-05-03 20:05:04,811 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_PREFIX
2022-05-03 20:05:04,812 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_S3_PREFIX
2022-05-03 20:05:04,812 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ACCESS_KEY_ID
2022-05-03 20:05:04,812 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY
2022-05-03 20:05:04,812 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_ENDPOINT
2022-05-03 20:05:04,813 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ENDPOINT
2022-05-03 20:05:04,813 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_REGION
2022-05-03 20:05:04,813 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_DISABLE_S3_SSE
2022-05-03 20:05:04,813 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DISABLE_S3_SSE
2022-05-03 20:05:04,813 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_S3_FORCE_PATH_STYLE
2022-05-03 20:05:04,813 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DOWNLOAD_CONCURRENCY
2022-05-03 20:05:04,814 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_UPLOAD_CONCURRENCY
2022-05-03 20:05:04,814 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_BACKUP
2022-05-03 20:05:04,814 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_RESTORE
2022-05-03 20:05:04,814 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_LOG_DESTINATION
2022-05-03 20:05:04,814 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/PGPORT
2022-05-03 20:05:04,815 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/BACKUP_NUM_TO_RETAIN
2022-05-03 20:05:04,815 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/TMPDIR
2022-05-03 20:05:04,815 - bootstrapping - INFO - Configuring pam-oauth2
2022-05-03 20:05:04,815 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2022-05-03 20:05:04,815 - bootstrapping - INFO - Configuring certificate
2022-05-03 20:05:04,815 - bootstrapping - INFO - Generating ssl self-signed certificate
2022-05-03 20:05:04,847 - bootstrapping - INFO - Configuring bootstrap
2022-05-03 20:05:04,847 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/WALE_S3_PREFIX
2022-05-03 20:05:04,848 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/WALG_S3_PREFIX
2022-05-03 20:05:04,848 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/AWS_ACCESS_KEY_ID
2022-05-03 20:05:04,848 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/AWS_SECRET_ACCESS_KEY
2022-05-03 20:05:04,848 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/WALE_S3_ENDPOINT
2022-05-03 20:05:04,848 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/AWS_ENDPOINT
2022-05-03 20:05:04,848 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/AWS_REGION
2022-05-03 20:05:04,849 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/WALE_DISABLE_S3_SSE
2022-05-03 20:05:04,849 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/WALG_DISABLE_S3_SSE
2022-05-03 20:05:04,849 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/AWS_S3_FORCE_PATH_STYLE
2022-05-03 20:05:04,849 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/USE_WALG_BACKUP
2022-05-03 20:05:04,849 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/USE_WALG_RESTORE
2022-05-03 20:05:04,849 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/WALE_LOG_DESTINATION
2022-05-03 20:05:04,849 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/TMPDIR
2022-05-03 20:05:06,051 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2022-05-03 20:05:06,065 INFO: No PostgreSQL configuration items changed, nothing to reload.
2022-05-03 20:05:06,067 INFO: Lock owner: None; I am postgres-demo-cluster-0
2022-05-03 20:05:06,097 INFO: trying to bootstrap a new cluster
2022-05-03 20:05:06,098 INFO: Running custom bootstrap script: envdir "/run/etc/wal-e.d/env-clone-postgres-demo-cluster" python3 /scripts/clone_with_wale.py --recovery-target-time="2022-05-03T19:50:00+00:00"
2022-05-03 20:05:06,230 INFO: Trying s3://postgresql/spilo/postgres-demo-cluster/wal/14/ for clone
2022-05-03 20:05:06,290 INFO: cloning cluster postgres-demo-cluster using wal-g backup-fetch /home/postgres/pgdata/pgroot/data base_000000010000000000000004
INFO: 2022/05/03 20:05:06.313280 Selecting the backup with name base_000000010000000000000004...
INFO: 2022/05/03 20:05:06.409909 Finished decompression of part_003.tar.lz4
INFO: 2022/05/03 20:05:06.409930 Finished extraction of part_003.tar.lz4
2022-05-03 20:05:16,568 INFO: Lock owner: None; I am postgres-demo-cluster-0
2022-05-03 20:05:16,568 INFO: not healthy enough for leader race
2022-05-03 20:05:16,587 INFO: bootstrap in progress
...

There are two things to mention, in the upper part of the log, you can see, that the Postgres Operator sets some environment variables in the envdir /run/etc/wal-e.d/env-clone-postgres-demo-cluster/. These are the settings WAL-G needs to find the backups of your source instance (where source and target is the same in this scenario). The second thing to mention is the actual restore that takes place. Once in the line with out specified target time for the point in time restore and then second, that he has found the backup of our source database.

When you shell into your database pod and issue a patronictl list command, you can also see, that the timeline (so the incarnation of the instance) has increased. We took the backup from the 1st timeline, after the restore has been finished, we’re now at timeline 3.

root@postgres-demo-cluster-0:/home/postgres# patronictl list
| Member                  | Host         | Role    | State   | TL | Lag in MB |
+-------------------------+--------------+---------+---------+----+-----------+
| postgres-demo-cluster-0 | 10.244.1.11  | Leader  | running |  3 |           |
...

Let’s proceed to the next scenario, this time, we clone to a new instance / cluster in a different namespace.

Clone restore

This time, out manifest for cloning our postgres-demo-cluster looks like this. It’s mostly the same than in the first scenario, with the exception, that we use a different namespace and also a different cluster name. (Also, you can find the manifest here).

demo-cluster-clone.yaml

apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
  name: postgres-demo-cluster-clone
  namespace: postgres-clone
spec:
  teamId: "postgres"
  volume:
    size: 2Gi
  numberOfInstances: 1
  users:
    demouser:  # database owner
    - superuser
    - createdb
  databases:
    demo: demo_owner  # dbname: owner
  preparedDatabases:
    demo: {}
  postgresql:
    version: "14"
  clone:
    cluster: "postgres-demo-cluster"  # Source instance name; Instance name to clone from
    timestamp: "2022-05-03T19:50:00+00:00"  # timezone required (offset relative to UTC, see RFC 3339 section 5.6)

When doing a restore as a clone, you have more choices in changing the configuration of your database. I would still advice you not to change too much. On the positive side is, if there is a problem with this cloned restore, just delete it and start fresh. Also your source instance does not need to be deleted. You can doublecheck if the clone delivers the expected result first. Apply the manifest:

Fish

kubectl apply -f demo-cluster-clone.yaml
postgresql.acid.zalan.do/postgres-demo-cluster-clone created

...
2022-05-03 20:20:04,356 - bootstrapping - INFO - Configuring bootstrap
2022-05-03 20:20:04,357 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/WALE_S3_PREFIX
2022-05-03 20:20:04,357 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/WALG_S3_PREFIX
2022-05-03 20:20:04,357 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/AWS_ACCESS_KEY_ID
2022-05-03 20:20:04,357 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/AWS_SECRET_ACCESS_KEY
2022-05-03 20:20:04,357 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/WALE_S3_ENDPOINT
2022-05-03 20:20:04,358 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/AWS_ENDPOINT
2022-05-03 20:20:04,358 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/AWS_REGION
2022-05-03 20:20:04,358 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/WALE_DISABLE_S3_SSE
2022-05-03 20:20:04,358 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/WALG_DISABLE_S3_SSE
2022-05-03 20:20:04,358 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/AWS_S3_FORCE_PATH_STYLE
2022-05-03 20:20:04,358 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/USE_WALG_BACKUP
2022-05-03 20:20:04,359 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/USE_WALG_RESTORE
2022-05-03 20:20:04,359 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/WALE_LOG_DESTINATION
2022-05-03 20:20:04,359 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env-clone-postgres-demo-cluster/TMPDIR
2022-05-03 20:20:04,359 - bootstrapping - INFO - Configuring certificate
2022-05-03 20:20:04,359 - bootstrapping - INFO - Generating ssl self-signed certificate
2022-05-03 20:20:04,417 - bootstrapping - INFO - Configuring standby-cluster
2022-05-03 20:20:04,417 - bootstrapping - INFO - Configuring log
2022-05-03 20:20:04,417 - bootstrapping - INFO - Configuring pgbouncer
2022-05-03 20:20:04,417 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2022-05-03 20:20:04,418 - bootstrapping - INFO - Configuring crontab
2022-05-03 20:20:04,418 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2022-05-03 20:20:04,431 - bootstrapping - INFO - Configuring pam-oauth2
2022-05-03 20:20:04,431 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2022-05-03 20:20:04,431 - bootstrapping - INFO - Configuring wal-e
2022-05-03 20:20:04,431 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_PREFIX
2022-05-03 20:20:04,431 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_S3_PREFIX
2022-05-03 20:20:04,432 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ACCESS_KEY_ID
2022-05-03 20:20:04,432 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY
2022-05-03 20:20:04,432 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_ENDPOINT
2022-05-03 20:20:04,432 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ENDPOINT
2022-05-03 20:20:04,432 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_REGION
2022-05-03 20:20:04,432 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_DISABLE_S3_SSE
2022-05-03 20:20:04,432 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DISABLE_S3_SSE
2022-05-03 20:20:04,432 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_S3_FORCE_PATH_STYLE
2022-05-03 20:20:04,433 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DOWNLOAD_CONCURRENCY
2022-05-03 20:20:04,433 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_UPLOAD_CONCURRENCY
2022-05-03 20:20:04,433 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_BACKUP
2022-05-03 20:20:04,433 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_RESTORE
2022-05-03 20:20:04,433 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_LOG_DESTINATION
2022-05-03 20:20:04,433 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/PGPORT
2022-05-03 20:20:04,433 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/BACKUP_NUM_TO_RETAIN
2022-05-03 20:20:04,434 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/TMPDIR
2022-05-03 20:20:05,608 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2022-05-03 20:20:05,620 INFO: No PostgreSQL configuration items changed, nothing to reload.
2022-05-03 20:20:05,623 INFO: Lock owner: None; I am postgres-demo-cluster-clone-0
2022-05-03 20:20:05,643 INFO: trying to bootstrap a new cluster
2022-05-03 20:20:05,644 INFO: Running custom bootstrap script: envdir "/run/etc/wal-e.d/env-clone-postgres-demo-cluster" python3 /scripts/clone_with_wale.py --recovery-target-time="2022-05-03T19:50:00+00:00"
2022-05-03 20:20:05,776 INFO: Trying s3://postgresql/spilo/postgres-demo-cluster/wal/14/ for clone
2022-05-03 20:20:05,884 INFO: cloning cluster postgres-demo-cluster-clone using wal-g backup-fetch /home/postgres/pgdata/pgroot/data base_000000010000000000000004
INFO: 2022/05/03 20:20:05.906757 Selecting the backup with name base_000000010000000000000004...
INFO: 2022/05/03 20:20:05.989340 Finished decompression of part_003.tar.lz4
INFO: 2022/05/03 20:20:05.989414 Finished extraction of part_003.tar.lz4
INFO: 2022/05/03 20:20:15.138568 Finished decompression of part_001.tar.lz4
INFO: 2022/05/03 20:20:15.138590 Finished extraction of part_001.tar.lz4
INFO: 2022/05/03 20:20:15.153621 Finished decompression of pg_control.tar.lz4
INFO: 2022/05/03 20:20:15.153764 Finished extraction of pg_control.tar.lz4
INFO: 2022/05/03 20:20:15.154067
Backup extraction complete.
...

Again, you will find the same output then before. Shell into the database pod and check the cluster status by patronictl list.

root@postgres-demo-cluster-clone-0:/home/postgres# patronictl list
+ Cluster: postgres-demo-cluster-clone (7093595735000756305) ----+----+-----------+
| Member                        | Host        | Role   | State   | TL | Lag in MB |
+-------------------------------+-------------+--------+---------+----+-----------+
| postgres-demo-cluster-clone-0 | 10.244.1.13 | Leader | running |  4 |           |
+-------------------------------+-------------+--------+---------+----+-----------+

All seems fine. One more word regarding the cloned instance. If you’ve followed my backup setup instructions from this blogpost, the cloned instance will do a backup right after the restore. This might be a wanted behavior or not, this totally depends on you. So keep it on your list.

Philip

Tags:Backup basebackup bucket k8s Kubernetes minio object storage Postgres PostgreSQL restore s3 WAL WAL-G zalando operator

43 thoughts on “Restore and Clone from S3 – Configure Zalando Postgres Operator Restore with WAL-G”

Michael 5. August 2022 at 18:14

Reply

One question… On restore, I need to set a timestamp. Does this need to be the exact timestamp for an existing backup, or something relative to a possible backup? How can I know the exact timestamp once the deployment was destroyed, for example?
1. TheDatabaseMe 5. August 2022 at 23:27
  
  Reply
  
  Hello Michael,
  
  the timestamp is a point in time between the oldest existing backup and the newest WAL archive. So it doesn’t need to be an exact timestamp of a basebackup. There are some issues when you want to restore to a timeline before the current one. But this topic is adressed and actively worked on.
  
  In case you have lost the complete deployment, the only way to know which backups exist is to check the S3 bucket for existing backups or to recreate the deployment as a single instance with the same name. What then will happen is, that a new timeline will be created but you can check the existing backups with the wal-g commands then. (At least when you followed the instructions within the blogpost with unsetting prefix and suffix)
  I would suggest, you check the S3 storage though.
  
  Philip
Alexander 2. September 2022 at 14:39

Reply

Hello, I think I did everything correctly but when I try to clone restore it fails with credentials error. AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY don’t appear in env-clone- directory. Backup works correctly. Is there any additional configuration that needs to be done, perhaps?
1. TheDatabaseMe 2. September 2022 at 15:02
  
  Reply
  
  Hello Alexander,
  
  is the env-dir for the clone existing? If yes, can you provide an ls -l of the directory?
  Have you ensured, that you set the CLONE_* properties in the configmap according to the article?
  Are the env variables correctly set? Can you provide the output of env | grep -i clone (pls hide the actual values when you do so)
  
  Kind regards
  Philip
LukePham 11. November 2022 at 16:50

Reply

Mr. TheDatabaseMe, good day. I want to ask you a question. Could you provide me any advice on how to backup and restore a very sizable PostgreSQL database running on k8s cluster? 11 TB is the current size. There isn’t any initial backup for it. I appreciate your good article.
1. TheDatabaseMe 11. November 2022 at 20:25
  
  Reply
  
  Hello LukaPham,
  this depends a bit on which Container image and / or if you’re using a Kubernetes Operator (like Zalando Postgres Operator). With that normally 1-2 backup tools are implemented which should be used.
  In general, a database of this size is potentially hard to backup using a conventional backup method (like pg_basebackup, WAL-G…). With WAL-G you can write incremental basebackups which will fasten the backups after the initial backup. But a restore 11TB will easily take multiple hours, so not sure if this is reasonable for your needs.
  
  Another possible way for backing up Postgres is to use Storage snapshots. But this can be hard on Kubernetes for two reasons.
  1. Your Storage provider needs to support Snapshots
  2. If you are using an Operator (like Zalando) or a cluster software included in your container image (like Patroni in the Spilo images used by Zalando), a restore may be hard to implement. Probably you have to “patch” some code in these components cause there is no implementation of Snapshot restores.
  
  For me, a snapshot backup would be the only reasonable way to go still. But you need to ensure, that you backup WAL archives also, else you will not be able to do a point in time restore and this may also conflict with bootstrapping the container or your Operator. Without having it tried, I would test to “restore” a storage snapshot to the original volume (Kubernetes PV) and in theory, the database should do a crash recovery (if all WAL information is also part of the snapshot) and be back up to the point in time you’ve done the snapshot.
  I know of some Kubernetes components which will act as a storage provider in your Kubernetes cluster, which also support snapshots (e.g. Rancher Longhorn). This I would consider if your underlying storage does not support snapshots.
  
  More information on using storage snapshots in combination with Postgres can be found here.
  
  Hope this helps.
  Kind regards
  Philip
Charles Thomas Johnson 17. February 2023 at 22:20

Reply

In the article you mention “he cloned instance will do a backup right after the restore”. Is there way to configure the clone not to do this? I found that this causes issues if you want to clone a live database which may be written to during the cloning process and hence causes a fork in timelines which causes me to not be able to automatically create another clone.
1. TheDatabaseMe 17. February 2023 at 22:44
  
  Reply
  
  Hello,
  
  yes, this is possible by not setting the needed WAL* env vars for the clone instance. But therefore, you need to specifically unset them all in the `postgresql` custom resource if you set them as a configmap like in this tutorial.
  This will prevent to do a backup cause the pod can’t connect to any S3 bucket.
  
  Hope this helps.
  Philip
freedbka 9. March 2023 at 19:04

Reply

hello, I have a problem, for some reason the backup does not want to be restored if the bucket is located in another region. I have the variable AWS_REGION: eu-central-1. But he ignores her. Any ideas ?

“`
ERROR: 2023/03/09 17:52:27.978381 failed to list s3 folder: “*********/12/basebackups_005/’: BucketRegionError: incorrect region, the bucket is not in ‘eu-west-1’ region at endpoint ”
1. TheDatabaseMe 9. March 2023 at 19:26
  
  Reply
  
  Hard to tell without knowing all your WAL / backup env vars. Can you share them to me for the instance you want to clone?
freedbka 15. March 2023 at 11:12

Reply

TheDatabaseMe
I found the reason, in the configmap (or in the secret) It is necessary to specify the variable for the clone
CLONE_USE_WALG_RESTORE: “true”
CLONE_AWS_REGION: “REGION”
CLONE_WALG_S3_PREFIX: “s3://YOUR-BUCKET/$(CLONE_SCOPE)/$(PGVERSION)”
freedbka 15. March 2023 at 11:40

Reply

TheDatabaseMe
Do you have any idea how I can create multiple configmaps (secrets) for different clusters using one pg operator zalando?

Thank you very much <3
1. TheDatabaseMe 16. March 2023 at 8:22
  
  Reply
  
  Hi,
  
  you can do so by specifying the parameter pod_environment_secret within your postgres-operator configmap.
  This will lead to the following. Zalando Operator will add all env vars specified within this secret to the Postgres cluster pods.
  You have to create separate namespaces for every Postgres cluster and the specified secret must be located within this namespace.
  Be aware, that having the same env var specified within a pod_environment_configmap and the pod_environment_secret will lead to a conflict.
  
  Another option is, to specify the dedicated env vars for every cluster within the actual postgresql CRD, e.g.:
  
  env: - name: BACKUP_SCHEDULE value: 0 2 * * *
Lima 19. May 2023 at 8:38

Reply

Hi!
I did everything like you have listed in this post and the one before this regarding backup (at least i hope so)
My postgres cluster is up and running, creating a backup into S3 every day.
The problem comes, when I tried to restore it with the first method you listed. Delete the whole cluster in the same Namespace and reapply it.
You mentioned, that the uid would not be necessary, but it seems like my secondary postgres node can’t boostrap from the backup, because of some id missmatch.

apiVersion: acid.zalan.do/v1 kind: postgresql metadata: name: paas-db-cluster postgresql: version: "14" parameters: # Expert section password_encryption: scram-sha-256 shared_buffers: "32MB" max_connections: "100" log_statement: "all" env: - name: WAL_S3_BUCKET value: "postgres" - name: WAL_BUCKET_SCOPE_PREFIX value: "" - name: WAL_BUCKET_SCOPE_SUFFIX value: "" - name: USE_WALG_BACKUP value: "true" - name: USE_WALG_RESTORE value: "true" - name: BACKUP_SCHEDULE value: '00 04 * * *' - name: AWS_ACCESS_KEY_ID valueFrom: secretKeyRef: key: id name: hcp-credentials - name: AWS_SECRET_ACCESS_KEY valueFrom: secretKeyRef: key: key name: hcp-credentials - name: AWS_ENDPOINT value: "https://*****-sandbox.hcp.intra" - name: AWS_S3_FORCE_PATH_STYLE value: "false" - name: WALG_DISABLE_S3_SSE value: "true" - name: WALG_S3_CA_CERT_FILE value: "/tlsca/MT_CA_FILE" clone: cluster: "paas-db-cluster" timestamp: "2023-05-18T13:50:33+02:00"

And the error, that the secondary postgres node gives:
2023-05-19 06:32:26,311 INFO: No PostgreSQL configuration items changed, nothing to reload. 2023-05-19 06:32:26,353 CRITICAL: system ID mismatch, node paas-harbor-db-cluster-1 belongs to a different cluster: 7234777133454319697 != 7234523812773580881

paas-harbor-db-cluster-1 is up and running, its the secondary, that cant bootstrap.

Any idea what I did wrong?
Thanks!
1. TheDatabaseMe 19. May 2023 at 9:10
  
  Reply
  
  Hello,
  but your leader pod has got restored as it should? Can you share me the Spilo image version and the Operator version you’re using?
  
  As mentioned in the article, an inplace restore is quite tricky. What happened here in your case (at least it seems so). The secondary node got bootstrapped as a new cluster, so he didn’t take over the same Patroni cluster ID than the primary. So he can’t bootstrap himself from the leader cause he says, it’s a different cluster.
  
  To fix this situation up, you can enforce to recreate the secondary by deleting the pvc of the secondary pod (so the volume), then delete the pod of the secondary. It will come back up and getting recreated (with a new volume). There will be no data left when it comes back, so he will either try to bootstrap newly from the leader of the latest backup.
  
  Please check if the backups are getting written correctly from your restored cluster in general. If there seems something fishy to you, I would recommend that you use the second scenario (restore as clone) from the article.
  
  Kind regards
  Philip
2. 1. Lima 19. May 2023 at 14:11
    
    Reply
    
    Thanks for the very fast reply! 🙂
    
    Spilo:8 spilo-14:2.1-p6
    Operator: postgres-operator:v1.10.0
    
    I was testing this out today in our sandbox cluster.
    It seems like, your suggestion fixes the issue with the id missmatch. I deleted the postgres cluster after I loaded some data into it and it wrote the first backup into S3, then I reapplied the postgres.yaml, just like before. I got the ID missmatch, deleted the PVC, then the Secondary pod and the new pod joined the postgres cluster as a secondary.
    The problem came, when I checked and there were no real data loaded from the backup. So my ID problem seems to be fixed by the workaround, but the restore doesn’t work as intented when trying the in-place restore.
    It would be nice if it would work tho, because we use the Postgres as a DB for Harbor image repository. They are both located in the same Namespace, so the restore as clone wouldn’t be a viable option for us, as they need to be in the same NS.
    
    Master Postgres log after deletion and recreation:
    2023-05-19 11:44:11.910 - /scripts/postgres_backup.sh - producing a new backup INFO: 2023/05/19 11:44:12.105522 Calling pg_start_backup() INFO: 2023/05/19 11:44:12.403942 Starting a new tar bundle INFO: 2023/05/19 11:44:12.404031 Walking ... INFO: 2023/05/19 11:44:12.404663 Starting part 1 ... 2023-05-19 11:44:12.810 46 LOG Starting pgqd 3.3 2023-05-19 11:44:12.810 46 LOG auto-detecting dbs ... INFO: 2023/05/19 11:44:13.215011 Packing ... INFO: 2023/05/19 11:44:13.223221 Finished writing part 1. INFO: 2023/05/19 11:44:13.665593 Starting part 2 ... INFO: 2023/05/19 11:44:13.665658 /global/pg_control INFO: 2023/05/19 11:44:13.667669 Finished writing part 2. INFO: 2023/05/19 11:44:13.667718 Calling pg_stop_backup() INFO: 2023/05/19 11:44:14.690746 Starting part 3 ... INFO: 2023/05/19 11:44:14.707151 backup_label INFO: 2023/05/19 11:44:14.707177 tablespace_map INFO: 2023/05/19 11:44:14.707222 Finished writing part 3. INFO: 2023/05/19 11:44:14.961771 Wrote backup with name base_000000010000000000000002
    
    Secondary Postgres logs:
    2023-05-19 11:48:23,327 INFO: trying to bootstrap from leader 'paas-db-cluster-0' Current cluster size: 26814537 Wals generated since the last backup: 16933064 not restoring from backup because of amount of generated wals exceeds 30% of cluster_size 2023-05-19 11:48:23,788 ERROR: Error creating replica using method wal_e: envdir /run/etc/wal-e.d/env bash /scripts/wale_restore.sh exited with code=1 1024+0 records in 1024+0 records out 16777216 bytes (17 MB, 16 MiB) copied, 0.0149227 s, 1.1 GB/s NOTICE: all required WAL segments have been archived 2023-05-19 11:48:31,188 INFO: replica has been created using basebackup_fast_xlog 2023-05-19 11:48:31,191 INFO: bootstrapped from leader 'paas-harbor-db-cluster-0' 2023-05-19 11:48:31 UTC [117]: [1-1] 6467620f.75 0 LOG: Auto detecting pg_stat_kcache.linux_hz parameter... 2023-05-19 11:48:31 UTC [117]: [2-1] 6467620f.75 0 LOG: pg_stat_kcache.linux_hz is set to 199999 2023-05-19 11:48:31,653 INFO: postmaster pid=117 2023-05-19 11:48:31 UTC [117]: [3-1] 6467620f.75 0 LOG: redirecting log output to logging collector process 2023-05-19 11:48:31 UTC [117]: [4-1] 6467620f.75 0 HINT: Future log output will appear in directory "../pg_log". /var/run/postgresql:5432 - no response /var/run/postgresql:5432 - rejecting connections /var/run/postgresql:5432 - rejecting connections /var/run/postgresql:5432 - accepting connections
    
    Best,
    Martin
  2. 1. TheDatabaseMe 19. May 2023 at 14:36
      
      Reply
      
      Hello Martin,
      
      ok, this makes sense now to me. I’ve checked your first comment postgresql CRD and saw, that you haven’t specified the needed env vars for cloning (so all env vars you set in order to get the backup running (e.g. WAL_S3_BUCKET) need to be specified with a CLONE_ prefix.
      
      So this is my assumption now what have happened. You deleted and recreated the postgresql CRD. The cluster spun up. You specified a clone section within the CRD. But without the env vars for CLONE_ set, no restore happened really, so the primary came up as a brand new cluster, just with the old name. Now something nasty happens, since the new cluster has the same settings on how to make backups and where to store them, the primary created a backup (with a new cluster ID) with the timeline ID 1, maybe overwriting even your existing backups of the old instance if there were any with the named timeline.
      So now, the secondary pod comes up. Needs to be bootstrapped either by leader or by backup. Since there was a backup created by the primary, the secondary chose to bootstrap from backup as of the highest timeline ID and timestamp which he finds on S3 (which are the backups of your original instance). He tries to bootstrap from those but finds out, that those backups belong to another instance ID / cluster (the original one) which differs from the new leaders instance ID. So he can’t bootstrap.
      
      To prevent this mess with your backups, Spilo uses by default the Prefix + Suffix stuff in the “path” of the S3 backups. So it’s ensured, that no backups will get overwritten by a new instance with the same name. But this makes cloning harder (cause you need to remember, save the original UID of the postgresql cluster).
      
      So I would recommend now, to doublecheck that at least one timeline / timestamp is still existing on S3 and not been overwritten. Or even (because you named it’s a sandbox system) start from scratch. Ensure you have set all the needed CLONE_ env vars. For the next try, before you do an inplace restore, try to restore as a real clone with a different name. This ensures to have the settings right. When this works as “restore as clone”, then proceed to restore “in-place”.
      
      I doublechecked the blog post of mine, I really haven’t written about the CLONE_ variables. I assumed that people will use by backup configuration in my Github repo. There I set all env vars within one configmap which is used by all Postgres instances deployed by Zalando Operator as an “overlay”. I will update the blogpost to at least mention this configmap in specific. You can find the configmap (and therefore all env vars you need to set in your postgresql CRD) here.
      
      Kind regards
      Philip
    2. TheDatabaseMe 19. May 2023 at 14:43
      
      Reply
      
      By the way, I would recommend you to update the used Spilo image, it’s quite outdated. But this will not change anything on your current issue.
Lima 22. May 2023 at 11:31

Reply

Hi!
Thanks for the great advice again!
I think I was able to set up the cloning, but it doesn’t spin up from the backup it’s just stuck in a bootstrap.
What I did:
– set up a postgres in namespace A (db-cluster) with backup variables
– I made sure there was a backup created on my S3
– set up postgres in namespace B (db-cluster-clone), with the now right env variables for cloning
postgres logs in namespace B:
2023-05-22 09:16:59,816 INFO: No PostgreSQL configuration items changed, nothing to reload. 2023-05-22 09:16:59,820 INFO: Lock owner: None; I am paas-harbor-db-cluster-clone-0 2023-05-22 09:16:59,869 INFO: trying to bootstrap a new cluster 2023-05-22 09:16:59,870 INFO: Running custom bootstrap script: envdir "/run/etc/wal-e.d/env-clone-paas-harbor-db-cluster" python3 /scripts/clone_with_wale.py --recovery-target-time="2023-05-22T10:50:33+02:00" 2023-05-22 09:17:00,243 INFO: Trying s3://postgres-harbor/spilo/paas-harbor-db-cluster/wal/14/ for clone 2023-05-22 09:17:15,321 INFO: Lock owner: None; I am paas-harbor-db-cluster-clone-0 2023-05-22 09:17:15,322 INFO: not healthy enough for leader race 2023-05-22 09:17:15,350 INFO: bootstrap in progress

And this bootstrap goes on for forever.

postgres env variables in Namespace B: - name: WAL_S3_BUCKET value: "postgres-harbor" - name: WAL_BUCKET_SCOPE_PREFIX value: "" - name: WAL_BUCKET_SCOPE_SUFFIX value: "" - name: USE_WALG_BACKUP value: "true" - name: USE_WALG_RESTORE value: "true" - name: BACKUP_SCHEDULE value: '00 04 * * *' - name: AWS_ACCESS_KEY_ID valueFrom: secretKeyRef: key: id name: hcp-credentials - name: AWS_SECRET_ACCESS_KEY valueFrom: secretKeyRef: key: key name: hcp-credentials - name: AWS_ENDPOINT value: "https://*******.hcp.intra" - name: AWS_S3_FORCE_PATH_STYLE value: "false" - name: WALG_DISABLE_S3_SSE value: "true" - name: WALG_S3_CA_CERT_FILE value: "/tlsca/MT_CA_FILE" # - name: BACKUP_NUM_TO_RETAIN # value: "5" - name: CLONE_USE_WALG_RESTORE value: "true" - name: CLONE_AWS_ACCESS_KEY_ID valueFrom: secretKeyRef: key: id name: hcp-credentials - name: CLONE_AWS_SECRET_ACCESS_KEY valueFrom: secretKeyRef: key: key name: hcp-credentials - name: CLONE_AWS_ENDPOINT value: "https://********.hcp.intra" - name: CLONE_METHOD value: "CLONE_WITH_WALE" - name: CLONE_WAL_BUCKET_SCOPE_PREFIX value: "" - name: CLONE_WAL_S3_BUCKET value: "postgres-harbor" - name: CLONE_AWS_S3_FORCE_PATH_STYLE value: "false" clone: cluster: "db-cluster" timestamp: "2023-05-22T10:50:33+02:00"

Maybe I should provide an other path to search for the backups?
Thanks,
Martin
1. TheDatabaseMe 22. May 2023 at 12:27
  
  Reply
  
  Hello Martin,
  
  I think something got mixed up. Can you verify that there are backups under s3://postgres-harbor/spilo/paas-harbor-db-cluster/wal/14/basebackups_005/? Also I’m confused why it tries to search for a backup there, cause you specified to clone from backups from db-cluster and not paas-harbor-db-cluster which he searches for.
  
  Kind Regards
  Philip
Lidy Martin 23. May 2023 at 9:21

Reply

Hi Philip!

Sorry for the late reply. Worked on it all afternoon.
I had some issues with TLS certificates, that only got logged out after 10 mins of bootstrapping.
I loaded the S3 CA into the container and it found the backup instantly. Tried both cloning and in-place works like a charm.
Thanks for the great advices!

Best,
Martin
1. TheDatabaseMe 23. May 2023 at 9:58
  
  Reply
  
  Glad I could help
Martin Lidy 21. July 2023 at 12:06

Reply

Hi Philip!

I have been using Postgres backup and I have made a full recovery on one of my clusters.
It has been working fine and have tested multiple time what happens if one of my node from the two restarts. It always got spun up in a minute or two and worked fine.
However yesterday one of my secondary nodes got restarted and got into an intinite loop of bootstrap.
After an hour or so it got it self together, but its very odd for me.
Have you ever encountered this issue? There is a Github issue on this, but didn’t help at all.

2023-07-21 09:39:42,776 INFO: Lock owner: paas-harbor-db-cluster-0; I am paas-harbor-db-cluster-1 2023-07-21 09:39:42,777 INFO: restarting after failure in progress /var/run/postgresql:5432 - rejecting connections /var/run/postgresql:5432 - rejecting connections /var/run/postgresql:5432 - rejecting connections /var/run/postgresql:5432 - rejecting connections /var/run/postgresql:5432 - rejecting connections /var/run/postgresql:5432 - rejecting connections /var/run/postgresql:5432 - rejecting connections /var/run/postgresql:5432 - rejecting connections /var/run/postgresql:5432 - rejecting connections /var/run/postgresql:5432 - rejecting connections /var/run/postgresql:5432 - rejecting connections /var/run/postgresql:5432 - rejecting connections /var/run/postgresql:5432 - rejecting connections 2023-07-21 09:39:57,836 INFO: Lock owner: paas-harbor-db-cluster-0; I am paas-harbor-db-cluster-1 2023-07-21 09:39:57,836 INFO: Still starting up as a standby. 2023-07-21 09:39:57,838 INFO: Lock owner: paas-harbor-db-cluster-0; I am paas-harbor-db-cluster-1 2023-07-21 09:39:57,839 INFO: establishing a new patroni connection to the postgres cluster 2023-07-21 09:39:58,191 INFO: establishing a new patroni connection to the postgres cluster 2023-07-21 09:39:58,194 WARNING: Retry got exception: 'connection problems' 2023-07-21 09:39:58,196 WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role 2023-07-21 09:39:58,198 INFO: no action. I am (paas-harbor-db-cluster-1), a secondary, and following a leader (paas-harbor-db-cluster-0) /var/run/postgresql:5432 - rejecting connections 2023-07-21 09:40:12,793 INFO: Lock owner: paas-harbor-db-cluster-0; I am paas-harbor-db-cluster-1 2023-07-21 09:40:12,793 INFO: Still starting up as a standby. 2023-07-21 09:40:12,794 INFO: Lock owner: paas-harbor-db-cluster-0; I am paas-harbor-db-cluster-1 2023-07-21 09:40:12,795 INFO: establishing a new patroni connection to the postgres cluster 2023-07-21 09:40:13,219 INFO: establishing a new patroni connection to the postgres cluster 2023-07-21 09:40:13,223 WARNING: Retry got exception: 'connection problems' 2023-07-21 09:40:13,223 WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role 2023-07-21 09:40:13,226 INFO: no action. I am (paas-harbor-db-cluster-1), a secondary, and following a leader (paas-harbor-db-cluster-0) /var/run/postgresql:5432 - rejecting connections

Sorry for the slightly out of context comment, but I couldn’t find help anywhere else. Maybe you have seen this issue before?
Thanks in advance,
Martin
1. TheDatabaseMe 21. July 2023 at 12:21
  
  Reply
  
  Hello Martin,
  you are right, this is not restore topic. I haven’t encountered something like this by myself. But I would have a guess why this might be.
  
  Some questions / remarks in general:
  – Do you know the reason, why the secondary got unavailable / restarted in the first place?
  – Do you have logs from the time when the secondary came back up in the end?
  – This is not bootstrapping exactly. The secondary is still aware, that it is part of a cluster and was secondary as his last role
  
  What the logs tell me is, that the secondary can’t connect to the primary for unknown reason. It still knows that it was a secondary
  node and tries to get a connection to the primary node again and again. So this could just be a network issue on your Kubernetes. Was
  the primary still running without any log errors / warnings and were you able to connect to it still?
  
  Kind regards
  Philip
Martin Lidy 21. July 2023 at 12:43

Reply

Hi Philip!

Thanks for the lightning fast reply and again I’m sorry for my out-of-context question.
– the secondary got restarted because the node that it was on got drained.
– logs from when it came up available again:

2023-07-21 10:30:13,004 INFO: no action. I am (paas-harbor-db-cluster-1), a secondary, and following a leader (paas-harbor-db-cluster-0)
2023-07-21 10:30:27,838 INFO: no action. I am (paas-harbor-db-cluster-1), a secondary, and following a leader (paas-harbor-db-cluster-0)
2023-07-21 10:30:42,840 INFO: no action. I am (paas-harbor-db-cluster-1), a secondary, and following a leader (paas-harbor-db-cluster-0)
2023-07-21 10:30:57,882 INFO: no action. I am (paas-harbor-db-cluster-1), a secondary, and following a leader (paas-harbor-db-cluster-0)
2023-07-21 10:31:12,840 INFO: no action. I am (paas-harbor-db-cluster-1), a secondary, and following a leader (paas-harbor-db-cluster-0)

I don’t know any issues with the Kubernetes networking, but will look into it.

Thanks,
Martin
1. TheDatabaseMe 21. July 2023 at 13:34
  
  Reply
  
  Hm. And there is nothing in between regarding the logs? It jumped from not working error messages to “I’m back”?
Martin Lidy 21. July 2023 at 15:51

Reply

Yes just like that.

I’m suspecting, that it has something to do with the Restore function.
After I tested the Restore on these clusters I left the env variables and the clone section in the Postgres config. 2 days ago there was a Cert change on our DB, which I use to store the backups. After this change the clusters configured with Restore stopped producing backups and couldn’t start only after an hour of crashlooping.

I also have a Postgres cluster which isn’t configured with restore, only makes backups, which we would use in case of a disaster.
This cluster works fine, producing backups as it should and could handle the event of Postgres pods restart.

So I’m gonna go into this direction, thanks a lot for the input.

Best,
Martin
Oleksandr 17. August 2023 at 16:11

Reply

Hello @TheDatabaseMe

Do you try to clone the cluster from the backup using IAM access instead access ID and key?
I’ve tried it in multiple ways and without success. My backup pod see the backups:

root@test-clone-postgresql-0:/home/postgres# envdir “/run/etc/wal-e.d/env-clone-test-postgresql/” wal-g backup-list
name modified wal_segment_backup_start
base_000000010000000000000002 2023-08-17T12:47:14Z 000000010000000000000002
root@test-clone-postgresql-0:/home/postgres#

But the cloning procedure failed with an error:
2023-08-17 13:53:17,613 INFO: Trying s3://el-postgres-backup-int/spilo/postgresql-test-postgresql/wal/14/ for clone
ERROR: 2023/08/17 13:53:20.815160 failed to list s3 folder: ‘spilo/postgresql-test-postgresql/wal/14/basebackups_005/’: AccessDenied: Access Denied
status code: 403, request id: P9FAGGKSC9YJ7B47, host id: 9m+4YFj46reDSATMmO76kpMcpdUuzfLb9LiZlTHnYfefcNC6zZ0qIx7WQe6xSqSl9iNw2JihxGcJGYn+KqMx2w==
2023-08-17 13:53:20,818 ERROR: Clone failed
1. TheDatabaseMe 17. August 2023 at 17:42
  
  Reply
  
  Hello,
  no, I haven’t tried IAM for cloning.
  
  I found your GitHub issue @Zalando and checked those information as well. In my opinion, you got something wrong with the clone section. When specifying the UID there, this could lead to issues since the UID is not appended during backup in the path. So leave this one out would be my first suggestion. Also, remove the s3_wal_path parameter and try again.
  
  Kind regards
  Philip
Liz 6. September 2023 at 13:30

Reply

Hi Philip,
I am fairly new to kubernetes and followed your steps. I was able to have the backups on s3 using wal-g. However, I am not able to clone the cluster in a different location. I need to be able to do bi-weekly backup/restore tests to ensure that the backups we have are OK. I am getting the warning below. When I check my cluster it has not restored the databases or the files inside. Is there anywhere I was suppose to specify the database names of the databases I want to be restored?

2023-09-06 11:24:01,044 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?) 2023-09-06 11:24:01,100 - bootstrapping - WARNING - Cloning with WAL-E is only possible when CLONE_WALE_*_PREFIX or CLONE_WALG_*_PREFIX or CLONE_WAL_*_BUCKET and CLONE_SCOPE are set. 2023-09-06 11:24:01,101 - bootstrapping - INFO - No meta-data available for this provider 2023-09-06 11:24:01,101 - bootstrapping - INFO - Looks like you are running unsupported 2023-09-06 11:24:01,146 - bootstrapping - INFO - Configuring bootstrap 2023-09-06 11:24:01,146 - bootstrapping - INFO - Configuring certificate 2023-09-06 11:24:01,147 - bootstrapping - INFO - Generating ssl self-signed certificate 2023-09-06 11:24:01,293 - bootstrapping - INFO - Configuring crontab 2023-09-06 11:24:01,293 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability 2023-09-06 11:24:01,309 - bootstrapping - INFO - Configuring pam-oauth2 2023-09-06 11:24:01,309 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql 2023-09-06 11:24:01,310 - bootstrapping - INFO - Configuring wal-e 2023-09-06 11:24:01,311 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_PREFIX 2023-09-06 11:24:01,311 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_S3_PREFIX 2023-09-06 11:24:01,312 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ACCESS_KEY_ID 2023-09-06 11:24:01,312 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY 2023-09-06 11:24:01,312 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_ENDPOINT 2023-09-06 11:24:01,313 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ENDPOINT 2023-09-06 11:24:01,313 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_REGION 2023-09-06 11:24:01,313 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_DISABLE_S3_SSE 2023-09-06 11:24:01,313 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DISABLE_S3_SSE 2023-09-06 11:24:01,314 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_S3_FORCE_PATH_STYLE 2023-09-06 11:24:01,314 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DOWNLOAD_CONCURRENCY 2023-09-06 11:24:01,314 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_UPLOAD_CONCURRENCY 2023-09-06 11:24:01,315 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_BACKUP 2023-09-06 11:24:01,315 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_RESTORE 2023-09-06 11:24:01,315 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_LOG_DESTINATION 2023-09-06 11:24:01,316 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/PGPORT 2023-09-06 11:24:01,316 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/BACKUP_NUM_TO_RETAIN 2023-09-06 11:24:01,316 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/TMPDIR 2023-09-06 11:24:01,317 - bootstrapping - INFO - Configuring log 2023-09-06 11:24:01,317 - bootstrapping - INFO - Configuring pgbouncer 2023-09-06 11:24:01,317 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping 2023-09-06 11:24:01,317 - bootstrapping - INFO - Configuring patroni 2023-09-06 11:24:01,329 - bootstrapping - INFO - Writing to file /run/postgres.yml 2023-09-06 11:24:01,330 - bootstrapping - INFO - Configuring standby-cluster 2023-09-06 11:24:01,330 - bootstrapping - INFO - Configuring pgqd
1. TheDatabaseMe 6. September 2023 at 14:04
  
  Reply
  
  Hello,
  
  can you please share your manifest of the to be cloned postgresql?
  
  What I can see from the log, you haven’t specified the environment variables needed for cloning with WAL-G as it seems. (CLONE_*)
  Keep in mind, for every WAL_G environment variable, there is also one with a CLONE_ prefix which will cover the cloning.
  
  Kind regards
  Philip
Liz 6. September 2023 at 17:26

Reply

Thanks Philip for your quick response.
I checked what you suggested and edited my pod-config.yaml as below
apiVersion: v1
kind: ConfigMap
metadata:
name: pod-config
data:
WAL_S3_BUCKET: postgresql
WAL_BUCKET_SCOPE_PREFIX: “”
WAL_BUCKET_SCOPE_SUFFIX: “”
USE_WALG_BACKUP: “true”
USE_WALG_RESTORE: “true”
BACKUP_SCHEDULE: ’00 10 * * *’
AWS_ACCESS_KEY_ID: *****
AWS_SECRET_ACCESS_KEY: ******
AWS_S3_FORCE_PATH_STYLE: “true” # needed for MinIO
AWS_ENDPOINT: http://****:9000 # Endpoint URL to your S3 Endpoint; MinIO in this example
AWS_REGION: nyc3
WALG_DISABLE_S3_SSE: “true”
BACKUP_NUM_TO_RETAIN: “5”
CLONE_USE_WALG_RESTORE: “true”
CLONE_AWS_ACCESS_KEY_ID: ******
CLONE_AWS_SECRET_ACCESS_KEY: *****
CLONE_AWS_ENDPOINT: http://***:9000 # Endpoint URL to your S3 Endpoint; MinIO in this example
CLONE_AWS_REGION: nyc3
CLONE_METHOD: CLONE_WITH_WALE
CLONE_WAL_BUCKET_SCOPE_PREFIX: “”
CLONE_WAL_S3_BUCKET: postgresql
CLONE_AWS_S3_FORCE_PATH_STYLE: “true”
It is working now.. Thank you.
Regards,
Liz
Andrés 7. September 2023 at 11:46

Reply

Hey man! How are you doing? Amazing tutorial:D

One quick question, whenever I spin up a new Zalando’s cluster it’ll fail. I’m trying to do a clone directly from cluster to cluster.

Here’s the error:

2023-09-06 19:24:25,024 WARNING: Kubernetes RBAC doesn’t allow GET access to the ‘kubernetes’ endpoint in the ‘default’ namespace. Disabling ‘bypass_api_service’.
2023-09-06 19:24:25,050 INFO: No PostgreSQL configuration items changed, nothing to reload.
2023-09-06 19:24:25,051 INFO: Lock owner: None; I am postgres-demo-cluster-0
2023-09-06 19:24:25,181 INFO: trying to bootstrap a new cluster
2023-09-06 19:24:25,181 INFO: Running custom bootstrap script: python3 /scripts/clone_with_basebackup.py –pgpass=/home/postgres/.pgpass_postgresql-1-backoffice-postgresql-1 –host=172.20.12.231 –port=5432 –user=”standby”
2023-09-06 19:24:25,210 INFO: cloning cluster postgres-demo-cluster from “host=172.20.12.231 port=5432 user=standby dbname=postgres”
pg_basebackup: error: connection to server at “172.20.12.231”, port 5432 failed: FATAL: password authentication failed for user “standby”
password retrieved from file “/home/postgres/.pgpass_postgresql-1-backoffice-postgresql-1”
connection to server at “172.20.12.231”, port 5432 failed: FATAL: password authentication failed for user “standby”
password retrieved from file “/home/postgres/.pgpass_postgresql-1-backoffice-postgresql-1”
2023-09-06 19:24:25,275 ERROR: Clone failed

The thing that I wanted to ask is, do you need to create standby user on the source database/cluster? Cause for example when I create this cluster, it’ll always try to find a secret that doesn’t exist, maybe it’s related to that.

Thanks!
1. TheDatabaseMe 7. September 2023 at 12:56
  
  Reply
  
  Hello,
  I’m not entirely sure if I got what you want to do. Can you please share the manifest of
  the source and clone cluster?
  
  In general, the standby user and also the secret for it, is managed by the Zalando Operator. It needs
  to be created by it.
  
  Kind regards
  Philip
Remo 11. October 2023 at 0:32

Reply

Hi Philip,

I am wondering, why restores are rather streamed from the leader than from S3 backups. Any idea how I can enforce using s3?

S3 restore fails with following error:
>> not restoring from backup because of amount of generated wals exceeds 30% of cluster_size
>> ERROR: Error creating replica using method wal_e: envdir /run/etc/wal-e.d/env bash /scripts/wale_restore.sh exited with code=1

I see in /scripts/wale_restore.sh THRESHOLD_PERCENTAGE=30 but don’t understand the reasoning behind it. There’s a supported parameter threshold_backup_size_percentage that could probably be used to pass a jumbo number to suppress this behaviour but I don’t know how I can do this.

Cheers
Remo
1. TheDatabaseMe 11. October 2023 at 7:42
  
  Reply
  
  Hello Remo,
  you can configure the threshold when to use S3 over creating a basebackup from the leader by setting the environment variable WALE_BACKUP_THRESHOLD_PERCENTAGE in the spilo container.
  You can do so by using a pod-config configmap or by directly adding the env var to the postgresql custom resource manifest.
  
  The idea behind this threshold is, that the more WALs have been written in comparison to the database size, a new basebackup will be faster to restore from. For very small instances, the
  default setting will almost every time lead to a new basebackup which is replicated, where it doesn’t really matter if a clone will take some seconds more or less. For large databases, this approach makes totally sense. Think of a db size of 500GB, where 30% would be 150GB in WALs alone which would have to get fetched and applied from S3. Instead, creating a new basebackup from the leader would probably be faster because you save the amount of WALs to be transfered and applied.
  
  I hope this helps.
  Philip
2. 1. Remo 12. October 2023 at 15:04
    
    Reply
    
    Thanks Philip, that’s a thorough clarification. Much appreciated.
    Now I am wondering, why in all my tests the basebackups are created from the leader rather than from a healthy replica. Shouldn’t it be preferred to reduce the load on the leader (as stated in parenthesis here: https://postgres-operator.readthedocs.io/en/latest/administrator/#restoring-physical-backups)? Or maybe this is also something that happens only on larger databases…
    BR
    Remo
  2. 1. TheDatabaseMe 13. October 2023 at 14:06
      
      Reply
      
      AFAIK, Patroni / Spilo always pulls from the leader. I would say this makes also kind of sense most of the time. Since you can’t rely on, that a secondary node is on the most current state. You might have an apply gap between a primary and an already existing secondary.
Andrew 27. December 2023 at 3:20

Reply

Hi TheDatabaseMe,

First of all thanks for this article, it was very helpful in getting backups configured with the Zalando operator. My colleagues and I find that the documentation is either plain woeful or written by people who assume you somewhat know about the operator already, which, having never used it before, how could you?

Anyway, we are working on an outstanding query of whether it is possible to do individual database backups/restores. I haven’t seen it mentioned in the documentation and it certainly looks like only cluster/instance-wide. What would your thoughts be on achieving per-db backups? cluster-per-db? seems wasteful but it’s k8s… they could be lean pods…

Cheers
rogers kagoda 11. April 2024 at 15:00

Reply

Hey Philip,
Thank you for these amazing articles. Could you please advise on how to migrate a zalando production database to another managed database like RDS PostgreSQL? Thank you once again.
1. TheDatabaseMe 6. May 2024 at 9:48
  
  Reply
  
  Hello,
  I’m sorry, but I can’t give you general advice without knowing the environment and all dependencies. Maybe some real
  high level advices. I assume, the simplest way will be using pg_dump. If you want to have a real minimal downtime,
  you could try to setup another replication to RDS PostgreSQL. But this will involve a lot of preparation and conceptual
  work first. I can’t guarantee, that this will even work.
  
  Kind regards
  Philip
Flo 21. July 2024 at 22:13

Reply

Hi,

if i created a clone using point in time recovery, and i decide to keep this clone and use it for my application, do i have to delete the clone section to avoid unexpected repeated restore to this timestamp for example on operator or pod restart? Did not find anything in the docs.
1. TheDatabaseMe 21. July 2024 at 22:56
  
  Reply
  
  No you don’t. Zalando is aware, that the cluster exists and will not trigger a clone again.

Restore and Clone from S3 – Configure Zalando Postgres Operator Restore with WAL-G

Inplace restore

Clone restore

Ähnliche Beiträge:

43 thoughts on “Restore and Clone from S3 – Configure Zalando Postgres Operator Restore with WAL-G”

Leave a Reply Cancel reply