Backup to S3 – Configure Zalando Postgres Operator Backup with WAL-G

Doing basebackups of Postgres databases enables you to do a Point-in-time recovery (PITR) of your database. Doing backups for your databases that have been deployed using the Zalando Postgres Operator, is a different beast. Here I show you how it’s done with a MinIO (self)hosted S3 compatible Object Storage.

The so called Spilo images that are deployed when using the Zalando Postgres Operator, can do backups and WAL archiving to S3 (compatible) storage using WAL-E or it’s successor WAL-G. For me the problem is, that the documentation on WAL-G integration on Zalando side is not very good. You have to put quite some puzzle pieces together in order to get it running. Because I went through this process, I thought, it might come handy for you too.

I assume, that you’ve the following prerequisites setup:

Zalando Postgres Operator is deployed on your Kubernetes cluster. If you search for a tutorial on that topic, you can find it here.
You have a working S3 Object Storage up and running. If you want to setup your own, self-hosted Object Storage using MinIO, you can find the instructions for it here.

Attention: I use my own MinIO Object Storage in this example. There might be some different configurations for you, if you use a different provider. I will try to mention MinIO specific parameters inline. But your milage may vary.

If you want to quickstart the whole process, you can directly head to my Github repository and apply the kustomization overlay under overlays/enabled-backup.

What the overlay does is first add (patch) the Zalando Operator central configmap with the parameter pod_environment_configmap. This is the reference to a Pod specific Configmap which holds the environment variables that configures WAL-G to use our S3 Object Storage.

configmap.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: postgres-operator
data:
  pod_environment_configmap: "postgres-operator/pod-config"

If your Pod specific Configmap resides in a Namespace other than default, you need to specify the Namespace before the name of the Configmap (postgres-operator in my case).

The Pod specific Configmap holds these configuration in my case.

pod-config.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: pod-config
data:
  WAL_S3_BUCKET: postgresql
  WAL_BUCKET_SCOPE_PREFIX: ""
  WAL_BUCKET_SCOPE_SUFFIX: ""
  USE_WALG_BACKUP: "true"
  USE_WALG_RESTORE: "true"
  BACKUP_SCHEDULE: '00 10 * * *'
  AWS_ACCESS_KEY_ID: postgresql
  AWS_SECRET_ACCESS_KEY: supersecret
  AWS_S3_FORCE_PATH_STYLE: "true" # needed for MinIO
  AWS_ENDPOINT: http://minio.home.lab:9000 # Endpoint URL to your S3 Endpoint; MinIO in this example
  AWS_REGION: de01
  WALG_DISABLE_S3_SSE: "true"
  BACKUP_NUM_TO_RETAIN: "5"
  CLONE_USE_WALG_RESTORE: "true"

Let’s have a look on the parameters.

`AWS_ENDPOINT`	Specifies the S3 Object Storage API endpoint. In my case it’s a MinIO service in my homelab, listening on port 9000.
`AWS_REGION`	The region of your S3 storage.
`AWS_S3_FORCE_PATH_STYLE`	Controls, if you want to use S3 path style or virtual hosted style. In case of my MinIO setup, I can’t use virtual hosted style. In the end it controls how the endpoint URL will look like. Path style looks like this `http://minio.home.lab:9000/<WAL_S3_BUCKET>` vs. virtual hosted style would look like this: `http://<WAL_S3_BUCKET>.minio.home.lab:9000`.
`WAL_S3_BUCKET`	The S3 bucket name where your Postgres backups should be stored. You have to create the bucket before you can use it though.
`AWS_ACCESS_KEY_ID`	You can think of this as your “username” to access the Object Storage. Both access key and secret access key have to be created on your Object Storage.
`AWS_SECRET_ACCESS_KEY`	This is the secret “password” to your Object Storage. Both access key and secret access key have to be created on your Object Storage.
`USE_WALG_BACKUP` and `USE_WALG_RESTORE`	By default the Spilo Images that the Zalando Operator deploy use WAL-E instead of it’s predecessor WAL-G. WAL-G is way faster then WAL-E, but they advice not to use WAL-G in production workloads yet. In my Homelab, I’m pretty sure, WAL-G will suite well.
`WALG_DISABLE_S3_SSE`	Disables the backup encryption. In my MinIO setup, this is not possible to do.
`WAL_BUCKET_SCOPE_PREFIX` and `WAL_BUCKET_SCOPE_SUFFIX`	By default, all backups will be stored under a path which will include the cluster UID and the namespace of the cluster. I decided to blank both parameters because they make trouble when trying to restore the cluster later.
`BACKUP_NUM_TO_RETAIN`	Controls the number of WAL-G backups which should reside on your S3 storage. At least it should, but at the time of writing, there are known issues regarding this topic. I advice you to use a lifecycle policy on your S3 storage in order to get some kind of housekeeping running.
`CLONE_USE_WALG_RESTORE`	This controls to use WAL-G instead of WAL-E when doing a clone of your Postgres cluster. Restore and cloning is a topic on it’s own, so I will not go into details here.
`BACKUP_SCHEDULE`	The schedule in cron format when a basebackup should be made. In my case, every day at 10am.

The kustomization overlay looks like this (see the Github repository mentioned above):

kustomization.yaml

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: postgres-operator

resources:
  - pod-config.yaml
  - ../../base
  - ../../ui

patchesStrategicMerge:
  - configmap.yaml

You may now apply the kustomization overlay like this:

kubectl apply -k overlays/enabled-backup/

This will deploy the Zalando Operator with all needed adjustments regarding backup. If you have already had deployed the Operator, it will patch the postgres-operator Configmap. You will need to restart the Operator Pod in order to get the Pod specific Configmap applied.

Now let’s quickly build a Postgres cluster with the Operator. You can find some examples in my Github repository under the manifests folder.

demo-cluster.yaml

apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
  name: postgres-demo-cluster
  namespace: postgres
spec:
  teamId: "postgres"
  volume:
    size: 2Gi
  numberOfInstances: 2
  users:
    demouser:  # database owner
    - superuser
    - createdb
  databases:
    demo: demouser  # dbname: owner
  preparedDatabases:
    demo: {}
  postgresql:
    version: "14"

Let’s deploy it:

kubectl apply -f demo-cluster.yaml

After a short time, your cluster should get started and directly after the cluster has been created, a backup will be made (regardless of the schedule you’ve configured). Execute into the container, if your configuration works, you should see all environment variables set as an envdir under /run/etc/wal-e.d/env.

> /run/etc/wal-e.d/env# ls -ltr
total 68
-rw-r--r-- 1 postgres root  1 Mar 26 15:59 WALG_UPLOAD_CONCURRENCY
-rw-r--r-- 1 postgres root 50 Mar 26 15:59 WALG_S3_PREFIX
-rw-r--r-- 1 postgres root  1 Mar 26 15:59 WALG_DOWNLOAD_CONCURRENCY
-rw-r--r-- 1 postgres root  4 Mar 26 15:59 WALG_DISABLE_S3_SSE
-rw-r--r-- 1 postgres root 50 Mar 26 15:59 WALE_S3_PREFIX
-rw-r--r-- 1 postgres root 31 Mar 26 15:59 WALE_S3_ENDPOINT
-rw-r--r-- 1 postgres root  4 Mar 26 15:59 USE_WALG_RESTORE
-rw-r--r-- 1 postgres root  4 Mar 26 15:59 USE_WALG_BACKUP
-rw-r--r-- 1 postgres root  8 Mar 26 15:59 AWS_SECRET_ACCESS_KEY
-rw-r--r-- 1 postgres root  4 Mar 26 15:59 AWS_S3_FORCE_PATH_STYLE
-rw-r--r-- 1 postgres root  4 Mar 26 15:59 AWS_REGION
-rw-r--r-- 1 postgres root 26 Mar 26 15:59 AWS_ENDPOINT
-rw-r--r-- 1 postgres root 10 Mar 26 15:59 AWS_ACCESS_KEY_ID
-rw-r--r-- 1 postgres root  6 Mar 26 15:59 WALE_LOG_DESTINATION
-rw-r--r-- 1 root     root 25 Mar 26 15:59 TMPDIR
-rw-r--r-- 1 postgres root  4 Mar 26 15:59 PGPORT
-rw-r--r-- 1 postgres root  1 Mar 26 15:59 BACKUP_NUM_TO_RETAIN

You can check for created backups using the WAL-G client. Issue the following command as root user from within your Postgres container.

> envdir "/run/etc/wal-e.d/env" wal-g backup-list
name                          modified             wal_segment_backup_start
base_000000010000000000000004 2022-03-26T16:00:13Z 000000010000000000000004

If you are not able to see any backups or you get error messages when using the backup-list command, check for more information in the Pod logs of your Postgres cluster Pods.

As in my case, I can use the MinIO CLI client to view the objects stored on my S3 storage.

> mc ls minio/postgresql/spilo/postgres-demo-cluster/wal/14/basebackups_005/
[2022-03-26 17:00:13 CET] 174KiB STANDARD base_000000010000000000000004_backup_stop_sentinel.json
[2022-03-26 17:19:06 CET]     0B base_000000010000000000000004/

But WAL-G does more than only basebackups, because we setup WAL-G, the Zalando Operator configured our PostgreSQL database to use WAL-G for WAL archiving too. You can find WAL archives under the following path on your S3:

> mc ls minio/postgresql/spilo/postgres-demo-cluster/wal/14/wal_005/
[2022-03-26 16:59:19 CET] 4.2MiB STANDARD 000000010000000000000001.lz4
[2022-03-26 16:59:27 CET]   255B STANDARD 000000010000000000000002.00000028.backup.lz4
[2022-03-26 16:59:27 CET]  65KiB STANDARD 000000010000000000000002.lz4
[2022-03-26 17:00:11 CET] 184KiB STANDARD 000000010000000000000003.lz4
[2022-03-26 17:00:12 CET]   266B STANDARD 000000010000000000000004.00000028.backup.lz4
[2022-03-26 17:00:12 CET]  65KiB STANDARD 000000010000000000000004.lz4

So our backup is working. As mentioned above, restore and cloning is a differnt topic and I promise to write about it soon.

Philip

Tags:Backup basebackup k8s Kubernetes minio object storage Postgres PostgreSQL s3 WAL WAL-G zalando operator

28 thoughts on “Backup to S3 – Configure Zalando Postgres Operator Backup with WAL-G”

Sidharth 2. May 2022 at 10:13

Reply

Can’t wait for the restore and cloning part! Any documents that are already available on how to do it?
1. TheDatabaseMe 2. May 2022 at 12:23
  
  Reply
  
  Thanks for the feedback. I’m working on it. Beside from the “official” documentation on the Zalando Github repository here, I’m only aware on this one here. But both are not really useable in my opinion.
  
  Keep checking by for updates on this topic.
  Philip
2. TheDatabaseMe 3. May 2022 at 22:34
  
  Reply
  
  You can find the restore part of this topic here.
Lima 4. August 2022 at 9:49

Reply

Hi!

I’m having trouble with setting up WAL-G and the documentations do not help me whatsoever.
When I start the Postgres cluster with the env variables requiered for WAL-G my standy cluster does not start and gives me this error in the logs:
2022-08-03 09:02:01,501 INFO: Lock owner: paas-test-db-cluster-0; I am test-test-db-cluster-1 2022-08-03 09:02:01,501 INFO: bootstrap from leader 'test-test-db-cluster-0' in progress
If i start the cluster without WAL-G backup it works fine.
Did you ever encounter this error while setting up WAL-G?
1. TheDatabaseMe 4. August 2022 at 10:10
  
  Reply
  
  Hi Lima,
  
  hard to tell from these two lines what is exactly the problem. I assume, that either your PostgreSQL manifest has an error or your S3 configuration is bad in some way. Keep in mind, when Zalando Operator recognises, that a WAL-G configuration is existing, he will try to bootstrap the secondary nodes from a backup made to the S3 bucket. If he can’t access it, the bootstrap will not work. It should (in my understanding) try to directly bootstrap the secondary then directly from primary however. There should be more log information from the secondary cluster apart from these two lines.
  
  Have you doublechecked, that a backup is made from primary when you applied the WAL-G config? If so, can you share the path to the backup to me here? Also can you share the cluster manifest to me?
  
  Kind regards
  Philip
2. 1. Lima 11. August 2022 at 15:36
    
    Reply
    
    Hi Philip!
    
    Thank you for your fast reply. We were able to solve the problem. There were some minor issues in our manifest file, but we were able to fix them.
    Sorry for the late update.
    
    Best,
    Lima
  2. 1. Roman 26. August 2022 at 22:26
      
      Reply
      
      Can you just please please always include details about what was the issue?
Rafael 9. August 2022 at 2:34

Reply

Hi, this ensures the database manifest in case of cluster has been recreated using argocd. My database will be restored if I apply the same manifests using argocd ?

I have set up using your excellent tutorial. Still, if I destroy the cluster using kubectl and redeploy, it’s being restored empty, making it impossible for me to use this operator to have persistence outside the k8s cluster.
1. TheDatabaseMe 17. August 2022 at 17:13
  
  Reply
  
  Hello Rafael,
  
  sorry for the late reply. I’m not sure, if I understand your problem 100%. I understood, that you manage the PostgreSQL CRD with ArgoCD. In this case, you either have to tell Argo to ignore the clone section within the CRD (see here) or you have to specify the clone section within the ArgoCD managed App repository.
  
  I hope this helps you further. If this is not your problem, then please contact me again.
  
  Kind regards
  Philip
Soulimane Mammar 29. January 2023 at 21:15

Reply

Hi,
I have a working database and I want to configure backup as detailed by this post. My question is:
Once the operator reconfigured, will the backup start with the already installed cluster or have I to create a new one for the configuration to take effect?
Regards
1. TheDatabaseMe 29. January 2023 at 23:42
  
  Reply
  
  Hi,
  as soon as you have configured the backup parameters. The Operator will restart your database pods and apply the environment configuration for the backups. Starting then, a first full backup will be done soon after the first start and then reoccuring defined by your backup schedule. WAL Archives will be automatically stored as well, when you do WAL-E / WAL-G backups.
  
  Kind regards
  Philip
Gajendra D Ambi 10. May 2023 at 15:52

Reply

I am using the latest zalando PGO. when i ssh to the pods of operator or the postgres pods, I see no /run/etc/wal-e.d/env as you mentioned here. I ahve configured
1. TheDatabaseMe 10. May 2023 at 16:48
  
  Reply
  
  Hello,
  this could probably only two things.
  
  1. Check that you have specified the wal-g relevant environment variables (e.g. AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) somewhere. This could be on the postgresql CRD itself, or via a pod-config configmap (like I’ve done it in the article)
  2. If you’ve done it via the configmap, ensure that you’ve specified the pod_environment_configmap: "postgres-operator/pod-config" parameter within the postgres-operator configmap. Also ensure, that the pod-config configmap is existing in the namespace you’ve specified in the postgres-operator configmap (in my example it’s all in the postgres-operator namespace. I would assume, that you’ve created the pod-config configmap in default namespace by mistake.
  
  Kind Regards
  Philip
Gajendra D Ambi 10. May 2023 at 19:00

Reply

Appreciate the response. I have mimicked your config here.
apiVersion: v1 kind: ConfigMap metadata: name: postgres-operator namespace: backend data: pod_environment_configmap: "postgres-operator/pod-config" aws_region: ap-south-1 kube_iam_role: postgres-pod-role wal_s3_bucket: --- apiVersion: v1 kind: ConfigMap metadata: name: pod-config namespace: backend data: WAL_S3_BUCKET: WAL_BUCKET_SCOPE_PREFIX: "" WAL_BUCKET_SCOPE_SUFFIX: "" USE_WALG_BACKUP: "true" USE_WALG_RESTORE: "true" BACKUP_SCHEDULE: "11 11 * * *" # Access key AWS_ACCESS_KEY_ID: # Secret access key AWS_SECRET_ACCESS_KEY: AWS_S3_FORCE_PATH_STYLE: "true" # needed for MinIO AWS_ENDPOINT: AWS_REGION: WALG_DISABLE_S3_SSE: "true" BACKUP_NUM_TO_RETAIN: "5" CLONE_USE_WALG_RESTORE: "true"

I have installed the operator in my namespace ‘backend’ with helm chart.
1. TheDatabaseMe 10. May 2023 at 20:50
  
  Reply
  
  Hello,
  
  I’ve reducted the secret information from your comment, just FYI. The error lies within the postgres-operator configmap. You specified in the parameter pod_environment_configmap that the operator should merge configuration parameters from a configmap named pod-config in the namespace postgres-operator. So it searches for the configmap in the wrong place. Fix that to pod_environment_configmap: "backend/pod-config" should do the trick.
  
  Also I’m not 100% sure that deploying the Operator via helm, will not use the operatorConfiguration CRD instead of the configmap. But you will find that out. Check for a custom resource of type OperatorConfiguration
  
  Philip
2. 1. Gajendra D Ambi 10. May 2023 at 21:27
    
    Reply
    
    even after that modification, the postgres pod hippo-0 logs say nothing about any backup, cron set to BACKUP_SCHEDULE: “1 * * * *” but no go. nothing in my aws s3. I even changed the
    AWS_ENDPOINT: s3://arn:aws:s3:ap-south-1:xxxxxxx:accesspoint/yyyyyyy
    2023-05-10 19:23:35,866 INFO: no action. I am (hippo-0), the leader with the lock
    2023-05-10 19:23:45,862 INFO: no action. I am (hippo-0), the leader with the lock
    2023-05-10 19:23:52.598 UTC [32] LOG {ticks: 0, maint: 0, retry: 0}
    2023-05-10 19:23:55,863 INFO: no action. I am (hippo-0), the leader with the lock
    
    the operator too has no logs about any backup.
    time=”2023-05-10T19:15:27Z” level=info msg=”found pod: \”backend/hippo-0\” (uid: \”57859770-7506-48a7-8d22-86de9b2a30dd\”)” cluster-name=backend/hippo pkg=cluster worker=1
    time=”2023-05-10T19:15:27Z” level=info msg=”found PVC: \”backend/pgdata-hippo-0\” (uid: \”be749801-269b-401e-8781-a91c1900dc18\”)” cluster-name=backend/hippo pkg=cluster worker=1
    time=”2023-05-10T19:15:27Z” level=debug msg=”syncing connection pooler (master, replica) from (false, nil) to (false, nil)” cluster-name=backend/hippo pkg=cluster worker=1
    time=”2023-05-10T19:15:27Z” level=info msg=”cluster has been created” cluster-name=backend/hippo pkg=controller worker=1
    
    nothing about backup
  2. 1. TheDatabaseMe 10. May 2023 at 21:47
      
      Reply
      
      You have restarted the Zalando Operator pod after changing it’s configuration, right? Also have you checked what I’ve written regarding operatorconfiguration CRD?
    2. 1. Gajendra D Ambi 10. May 2023 at 22:33
        
        yes the CRD goes exist, i did restart the PGO everytime i made changes. Since it is for testing, i have even made the buckets public but no go.
      3. TheDatabaseMe 10. May 2023 at 22:43
        
        When the CRD exists, the configmap is ignored. Place the custom_pod_configuration parameter there. It has nothing to do with your S3. The only problem is, that the operator does not inject the env vars properly in the Spilo pod.
      4. TheDatabaseMe 11. May 2023 at 7:43
        
        Or, even better, redeploy Zalando as described here and not via Helm.
        https://thedatabaseme.de/2022/03/13/keep-the-elefants-in-line-deploy-zalando-operator-on-your-kubernetes-cluster
Wee Sritippho 3. July 2023 at 10:16

Reply

Thank you. Your post made configuring the backup for Postgres Operator much easier.

In my case, I used Rook Ceph RGW, which provides a Secret and a ConfigMap upon bucket creation. Therefore, I added custom pod environment variables [via the Postgres cluster manifest][1] instead since it allows referencing existing Secrets/ConfigMaps.

[1]: https://postgres-operator.readthedocs.io/en/stable/administrator/#via-postgres-cluster-manifest
1. TheDatabaseMe 3. July 2023 at 12:46
  
  Reply
  
  Glad you liked it.
  
  Philip
lee 9. October 2023 at 5:03

Reply

hello
I’m having trouble with settting

ERROR: 2023/10/09 02:58:50.512386 failed to upload 'spilo/postgres-demo-cluster/wal/15/basebackups_005/base_000000010000000000000003/tar_partitions/part_001.tar.lz4' to bucket 'postgresql': InvalidArgument: S3 API Requests must be made to API port. status code: 400, request id: , host id: ERROR: 2023/10/09 02:58:50.512391 upload: could not upload 'base_000000010000000000000003/tar_partitions/part_001.tar.lz4' ERROR: 2023/10/09 02:58:50.512393 failed to upload 'spilo/postgres-demo-cluster/wal/15/basebackups_005/base_000000010000000000000003/tar_partitions/part_001.tar.lz4' to bucket 'postgresql': InvalidArgument: S3 API Requests must be made to API port. status code: 400, request id: , host id: ERROR: 2023/10/09 02:58:50.512394 Unable to continue the backup process because of the loss of a part 1.

My config
apiVersion: v1 kind: ConfigMap metadata: name: pod-config data: WAL_S3_BUCKET: postgresql WAL_BUCKET_SCOPE_PREFIX: "" WAL_BUCKET_SCOPE_SUFFIX: "" USE_WALG_BACKUP: "true" USE_WALG_RESTORE: "true" BACKUP_SCHEDULE: '00 10 * * *' AWS_ACCESS_KEY_ID: xxx AWS_SECRET_ACCESS_KEY: xxx AWS_S3_FORCE_PATH_STYLE: "true" AWS_ENDPOINT: http://172.30.31.12:31794 AWS_REGION: de01 WALG_DISABLE_S3_SSE: "true" BACKUP_NUM_TO_RETAIN: "5" CLONE_USE_WALG_RESTORE: "true"
1. TheDatabaseMe 9. October 2023 at 19:41
  
  Reply
  
  Hello Lee,
  
  most likely, the issue is a misconfigured AWS_ENDPOINT. It seems wrong with the setting
  AWS_ENDPOINT: http://172.30.31.12:31794. This seems to be a cluster IP with a nodePort which will not work. If the service
  is of type nodePort, then you need to insert one of your Kubernetes nodes IP here. I don’t know your S3 setup. But you are not needed to
  use a public facing service / IP, when the Postgres cluster runs on the same Kubernetes cluster. If so, then you’re able to specify the
  service DNS + minio API port. e.g. http://minio-svc.minio.svc.cluster.local:9000. With that, you can also use a service of type clusterIP.
  
  Long story short, the error tells, that the container can’t communicate with the S3 bucket endpoint on the given setting.
  
  Kind regards
  Philip
lee 10. October 2023 at 4:01

Reply

172.30.31.12 is my nodeip
1. TheDatabaseMe 10. October 2023 at 7:05
  
  Reply
  
  Ok, can you access the bucket with a S3 client from another system? (E.g. your client machine)
  
  Kind regards
2. 1. lee 12. October 2023 at 8:01
    
    Reply
    
    minio was not working properly, It’s work fine now, thank you.
Azad 6. February 2024 at 20:13

Reply

Hi, really great documentation and explanation! I appreciate your work a lot. Thank you! It took me a while to understand everything but i managed it.

Backup to S3 – Configure Zalando Postgres Operator Backup with WAL-G

Ähnliche Beiträge:

28 thoughts on “Backup to S3 – Configure Zalando Postgres Operator Backup with WAL-G”

Leave a Reply to Sidharth Cancel reply