Doing basebackups of Postgres databases enables you to do a Point-in-time recovery (PITR) of your database. Doing backups for your databases that have been deployed using the Zalando Postgres Operator, is a different beast. Here I show you how it’s done with a MinIO (self)hosted S3 compatible Object Storage.
The so called Spilo images that are deployed when using the Zalando Postgres Operator, can do backups and WAL archiving to S3 (compatible) storage using WAL-E or it’s successor WAL-G. For me the problem is, that the documentation on WAL-G integration on Zalando side is not very good. You have to put quite some puzzle pieces together in order to get it running. Because I went through this process, I thought, it might come handy for you too.
I assume, that you’ve the following prerequisites setup:
- Zalando Postgres Operator is deployed on your Kubernetes cluster. If you search for a tutorial on that topic, you can find it here.
- You have a working S3 Object Storage up and running. If you want to setup your own, self-hosted Object Storage using MinIO, you can find the instructions for it here.
Attention: I use my own MinIO Object Storage in this example. There might be some different configurations for you, if you use a different provider. I will try to mention MinIO specific parameters inline. But your milage may vary.
If you want to quickstart the whole process, you can directly head to my Github repository and apply the kustomization overlay under overlays/enabled-backup
.
What the overlay does is first add (patch) the Zalando Operator central configmap with the parameter pod_environment_configmap
. This is the reference to a Pod specific Configmap which holds the environment variables that configures WAL-G to use our S3 Object Storage.
apiVersion: v1
kind: ConfigMap
metadata:
name: postgres-operator
data:
pod_environment_configmap: "postgres-operator/pod-config"
If your Pod specific Configmap resides in a Namespace other than default, you need to specify the Namespace before the name of the Configmap (postgres-operator
in my case).
The Pod specific Configmap holds these configuration in my case.
apiVersion: v1
kind: ConfigMap
metadata:
name: pod-config
data:
WAL_S3_BUCKET: postgresql
WAL_BUCKET_SCOPE_PREFIX: ""
WAL_BUCKET_SCOPE_SUFFIX: ""
USE_WALG_BACKUP: "true"
USE_WALG_RESTORE: "true"
BACKUP_SCHEDULE: '00 10 * * *'
AWS_ACCESS_KEY_ID: postgresql
AWS_SECRET_ACCESS_KEY: supersecret
AWS_S3_FORCE_PATH_STYLE: "true" # needed for MinIO
AWS_ENDPOINT: http://minio.home.lab:9000 # Endpoint URL to your S3 Endpoint; MinIO in this example
AWS_REGION: de01
WALG_DISABLE_S3_SSE: "true"
BACKUP_NUM_TO_RETAIN: "5"
CLONE_USE_WALG_RESTORE: "true"
Let’s have a look on the parameters.
AWS_ENDPOINT | Specifies the S3 Object Storage API endpoint. In my case it’s a MinIO service in my homelab, listening on port 9000. |
AWS_REGION | The region of your S3 storage. |
AWS_S3_FORCE_PATH_STYLE | Controls, if you want to use S3 path style or virtual hosted style. In case of my MinIO setup, I can’t use virtual hosted style. In the end it controls how the endpoint URL will look like. Path style looks like this http://minio.home.lab:9000/<WAL_S3_BUCKET> vs. virtual hosted style would look like this: http://<WAL_S3_BUCKET>.minio.home.lab:9000 . |
WAL_S3_BUCKET | The S3 bucket name where your Postgres backups should be stored. You have to create the bucket before you can use it though. |
AWS_ACCESS_KEY_ID | You can think of this as your “username” to access the Object Storage. Both access key and secret access key have to be created on your Object Storage. |
AWS_SECRET_ACCESS_KEY | This is the secret “password” to your Object Storage. Both access key and secret access key have to be created on your Object Storage. |
USE_WALG_BACKUP and USE_WALG_RESTORE | By default the Spilo Images that the Zalando Operator deploy use WAL-E instead of it’s predecessor WAL-G. WAL-G is way faster then WAL-E, but they advice not to use WAL-G in production workloads yet. In my Homelab, I’m pretty sure, WAL-G will suite well. |
WALG_DISABLE_S3_SSE | Disables the backup encryption. In my MinIO setup, this is not possible to do. |
WAL_BUCKET_SCOPE_PREFIX and WAL_BUCKET_SCOPE_SUFFIX | By default, all backups will be stored under a path which will include the cluster UID and the namespace of the cluster. I decided to blank both parameters because they make trouble when trying to restore the cluster later. |
BACKUP_NUM_TO_RETAIN | Controls the number of WAL-G backups which should reside on your S3 storage. At least it should, but at the time of writing, there are known issues regarding this topic. I advice you to use a lifecycle policy on your S3 storage in order to get some kind of housekeeping running. |
CLONE_USE_WALG_RESTORE | This controls to use WAL-G instead of WAL-E when doing a clone of your Postgres cluster. Restore and cloning is a topic on it’s own, so I will not go into details here. |
BACKUP_SCHEDULE | The schedule in cron format when a basebackup should be made. In my case, every day at 10am. |
The kustomization overlay looks like this (see the Github repository mentioned above):
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: postgres-operator
resources:
- pod-config.yaml
- ../../base
- ../../ui
patchesStrategicMerge:
- configmap.yaml
You may now apply the kustomization overlay like this:
kubectl apply -k overlays/enabled-backup/
This will deploy the Zalando Operator with all needed adjustments regarding backup. If you have already had deployed the Operator, it will patch the postgres-operator
Configmap. You will need to restart the Operator Pod in order to get the Pod specific Configmap applied.
Now let’s quickly build a Postgres cluster with the Operator. You can find some examples in my Github repository under the manifests
folder.
apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
name: postgres-demo-cluster
namespace: postgres
spec:
teamId: "postgres"
volume:
size: 2Gi
numberOfInstances: 2
users:
demouser: # database owner
- superuser
- createdb
databases:
demo: demouser # dbname: owner
preparedDatabases:
demo: {}
postgresql:
version: "14"
Let’s deploy it:
kubectl apply -f demo-cluster.yaml
After a short time, your cluster should get started and directly after the cluster has been created, a backup will be made (regardless of the schedule you’ve configured). Execute into the container, if your configuration works, you should see all environment variables set as an envdir under /run/etc/wal-e.d/env
.
> /run/etc/wal-e.d/env# ls -ltr
total 68
-rw-r--r-- 1 postgres root 1 Mar 26 15:59 WALG_UPLOAD_CONCURRENCY
-rw-r--r-- 1 postgres root 50 Mar 26 15:59 WALG_S3_PREFIX
-rw-r--r-- 1 postgres root 1 Mar 26 15:59 WALG_DOWNLOAD_CONCURRENCY
-rw-r--r-- 1 postgres root 4 Mar 26 15:59 WALG_DISABLE_S3_SSE
-rw-r--r-- 1 postgres root 50 Mar 26 15:59 WALE_S3_PREFIX
-rw-r--r-- 1 postgres root 31 Mar 26 15:59 WALE_S3_ENDPOINT
-rw-r--r-- 1 postgres root 4 Mar 26 15:59 USE_WALG_RESTORE
-rw-r--r-- 1 postgres root 4 Mar 26 15:59 USE_WALG_BACKUP
-rw-r--r-- 1 postgres root 8 Mar 26 15:59 AWS_SECRET_ACCESS_KEY
-rw-r--r-- 1 postgres root 4 Mar 26 15:59 AWS_S3_FORCE_PATH_STYLE
-rw-r--r-- 1 postgres root 4 Mar 26 15:59 AWS_REGION
-rw-r--r-- 1 postgres root 26 Mar 26 15:59 AWS_ENDPOINT
-rw-r--r-- 1 postgres root 10 Mar 26 15:59 AWS_ACCESS_KEY_ID
-rw-r--r-- 1 postgres root 6 Mar 26 15:59 WALE_LOG_DESTINATION
-rw-r--r-- 1 root root 25 Mar 26 15:59 TMPDIR
-rw-r--r-- 1 postgres root 4 Mar 26 15:59 PGPORT
-rw-r--r-- 1 postgres root 1 Mar 26 15:59 BACKUP_NUM_TO_RETAIN
You can check for created backups using the WAL-G client. Issue the following command as root
user from within your Postgres container.
> envdir "/run/etc/wal-e.d/env" wal-g backup-list
name modified wal_segment_backup_start
base_000000010000000000000004 2022-03-26T16:00:13Z 000000010000000000000004
If you are not able to see any backups or you get error messages when using the backup-list
command, check for more information in the Pod logs of your Postgres cluster Pods.
As in my case, I can use the MinIO CLI client to view the objects stored on my S3 storage.
> mc ls minio/postgresql/spilo/postgres-demo-cluster/wal/14/basebackups_005/
[2022-03-26 17:00:13 CET] 174KiB STANDARD base_000000010000000000000004_backup_stop_sentinel.json
[2022-03-26 17:19:06 CET] 0B base_000000010000000000000004/
But WAL-G does more than only basebackups, because we setup WAL-G, the Zalando Operator configured our PostgreSQL database to use WAL-G for WAL archiving too. You can find WAL archives under the following path on your S3:
> mc ls minio/postgresql/spilo/postgres-demo-cluster/wal/14/wal_005/
[2022-03-26 16:59:19 CET] 4.2MiB STANDARD 000000010000000000000001.lz4
[2022-03-26 16:59:27 CET] 255B STANDARD 000000010000000000000002.00000028.backup.lz4
[2022-03-26 16:59:27 CET] 65KiB STANDARD 000000010000000000000002.lz4
[2022-03-26 17:00:11 CET] 184KiB STANDARD 000000010000000000000003.lz4
[2022-03-26 17:00:12 CET] 266B STANDARD 000000010000000000000004.00000028.backup.lz4
[2022-03-26 17:00:12 CET] 65KiB STANDARD 000000010000000000000004.lz4
So our backup is working. As mentioned above, restore and cloning is a differnt topic and I promise to write about it soon.
Philip
Can’t wait for the restore and cloning part! Any documents that are already available on how to do it?
Thanks for the feedback. I’m working on it. Beside from the “official” documentation on the Zalando Github repository here, I’m only aware on this one here. But both are not really useable in my opinion.
Keep checking by for updates on this topic.
Philip
You can find the restore part of this topic here.
Hi!
I’m having trouble with setting up WAL-G and the documentations do not help me whatsoever.
When I start the Postgres cluster with the env variables requiered for WAL-G my standy cluster does not start and gives me this error in the logs:
2022-08-03 09:02:01,501 INFO: Lock owner: paas-test-db-cluster-0; I am test-test-db-cluster-1
2022-08-03 09:02:01,501 INFO: bootstrap from leader 'test-test-db-cluster-0' in progress
If i start the cluster without WAL-G backup it works fine.
Did you ever encounter this error while setting up WAL-G?
Hi Lima,
hard to tell from these two lines what is exactly the problem. I assume, that either your PostgreSQL manifest has an error or your S3 configuration is bad in some way. Keep in mind, when Zalando Operator recognises, that a WAL-G configuration is existing, he will try to bootstrap the secondary nodes from a backup made to the S3 bucket. If he can’t access it, the bootstrap will not work. It should (in my understanding) try to directly bootstrap the secondary then directly from primary however. There should be more log information from the secondary cluster apart from these two lines.
Have you doublechecked, that a backup is made from primary when you applied the WAL-G config? If so, can you share the path to the backup to me here? Also can you share the cluster manifest to me?
Kind regards
Philip
Hi Philip!
Thank you for your fast reply. We were able to solve the problem. There were some minor issues in our manifest file, but we were able to fix them.
Sorry for the late update.
Best,
Lima
Can you just please please always include details about what was the issue?
Hi, this ensures the database manifest in case of cluster has been recreated using argocd. My database will be restored if I apply the same manifests using argocd ?
I have set up using your excellent tutorial. Still, if I destroy the cluster using kubectl and redeploy, it’s being restored empty, making it impossible for me to use this operator to have persistence outside the k8s cluster.
Hello Rafael,
sorry for the late reply. I’m not sure, if I understand your problem 100%. I understood, that you manage the PostgreSQL CRD with ArgoCD. In this case, you either have to tell Argo to ignore the
clone
section within the CRD (see here) or you have to specify theclone
section within the ArgoCD managed App repository.I hope this helps you further. If this is not your problem, then please contact me again.
Kind regards
Philip
Hi,
I have a working database and I want to configure backup as detailed by this post. My question is:
Once the operator reconfigured, will the backup start with the already installed cluster or have I to create a new one for the configuration to take effect?
Regards
Hi,
as soon as you have configured the backup parameters. The Operator will restart your database pods and apply the environment configuration for the backups. Starting then, a first full backup will be done soon after the first start and then reoccuring defined by your backup schedule. WAL Archives will be automatically stored as well, when you do WAL-E / WAL-G backups.
Kind regards
Philip
I am using the latest zalando PGO. when i ssh to the pods of operator or the postgres pods, I see no /run/etc/wal-e.d/env as you mentioned here. I ahve configured
Hello,
this could probably only two things.
1. Check that you have specified the wal-g relevant environment variables (e.g.
AWS_ACCESS_KEY_ID
,AWS_SECRET_ACCESS_KEY
) somewhere. This could be on thepostgresql
CRD itself, or via a pod-config configmap (like I’ve done it in the article)2. If you’ve done it via the configmap, ensure that you’ve specified the
pod_environment_configmap: "postgres-operator/pod-config"
parameter within thepostgres-operator
configmap. Also ensure, that the pod-config configmap is existing in the namespace you’ve specified in thepostgres-operator
configmap (in my example it’s all in thepostgres-operator
namespace. I would assume, that you’ve created the pod-config configmap in default namespace by mistake.Kind Regards
Philip
Appreciate the response. I have mimicked your config here.
apiVersion: v1
kind: ConfigMap
metadata:
name: postgres-operator
namespace: backend
data:
pod_environment_configmap: "postgres-operator/pod-config"
aws_region: ap-south-1
kube_iam_role: postgres-pod-role
wal_s3_bucket:
---
apiVersion: v1
kind: ConfigMap
metadata:
name: pod-config
namespace: backend
data:
WAL_S3_BUCKET:
WAL_BUCKET_SCOPE_PREFIX: ""
WAL_BUCKET_SCOPE_SUFFIX: ""
USE_WALG_BACKUP: "true"
USE_WALG_RESTORE: "true"
BACKUP_SCHEDULE: "11 11 * * *"
# Access key
AWS_ACCESS_KEY_ID:
# Secret access key
AWS_SECRET_ACCESS_KEY:
AWS_S3_FORCE_PATH_STYLE: "true" # needed for MinIO
AWS_ENDPOINT:
AWS_REGION:
WALG_DISABLE_S3_SSE: "true"
BACKUP_NUM_TO_RETAIN: "5"
CLONE_USE_WALG_RESTORE: "true"
I have installed the operator in my namespace ‘backend’ with helm chart.
Hello,
I’ve reducted the secret information from your comment, just FYI. The error lies within the
postgres-operator
configmap. You specified in the parameterpod_environment_configmap
that the operator should merge configuration parameters from a configmap namedpod-config
in the namespacepostgres-operator
. So it searches for the configmap in the wrong place. Fix that topod_environment_configmap: "backend/pod-config"
should do the trick.Also I’m not 100% sure that deploying the Operator via helm, will not use the
operatorConfiguration
CRD instead of the configmap. But you will find that out. Check for a custom resource of typeOperatorConfiguration
Philip
even after that modification, the postgres pod hippo-0 logs say nothing about any backup, cron set to BACKUP_SCHEDULE: “1 * * * *” but no go. nothing in my aws s3. I even changed the
AWS_ENDPOINT: s3://arn:aws:s3:ap-south-1:xxxxxxx:accesspoint/yyyyyyy
2023-05-10 19:23:35,866 INFO: no action. I am (hippo-0), the leader with the lock
2023-05-10 19:23:45,862 INFO: no action. I am (hippo-0), the leader with the lock
2023-05-10 19:23:52.598 UTC [32] LOG {ticks: 0, maint: 0, retry: 0}
2023-05-10 19:23:55,863 INFO: no action. I am (hippo-0), the leader with the lock
the operator too has no logs about any backup.
time=”2023-05-10T19:15:27Z” level=info msg=”found pod: \”backend/hippo-0\” (uid: \”57859770-7506-48a7-8d22-86de9b2a30dd\”)” cluster-name=backend/hippo pkg=cluster worker=1
time=”2023-05-10T19:15:27Z” level=info msg=”found PVC: \”backend/pgdata-hippo-0\” (uid: \”be749801-269b-401e-8781-a91c1900dc18\”)” cluster-name=backend/hippo pkg=cluster worker=1
time=”2023-05-10T19:15:27Z” level=debug msg=”syncing connection pooler (master, replica) from (false, nil) to (false, nil)” cluster-name=backend/hippo pkg=cluster worker=1
time=”2023-05-10T19:15:27Z” level=info msg=”cluster has been created” cluster-name=backend/hippo pkg=controller worker=1
nothing about backup
You have restarted the Zalando Operator pod after changing it’s configuration, right? Also have you checked what I’ve written regarding operatorconfiguration CRD?
yes the CRD goes exist, i did restart the PGO everytime i made changes. Since it is for testing, i have even made the buckets public but no go.
When the CRD exists, the configmap is ignored. Place the custom_pod_configuration parameter there. It has nothing to do with your S3. The only problem is, that the operator does not inject the env vars properly in the Spilo pod.
Or, even better, redeploy Zalando as described here and not via Helm.
https://thedatabaseme.de/2022/03/13/keep-the-elefants-in-line-deploy-zalando-operator-on-your-kubernetes-cluster
Thank you. Your post made configuring the backup for Postgres Operator much easier.
In my case, I used Rook Ceph RGW, which provides a Secret and a ConfigMap upon bucket creation. Therefore, I added custom pod environment variables [via the Postgres cluster manifest][1] instead since it allows referencing existing Secrets/ConfigMaps.
[1]: https://postgres-operator.readthedocs.io/en/stable/administrator/#via-postgres-cluster-manifest
Glad you liked it.
Philip
hello
I’m having trouble with settting
ERROR: 2023/10/09 02:58:50.512386 failed to upload 'spilo/postgres-demo-cluster/wal/15/basebackups_005/base_000000010000000000000003/tar_partitions/part_001.tar.lz4' to bucket 'postgresql': InvalidArgument: S3 API Requests must be made to API port.
status code: 400, request id: , host id:
ERROR: 2023/10/09 02:58:50.512391 upload: could not upload 'base_000000010000000000000003/tar_partitions/part_001.tar.lz4'
ERROR: 2023/10/09 02:58:50.512393 failed to upload 'spilo/postgres-demo-cluster/wal/15/basebackups_005/base_000000010000000000000003/tar_partitions/part_001.tar.lz4' to bucket 'postgresql': InvalidArgument: S3 API Requests must be made to API port.
status code: 400, request id: , host id:
ERROR: 2023/10/09 02:58:50.512394 Unable to continue the backup process because of the loss of a part 1.
My config
apiVersion: v1
kind: ConfigMap
metadata:
name: pod-config
data:
WAL_S3_BUCKET: postgresql
WAL_BUCKET_SCOPE_PREFIX: ""
WAL_BUCKET_SCOPE_SUFFIX: ""
USE_WALG_BACKUP: "true"
USE_WALG_RESTORE: "true"
BACKUP_SCHEDULE: '00 10 * * *'
AWS_ACCESS_KEY_ID: xxx
AWS_SECRET_ACCESS_KEY: xxx
AWS_S3_FORCE_PATH_STYLE: "true"
AWS_ENDPOINT: http://172.30.31.12:31794
AWS_REGION: de01
WALG_DISABLE_S3_SSE: "true"
BACKUP_NUM_TO_RETAIN: "5"
CLONE_USE_WALG_RESTORE: "true"
Hello Lee,
most likely, the issue is a misconfigured
AWS_ENDPOINT
. It seems wrong with the settingAWS_ENDPOINT: http://172.30.31.12:31794
. This seems to be a cluster IP with a nodePort which will not work. If the serviceis of type nodePort, then you need to insert one of your Kubernetes nodes IP here. I don’t know your S3 setup. But you are not needed to
use a public facing service / IP, when the Postgres cluster runs on the same Kubernetes cluster. If so, then you’re able to specify the
service DNS + minio API port. e.g.
http://minio-svc.minio.svc.cluster.local:9000
. With that, you can also use a service of type clusterIP.Long story short, the error tells, that the container can’t communicate with the S3 bucket endpoint on the given setting.
Kind regards
Philip
172.30.31.12 is my nodeip
Ok, can you access the bucket with a S3 client from another system? (E.g. your client machine)
Kind regards
minio was not working properly, It’s work fine now, thank you.
Hi, really great documentation and explanation! I appreciate your work a lot. Thank you! It took me a while to understand everything but i managed it.
Hi, @TheDatabaseMe!
Thank you so much for your work and this manual. It’s really help me to understand how to use WALG for making backups.
Thanks for the lovely docu!
One problem I have is that I cannot find wal-g logs anywhere; they’re neither in the cluster pods’ logs nor in the operator logs (makes sense).
Is there a way to find these somewhere on the pods?
Another question I have is why you wouldn’t simply set those ENVs directly in the postgresql manifest?
Kind regards
Hello Jamal,
thanks for your feedback.
The logs of WAL-G are a bit complicated. The actual fullbackup is wrapped by a Python script in the Spilo image. This script writes to STDOUT / STDERR, so the logs from WAL-G are only visible in case of an error AFAIR. In case of success, you will see a backup success message from the Python script in the pod logs. No more information. This is especially tricky, since the script runs nearly endless without handing back an error, if a WAL-G backup can’t be written. So it’s not really talky.
As for the WAL archives, those are triggered by Postgres itself. If there is an error within the archive command, I would guess the logs for it can be found in the actual Postgres logs under
pg_log
.I had no special reasons why not to enter the env variables within the Postgresql manifest. In more customized / bigger environments, it will be a mixture of entering some ENVs in the configmap and some in the manifest. Since you have some information, that would be redundant in every manifest and others not customizeable enough in the configmap. As for my lab environment, I like to have everything in one configuration / configmap. So I don’t need to remember to add env variables to enable a backup. They are all enabled directly when I create them.
Hope this helped
Philip