Skip to content

Backup to S3 – Configure Zalando Postgres Operator Backup with WAL-G

Doing basebackups of Postgres databases enables you to do a Point-in-time recovery (PITR) of your database. Doing backups for your databases that have been deployed using the Zalando Postgres Operator, is a different beast. Here I show you how it’s done with a MinIO (self)hosted S3 compatible Object Storage.

The so called Spilo images that are deployed when using the Zalando Postgres Operator, can do backups and WAL archiving to S3 (compatible) storage using WAL-E or it’s successor WAL-G. For me the problem is, that the documentation on WAL-G integration on Zalando side is not very good. You have to put quite some puzzle pieces together in order to get it running. Because I went through this process, I thought, it might come handy for you too.

I assume, that you’ve the following prerequisites setup:

  • Zalando Postgres Operator is deployed on your Kubernetes cluster. If you search for a tutorial on that topic, you can find it here.
  • You have a working S3 Object Storage up and running. If you want to setup your own, self-hosted Object Storage using MinIO, you can find the instructions for it here.

Attention: I use my own MinIO Object Storage in this example. There might be some different configurations for you, if you use a different provider. I will try to mention MinIO specific parameters inline. But your milage may vary.

If you want to quickstart the whole process, you can directly head to my Github repository and apply the kustomization overlay under overlays/enabled-backup.

What the overlay does is first add (patch) the Zalando Operator central configmap with the parameter pod_environment_configmap. This is the reference to a Pod specific Configmap which holds the environment variables that configures WAL-G to use our S3 Object Storage.

configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: postgres-operator
data:
  pod_environment_configmap: "postgres-operator/pod-config"

If your Pod specific Configmap resides in a Namespace other than default, you need to specify the Namespace before the name of the Configmap (postgres-operator in my case).

The Pod specific Configmap holds these configuration in my case.

pod-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: pod-config
data:
  WAL_S3_BUCKET: postgresql
  WAL_BUCKET_SCOPE_PREFIX: ""
  WAL_BUCKET_SCOPE_SUFFIX: ""
  USE_WALG_BACKUP: "true"
  USE_WALG_RESTORE: "true"
  BACKUP_SCHEDULE: '00 10 * * *'
  AWS_ACCESS_KEY_ID: postgresql
  AWS_SECRET_ACCESS_KEY: supersecret
  AWS_S3_FORCE_PATH_STYLE: "true" # needed for MinIO
  AWS_ENDPOINT: http://minio.home.lab:9000 # Endpoint URL to your S3 Endpoint; MinIO in this example
  AWS_REGION: de01
  WALG_DISABLE_S3_SSE: "true"
  BACKUP_NUM_TO_RETAIN: "5"
  CLONE_USE_WALG_RESTORE: "true"

Let’s have a look on the parameters.

AWS_ENDPOINTSpecifies the S3 Object Storage API endpoint. In my case it’s a MinIO service in my homelab, listening on port 9000.
AWS_REGIONThe region of your S3 storage.
AWS_S3_FORCE_PATH_STYLEControls, if you want to use S3 path style or virtual hosted style. In case of my MinIO setup, I can’t use virtual hosted style. In the end it controls how the endpoint URL will look like. Path style looks like this http://minio.home.lab:9000/<WAL_S3_BUCKET> vs. virtual hosted style would look like this: http://<WAL_S3_BUCKET>.minio.home.lab:9000.
WAL_S3_BUCKETThe S3 bucket name where your Postgres backups should be stored. You have to create the bucket before you can use it though.
AWS_ACCESS_KEY_IDYou can think of this as your “username” to access the Object Storage. Both access key and secret access key have to be created on your Object Storage.
AWS_SECRET_ACCESS_KEYThis is the secret “password” to your Object Storage. Both access key and secret access key have to be created on your Object Storage.
USE_WALG_BACKUP and USE_WALG_RESTOREBy default the Spilo Images that the Zalando Operator deploy use WAL-E instead of it’s predecessor WAL-G. WAL-G is way faster then WAL-E, but they advice not to use WAL-G in production workloads yet. In my Homelab, I’m pretty sure, WAL-G will suite well.
WALG_DISABLE_S3_SSEDisables the backup encryption. In my MinIO setup, this is not possible to do.
WAL_BUCKET_SCOPE_PREFIX and WAL_BUCKET_SCOPE_SUFFIXBy default, all backups will be stored under a path which will include the cluster UID and the namespace of the cluster. I decided to blank both parameters because they make trouble when trying to restore the cluster later.
BACKUP_NUM_TO_RETAINControls the number of WAL-G backups which should reside on your S3 storage. At least it should, but at the time of writing, there are known issues regarding this topic. I advice you to use a lifecycle policy on your S3 storage in order to get some kind of housekeeping running.
CLONE_USE_WALG_RESTOREThis controls to use WAL-G instead of WAL-E when doing a clone of your Postgres cluster. Restore and cloning is a topic on it’s own, so I will not go into details here.
BACKUP_SCHEDULEThe schedule in cron format when a basebackup should be made. In my case, every day at 10am.

The kustomization overlay looks like this (see the Github repository mentioned above):

kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: postgres-operator

resources:
  - pod-config.yaml
  - ../../base
  - ../../ui

patchesStrategicMerge:
  - configmap.yaml

You may now apply the kustomization overlay like this:

kubectl apply -k overlays/enabled-backup/

This will deploy the Zalando Operator with all needed adjustments regarding backup. If you have already had deployed the Operator, it will patch the postgres-operator Configmap. You will need to restart the Operator Pod in order to get the Pod specific Configmap applied.

Now let’s quickly build a Postgres cluster with the Operator. You can find some examples in my Github repository under the manifests folder.

demo-cluster.yaml
apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
  name: postgres-demo-cluster
  namespace: postgres
spec:
  teamId: "postgres"
  volume:
    size: 2Gi
  numberOfInstances: 2
  users:
    demouser:  # database owner
    - superuser
    - createdb
  databases:
    demo: demouser  # dbname: owner
  preparedDatabases:
    demo: {}
  postgresql:
    version: "14"

Let’s deploy it:

kubectl apply -f demo-cluster.yaml

After a short time, your cluster should get started and directly after the cluster has been created, a backup will be made (regardless of the schedule you’ve configured). Execute into the container, if your configuration works, you should see all environment variables set as an envdir under /run/etc/wal-e.d/env.

> /run/etc/wal-e.d/env# ls -ltr
total 68
-rw-r--r-- 1 postgres root  1 Mar 26 15:59 WALG_UPLOAD_CONCURRENCY
-rw-r--r-- 1 postgres root 50 Mar 26 15:59 WALG_S3_PREFIX
-rw-r--r-- 1 postgres root  1 Mar 26 15:59 WALG_DOWNLOAD_CONCURRENCY
-rw-r--r-- 1 postgres root  4 Mar 26 15:59 WALG_DISABLE_S3_SSE
-rw-r--r-- 1 postgres root 50 Mar 26 15:59 WALE_S3_PREFIX
-rw-r--r-- 1 postgres root 31 Mar 26 15:59 WALE_S3_ENDPOINT
-rw-r--r-- 1 postgres root  4 Mar 26 15:59 USE_WALG_RESTORE
-rw-r--r-- 1 postgres root  4 Mar 26 15:59 USE_WALG_BACKUP
-rw-r--r-- 1 postgres root  8 Mar 26 15:59 AWS_SECRET_ACCESS_KEY
-rw-r--r-- 1 postgres root  4 Mar 26 15:59 AWS_S3_FORCE_PATH_STYLE
-rw-r--r-- 1 postgres root  4 Mar 26 15:59 AWS_REGION
-rw-r--r-- 1 postgres root 26 Mar 26 15:59 AWS_ENDPOINT
-rw-r--r-- 1 postgres root 10 Mar 26 15:59 AWS_ACCESS_KEY_ID
-rw-r--r-- 1 postgres root  6 Mar 26 15:59 WALE_LOG_DESTINATION
-rw-r--r-- 1 root     root 25 Mar 26 15:59 TMPDIR
-rw-r--r-- 1 postgres root  4 Mar 26 15:59 PGPORT
-rw-r--r-- 1 postgres root  1 Mar 26 15:59 BACKUP_NUM_TO_RETAIN

You can check for created backups using the WAL-G client. Issue the following command as root user from within your Postgres container.

> envdir "/run/etc/wal-e.d/env" wal-g backup-list
name                          modified             wal_segment_backup_start
base_000000010000000000000004 2022-03-26T16:00:13Z 000000010000000000000004

If you are not able to see any backups or you get error messages when using the backup-list command, check for more information in the Pod logs of your Postgres cluster Pods.

As in my case, I can use the MinIO CLI client to view the objects stored on my S3 storage.

> mc ls minio/postgresql/spilo/postgres-demo-cluster/wal/14/basebackups_005/
[2022-03-26 17:00:13 CET] 174KiB STANDARD base_000000010000000000000004_backup_stop_sentinel.json
[2022-03-26 17:19:06 CET]     0B base_000000010000000000000004/

But WAL-G does more than only basebackups, because we setup WAL-G, the Zalando Operator configured our PostgreSQL database to use WAL-G for WAL archiving too. You can find WAL archives under the following path on your S3:

> mc ls minio/postgresql/spilo/postgres-demo-cluster/wal/14/wal_005/
[2022-03-26 16:59:19 CET] 4.2MiB STANDARD 000000010000000000000001.lz4
[2022-03-26 16:59:27 CET]   255B STANDARD 000000010000000000000002.00000028.backup.lz4
[2022-03-26 16:59:27 CET]  65KiB STANDARD 000000010000000000000002.lz4
[2022-03-26 17:00:11 CET] 184KiB STANDARD 000000010000000000000003.lz4
[2022-03-26 17:00:12 CET]   266B STANDARD 000000010000000000000004.00000028.backup.lz4
[2022-03-26 17:00:12 CET]  65KiB STANDARD 000000010000000000000004.lz4

So our backup is working. As mentioned above, restore and cloning is a differnt topic and I promise to write about it soon.

Philip

28 thoughts on “Backup to S3 – Configure Zalando Postgres Operator Backup with WAL-G”

  1. Can’t wait for the restore and cloning part! Any documents that are already available on how to do it?

    1. Avatar photo

      Thanks for the feedback. I’m working on it. Beside from the “official” documentation on the Zalando Github repository here, I’m only aware on this one here. But both are not really useable in my opinion.

      Keep checking by for updates on this topic.
      Philip

  2. Hi!

    I’m having trouble with setting up WAL-G and the documentations do not help me whatsoever.
    When I start the Postgres cluster with the env variables requiered for WAL-G my standy cluster does not start and gives me this error in the logs:
    2022-08-03 09:02:01,501 INFO: Lock owner: paas-test-db-cluster-0; I am test-test-db-cluster-1
    2022-08-03 09:02:01,501 INFO: bootstrap from leader 'test-test-db-cluster-0' in progress

    If i start the cluster without WAL-G backup it works fine.
    Did you ever encounter this error while setting up WAL-G?

    1. Avatar photo

      Hi Lima,

      hard to tell from these two lines what is exactly the problem. I assume, that either your PostgreSQL manifest has an error or your S3 configuration is bad in some way. Keep in mind, when Zalando Operator recognises, that a WAL-G configuration is existing, he will try to bootstrap the secondary nodes from a backup made to the S3 bucket. If he can’t access it, the bootstrap will not work. It should (in my understanding) try to directly bootstrap the secondary then directly from primary however. There should be more log information from the secondary cluster apart from these two lines.

      Have you doublechecked, that a backup is made from primary when you applied the WAL-G config? If so, can you share the path to the backup to me here? Also can you share the cluster manifest to me?

      Kind regards
      Philip

      1. Hi Philip!

        Thank you for your fast reply. We were able to solve the problem. There were some minor issues in our manifest file, but we were able to fix them.
        Sorry for the late update.

        Best,
        Lima

  3. Hi, this ensures the database manifest in case of cluster has been recreated using argocd. My database will be restored if I apply the same manifests using argocd ?

    I have set up using your excellent tutorial. Still, if I destroy the cluster using kubectl and redeploy, it’s being restored empty, making it impossible for me to use this operator to have persistence outside the k8s cluster.

    1. Avatar photo

      Hello Rafael,

      sorry for the late reply. I’m not sure, if I understand your problem 100%. I understood, that you manage the PostgreSQL CRD with ArgoCD. In this case, you either have to tell Argo to ignore the clone section within the CRD (see here) or you have to specify the clone section within the ArgoCD managed App repository.

      I hope this helps you further. If this is not your problem, then please contact me again.

      Kind regards
      Philip

  4. Hi,
    I have a working database and I want to configure backup as detailed by this post. My question is:
    Once the operator reconfigured, will the backup start with the already installed cluster or have I to create a new one for the configuration to take effect?
    Regards

    1. Avatar photo

      Hi,
      as soon as you have configured the backup parameters. The Operator will restart your database pods and apply the environment configuration for the backups. Starting then, a first full backup will be done soon after the first start and then reoccuring defined by your backup schedule. WAL Archives will be automatically stored as well, when you do WAL-E / WAL-G backups.

      Kind regards
      Philip

    1. Avatar photo

      Hello,
      this could probably only two things.

      1. Check that you have specified the wal-g relevant environment variables (e.g. AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) somewhere. This could be on the postgresql CRD itself, or via a pod-config configmap (like I’ve done it in the article)
      2. If you’ve done it via the configmap, ensure that you’ve specified the pod_environment_configmap: "postgres-operator/pod-config" parameter within the postgres-operator configmap. Also ensure, that the pod-config configmap is existing in the namespace you’ve specified in the postgres-operator configmap (in my example it’s all in the postgres-operator namespace. I would assume, that you’ve created the pod-config configmap in default namespace by mistake.

      Kind Regards
      Philip

  5. Appreciate the response. I have mimicked your config here.
    apiVersion: v1
    kind: ConfigMap
    metadata:
    name: postgres-operator
    namespace: backend
    data:
    pod_environment_configmap: "postgres-operator/pod-config"
    aws_region: ap-south-1
    kube_iam_role: postgres-pod-role
    wal_s3_bucket:
    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
    name: pod-config
    namespace: backend
    data:
    WAL_S3_BUCKET:
    WAL_BUCKET_SCOPE_PREFIX: ""
    WAL_BUCKET_SCOPE_SUFFIX: ""
    USE_WALG_BACKUP: "true"
    USE_WALG_RESTORE: "true"
    BACKUP_SCHEDULE: "11 11 * * *"
    # Access key
    AWS_ACCESS_KEY_ID:
    # Secret access key
    AWS_SECRET_ACCESS_KEY:
    AWS_S3_FORCE_PATH_STYLE: "true" # needed for MinIO
    AWS_ENDPOINT:
    AWS_REGION:
    WALG_DISABLE_S3_SSE: "true"
    BACKUP_NUM_TO_RETAIN: "5"
    CLONE_USE_WALG_RESTORE: "true"

    I have installed the operator in my namespace ‘backend’ with helm chart.

    1. Avatar photo

      Hello,

      I’ve reducted the secret information from your comment, just FYI. The error lies within the postgres-operator configmap. You specified in the parameter pod_environment_configmap that the operator should merge configuration parameters from a configmap named pod-config in the namespace postgres-operator. So it searches for the configmap in the wrong place. Fix that to pod_environment_configmap: "backend/pod-config" should do the trick.

      Also I’m not 100% sure that deploying the Operator via helm, will not use the operatorConfiguration CRD instead of the configmap. But you will find that out. Check for a custom resource of type OperatorConfiguration

      Philip

      1. even after that modification, the postgres pod hippo-0 logs say nothing about any backup, cron set to BACKUP_SCHEDULE: “1 * * * *” but no go. nothing in my aws s3. I even changed the
        AWS_ENDPOINT: s3://arn:aws:s3:ap-south-1:xxxxxxx:accesspoint/yyyyyyy
        2023-05-10 19:23:35,866 INFO: no action. I am (hippo-0), the leader with the lock
        2023-05-10 19:23:45,862 INFO: no action. I am (hippo-0), the leader with the lock
        2023-05-10 19:23:52.598 UTC [32] LOG {ticks: 0, maint: 0, retry: 0}
        2023-05-10 19:23:55,863 INFO: no action. I am (hippo-0), the leader with the lock

        the operator too has no logs about any backup.
        time=”2023-05-10T19:15:27Z” level=info msg=”found pod: \”backend/hippo-0\” (uid: \”57859770-7506-48a7-8d22-86de9b2a30dd\”)” cluster-name=backend/hippo pkg=cluster worker=1
        time=”2023-05-10T19:15:27Z” level=info msg=”found PVC: \”backend/pgdata-hippo-0\” (uid: \”be749801-269b-401e-8781-a91c1900dc18\”)” cluster-name=backend/hippo pkg=cluster worker=1
        time=”2023-05-10T19:15:27Z” level=debug msg=”syncing connection pooler (master, replica) from (false, nil) to (false, nil)” cluster-name=backend/hippo pkg=cluster worker=1
        time=”2023-05-10T19:15:27Z” level=info msg=”cluster has been created” cluster-name=backend/hippo pkg=controller worker=1

        nothing about backup

        1. Avatar photo

          You have restarted the Zalando Operator pod after changing it’s configuration, right? Also have you checked what I’ve written regarding operatorconfiguration CRD?

            1. Avatar photo

              When the CRD exists, the configmap is ignored. Place the custom_pod_configuration parameter there. It has nothing to do with your S3. The only problem is, that the operator does not inject the env vars properly in the Spilo pod.

    2. hello
      I’m having trouble with settting

      ERROR: 2023/10/09 02:58:50.512386 failed to upload 'spilo/postgres-demo-cluster/wal/15/basebackups_005/base_000000010000000000000003/tar_partitions/part_001.tar.lz4' to bucket 'postgresql': InvalidArgument: S3 API Requests must be made to API port.
      status code: 400, request id: , host id:
      ERROR: 2023/10/09 02:58:50.512391 upload: could not upload 'base_000000010000000000000003/tar_partitions/part_001.tar.lz4'
      ERROR: 2023/10/09 02:58:50.512393 failed to upload 'spilo/postgres-demo-cluster/wal/15/basebackups_005/base_000000010000000000000003/tar_partitions/part_001.tar.lz4' to bucket 'postgresql': InvalidArgument: S3 API Requests must be made to API port.
      status code: 400, request id: , host id:
      ERROR: 2023/10/09 02:58:50.512394 Unable to continue the backup process because of the loss of a part 1.

      My config
      apiVersion: v1
      kind: ConfigMap
      metadata:
      name: pod-config
      data:
      WAL_S3_BUCKET: postgresql
      WAL_BUCKET_SCOPE_PREFIX: ""
      WAL_BUCKET_SCOPE_SUFFIX: ""
      USE_WALG_BACKUP: "true"
      USE_WALG_RESTORE: "true"
      BACKUP_SCHEDULE: '00 10 * * *'
      AWS_ACCESS_KEY_ID: xxx
      AWS_SECRET_ACCESS_KEY: xxx
      AWS_S3_FORCE_PATH_STYLE: "true"
      AWS_ENDPOINT: http://172.30.31.12:31794
      AWS_REGION: de01
      WALG_DISABLE_S3_SSE: "true"
      BACKUP_NUM_TO_RETAIN: "5"
      CLONE_USE_WALG_RESTORE: "true"

      1. Avatar photo

        Hello Lee,

        most likely, the issue is a misconfigured AWS_ENDPOINT. It seems wrong with the setting
        AWS_ENDPOINT: http://172.30.31.12:31794. This seems to be a cluster IP with a nodePort which will not work. If the service
        is of type nodePort, then you need to insert one of your Kubernetes nodes IP here. I don’t know your S3 setup. But you are not needed to
        use a public facing service / IP, when the Postgres cluster runs on the same Kubernetes cluster. If so, then you’re able to specify the
        service DNS + minio API port. e.g. http://minio-svc.minio.svc.cluster.local:9000. With that, you can also use a service of type clusterIP.

        Long story short, the error tells, that the container can’t communicate with the S3 bucket endpoint on the given setting.

        Kind regards
        Philip

    3. Hi, really great documentation and explanation! I appreciate your work a lot. Thank you! It took me a while to understand everything but i managed it.

    Leave a Reply to Sidharth Cancel reply

    Your email address will not be published. Required fields are marked *