HPE MLDM Docs: Cluster Backup

This page will walk you through the main steps required to manually back up and restore the state of a HPE ML Data Management cluster in production. Details on how to perform those steps might vary depending on your infrastructure and cloud provider / on-premises setup. Refer to your provider’s documentation.

Overview #

HPE ML Data Management state is stored in two main places:

An object-store holding HPE ML Data Management’s data.
A PostgreSQL instance made up of one or two databases: pachyderm holding HPE ML Data Management’s metadata and dex holding authentication data.

Backing up a HPE ML Data Management cluster involves snapshotting both the object store and the PostgreSQL database(s), in a consistent state, at a given point in time.

Restoring it involves re-populating the database(s) and the object store using those backups, then recreating a HPE ML Data Management cluster.

ℹ️

Make sure that you have a bucket for backup use, separate from the object store used by your cluster.
Depending on the reasons behind your cluster recovery, you might choose to use an existing vs. a new instance of PostgreSQL and/or the object store.

Manual Back Up Of A HPE ML Data Management Cluster #

Before any manual backup:

Make sure to retain a copy of the Helm values used to deploy your cluster.
Then, suspend any state-mutating operations.

ℹ️

Backups incur downtime until operations are resumed.
Operational best practices include notifying HPE ML Data Management users of the outage and providing an estimated time when downtime will cease.
Downtime duration is a function of the size of the data be to backed up and the networks involved; Testing before going into production and monitoring backup times on an ongoing basis might help make accurate predictions.

Suspend Operations #

Pause any external automated process ingressing data to HPE ML Data Management input repos, or queue/divert those as they will fail to connect to the cluster while the backup occurs.
Suspend all mutation of state by scaling pachd and the worker pods down:

⚠️

Before starting, make sure that your context points to the server you want to pause by running pachctl config get active-context.

To pause HPE ML Data Management:

If you are an Enterprise user: Run the pachctl enterprise pause command.
Alternatively, you can use kubectl:
Before starting, make sure that kubectl points to the right cluster.
Run kubectl config get-contexts to list all available clusters and contexts (the current context is marked with a *), then kubectl config use-context <your-context-name> to set the proper active context.
```
kubectl scale deployment pachd --replicas 0 
kubectl scale rc --replicas 0 -l suite=pachyderm,component=worker
```
Note that it takes some time for scaling down to take effect;
Run the watch command to monitor the state of pachd and worker pods terminating:
```
watch -n 5 kubectl get pods
```

Back Up The Databases And The Object Store #

This step is specific to your database and object store hosting.

If your PostgreSQL instance is solely dedicated to HPE ML Data Management, you can use PostgreSQL’s tools, like pg_dumpall, to dump your entire PostgreSQL state.
Alternatively, you can use targeted pg_dump commands to dump the pachyderm and dex databases, or use your Cloud Provider’s backup product. In any case, make sure to use TLS. Note that if you are using a cloud provider, you might choose to use the provider’s method of making PostgreSQL backups.

⚠️

A production setting of HPE ML Data Management implies that you are running a managed PostgreSQL instance.

📖

For on-premises Kubernetes deployments, check the vendor documentation for your on-premises PostgreSQL for details on backing up and restoring your databases.

To back up the object store, you can either download all objects or use the object store provider’s backup method.
The latter is preferable since it will typically not incur egress costs.

📖

For on-premises Kubernetes deployments, check the vendor documentation for your on-premises object store for details on backing up and restoring a bucket.

Resuming operations #

Once your backup is completed, resume your normal operations by scaling pachd back up. It will take care of restoring the worker pods:

Enterprise users: run pachctl enterprise unpause.

Alternatively, if you used kubectl:

kubectl scale deployment pachd --replicas 1

Restore HPE ML Data Management #

There are two primary use cases for restoring a cluster:

Your data have been corrupted, preventing your cluster from functioning correctly. You want the same version of HPE ML Data Management re-installed on the latest uncorrupted data set.
You have upgraded a cluster and are encountering problems. You decide to uninstall the current version and restore the latest backup of a previous version of HPE ML Data Management.

Depending on your scenario, pick all or a subset of the following steps:

Populate new pachyderm and dex (if required) databases on your PostgreSQL instance
Populate a new bucket or use the backed-up object-store (note that, in that case, it will no longer be a backup)
Create a new empty Kubernetes cluster and give it access to your databases and bucket
Deploy HPE ML Data Management into your new cluster

Restore The Databases And Objects #

Restore PostgreSQL backups into your new databases using the appropriate method (this is most straightforward when using a cloud provider).
Copy the objects from the backed-up object store to your new bucket or re-use your backup.

Deploy HPE ML Data Management Into The New Cluster #

Finally, update the copy of your original Helm values to point HPE ML Data Management to the new databases and the new object store, then use Helm to install HPE ML Data Management into the new cluster.

Connect ‘pachctl’ To Your Restored Cluster #

And check that your cluster is up and running.