Restore Platform System Data and Storage

You can perform a system restore (controllers, workers, including or excluding storage nodes) of a StarlingX cluster from a previous system backup and bring it back to the operational state it was when the backup procedure took place.

There are two restore modes; optimized restore and legacy restore. Optimized restore must be used on AIO-SX and legacy restore must be used on systems that are not AIO-SX.

About this task

Kubernetes configuration will be restored and pods that are started from repositories accessible from the internet or from external repositories will start immediately. StarlingX specific applications must be re-applied once a storage cluster is configured.

Everything is restored as it was when the backup was created, except for optional data if not defined.

See Back Up System Data for more details on the backup.

Warning

The system backup file can only be used to restore the system from which the backup was made. You cannot use this backup file to restore the system to different hardware.

To restore the backup, use the same version of the boot image (ISO) and patches that were installed at the time of the backup.

The StarlingX restore supports the following optional modes:

  • To keep the Ceph cluster data intact (false - default option), use the following parameter, when passing the extra arguments to the Ansible Restore playbook command:

    wipe_ceph_osds=false
    
  • To wipe the Ceph cluster entirely (true), where the Ceph cluster will need to be recreated, or if the Ceph partition was previously wiped, such as during a fresh install between backup and restore or during reinstall, use the following parameter:

    wipe_ceph_osds=true
    

Restoring a StarlingX cluster from a backup file is done by re-installing the ISO on controller-0, applying updates (patches), running the Ansible Restore Playbook, unlocking controller-0, and then powering on, and unlocking the remaining hosts, one host at a time, starting with the controllers, and then the storage hosts, ONLY if required, and lastly the compute (worker) hosts. Lastly, running system restore-complete command.

Prerequisites

Before you start the restore procedure you must ensure the following conditions are in place:

  • All cluster hosts must be prepared for network boot and then powered down. You can prepare a host for network boot.

    Note

    If you are restoring system data only, do not lock, power off or prepare the storage hosts to be reinstalled.

  • The backup file is accessible locally, if restore is done by running Ansible Restore playbook locally on the controller. The backup file is accessible remotely, if restore is done by running Ansible Restore playbook remotely.

  • If backup encryption is enabled during platform backup, you have the encryption passphrase needed to decrypt the archive file.

  • You have the original StarlingX ISO installation image available on a USB flash drive. It is mandatory that you use the exact same version of the software used during the original installation, otherwise the restore procedure will fail.

  • The restore procedure requires all hosts but controller-0 to boot over the internal management network using the PXE protocol. Ideally, the old boot images are no longer present, so that the hosts boot from the network when powered on. If this is not the case, you must configure each host manually for network boot immediately after powering it on.

  • If you are restoring a Distributed Cloud subcloud first, ensure it is in an unmanaged state on the Central Cloud (SystemController) by using the following commands:

    $ source /etc/platform/openrc
    ~(keystone_admin)]$ dcmanager subcloud unmanage <subcloud-name>
    

    where <subcloud-name> is the name of the subcloud to be unmanaged.

    For more information, see:

Procedure

  1. Power down all hosts.

    If you have a storage host and want to retain Ceph data, then power down all the nodes except the storage hosts; the cluster has to be functional during a restore operation.

    Caution

    Do not use wipedisk before a restore operation. This will lead to data loss on your Ceph cluster. It is safe to use wipedisk during an initial installation, while reinstalling a host, or during an upgrade.

  2. Install the StarlingX ISO software on controller-0 from the USB flash drive.

    You can now log in using the host’s console.

  3. Log in to the console as user sysadmin with password sysadmin.

  4. Install network connectivity required for the subcloud.

  5. Ensure that the system is at the same patch level as it was when the backup was taken. You must manually reinstall any previous patches and reboot the system (for reboot-required patches) to prevent restore failures due to mismatched patch levels.

    Note

    Restoring patches is done automatically if exclude_sw_deployments=false is used during backup and restore. By default, exclude_sw_deployments will be true for AIO-SX and subclouds and false for AIO-DX systems.

    It is recommended to restore subclouds only when there is an existing backup taken at the same patch level as the system controller.

    For steps on how to install patches manually, see Deployment of Patch Releases before Bootstrap and Commissioning of initial Installation.

    After the reboot, you can verify that the updates were applied.

    Note

    On the systems that are not AIO-SX, you can skip this step if exclude_sw_deployments=true is not used.

  6. Ensure that the backup files are available on the controller. Run both Ansible Restore playbooks, restore_platform.yml and restore_user_images.yml. For more information on restoring the back up file, see Run Restore Playbook Locally on the Controller, and Run Ansible Restore Playbook Remotely.

    Note

    The backup files contain the system data and updates.

    The restore operation will pull missing images from the upstream registries.

  7. Restore the local registry using the file restore_user_images.yml.

    Example:

    ~(keystone_admin)]$ ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_user_images.yml -e "initial_backup_dir=/home/sysadmin backup_filename=localhost_user_images_backup_2023_07_15_21_24_22.tgz ansible_become_pass=St8rlingXCloud*"
    

    Note

    • This step applies only if it was created during the backup operation.

    • The user_images_backup*.tgz file is created during backup only if backup_user_images is true.

    This must be done before unlocking controller-0.

  8. Unlock Controller-0.

    ~(keystone_admin)]$ system host-unlock controller-0
    

    After you unlock controller-0, storage nodes become available and Ceph becomes operational.

  9. If the system is a Distributed Cloud system controller, restore the dc-vault using the restore_dc_vault.yml playbook. Perform this step after unlocking controller-0:

    $ ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_dc_vault.yml -e "initial_backup_dir=/home/sysadmin backup_filename=localhost_dc_vault_backup_2020_07_15_21_24_22.tgz ansible_become_pass=St0rlingX*"
    

    Note

    The dc-vault backup archive is created by the backup.yml playbook.

  10. Authenticate the system as Keystone user admin.

    Source the admin user environment as follows:

    $ source /etc/platform/openrc
    
  11. Apps transition from ‘restore-requested’ to ‘applying’ state, and from ‘applying’ state to ‘applied’ state.

    If apps are transitioned from ‘applying’ to ‘restore-requested’ state, ensure there is network access and access to the docker registry.

    The process is repeated once per minute until all apps are transitioned to ‘applied’.

  12. If you have a Duplex system, restore the controller-1 host.

    1. List the current state of the hosts.

      ~(keystone_admin)]$ system host-list
      +----+-------------+------------+---------------+-----------+------------+
      | id | hostname    | personality| administrative|operational|availability|
      +----+-------------+------------+---------------+-----------+------------+
      | 1  | controller-0| controller | unlocked      |enabled    |available   |
      | 2  | controller-1| controller | locked        |disabled   |offline     |
      | 3  | storage-0   | storage    | locked        |disabled   |offline     |
      | 4  | storage-1   | storage    | locked        |disabled   |offline     |
      | 5  | compute-0   | worker     | locked        |disabled   |offline     |
      | 6  | compute-1   | worker     | locked        |disabled   |offline     |
      +----+-------------+------------+---------------+-----------+------------+
      
    2. Power on the host.

      Ensure that the host boots from the network, and not from any disk image that may be present.

      The software is installed on the host, and then the host is rebooted. Wait for the host to be reported as locked, disabled, and online.

    3. Unlock controller-1.

      ~(keystone_admin)]$ system host-unlock controller-1
      +-----------------+--------------------------------------+
      | Property        | Value                                |
      +-----------------+--------------------------------------+
      | action          | none                                 |
      | administrative  | locked                               |
      | availability    | online                               |
      | ...             | ...                                  |
      | uuid            | 5fc4904a-d7f0-42f0-991d-0c00b4b74ed0 |
      +-----------------+--------------------------------------+
      
    4. Verify the state of the hosts.

      ~(keystone_admin)]$ system host-list
      +----+-------------+------------+---------------+-----------+------------+
      | id | hostname    | personality| administrative|operational|availability|
      +----+-------------+------------+---------------+-----------+------------+
      | 1  | controller-0| controller | unlocked      |enabled    |available   |
      | 2  | controller-1| controller | unlocked      |enabled    |available   |
      | 3  | storage-0   | storage    | locked        |disabled   |offline     |
      | 4  | storage-1   | storage    | locked        |disabled   |offline     |
      | 5  | compute-0   | worker     | locked        |disabled   |offline     |
      | 6  | compute-1   | worker     | locked        |disabled   |offline     |
      +----+-------------+------------+---------------+-----------+------------+
      
  13. Restore storage configuration. If wipe_ceph_osds is set to True, follow the same procedure used to restore controller-1, beginning with host storage-0 and proceeding in sequence.

    Note

    This step should be performed ONLY if you are restoring storage hosts.

    1. For storage hosts, there are two options:

      With the controller software installed and updated to the same level that was in effect when the backup was performed, you can perform the restore procedure without interruption.

      Standard with Controller Storage install or reinstall depends on the wipe_ceph_osds configuration:

      1. If wipe_ceph_osds is set to true, reinstall the storage hosts.

      2. If wipe_ceph_osds is set to false (default option), do not reinstall the storage hosts.

        Caution

        Do not reinstall or power off the storage hosts if you want to keep previous Ceph cluster data. A reinstall of storage hosts will lead to data loss.

    2. Ensure that the Ceph cluster is healthy. Verify that the three Ceph monitors (controller-0, controller-1, storage-0) are running in quorum.

      ~(keystone_admin)]$ ceph -s
      cluster:
          id:     3361e4ef-b0b3-4f94-97c6-b384f416768d
          health: HEALTH_OK
      
        services:
          mon: 3 daemons, quorum controller-0,controller-1,storage-0
          mgr: controller-0(active), standbys: controller-1
          osd: 10 osds: 10 up, 10 in
      
        data:
          pools:   5 pools, 600 pgs
          objects: 636  objects, 2.7 GiB
          usage:   6.5 GiB used, 2.7 TiB / 2.7 TiB avail
          pgs:     600 active+clean
      
        io:
          client:   85 B/s rd, 336 KiB/s wr, 0 op/s rd, 67 op/s wr
      

      Caution

      Do not proceed until the Ceph cluster is healthy and the message HEALTH_OK appears.

      If the message HEALTH_WARN appears, wait a few minutes and then try again. If the warning condition persists, consult the public documentation for troubleshooting Ceph monitors (for example http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/).

  14. Restore the compute (worker) hosts, one at a time.

    Restore the compute (worker) hosts following the same procedure used to restore controller-1.

  15. Allow Calico and Coredns pods to be recovered by Kubernetes. They should all be in ‘N/N Running’ state.

    The state of the hosts when the restore operation is complete is as follows:

    ~(keystone_admin)]$ kubectl get pods -n kube-system | grep -e calico -e coredns
    calico-kube-controllers-5cd4695574-d7zwt  1/1     Running
    calico-node-6km72                         1/1     Running
    calico-node-c7xnd                         1/1     Running
    coredns-6d64d47ff4-99nhq                  1/1     Running
    coredns-6d64d47ff4-nhh95                  1/1     Running
    
  16. If wipe_ceph_osds is set to true and all the system hosts are in an unlocked/enabled/available state, do the following:

    1. Remove and reapply platform-integ-apps. This step will re-create the default ceph pools (they were deleted):

      $ system application-remove platform-integ-apps
      $ system application-apply platform-integ-apps
      
    2. Delete completely and reapply all the applications that have persistent volumes (OpenStack or custom apps). For example for OpenStack, run the following commands

      $ system application-remove stx-openstack
      $ system application-delete stx-openstack
      $ system application-upload stx-openstack-20.12-0.tgz
      $ system application-apply stx-openstack
  17. If Hashicorp Vault data was also backed up, it can be restored using the vault_restore playbook. For more information on Hashicorp Vault restore see:

  18. Run the system restore-complete command.

    ~(keystone_admin)]$ system restore-complete
    
  19. Alarms 750.006 alarms disappear one at a time, as the apps are auto applied.

Postrequisites

  • Passwords for local user accounts must be restored manually since they are not included as part of the backup and restore procedures.

  • After restoring a Distributed Cloud subcloud, you need to bring it back to the managed state on the Central Cloud (SystemController), by using the following commands:

    $ source /etc/platform/openrc
    ~(keystone_admin)]$ dcmanager subcloud manage <subcloud-name>
    

    where <subcloud-name> is the name of the subcloud to be managed.