Restore a Subcloud/Group of Subclouds from Backup Data Using DCManager CLI

A subcloud can be restored from its backup data previously stored centrally on the system controller or locally on the subcloud using dcmanager command line interface (CLI). The subcloud install data must be available for this operation to proceed. The subcloud must support Redfish Virtual Media Service (version 1.2 or higher) if remote installation is required.

About this task

The CLI command dcmanager subcloud-backup restore can be used to restore a subcloud or a group of subclouds. By default, the restore is done from subcloud backup data on the central systemController. The command accepts the following parameters/options:

--with-install

Perform remote installation of the subcloud prior to execution of restore procedure. The subcloud must support Redfish Virtual Media Service (version 1.2 or higher) to use this option.

--local-only

Use the local backup archive (default local storage /opt/platform-backup/backups/<release-version>). If not specified, the subcloud backup archive on the central systemController will be used.

--registry-images

Restore saved container images to registry.local as part of restore procedure (local storage only).

--subcloud <subcloud-name>

The subcloud to restore.

--group <subcloud-group-name>

The group of subclouds to restore.

--release <release>

Software release used to install, bootstrap, and/or deploy the subcloud with. If not specified, the current software release of the System Controller will be used.

--auto

Triggers the restore process locally on the subcloud using only BMC connectivity. It is used to perform a fully autonomous restore when the OAM network connectivity between the System Controller and subcloud is not available before the restore. It can be combined with --local-only and/or --with-install. The subcloud must have software and the container images prestaged before auto restore.

--factory

Performs factory default restore via remote installation and automatic restore using prestaged software, container images, and factory backup from the subcloud’s platform-backup partition. This option overrides the --auto, --with-install, --registry-images, and --local-only options.

--restore-values <yaml-file>

The yaml file containing the customization parameters.

  • wipe_ceph_osds: false: To keep the Ceph cluster data intact.

  • wipe_ceph_osds: true: To wipe the Ceph cluster entirely.

  • on_box_data: true: To indicate that the backup data file is under /opt/platform-backup directory on the local machine.

  • ipmi_sel_event_monitoring: true|false: To enable or disable IPMI SEL (System Event Log) event monitoring for auto and factory-restore operations (defaults to true). Set to false for BMC systems that do not support custom SEL events (example: certain variants of OpenBMC).

  • bootstrap_address: List of subclouds and their corresponding bootstrap addresses for connectivity.

  • add_docker_prefix: For more details, see Install a Subcloud Using Redfish Platform Management Service.

    bootstrap_address:
      <subcloud_name1>: <subcloud_bootstrap_address1>
      <subcloud_name2>: <subcloud_bootstrap_address2>
    

    Note

    The bootstrap_address key is only necessary for the restore of manually installed subclouds. For the subclouds installed via Redfish, the bootstrap_address is already available in the install values.

See Run Restore Playbook Locally on the Controller for the list of configurable system restore parameters.

--sysadmin-password <sysadmin-password>

If not specified, user will be prompted for the password. Recommend that this option is ONLY used for automation; i.e., for interactive use, don’t use option and specify password on prompting, so as to avoid sysadmin password getting into log files. For factory-restore operations with ipmi_sel_event_monitoring as false, provide the factory default sysadmin password.

The --subcloud/--group is a mandatory parameter.

When --registry-images option is applied, the entire registry filesystem which contains both platform and user container images will be restored.

After the subcloud has been re-installed with the desired release version, the backup archive for that release will be transferred to the subcloud for the restore operation by default. If --local-only option is specified, the local backup archive for the release will be used instead.

It is possible to specify a custom location of the backup file that resides on the subcloud using --restore-values option and by setting initial_backup_dir and backup_filename in the provided restore_values yaml file. Please ensure this custom backup file is not corrupted and is compatible with software release the subcloud was installed with.

To restore images from a custom backup file on the subcloud using --restore-values <yaml-file> option, the registry_backup_filename parameter must be set in restore_values yaml file.

Restore a Single Subcloud

Prerequisites

  • The System Controller is healthy and ready to accept dcmanager related commands.

  • The subcloud is unmanaged and is in a valid state for restore operation (i.e. not being restored, installed, bootstrapped, deployed or rehomed).

  • The subcloud install data is available.

  • The backup file(s) exists and is compatible with the software release the subcloud is being restored to.

Note

When a vCSR application running on the subcloud provides the network routing, the standard restore operation (without --auto or --factory) is not supported. For this use case, see the Auto-Restore a Subcloud or Factory Restore a Subcloud sections below.

Procedure

To restore a subcloud, including remote installation, from system backup data in central storage:

~(keystone_admin)]$ dcmanager subcloud-backup restore --subcloud <subcloud-name> --with-install --sysadmin-password <sysadmin-password>

To restore a pre-installed subcloud from system backup data in central storage:

~(keystone_admin)]$ dcmanager subcloud-backup restore --subcloud <subcloud-name> --sysadmin-password <sysadmin-password>

To restore a subcloud, including remote installation, from system backup data stored in default local storage:

~(keystone_admin)]$ dcmanager subcloud-backup restore --subcloud <subcloud-name> --with-install --local-only --sysadmin-password <sysadmin-password>

To restore a subcloud, including remote installation, from system backup and images backup data in default local storage:

~(keystone_admin)]$ dcmanager subcloud-backup restore --subcloud <subcloud-name> --local-only --registry-images --sysadmin-password <sysadmin-password>

Note

The --registry-images option can only be used with --local-only option.

To restore a pre-installed subcloud from system and images backup data stored at custom location on the subcloud:

  1. Create a yaml file e.g. restore_overrides.yaml with the following content:

    initial_backup_dir: /home/sysadmin/mybackup_dir
    backup_filename: test_platform_backup.tgz
    registry_backup_filename: test_images_backup.tgz
    
  2. Then, run the command:

    ~(keystone_admin)]$ dcmanager subcloud-backup restore --subcloud subcloud1 --local-only –-registry-images --restore-values restore_overrides.yaml --sysadmin-password <sysadmin-password>
    

Sample response to a single subcloud restore:

+-----------------------------+----------------------------+
| Field                       | Value                      |
+-----------------------------+----------------------------+
| id                          | 8                          |
| name                        | subcloud1                  |
| description                 | None                       |
| location                    | None                       |
| software_version            | 22.12                      |
| management                  | unmanaged                  |
| availability                | offline                    |
| deploy_status               | restore-failed             |
| management_subnet           | fd01:15::0/64              |
| management_start_ip         | fd01:15::2                 |
| management_end_ip           | fd01:15::11                |
| management_gateway_ip       | fd01:15::1                 |
| systemcontroller_gateway_ip | fd01:1::1                  |
| group_id                    | 2                          |
| created_at                  | 2022-12-12 05:29:23.807243 |
| updated_at                  | 2022-12-13 16:39:48.904037 |
| backup_status               | unknown                    |
| backup_datetime             | None                       |
+-----------------------------+----------------------------+

Note

The subcloud can be restored or restored again while in a failed deploy state such as:

  • data-migration-failed (upgrade failure)

  • restore-failed (previous restore attempt failed due to a bad backup file)

  • rehome-failed

To view the progress of subcloud restore, please use dcmanager subcloud show or dcmanager subcloud list command:

~(keystone_admin)]$ dcmanager subcloud show subcloud1

+-----------------------------+----------------------------+
| Field                       | Value                      |
+-----------------------------+----------------------------+
| id                          | 9                          |
| name                        | subcloud2                  |
| description                 | None                       |
| location                    | None                       |
| software_version            | 22.12                      |
| management                  | unmanaged                  |
| availability                | offline                    |
| deploy_status               | restoring                  |
| management_subnet           | fd01:176::0/64             |
| management_start_ip         | fd01:176::2                |
| management_end_ip           | fd01:176::11               |
| management_gateway_ip       | fd01:176::1                |
| systemcontroller_gateway_ip | fd01:1::1                  |
| group_id                    | 2                          |
| created_at                  | 2022-12-13 00:09:44.543494 |
| updated_at                  | 2022-12-13 18:23:20.659138 |
| backup_status               | unknown                    |
| backup_datetime             | None                       |
| dc-cert_sync_status         | unknown                    |
| firmware_sync_status        | unknown                    |
| identity_sync_status        | unknown                    |
| kubernetes_sync_status      | unknown                    |
| kube-rootca_sync_status     | unknown                    |
| load_sync_status            | unknown                    |
| patching_sync_status        | unknown                    |
| platform_sync_status        | unknown                    |
+-----------------------------+----------------------------+

If the restore operation completes successfully, the subcloud will become online and the deploy_status will be set to ‘complete’.

Please continue with Post restore procedure.

If the restore operation fails, dcmanager subcloud errors command can be used to view the error.

Auto-Restore a Subcloud

The auto-restore feature enables fully autonomous subcloud restoration using only BMC connectivity, without requiring network communication between the System Controller and subcloud during the restore process. This is particularly useful when the subcloud network is unavailable until the restore completes (example: when a vCSR application provides network routing).

Note

This feature is available only to the AIO-SX subclouds.

Unresolved

Note

For non-vCSR systems, use the standard restore operation (without --auto). Standard restore runs remotely through Ansible and provides more detailed progress information than monitoring via BMC SEL events.

Prerequisites

Before performing an auto-restore operation, ensure that the following conditions are met.

  • The System Controller is healthy and ready to accept dcmanager commands.

  • The subcloud is unmanaged and is in a valid state for restore operation.

  • The subcloud is an AIO-SX.

  • The subcloud supports Redfish Virtual Media Service (version 1.2 or higher).

  • BMC access is available throughout the auto-restore operation.

  • For --local-only auto-restore, the specified backup file exists in the subcloud’s platform-backup partition and the subcloud has been prestaged with software (ostree repo) and container images.

  • For central storage auto-restore, the specified backup file exists on the System Controller and the subcloud has been prestaged with software (ostree repo) and container images.

Note

Auto-restore is only supported on subclouds running release r12 or later.

Procedure

You can auto-restore a subcloud using one of the following methods:

  • Auto-restore with remote installation using backup data from central storage.

    ~(keystone_admin)]$ dcmanager subcloud-backup restore --subcloud <subcloud-name> --auto --with-install --sysadmin-password <sysadmin-password>
    

    This command performs the following operations:

    • Creates a miniboot ISO with the backup data embedded

    • Triggers remote installation via Redfish

    • Automatically triggers the restore process locally on the subcloud after the installation completes

    • Monitors the restore progress via IPMI SEL events (if supported)

    Note

    Use auto-restore only when the subcloud network is unavailable until the restore completes (for example, when a vCSR application provides network routing). Otherwise, use the standard restore operation.

  • Auto-restore with remote installation using local backup data.

    ~(keystone_admin)]$ dcmanager subcloud-backup restore --subcloud <subcloud-name> --auto --with-install --local-only --sysadmin-password <sysadmin-password>
    

    This command uses prestaged ostree repository and backup data already present on the subcloud’s platform-backup partition.

  • Auto-restore to a specific release.

    ~(keystone_admin)]$ dcmanager subcloud-backup restore --subcloud <subcloud-name> --auto --local-only --release 26.03 --sysadmin-password <sysadminpassword>
    
  • Auto-restore without remote installation (pre-installed subcloud).

    ~(keystone_admin)]$ dcmanager subcloud-backup restore --subcloud <subcloud-name> --auto --sysadmin-password <sysadmin-password>
    

    When --with-install is not specified, the backup data is transferred to the subcloud via a cloud-init seed ISO, and the restore is triggered automatically upon boot. The cloud-init service must be enabled on the subcloud. This is done automatically if the subcloud is installed with a prestaged ISO.

    Note

    For BMC systems that do not support custom IPMI SEL events (example: OpenBMC), set ipmi_sel_event_monitoring to false in the restore values yaml file.

    ~(keystone_admin)]$ cat restore_overrides.yaml ipmi_sel_event_monitoring: false
    
    ~(keystone_admin)]$ dcmanager subcloud-backup restore --subcloud <subcloud-name> --auto --with-install --restore-values restore_overrides.yaml --sysadmin-password <sysadmin-password>
    

    When IPMI SEL event monitoring is disabled, the System Controller waits for the subcloud to become reachable via the OAM network to validate restore completion.

Factory-Restore a Subcloud

The factory-restore feature restores subcloud to its factory default state using prestaged software and factory backup data stored locally on the subcloud. This is useful for disaster recovery scenarios where the subcloud needs to be completely reset to its initial factory-installed state.

Note

This feature is available only to the AIO-SX subclouds.

Prerequisites

Before performing a factory-restore operation, ensure that the following conditions are met:

  • The System Controller is healthy and ready to accept dcmanager commands.

  • The subcloud is unmanaged and is in a valid state for restore operation.

  • The subcloud is an AIO-SX.

  • The subcloud supports Redfish Virtual Media Service (version 1.2 or higher).

  • BMC access is available throughout the factory-restore operation.

  • The factory backup data and prestaged software must exist on the subcloud’s platform-backup partition in the following structure:

    /opt/platform-backup/factory/<sw-version>/
        ostree_repo/
        local_registry_filesystem.tgz (and/or container image tarballs)
        factory_backup.tgz
        miniboot.cfg
    

    These files are created automatically during the factory install process described here Enroll a Factory Installed Non Distributed Standalone System as a Subcloud.

Note

Factory-restore only supports systems that were factory-installed with the r12 release or later.

Procedure

To factory-restore a subcloud, use the following command:

~(keystone_admin)]$ dcmanager subcloud-backup restore --subcloud <subcloud-name> --factory --sysadmin-password <factory-sysadmin-password>

This command performs the following operations:

  • Creates a miniboot ISO configured to use the local kickstart from the platform-backup partition.

  • Triggers remote installation via Redfish using prestaged ostree repository.

  • Automatically triggers the factory-restore process using the factory backup data after installation completes.

  • Monitors the restore progress via IPMI SEL events (if supported).

Note

When using --factory with ipmi_sel_event_monitoring as false, provide the factory default sysadmin password (the password that was set during the original factory installation).

Note

The --factory option cannot be combined with --auto, --with-install, --registry-images, or --local-only options. These options are ignored when --factory is specified.

Note

For factory-restore, the --release option must specify the same release version in which the subcloud system was factory installed.

Factory-Restore Completion State

Upon successful factory-restore, the subcloud deploy_status will be set to factory-restore-complete.

At this state, the subcloud can be enrolled or re-enrolled using the following commands:

~(keystone_admin)]$ dcmanager subcloud delete <subcloud-name>
~(keystone_admin)]$ dcmanager subcloud add --enroll ...

For more information, see Enroll a Factory Installed Non Distributed Standalone System as a Subcloud.

BMC Systems Without IPMI SEL Event Support

For BMC systems that do not support custom IPMI SEL events (example: OpenBMC), factory-restore completion cannot be automatically detected when vCSR application is involved (as the subcloud network is not available until after restore).

In this scenario:

  • The monitoring playbook will timeout and set the subcloud to the install-failed state, even if the restore was successful.

  • Manual verification of restore success is required via BMC serial console. This can be done by verifying /var/log/auto-restore.log inside the subcloud. If it contains the System restore-complete executed successfully message, the restore completed successfully.

  • If the restore was successful, delete the subcloud on the System Controller and re-enroll it.

For factory-restore of a non-vCSR system, always set ipmi_sel_event_monitoring to false. This enables remote factory-restore execution through Ansible, which provides more detailed progress information than monitoring via BMC SEL events.

~(keystone_admin)]$ cat restore_overrides.yaml
ipmi_sel_event_monitoring: false

~(keystone_admin)]$ dcmanager subcloud-backup restore --subcloud <subcloud-name> --factory --restore-values restore_overrides.yaml --sysadmin-password <factory-sysadmin-password>

Note

The ipmi_sel_event_monitoring restore value is ignored for the standard restore operations (without --auto or --factory), which do not monitor progress through SEL events.

Restore a Group of Subclouds

The above subcloud-backup restore operations can be performed for a group of subclouds simultaneously by replacing --subcloud option with --group option.

For example, to restore a group of subclouds with remote installation from their system data in central storage, run the following command:

~(keystone_admin)]$ dcmanager subcloud-backup restore --group <group> --with-install --sysadmin-password <sysadmin-password>

To auto-restore a group of subclouds, run the following command:

~(keystone_admin)]$ dcmanager subcloud-backup restore --group <group> --auto --with-install --sysadmin-password <sysadmin-password>

To factory-restore a group of subclouds, run the following command:

~(keystone_admin)]$ dcmanager subcloud-backup restore --group <group> --factory --sysadmin-password <factory-sysadmin-password>

If all subclouds in the group are not in the valid state for restore, an error message will be displayed. If some of the subclouds in the group meet restore operation criteria, a list will be displayed.

Sample group restore response:

+----+-----------+-------------+----------+------------------+------------+--------------+---------------+-------------------+---------------------+-------------------+-----------------------+-----------------------------+----------+----------------------------+----------------------------+----------------+----------------------------+
| id | name      | description | location | software_version | management | availability | deploy_status | management_subnet | management_start_ip | management_end_ip | management_gateway_ip | systemcontroller_gateway_ip | group_id | created_at                 | updated_at                 | backup_status  | backup_datetime            |
+----+-----------+-------------+----------+------------------+------------+--------------+---------------+-------------------+---------------------+-------------------+-----------------------+-----------------------------+----------+----------------------------+----------------------------+----------------+----------------------------+
|  8 | subcloud6 | None        | None     | 22.12            | unmanaged  | online       | complete      | fd01:15::0/64     | fd01:15::2          | fd01:15::11       | fd01:15::1            | fd01:1::1                   |        2 | 2022-12-13 18:23:03.883068 | 2022-12-13 22:14:39.331199 | complete-local | 2022-12-13 22:04:06.232043 |
|  9 | subcloud8 | None        | None     | 22.12            | unmanaged  | online       | complete      | fd01:176::0/64    | fd01:176::2         | fd01:176::11      | fd01:176::1           | fd01:1::1                   |        2 | 2022-12-13 19:27:55.115604 | 2022-12-13 22:15:09.287665 | complete-local | 2022-12-13 22:05:03.785280 |
+----+-----------+-------------+----------+------------------+------------+--------------+---------------+-------------------+---------------------+-------------------+-----------------------+-----------------------------+----------+----------------------------+----------------------------+----------------+----------------------------+

After group restore is complete, continue with Post restore procedure for each subcloud in the group.

Post Factory-Restore

After a successful factory-restore, the subcloud deploy_status will be set to factory-restore-complete.

Delete the subcloud and re-enroll it.

Delete the subcloud using the dcmanager subcloud delete <subcloud-name> command.

Re-add and enroll the subcloud using the dcmanager subcloud add --enroll command.

Post Restore

AIO-SX subcloud

Resume subcloud audit with the command:

~(keystone_admin)]$ dcmanager subcloud manage

AIO-DX/Standard subcloud

If the restore playbook completes successfully, the subcloud will be online and deploy_status will be set to complete. Only controller-0 will be in unlocked and online state. To complete the restore operation, follow the procedure available in Restore Platform System Data and Storage for restoring the remaining subcloud nodes.

Resume subcloud audit with the command:

~(keystone_admin)]$ dcmanager subcloud manage

Troubleshooting Auto-Restore and Factory-Restore

  • Diagnosing install failures

    If the auto-restore or factory-restore fails during installation (detected via install timeout), perform the following:

    • Enable serial logs by setting rvmc_debug_level in the install values.

      rvmc_debug_level: 1
      

      Note

      Some BMCs require specific cipher suites to allow serial console log capture. Use the bmc_ciphersuite parameter in the install values YAML file to configure the cipher suites.

    • Update the subcloud install values.

      ~(keystone_admin)]$ dcmanager subcloud update --install-values <install-values> --sysadmin-password <sysadmin-password> --bmc-password <bmc-password> <subcloud-name>
      

      Next time the restore is executed, check the logs in /var/log/dcmanager/ansible on the System Controller.

    • Alternatively, access /root/install.log via BMC serial console on the subcloud.

  • Common failure scenarios

    • Missing prestaged ostree data: The install will abort with the message Installation Failed: ERROR: ostree_repo must be prestaged for auto-restore operation in /root/install.log.

    • Missing backup file: Detected via IPMI SEL event (if supported) or during restore execution. Check that the backup file exists at the expected location.

    • Missing container images: Detected via IPMI SEL event (if supported) or during restore execution. Ensure that container images are prestaged or backed up.

  • Viewing restore logs

    For auto-restore and factory-restore operations, restore logs are captured in /var/log/auto-restore.log on the subcloud and can be viewed via BMC serial console.