Configure Distributed Cloud System Controller GEO Redundancy

About this task

You can configure a distributed cloud System Controller GEO Redundancy using DC manager CLI commands.

System administrators can follow the procedures below to enable and disable the GEO Redundancy feature.

Note

In this release, the GEO Redundancy feature supports only two distributed clouds in one protection group.

Enable GEO Redundancy

Set up a protection group for two distributed clouds, making these two distributed clouds operational in 1+1 active GEO Redundancy mode.

For example, let us assume we have two distributed clouds, site A and site B. When the operation is performed on site A, the local site is site A and the peer site is site B. When the operation is performed on site B, the local site is site B and the peer site is site A.

Prerequisites

The peer system controller’s OAM network is accessible to each other and can access the subclouds via both OAM and management networks.

For security of production system, it is important to ensure the safety and identification of peer site queries. To meet this objective, it is essential to have an HTTPS-based system API in place. This necessitates the presence of a well-known and trusted CA to enable secure HTTPS communication between peers. If you are using an internally trusted CA, ensure that the system trusts the CA by installing its certificate with the following command.

~(keystone_admin)]$ system certificate-install --mode ssl_ca <trusted-ca-bundle-pem-file>

where:

<trusted-ca-bundle-pem-file>

is the path to the intermediate or Root CA certificate associated with the StarlingX REST API’s Intermediate or Root CA-signed certificate.

Procedure

You can enable the GEO Redundancy feature between site A and site B from the command line. In this procedure, the subclouds managed by site A will be configured to be managed by GEO Redundancy protection group that consists of site A and site B. When site A is offline for some reasons, an alarm notifies the administrator, who initiates the group based batch migration to rehome the subclouds of site A to site B for centralized management.

Similarly, you can also configure the subclouds managed by site B to be taken over by site A when site B is offline by following the same procedure where site B is local site and site A is peer site.

  1. Log in to the active controller node of site B and get the required information about the site B to create a protection group.

    • Unique UUID of the central cloud of the peer system controller

    • URI of Keystone endpoint of peer system controller

    • Gateway IP address of the management network of peer system controller

    For example:

    # On site B
    sysadmin@controller-0:~$ source /etc/platform/openrc
    ~(keystone_admin)]$ system show | grep -i uuid
    | uuid | 223fcb30-909d-4edf-8c36-1aebc8e9bd4a |
    
    ~(keystone_admin)]$ openstack endpoint list --service keystone \
        --interface public --region RegionOne -c URL
    +-----------------------------+
    | URL                         |
    +-----------------------------+
    | http://10.10.10.2:5000      |
    +-----------------------------+
    
    ~(keystone_admin)]$ system host-route-list controller-0 | awk '{print $10}' | grep -v "^$"
    gateway
    10.10.27.1
    
  2. Log in to the active controller node of the central cloud of site A. Create a System Peer instance of site B on site A so that site A can access information of site B.

    # On site A
    ~(keystone_admin)]$ dcmanager system-peer add \
        --peer-uuid 223fcb30-909d-4edf-8c36-1aebc8e9bd4a \
        --peer-name siteB \
        --manager-endpoint http://10.10.10.2:5000 \
        --peer-controller-gateway-address 10.10.27.1
    Enter the admin password for the system peer:
    Re-enter admin password to confirm:
    
    +----+--------------------------------------+-----------+-----------------------------+----------------------------+
    | id | peer uuid                            | peer name | manager endpoint            | controller gateway address |
    +----+--------------------------------------+-----------+-----------------------------+----------------------------+
    |  2 | 223fcb30-909d-4edf-8c36-1aebc8e9bd4a | siteB     | http://10.10.10.2:5000      | 10.10.27.1                 |
    +----+--------------------------------------+-----------+-----------------------------+----------------------------+
    
  3. Collect the information from site A.

    # On site A
    sysadmin@controller-0:~$ source /etc/platform/openrc
    ~(keystone_admin)]$ system show | grep -i uuid
    ~(keystone_admin)]$ openstack endpoint list --service keystone --interface public --region RegionOne -c URL
    ~(keystone_admin)]$ system host-route-list controller-0 | awk '{print $10}' | grep -v "^$"
    
  4. Log in to the active controller node of the central cloud of site B. Create a System Peer instance of site A on site B so that site B has information about site A.

    # On site B
    ~(keystone_admin)]$ dcmanager system-peer add \
        --peer-uuid 3963cb21-c01a-49cc-85dd-ebc1d142a41d \
        --peer-name siteA \
        --manager-endpoint http://10.10.11.2:5000 \
        --peer-controller-gateway-address 10.10.25.1
    Enter the admin password for the system peer:
    Re-enter admin password to confirm:
    
  5. Create a SPG for site A.

    # On site A
    ~(keystone_admin)]$ dcmanager subcloud-peer-group add --peer-group-name group1
    
  6. Add the subclouds needed for redundancy protection on site A.

    Ensure that the subclouds bootstrap data is updated. The bootstrap data is the data used to bootstrap the subcloud, which includes the OAM and management network information, system controller gateway information, and docker registry information to pull necessary images to bootstrap the system.

    For an example of a typical bootstrap file, see Install and Provision a Subcloud.

    1. Update the subcloud information with the bootstrap values.

      ~(keystone_admin)]$ dcmanager subcloud update subcloud1 \
         --bootstrap-address <Subcloud_OAM_IP_Address> \
         --bootstrap-values <Path_of_Bootstrap-Value-File>
      
    2. Update the subcloud information with the SPG created locally.

      ~(keystone_admin)]$ dcmanager subcloud update <SiteA-Subcloud1-Name> \
          --peer-group <SiteA-Subcloud-Peer-Group-ID-or-Name>
      

      For example,

      ~(keystone_admin)]$ dcmanager subcloud update subcloud1 --peer-group group1
      
    3. If you want to remove one subcloud from the SPG, run the following command:

      ~(keystone_admin)]$ dcmanager subcloud update <SiteA-Subcloud-Name> --peer-group none
      

      For example,

      ~(keystone_admin)]$ dcmanager subcloud update subcloud1 --peer-group none
      
    4. Check the subclouds that are under the SPG.

      ~(keystone_admin)]$ dcmanager subcloud-peer-group list-subclouds <SiteA-Subcloud-Peer-Group-ID-or-Name>
      
  7. Create an association between the System Peer and SPG.

    # On site A
    ~(keystone_admin)]$ dcmanager peer-group-association add \
        --system-peer-id <SiteB-System-Peer-ID> \
        --peer-group-id <SiteA-System-Peer-Group1> \
        --peer-group-priority <priority>
    

    The peer-group-priority parameter can accept an integer value greater than 0. It is used to set the priority of the SPG, which is created in peer site using the peer site’s dcmanager API during association synchronization.

    • The default priority in the SPG is 0 when it is created in the local site.

    • The smallest integer has the highest priority.

    During the association creation, the SPG in the association will be synchronized from the local site to the peer site, and the subclouds belonging to the SPG.

    Confirm that the local SPG and its subclouds have been synchronized into site B with the same name.

    • Show the association information just created in site A and ensure that sync_status is in-sync.

      # On site A
      ~(keystone_admin)]$ dcmanager peer-group-association list <Association-ID>
      
      +----+---------------+----------------+---------+-----------------+---------------------+
      | id | peer_group_id | system_peer_id | type    | sync_status     | peer_group_priority |
      +----+---------------+----------------+---------+-----------------+---------------------+
      |  1 |             1 |              2 | primary | in-sync         | 2                   |
      +----+---------------+----------------+---------+-----------------+---------------------+
      
    • Show subcloud-peer-group in site B and ensure that it has been created.

    • List the subcloud in subcloud-peer-group in site B and ensure that all the subclouds have been synchronized as secondary subclouds.

      # On site B
      ~(keystone_admin)]$ dcmanager subcloud-peer-group show <SiteA-Subcloud-Peer-Group-Name>
      ~(keystone_admin)]$ dcmanager subcloud-peer-group list-subclouds <SiteA-Subcloud-Peer-Group-Name>
      

    When you create the primary association on site A, a non-primary association on site B will automatically be created to associate the synchronized SPG from site A and the system peer pointing to site A.

    You can check the association list to confirm if the non-primary association was created on site B.

    # On site B
    ~(keystone_admin)]$ dcmanager peer-group-association list
    +----+---------------+----------------+-------------+-------------+---------------------+
    | id | peer_group_id | system_peer_id | type        | sync_status | peer_group_priority |
    +----+---------------+----------------+-------------+-------------+---------------------+
    |  2 |            26 |              1 | non-primary | in-sync     | None                |
    +----+---------------+----------------+-------------+-------------+---------------------+
    
  8. (Optional) Update the protection group related configuration.

    After the peer group association has been created, you can still update the related resources configured in the protection group:

    • Update subcloud with bootstrap values

    • Add subcloud(s) into the SPG

    • Remove subcloud(s) from the SPG

    After any of the above operations, sync_status is changed to out-of-sync.

    After the update has been completed, you need to use the sync command to push the SPG changes to the peer site that keeps the SPG the same status.

    # On site A
    dcmanager peer-group-association sync <SiteA-Peer-Group-Association1-ID>
    

    Warning

    The dcmanager peer-group-association sync command must be run after any of the following changes:

    • Subcloud is removed from the SPG for the subcloud name change.

    • Subcloud is removed from the SPG for the subcloud management network reconfiguration.

    • Subcloud updates one or both of these parameters: --bootstrap-address, --bootstrap-values parameters.

    Similarly, you need to check the information has been synchronized by showing the association information just created in site A, ensuring that sync_status is in-sync.

    # On site A
    ~(keystone_admin)]$ dcmanager peer-group-association show <Association-ID>
    
     +----+---------------+----------------+---------+-----------------+---------------------+
     | id | peer_group_id | system_peer_id | type    | sync_status     | peer_group_priority |
     +----+---------------+----------------+---------+-----------------+---------------------+
     |  1 |             1 |              2 | primary | in-sync         | 2                   |
     +----+---------------+----------------+---------+-----------------+---------------------+
    

Results

You have configured a GEO Redundancy protection group between site A and site B. If site A is offline, the subclouds configured in the SPG can be migrated in batch to site B for central management manually.

Health Monitor and Migration

Peer monitoring and alarming

After the peer protection group is formed, if site A cannot be connected to site B, there will be an alarm message on site B.

For example:

# On site B
~(keystone_admin)]$ fm alarm-list
+----------+--------------------------------------------------------------------------------------------------------------------------+--------------------------------------+----------+--------------------------+
| Alarm ID | Reason Text                                                                                                              | Entity ID                            | Severity | Time Stamp               |
+----------+--------------------------------------------------------------------------------------------------------------------------+--------------------------------------+----------+--------------------------+
| 280.004  | Peer siteA is in disconnected state. Following subcloud peer groups are impacted: group1.                                | peer=223fcb30-909d-4edf-             | major    | 2023-08-18T10:25:29.     |
|          |                                                                                                                          | 8c36-1aebc8e9bd4a                    |          | 670977                   |
|          |                                                                                                                          |                                      |          |                          |
+----------+--------------------------------------------------------------------------------------------------------------------------+--------------------------------------+----------+--------------------------+

Administrator can suppress the alarm with the following command:

# On site B
~(keystone_admin)]$ fm event-suppress --alarm_id 280.004
+----------+------------+
| Event ID | Status     |
+----------+------------+
| 280.004  | suppressed |
+----------+------------+

Migration

If site A is down, after receiving the alarming message the administrator can choose to perform the migration on site B, which will migrate the subclouds under the SPG from site A to site B.

Note

Before initiating the migration operation, ensure that sync-status of the peer group association is in-sync so that the latest updates from site A have been successfully synchronized to site B. If sync_status is not in-sync, the migration may fail.

# On site B
~(keystone_admin)]$ dcmanager subcloud-peer-group migrate <Subcloud-Peer-Group-ID-or-Name>

# For example:
~(keystone_admin)]$ dcmanager subcloud-peer-group migrate group1

During the batch migration, you can check the status of the migration of each subcloud in the SPG by showing the details of the SPG being migrated.

# On site B
~(keystone_admin)]$ dcmanager subcloud-peer-group status <Subcloud-Peer-Group-ID-or-Name>

After successful migration, the subcloud(s) should be in managed/online/complete status on site B.

For example:

# On site B
~(keystone_admin)]$ dcmanager subcloud list
+----+---------------------------------+------------+--------------+---------------+-------------+---------------+-----------------+
| id | name                            | management | availability | deploy status | sync        | backup status | backup datetime |
+----+---------------------------------+------------+--------------+---------------+-------------+---------------+-----------------+
| 45 | subcloud3-node2                 | managed    | online       | complete      | in-sync     | None          | None            |
| 46 | subcloud1-node6                 | managed    | online       | complete      | in-sync     | None          | None            |
+----+---------------------------------+------------+--------------+---------------+-------------+---------------+-----------------+

Post Migration

If site A is restored, the subcloud(s) should be adjusted to unmanaged/secondary status in site A. The administrator can receive an alarm on site A that notifies that the SPG is managed by a peer site (site B), because this SPG on site A has the higher priority.

~(keystone_admin)]$ fm alarm-list
+----------+-------------------------------------------------------------------------------------------------------------------------+----------------------------------+----------+-----------------------+
| Alarm ID | Reason Text                                                                                                             | Entity ID                        | Severity | Time Stamp            |
+----------+-------------------------------------------------------------------------------------------------------------------------+----------------------------------+----------+-----------------------+
| 280.005  | Subcloud peer group (peer_group_name=group1)                                              is managed by remote system   | subcloud_peer_group=7            | warning  | 2023-09-04T04:51:58.  |
|          | (peer_uuid=223fcb30-909d-4edf-8c36-1aebc8e9bd4a) with lower priority.                                                   |                                  |          | 435539                |
|          |                                                                                                                         |                                  |          |                       |
+----------+-------------------------------------------------------------------------------------------------------------------------+----------------------------------+----------+-----------------------+

Then, the administrator can decide if and when to migrate the subcloud(s) back.

# On site A
~(keystone_admin)]$ dcmanager subcloud-peer-group migrate <Subcloud-Peer-Group-ID-or-Name>

# For example:
~(keystone_admin)]$ dcmanager subcloud-peer-group migrate group1

After successful migration, the subcloud status should be back to the managed/online/complete status.

For example:

+----+---------------------------------+------------+--------------+---------------+---------+---------------+-----------------+
| id | name                            | management | availability | deploy status | sync    | backup status | backup datetime |
+----+---------------------------------+------------+--------------+---------------+---------+---------------+-----------------+
| 33 | subcloud3-node2                 | managed    | online       | complete      | in-sync | None          | None            |
| 34 | subcloud1-node6                 | managed    | online       | complete      | in-sync | None          | None            |
+----+---------------------------------+------------+--------------+---------------+---------+---------------+-----------------+

Also, the alarm mentioned above will be cleared after migrating back.

~(keystone_admin)]$ fm alarm-list

Disable GEO Redundancy

You can disable the GEO Redundancy feature from the command line.

Ensure that you have a stable environment to disable the GEO Redundancy feature, ensuring that the subclouds are managed by the expected site.

Procedure

  1. Delete the primary association on both the sites.

    # site A
    ~(keystone_admin)]$ dcmanager peer-group-association delete <SiteA-Peer-Group-Association1-ID>
    
  2. Delete the SPG.

    # site A
    ~(keystone_admin)]$ dcmanager subcloud-peer-group delete group1
    
  3. Delete the system peer.

    # site A
    ~(keystone_admin)]$ dcmanager system-peer delete siteB
    # site B
    ~(keystone_admin)]$ dcmanager system-peer delete siteA
    

Results

You have torn down the protection group between site A and site B.

Backup and Restore Subcloud

You can backup and restore a subcloud in a distributed cloud environment. However, GEO redundancy does not support the replication of subcloud backup files from one site to another.

A subcloud backup is valid only for the current system controller. When a subcloud is migrated from site A to site B, the existing backup becomes unavailable. In this case, you can create a new backup of that subcloud on site B. Subsequently, you can restore the subcloud from this newly created backup when it is managed under site B.

For information on how to backup and restore a subcloud, see Backup a Subcloud/Group of Subclouds using DCManager CLI and Restore a Subcloud/Group of Subclouds from Backup Data Using DCManager CLI.

Operations Performed by Protected Subclouds

The table below lists the operations that can/cannot be performed on the protected subclouds.

Primary site: The site where the SPG was created.

Secondary site: The peer site where the subclouds in the SPG can be migrated to.

Protected subcloud: The subcloud that belongs to a SPG.

Local/Unprotected subcloud: The subcloud that does not belong to any SPG.

Operation

Allow (Y/N/Maybe)

Note

Unmanage

N

Subcloud must be removed from the SPG before it can be manually unmanaged.

Manage

N

Subcloud must be removed from the SPG before it can be manually managed.

Delete

N

Subcloud must be removed from the SPG before it can be manually unmanaged and deleted.

Update

Maybe

Subcloud can only be updated while it is managed in the primary site because the sync command can only be issued from the system controller where the SPG was created.

Warning

The subcloud network cannot be reconfigured while it is being managed by the secondary site. If this operation is necessary, perform the following steps:

  1. Remove the subcloud from the SPG to make it a local/unprotected subcloud.

  2. Update the subcloud.

  3. (Optional) Manually rehome the subcloud to the primary site after it is restored.

  4. (Optional) Re-add the subcloud to the SPG.

Rename

Yes

  • If the subcloud in the primary site is already a part of SPG, we need to remove it from the SPG and then unmanage, rename, and manage the subcloud, and add it back to SPG and perform the sync operation.

  • If the subcloud is in the secondary site, perform the following steps:

    1. Remove the subcloud from the SPG to make it a local/unprotected subcloud.

    2. Unmanage the subcloud.

    3. Rename the subcloud.

    4. (Optional) Manually rehome the subcloud to the primary site after it is restored.

    5. (Optional) Re-add the subcloud to the SPG.

Patch

Y

Warning

There may be a patch out-of-sync alarm when the subcloud is migrated to another site.

Upgrade

Y

All the system controllers in the protection group must be upgraded first before upgrading any of the subclouds.

Rehome

N

Subcloud cannot be manually rehomed while being part of the SPG

Backup

Y

Restore

Maybe

  • If the subcloud in the primary site is already a part of SPG, we need to remove it from the SPG and then unmanage and restore the subcloud, and add it back to SPG and perform the sync operation.

  • If the subcloud is in the secondary site, perform the following steps:

    1. Remove the subcloud from the SPG to make it a local/unprotected subcloud.

    2. Unmanage the subcloud.

    3. Restore the subcloud from the backup.

    4. (Optional) Manually rehome the subcloud to the primary site after it is restored.

    5. (Optional) Re-add the subcloud to the SPG.

Prestage

Y

Warning

The prestage data will get overwritten because it is not guaranteed that both the system controllers always run on the same patch level (ostree repo) and/or have the same images list.

Reinstall

Maybe

If the subcloud in the primary site is already a part of SPG, you need to remove it from the SPG, unmanage and reinstall the subcloud, and add it back to SPG and perform the sync operation.

If the subcloud is in the secondary site, perform the following steps:

  1. Remove the subcloud from the SPG to make it a local/unprotected subcloud.

  2. Unmanage the subcloud.

  3. Re-install the subcloud.

  4. (Optional) Manually rehome the subcloud to the primary site after it is restored.

  5. (Optional) Re-add the subcloud to the SPG.

Remove from SPG

Maybe

Subcloud can be removed from the SPG in the primary site. Subcloud can only be removed from the SPG in the secondary site if the primary site is currently down.

Add to SPG

Maybe

Subcloud can only be added to the SPG in the primary site as manual sync is required.

Note

After migrating the subcloud, kube-rootca_sync_status may become out-of-sync if it is not synchronized with the new system controller. To update the root CA certificate of the subcloud, run the dcmanager kube-rootca-update-strategy command and pass the kube-root CA cert from the new system controller. However, if you update the certificate and migrate the subcloud back to the primary site, then the certificate needs to be updated again.