Subcloud GEO Redundancy Error Root Cause and Correction Action

This section describes different error scenarios that can occur while using the GEO Redundancy feature. The error scenarios described here are based on the assumption that you are dealing with two distributed clouds, site A and site B. In this context, the GEO Redundancy feature is activated designating site A as the primary site and site B as the non-primary site. The GEO Redundancy feature allows migration of subclouds to the non-primary site when the primary site becomes unavailable, and also allows migrating them back to the primary site when it becomes available again.

The error scenarios are divided into the following categories:

Protection group setup

This scenario covers the errors detected during setup of the protection group and issues.

Error scenarios

Recovery mechanism

Site A goes down temporarily in the middle of association.

Upon site A recovery, the peer group association will automatically change its sync status to failed.

The administrator can trigger re-sync from the primary site if sync_status is either failed or out-of-sync.

Possible values of sync_status include syncing, in_sync, out-of-sync, failed, and unknown.

Possible values of association_type include primary, non-primary.

Site A is down in the middle of synchronization and remains offline for an extended period of time.

How does the user check the syncing status from site B to initiate the migration?

The administrator can check the peer group association sync status in the non-primary site to decide the next step. If the sync status is in-sync, migration can be initiated.

After initial sync is completed, site B goes down. How does site A sync to site B after site B comes back online?

Site A needs to keep track of subcloud group updates when site B is down. The sync status will go into unknown status in site A.

The peer group association sync status in site A will change to unknown as soon as site B becomes unavailable. Upon the recovery of site B, the sync status will become in-sync on both sites again.

If changes are made to the peer group while site B is offline, the sync status in site A will change to failed. Upon the recovery of site B, the sync status in site A will change to out-of-sync. The administrator will need to re-initiate the sync in site A using the dcmanager peer-group-association sync <SiteA-Peer-Group-Association-ID> command.

Site B is offline while creating peer group association to associate peer and a SPG.

Creation of association will be accepted but sync_status will be failed. Protection group cannot be created.

The administrator can re-sync the association after site B is online using the dcmanager peer-group-association sync <SiteA-Peer-Group-Association-ID> command.

Swact occurs in site A while a peer group association is syncing.

Expected behavior should be similar to that of site A abrupt shutdown during sync. Re-sync needs to be done.

Swact occurs in site B while a peer group association is syncing.

Expected behavior should be similar to that of site B abrupt shutdown during sync. Re-sync needs to be done.

In the event of either site going down or swact occurring:

  1. How to track secondary subclouds added to site B and subclouds yet to be added to site B as secondary subcloud?

  2. How to track newly added subclouds to peer group and yet to be added new subclouds to peer group?

  1. Use the dcmanager peer-group-association show <association-id> command to view the sync status in available site.. If the status is in-sync, all the subclouds are added, otherwise synchronization has not finished and it needs to be re-initiated in the primary site when both sites are online.

  2. Run the dcmanager subcloud-peer-group list-subclouds <peer-group> command on site B to check total number of secondary subclouds and the subcloud details.

Migration

Assumption: Subclouds will be migrated to site B if site A goes down.

The following are the error scenarios that can occur during peer group migration.

Error scenarios

Recovery mechanism

What will be the status of the SPG if some subclouds failed to migrate?

After the migration, you can use dcmanager subcloud-peer-group list-subclouds to check the subclouds status under this SPG and you can check the SPG status using dcmanager subcloud-peer-group status.

Re-run the dcmanager subcloud-peer-group migrate PEER_GROUP command after fixing the failure.

How to recover when the subcloud rehome fails because of incorrect bootstrap address or bootstrap values and site A cannot recover in a time period?

When site A goes down, migrate SPG to site B. The subcloud will go to the rehome-failed deploy status when it has the wrong bootstrap address or bootstrap values. You can update the bootstrap address and bootstrap values if the subcloud migration fails and the primary site is down using the dcmanager subcloud update --bootstrap-address and dcmanager subcloud update --bootstrap-values commands. You do not need to remove the rehome failed subcloud from the SPG.

How to fix when the subcloud has incorrect bootstrap address or bootstrap values in the following situations of the SPG migration of site B?

  • Site A is recovered during migration.

  • Site A is recovered post migration.

  • Site A is online before the migration process.

Check the SPG migration status using the command dcmanager subcloud-peer-group status command to confirm if it has a subcloud in rehoming status. If there is no subcloud in rehoming status, it means the SPG migration was completed and you need to migrate the SPG back to site A. You can update the subcloud after the migration failure and try again. If you want to recover the subcloud, follow the instructions below:

  • When site A is recovered during migration, you can update the subcloud on site A. After the update, you need to wait for the SPG migration process to finish. You can then migrate SPG back to site A to recover the subcloud.

  • When site A is recovered post migration, you can migrate the SPG back to site A. If the subcloud rehome fails again in site A, you can update the subcloud.

  • When site A is online before the migration process, you can update the subcloud on site A and sync the updated subcloud to site B.

Use the dcmanager subcloud update --bootstrap-address and dcmanager subcloud update --bootstrap-values commands to update the subcloud. You do not need to remove the rehome failed subcloud from the SPG.

Site B goes down during SPG migration.

Re-execute the SPG migration if there is any subcloud with rehome-failed deploy status after site B is online.

Post migration

Audit operations will be triggered when the network is restored or migration_status of the peer group retrieved is changed to complete.

Error scenarios

Recovery mechanism

Site B goes down after the SPG has been migrated to its site.

Upon site A recovery, the administrator can trigger the migration of the SPG back to site A.