StarlingX Platform Upgrades

Storyboard: https://storyboard.openstack.org/#!/story/2007403

This story will provide a mechanism to upgrade the platform components on a running StarlingX system. This is required to allow upgrades between StarlingX versions.

The platform upgrade components includes the Host OS and StarlingX components. (e.g. flock services)

A maintenance release for stx3 is required to upgrade to stx4.0

Problem Description

StarlingX must provide a mechanism to allow migration to a new StarlingX release.

In order to provide a robust and simple upgrade experience for users of StarlingX, the upgrade process must be automated as much as possible and controls must be in place to ensure the steps are followed in the right order.

The platform components compatibility release over release are affected by inter node messaging between components, configuration migration requirements, and kubernetes control plane compatibility.

The downtime over an upgrade must be minimized.

  • controller upgrade - impact minimized to time for host-swact

  • worker upgrade - impact to applications minimized to time it take to migrate application from a worker node before it is upgraded

  • storage - no loss of storage over an upgrade

Upgrades must be done in-service. The platform and applications must continue to provide service during the upgrade. This does not apply to simplex deployments.

Upgrades must be done without any additional hardware.

Background

Three types of StarlingX upgrades will be supported:

  • Platform Upgrade, which includes Host OS and StarlingX components (e.g. flock services)

  • Kubernetes Upgrade

  • Application upgrade, which includes: StarlingX applications (e.g. platform-integ-apps, stx-openstack), User applications and kubernetes workloads

These three types of upgrades are done independently. For example, the Platform is upgraded to a new release of StarlingX without changing the kubernetes version. However, there are dependencies which determine the order in which these upgrades can be done. For example, kubernetes must be upgraded to a particular version before a platform upgrade can be done.

Use Cases

  • Administrator wants to upgrade to a new StarlingX platform version with minimal impact to running applications.

  • Administrator wants to abort an upgrade in progress prior to upgrading all controllers Note: Downgrade to previous release version is not supported

Proposed Process

StarlingX will only support upgrades from release N to release N+1. For example, maintenance release stx3.x can be upgraded to stx4, but not directly to stx5.0. Changes required for kubernetes configuration compatibility is delivered in a maintenance release to enable upgrade from stx3 to stx4.

The administrator must ensure that their deployment has enough extra capacity (e.g. worker hosts) to allow one (or more) hosts to be temporarily taken out of service for the upgrade.

For each supported platform version, supported upgrades from version is tracked by the metadata (in metal/common-bsp/files/upgrades/metadata.xml). The metadata handling is extended to support multiple from versions.

A maintenance release will enable stx3 to stx4 upgrades, and includes the configuration updates required to enable compatibility with the kubernetes control plane during the upgrade.

The following is a summary of the steps the user will take when performing a platform upgrade. For each step, a summary of the actions the system will perform is provided.

The software_upgrade table tracks the upgrade state, and includes:
  • upgrade-started

  • data-migration

  • data-migration-complete

  • upgrading-controllers

  • upgrading-hosts

  • activation-requested

  • activation-complete

  • completing

  • completed

When an upgrade is aborted the following state transitions occur:
  • aborting

  • abort-completing

  • aborting-reinstall

  1. Import release N+1 load

    # system load-import <bootimage.iso> <bootimage.sig>
    
    # system load-list
    +----+----------+------------------+
    | id | state    | software_version |
    +----+----------+------------------+
    | 1  | active   | 19.12            |
    | 2  | imported | 20.06            |
    +----+----------+------------------+
    

    The fields are:

    • software_version: comes from metadata in the load image.

    • states:

      • active: the current version, version N

      • importing: image is being uploaded to load repository

      • error: error load state

      • deleting: load is being deleted from repository

      • imported: version that can be upgraded to, i.e. version N+1

  2. Perform Health checks for upgrade

    # system health-query-upgrade
    

    This will perform health checks to ensure the system is at a state ready for upgrade.

    These health checks are also performed as part of upgrade-start.

    These include checks for:

    • upgrade target load is imported

    • all hosts provisioned

    • all hosts load current

    • all hosts unlocked/enabled

    • all hosts have matching configs

    • no management affecting alarms

    • for ceph systems: storage cluster is healthy

    • verifies kubernetes nodes are ready

    • verifies kubernetes control plane pods are ready

    • verifies that the kubernetes control plane is at a version and configuration required for upgrade. If not, the kubernetes upgrade [1] method must be performed in order to bring it to baseline.

  3. Start the upgrade

    # system upgrade-start
    

    This performs semantic checks and the health checks as per the ‘health-query-upgrade’ step.

    This will make a copy of the system data (e.g. postgres databases, armada, helm, kubernetes, and puppet hiera data migrations) to be used in the upgrade.

    Note that /opt/etcd cluster data may be dynamically updating cluster state info until the host-swact when service management brings down the active-standby etcd on N side and up on N+1 side.

    Configuration changes are not allowed after this point, until the upgrade is completed.

  4. Lock and upgrade controller-1

    # system host-upgrade controller-1
    
    • upgrade state is set to ‘data-migration’

    • update upgrade_controller_1 flag so that controller-1 can determine whether in an upgrade

    • host controller-1 is reinstalled with N+1 load

    • Migrate data and configuration from release N to release N+1

    • A special release N+1 puppet upgrade manifest is applied, based on the hiera data that was migrated from release N. This allows for one-time actions similar to what was done on the initial install of controller-0 (e.g. configuring rabbit, postgres, keystone).

    • Generate hiera data for release N+1, to be used to apply the regular puppet manifest when controller-1 is unlocked

    • sync replicated (DRBD) filesystems

    • upgrade state is set to ‘data-migration-complete’

    • system data is present in both release N and release N+1 versioned directories (e.g. /opt/platform/config/<release>, /var/lib/postgresql/<release>)

  5. Unlock controller-1

    This includes generating configuration data for controller-1 which must be generated from the active controller.

    • the join_cmd for the kubernetes controlplane is generated on the N side for the N+1 hierdata

    The N+1 hiera data drives the puppet manifest apply.

  6. Swact to controller-1

    # system host-swact controller-0
    
    • controller-1 becomes active and runs release N+1 while rest of the system is running release N

    • Any release N+1 components that do inter-node communications must be backwards compatible to ensure that communication with release N works correctly

    • update to backup /opt/etcd/version_N and restore to /opt/etcd/version_N+1 for the target version on host-swact This must be performed at a time where data loss can be avoided. As part of the host-swact startup on controller-1, and during an upgrade, etcd is copied from the release N etcd directory to the release N+1 etcd directory.

  7. Lock and upgrade controller-0

    # system host-upgrade controller-0
    
    • install N+1 load and on the host-unlock, apply the upgrades manifest and the puppet host configuration

    • after controller-0 is upgraded, upgrade state is set to ‘upgrading-hosts’

  8. If applicable, Lock and upgrade storage hosts

    # system host-upgrade storage-0
    
    • If provisioned, all storage hosts must be upgraded prior to proceeding with workers

    • Install N+1 load; up to half of the storage hosts can be done in parallel

    • Ceph data sync

  9. Lock and upgrade worker hosts

    # system host-upgrade worker-x
    
    • Migrate workloads from worker node (triggered by host-lock)

    • Install N+1 load

    • Can be done in parallel, depending upon excess capacity. Each worker host will first be locked using the existing “system host-lock” CLI (worker hosts can be done in any order). This results in services being migrated off the host and applies the NoExecute taint, which will evict any pods that can be evicted.

  10. Activate the upgrade

    # system upgrade-activate
    
    • Perform any additional configuration which may be required after all hosts have been upgraded.

  11. host-swact to controller-0

    # system host-swact controller-1
    
  12. Complete the upgrade

    # system upgrade-complete
    
    • Run post-checks to ensure upgrade has been completed

    • Remove release N data

Failure Handling

  • When a failure happens and cannot be resolved without manual intervention, the upgrade state will be set to data-migration-failed or activation-failed.

  • To recover, the user will need to resolve the issue that caused the upgrade step to fail.

  • An upgrade-abort is only possible before controller-0 has been upgraded. In other cases, the user would need to resolve the issue and reattempt the step.

Health Checks

  • In order to ensure the health and stability of the system we will do health checks both before allowing a platform upgrade to start and then as each upgrade CLI is run.

  • The health checks will include:

    • basic system health (i.e. system health-query)

    • new kubernetes specific checks - for example:

      • verify that all kubernetes control plane pods are running

      • verify that all kubernetes applications are fully applied

      • verify that kubernetes control plane version and configuration is at baseline required for platform upgrade.

Interactions with container applications

  • The kubernetes join_cmd must be created from the N side running the active kubernetes control plane.

  • The platform upgrade must be performed exclusively from the kubernetes upgrade. A kubernetes upgrade is not allowed when a platform upgrade is in progress, and vice-versa.

  • Before starting a platform upgrade, we also need to check that kubernetes configuration is at a baseline suitable for upgrade. The N+1 load metadata enforces the configuration baseline required on the N from side.

  • If the N+1 version is at a newer kubernetes version, then the kubernetes upgrade procedure must be completed first in order to align the kubernetes version.

  • After a platform upgrade has started, helm-override operations will be prevented as these configuration changes will not be preserved after upgrade-start and can also trigger applications to be reapplied.

Alternatives

Update the kubernetes configuration to N+1 configuration after the upgrade. However, this would necessitate the coordination of activation of features such as control plane address, encryption at rest during an upgrade, such as during the upgrade-activate step. This would require N+1 to be backwards compatible with N.

A mechanism is required to upgrade etcd [3], thus keeping the versioning for the etcd database will allow an upgrade to a newer etcd version.

etcd Upgrades

  • host-swact to controller-1 during upgrade. As part of the host-swact during an upgrade, the kubernetes etcd is copied from N side. This would take a copy at a time when the etcd data is not allowed to change, as etcd would be brought down on controller-0, and prior to service management bringing up etcd on controller-1. After the new version of etcd runs with the migrated etcd, it is no longer possible to run the old version of etcd against it. Therefore, the release N version of the data must be maintained in the event of a host-swact back or upgrade-abort prior to upgrade of controller-0. This is the chosen alternative.

  • Alternative: upgrade-start: As an alternative to updating etcd on host-swact. Migrate etcd for upgrade. Configuration changes which affect the cluster state information could still occur in this scenario. Kubernetes state changes that occur after the snapshot would be lost and have the potential to put the kubernetes cluster into a bad state.

  • Alternative: /opt/etcd is unversioned so that the N and N+1 sides both reference the same directory. This is based on the premise that kubernetes control plane is upgraded independently and does not require a versioned directory. However, as noted in the host-swact alternative, this would not be compatible with upgrade-abort or host-swact back to the N release.

Data Model Impact

The following tables in the sysinv database are required. The datamodel required to support platform upgrades are in the stx3.0 data model, and include the following platform upgrade focused tables.

  • loads represents the load version (e.g. N and N+1), load state, compatible versions

  • software_upgrade represents the software upgrade state, from_load and to_load

  • host_upgrade represents the software_load and target_load for each host

REST API Impact

The v1/load, health, upgrade implements the platform upgrade specific URL utilized for the upgrade. The config repo api-ref-sysinv-v1-config.rst doc is updated accordingly.

The sysinv REST API supports the following upgrade-related methods:

  • The existing resource /loads

    • URLS:

      • /v1/loads

    • Request Methods:

      • GET /v1/loads

        • Returns all platform loads known to the system

      • POST /v1/loads/import_load

        • Imports the new load passed into the body of the POST request

  • The existing resource /upgrade

    • URLS:

      • /v1/upgrade

  • The existing resource /ihosts support for upgrade actions.

    • URLS:

      • /v1/ihosts/<hostid>

    • Request Methods:

      • POST /v1/ihosts/<hostid>/upgrade

        • Upgrades the platform load on the specified host

Security Impact

This story is providing a mechanism to upgrade platform from one version to another. It does not introduce any additional security impacts above what is already there regarding the initial deployment.

Other End User Impact

End users will typically perform upgrades using the sysinv (i.e. system) CLI. The CLI commands used for the upgrade are as noted in the Proposed Process section above.

Performance Impact

When a platform upgrade is in progress, each host must be taken out of service in order to install the new load. The user must ensure that there is enough capacity in the system to handle the removal from service of one (or more) hosts as the load on each host is upgraded.

Other Deployer Impact

Deployers will now be able to upgrade StarlingX platform on a running system.

Developer Impact

Developers working on the StarlingX components that manage container applications may need to be aware that certain operations should be prevented when a platform upgrade is in progress. This is discussed in the Proposed Process section above.

Upgrade Impact

StarlingX platform upgrades are independent from the Kubernetes upgrade [1]. However, when StarlingX platform upgrades are supported, checks must be put in place to ensure that the kubernetes version is not allowed to change due to a platform upgrade. In effect, the system must be upgraded to the same version of kubernetes as is packaged in the new platform release, to ensure this is the case. This will be enforced through semantic checking in the platform upgrade APIs.

The platform upgrade excludes the upgrade of applications. Applications will need to be compatible with the new version of the platform/kubernetes. Any upgrade of hosted applications is independent of the platform upgrade.

Simplex Platform Upgrades

At a high level the simplex upgrade process involves the following steps.

  • Taking a backup of the platform data.

  • Installing the new StarlingX software.

  • Restoring and migrating the platform data.

Simplex Upgrade Process

  1. Import release N+1 load

    # system load-import <bootimage.iso> <bootimage.sig>
    
    # system load-list
    +----+----------+------------------+
    | id | state    | software_version |
    +----+----------+------------------+
    | 1  | active   | 19.12            |
    | 2  | imported | 20.06            |
    +----+----------+------------------+
    
  2. Start the upgrade

    # system upgrade-start
    

    This performs semantic checks and the health checks as per the ‘health-query-upgrade’ command.

    This will make a copy of the system platform data similar to a platform backup. The upgrade data will be placed under /opt/backups.

    Any changes made after this point will be lost.

  3. Copy the upgrade data

    During the upgrade process the rootfs will be wiped, and the upgrade data deleted. The upgrade data must be copied from the system to an alternate safe location (such as a USB drive or remote server).

  4. Lock and upgrade controller-0

    # system host-upgrade controller-0
    

    This will wipe the rootfs and reboot the host.

  5. Install the new release of StarlingX

    Install the new release of StarlingX software via network or USB.

  6. Restore the upgrade data

    # ansible-playbook /usr/share/ansible/stx-ansible/playbooks/upgrade.yml
    

    The upgrade playbook will migrate the upgrade data to the current release and restore it to the system.

    This playbook requires the following parameters:

    • ansible_become_pass

    • admin_password

    • upgrade_data_file

  7. Unlock controller-0

    # system host-unlock controller-0
    
  8. Activate the upgrade

    # system upgrade-activate
    

    Perform any additional configuration which may be required after the host is unlocked.

  9. Complete the upgrade

    # system upgrade-complete
    

    Remove data from the previous release.

Implementation

Assignee(s)

Primary assignee:

Other contributors:

Repos Impacted

  • config

  • update

  • integ

  • metal

  • stx-puppet

  • ansible-playbooks

Work Items

Please refer to the Story [2] for the complete list of tasks.

The following are prerequisites to prior the upgrade

  • update kubernetes configuration to the configured features on N+1 This will be enabled by a software delivered increment that will enable the required configuration baseline.

    • update kubernetes control_plane_address

    • kubernetes encryption at rest

    Updating the kubernetes version is covered by [1] and is performed independently.

  • This is enforced by the upgrade load N+1 metadata which specifies the upgrade supported from load.

  • The etcd directory is unversioned so that it can be referenced by the N and N+1 kubernetes control plane

The following steps in the upgrade require changes:

  • load-import The metadata handling is extended to support multiple from versions.

  • health-query-upgrade Health checks are added to ensure kubernetes version and configuration are at correct baseline for upgrade

upgrade-start

  • upgrade-start-pkg-extract Update to reference dnf rather than superseded repoquery tool

  • migrate puppet hiera data

  • Export armada, helm, kubernetes configuration to N+1

  • export the databases for N+1

host-upgrade

  • create /etc/platform/.upgrade_controller_1 so that controller-1 via RPC can determine that controller upgrade is required

host-unlock

  • Create join command from the N side for the N+1 side

  • run upgrades playbook for docker. This will push docker images required.

host-swact

  • update to backup /opt/etcd/from_version and restore to /opt/etcd/to_version for the target version on host-swact This is performed at a time where data loss can be avoided. During an upgrade, before etcd has started on controller-1, after host-swact, the etcd is copied from controller-0. Normally, etcdctl snapshot is required when data is still dynamically changing; however, as service management manages etcd in active-standby, and the snapshot is occuring as part of etcd startup, it is possible to use a direct copy.

Ansible:

  • upgrade playbook for docker. push_k8s_images.yml is updated to handle platform upgrade case.

Integ:

  • Update registry-token-server to continue to support GET for token This is performed as part of Story 2006145, Task 38763 https://review.opendev.org/#/c/707283/

  • Add semantic checks to existing APIs

    • application-apply/remove/etc… - prevent when platform upgrade in progress

    • helm-override-update/etc… - prevent when platform upgrade in progress

Miscellaneous:

  • Update metadata for upgrade versions

  • Remove openstack service, databases references in upgrade code

  • Update supported from version checks

Dependencies

None

Testing

Upgrades must be tested in the following StarlingX configurations:

  • AIO-DX

  • Standard with controller storage

  • Standard with dedicated storage

  • AIO-SX

The testing can be performed on hardware or virtual environments.

Documentation Impact

New user end user documentation will be required to describe how platform upgrades should be done.

The config API reference will also need updates.

References

History

Revisions

Release Name

Description

stx-4.0

Introduced