StarlingX 11.0 Release Notes¶

About this task

StarlingX is a fully integrated edge cloud software stack that provides everything needed to deploy an edge cloud on one, two, or up to 100 servers.

This section describes the new capabilities, Known Limitations and Procedural Changes, Defects fixed and deprecated information in StarlingX 11.0.

ISO image ¶

The pre-built ISO (Debian) for StarlingX 11.0 is located at the StarlingX mirror repo:

https://mirror.starlingx.windriver.com/mirror/starlingx/release/11.0.0/debian/bullseye/amd64/monolithic/outputs/iso/

Source Code for StarlingX 11.0 ¶

The source code for StarlingX 11.0 is available on the r/stx.11.0 branch in the StarlingX repositories.

Deployment ¶

To deploy StarlingX 11.0, see Consuming StarlingX.

For detailed installation instructions, see StarlingX 11.0 Installation Guides.

New Features / Enhancements / Updates ¶

The sections below provide a detailed list of new features and links to the associated user guides (if applicable).

Platform Component Upversion¶

The following platform component versions have been updated in StarlingX Release Stx 11.0

kernel version 6.12.40

Supported Kubernetes versions in StarlingX 11.0:

1.29.2
1.30.6
1.31.5
1.32.2
nginx-ingress-controller
- ingress-nginx 4.13.3
cert-manager 1.17.2
platform-integ-apps 3.11.0
- ceph-csi-rbd-3.13.1
- ceph-csi-cephfs-3.13.1
- ceph-pools-audit-1.0.1
Note

The Ceph pools audit chart is now disabled by default. It can be enabled through user-overrides based on user preference, if required.
rook-ceph
- rook-ceph-1.16.6
- rook-ceph-cluster-1.16.6
- rook-ceph-provisioner-2.1.0
- rook-ceph-floating-monitor-2.1.0
oidc-auth-apps 2.42.0
- dex-0.23.0
- secret-observer-0.1.8
- oidc-client-0.1.24
Helm chart metrics-server: 3.12.2 (deploys Metrics Server 0.7.2)
kubevirt-app 1.5.0
node-feature-discovery 0.17.3
sriov-fec-operator 2.11.1
node-interface-metrics-exporter 0.1.4
security-profiles-operator 0.8.7
dell-storage
- csi-powerflex 2.13.0
- csi-powermax 2.13.0
- csi-powerscale 2.13.0
- csi-powerstore 2.13.0
- csi-unity 2.13.0
- csm-observability 1.11.0
- csm-replication 1.11.0
- csm-resiliency 1.12.0
oran-o2 2.2.1
snmp 1.0.5
auditd 1.0.3
portieris 0.13.28

Warning

Kubernetes upgrade fails if Portieris is applied.
intel-device-plugins-operator
- intel-device-plugins-operator-0.32.5
- intel-device-plugins-qat-0.32.1
- intel-device-plugins-gpu-0.32.1
- intel-device-plugins-dsa-0.32.1
- secret-observer-0.1-1
kubernetes-power-manager 2.5.1

Note

Intel has stopped support for the kubernetes-power-manager application. This is still being supported by StarlingX and will be removed in a future release. For more information, see Configurable Power Manager.

cpu_busy_cycles metric is deprecated and must be replaced with cpu_c0_state_residency_percent for continued usage (if the metrics are customized via helm overrides).
power-metrics
- cadvisor 0.52.1
- telegraf 1.34.4
app-istio
- Istio 1.26.2
- Kiali 2.11.0
FluxCD helm-controller 1.2.0
FluxCD source-controller 1.5.0
FluxCD notification-controller 1.5.0
FluxCD kustomize-controller 1.5.1
Helm 3.17.1 for Kubernetes 1.29-1.32
volume-snapshot-controller
- snapshot-controller 6.1.0 for K8s 1.29.2
- snapshot-controller 6.3.3 for K8s 1.30.6
- snapshot-controller 8.0.0 for K8s 1.31.5 - 1.32.2
- snapshot-controller 8.1.0 for K8s 1.33.0
ptp-notification 2.0.75
app-netapp-storage (NetApp Trident CSI) 25.02.1
Mellanox (OFED) ConnectX 24.10-2.1.8
Mellanox ConnectX-6 DX firmware 22.43.2566
ice: 2.3.10
- Intel E810 - Required NVM/firmware: 4.80
- Intel E825 - Required NVM/firmware: 4.02
- Intel E830 - Required NVM/firmware: 1.11
i40e: 2.28.9 / Required NVM/firmware: 9.20

See: Application Reference

OpenBao is not supported¶

Warning

OpenBao is not supported in StarlingX Stx 11.0. Do not upload/apply this application on a production system.

Secure Pod-to-Pod Communication of Inter-Host Network Traffic¶

To strengthen security across the StarlingX, new measures have been implemented to protect selected pod-to-pod network traffic from both passive and active network attackers including those with access to the cluster host network.

On StarlingX, inter-host pod-to-pod traffic for a service can be configured to be protected by IPsec in tunnel mode over cluster host network. The configurations are defined as IPsec policies and managed by the ipsec-policy-operator Kubernetes system application.

See:

Threat Mitigation¶

Passive attackers: Defend against traffic snooping and unauthorized data observation
Active attackers: Blocked from attempting unauthorized connections to StarlingX cluster hosts

Secure Pod-to-Pod Communication¶

The StarlingX now supports encryption of Calico-based inter-hosts networking using IPsec, ensuring secure Pod-to-Pod traffic across the cluster-host network.

Applies to application pod-to-pod traffic on the cluster-host network
Applications, and applications’ pod-to-pod traffic can selectively be protected
Excludes SR-IOV VF interface traffic
Configuring IPSec policies on pod-to-pod traffic may degrade the CPU performance. Ensure that adequate resources are available to support sustained and peak inter-node traffic

See:

Install IPsec Policy Operator System Application¶

The ipsec-policy-operator system application is managed by the system application framework and will be automatically uploaded once the system is ready. Subsequently, the application can be installed by applying its manifest.

See:

Platform Networks Address Reduction for AIO-SX¶

To reduce the number of IP addresses required for Distributed Cloud AIO-SX Subcloud deployments, platform networks are updated to allocate only a single IP address per subcloud, removing the need for additional unit-specific addresses that are no longer required.

However, the platform network IP address must be assigned from a shared subnet, allowing multiple subclouds to use the same network address range. This enables more efficient IP management across large-scale deployments. The OAM network serves as a reference model, as it already supports the necessary capabilities and expected behavior for this configuration.

See:

Intermediate CA Support for Kubernetes Root CA¶

StarlingX now supports the use of server certificates signed by an Intermediate Certificate Authority (CA) for the external kube-apiserver endpoint. This enhancement ensures that external access to the Kubernetes API can be validated under the same root of trust as other platform certificates, improving consistency and security across the system.

Intermediate CA Support for External Connections to kube-apiserver¶

External connections to kube-apiserver are now routed through HAProxy, which listens on port 6443. HAProxy uses the REST API / GUI certificate issued by system-local-ca, supporting Intermediate CAs, to perform SSL termination with the external client. It then initiates a new SSL connection to kube-apiserver, now operating on port 16443 behind the firewall, on behalf of the client. External clients must recognize and trust the public certificate of system-local-ca’s Root CA.

See: Kubernetes Certificates.

Unified PTP Notification Overall Sync State¶

The overall sync state notification (sync-state) describes the health of the timing chain on the local system. A locked state is reported when the system has reference to an external time source (GNSS or PTP) and the system clock is synchronized to that time source.

See: PTP Notification Status Conditions

New Default/Static Platform API/CLI/GUI Access-Control Roles for Configurator and Operator¶

In StarlingX, 5 different keystone roles are supported: admin, reader, configurator, operator and member.

In StarlingX Release 11.0, the following new keystone roles are introduced:

configurator
operator

See: Keystone Account Roles

Multi-Node Upgrades¶

In StarlingX Release 11.0, the restriction on K8s multi-node orchestrated upgrades has been removed. You can now perform upgrades across multiple nodes in a single orchestration strategy.

Example: Upgrading from v1.29.2 to v1.32.2

See: About Kubernetes Upgrade Cloud Orchestration

PTP Netlink API Integration¶

The following new interface parameters have been added in StarlingX Release 11.0:

ts2phc.pin_index = 1
ts2phc.channel = 1

See: Instance Specific Considerations

Docker Size updates¶

In StarlingX Release 11.0 the default Docker filesystem size is 30GB. Resize the Docker filesystem on all controllers to a minimum 50GB or more prior to upgrading the system using the following command:

system host-fs-modify <controller-name> docker=<GB>

A new deploy precheck script is added to ensure the docker filesystem size is not less than 50GB.

VIM Rollback Orchestration¶

StarlingX Release 11.0 introduces expanded rollback capabilities to improve system recovery during software deployments:

Manual Rollback is supported across all configurations, including AIO-SX, AIO-DX, Standard, and Standard with dedicated storage.

VIM Orchestrated Rollback is supported on duplex configurations (AIO-DX, AIO-DX+, Standard, and Standard with dedicated storage) for the following scenarios:

Rollback of Major Release software deployments
Rollback of Patch Release software deployments
Rollback of Patched Major Release deployments
Recovery from aborted or failed deployments

These enhancements aim to streamline recovery workflows and reduce downtime across a broader range of deployment scenarios.

See:

Upgrade / Rollback Process Optimization¶

To accelerate recovery from failed operations during software updates and upgrades, a new snapshot-based restore capability is introduced in StarlingX Release 11.0. Unlike traditional backup and restore, this feature leverages OSTree deployment management and LVM volume snapshots to revert the system to a previously saved state without requiring a full reinstall. Snapshots will be created for select LVM volumes, excluding directories such as /opt/backup, /var/log, and /scratch, as outlined in the “Filesystem Summary” below. This capability is currently limited to Simplex systems (AIO-SX).

FileSystem Summary¶
LVM Name	Mount Path	DRBD	Versioned**	Snapshot
root-lv	/sysroot		N	N*
var-lv	/var		N	Y
log-lv	/var/log		N	N
backup-lv	/var/rootdirs/opt/backups		N	N
ceph-mon-lv	/var/lib/ceph/mon		N	N
docker-lv	/var/lib/docker		N	Y
kubelet-lv	/var/lib/kubelet		N	Y
pgsql-lv	/var/lib/postgresql	drbd0	Y	Y
rabbit-lv	/var/lib/rabbitmq	drbd1	Y	Y
dockerdistribution-lv	/var/lib/docker-distribution	drbd8	N	N
platform-lv	/var/rootdirs/opt/platform	drbd2	Y	Y
etcd-lv	/var/rootdirs/opt/etcd	drbd7	Y	Y
extension-lv	/var/rootdirs/opt/extension	drbd5	N	N
dc-vault-lv	/var/rootdirs/opt/dc-vault	drbd6	Y	N
scratch-lv	/var/rootdirs/scratch		N	N

Managed by OSTree

** Versioned subpaths

See:

Platform Real Time Kernel Robustness¶

Stalld can be configured to use the queue_track backend, which is based on eBPF. Stalld protects lower priority tasks from starvation.

Unlike other backends, queue_track reduces CPU usage and more accurately identifies which tasks can be excecuted even if they are currently blocked waiting for a lock.

See: Configure stall daemon.

Enable CONFIG_GENEVE Kernel Configuration¶

StarlingX Release 11.0 supports geneve.ko kernel module, controlled by the CONFIG_GENEVE kernel config option.

Cloud User Management GUI/CLI/RESTAPI Enhancements; Deletion Restriction¶

In StarlingX Release 11.0, existing Local LDAP users in the sudo group do not need to be migrated to the sys_admin group.

Administrators may retain their existing configuration if required. However, to better align with the platform’s security and access control standards, it is recommended to assign restricted sudo privileges through the sys_admin group.

Administrators may optionally update their configurations by transitioning Local LDAP users from the sudo group to the sys_admin group. This can be done using ONLY the following method:

via pam_group & /etc/security/group.conf to map users into additional groups

See: Local LDAP Linux User Accounts.

In-tree and Out-of-tree drivers¶

In StarlingX Release 11.0 only the out-of-tree versions of the Intel ice, i40e, and iavf drivers are supported. Switching between in-tree and out-of-tree driver versions are not supported.

See:

Intel Driver Versions

CaaS Traffic Bandwidth Configuration¶

Previously, the max_tx_rate parameter was used to set the maximum transmission rate for a VF interface, with the short form -r. With the introduction of the max_rx_rate parameter that is used to configure the maximum receiving rate, both max_tx_rate and max_rx_rate can now be applied to define bandwidth limits for platform interfaces. To align with naming conventions:

-t short form for max_tx_rate parameter allows the configuration of the maximum transmission rate for both VF and platform interfaces.
r short form for max_rx_rate parameter is used to set the maximum receiving rate for platform interfaces.

See:

Configure Platform Network Bandwidth using the CLI.

Rook Ceph Updates and Enhancements¶

Rook Ceph is an orchestrator that provides a containerized solution for Ceph Storage with a specialized Kubernetes Operator to automate the management of the cluster. It is an alternative solution for the bare-metal Ceph storage. See https://rook.io/docs/rook/latest-release/Getting-Started/intro/ for more details.

ECblock pools are renamed: Both data and metadata pools for ECblock on Rook Ceph changed names to comply with the new standards for upstream Rook Ceph.

Data pool was renamed from ec-data-pool to kube-ecblock
Metadata pool was renamed from ec-metadata-pool to kube-ecblock-metadata

Ceph version upgrade

Ceph version is upgraded from 18.2.2 to 18.2.5, with minimal impact on the upgrade.

Rook Ceph OSDs Management

To add, remove or replace OSDs in a Rook Container-based Ceph, see Remove a Host From a Rook Ceph Cluster

Note

Host-based Ceph is deprecated in StarlingX Release 11.0.

For any new StarlingX deployments Rook Ceph is mandatory in order to prevent any service disruptions during migration procedures.

User Management GUI/CLI Enhancements¶

For critical operations performed via the StarlingX CLI or GUI such as delete actions or operations that may impact services the system will display a warning indicating that the operation is critical, irreversible, or may affect service availability. Also the system will prompt the user to confirm before proceeding with the execution of the operation.

A user confirmation request can optionally be used to safeguard critical operations performed via the CLI. When the user CLI confirmation request is enabled, CLI users are prompted to explicitly confirm a potentially critical or destructive CLI command, before proceeding with the execution of the CLI command.

See:

Optimized Platform Processing and Memory Usage- Ph1¶

StarlingX Release 11.0 requires approximately 1 GB less memory, enabling more efficient deployment in resource-constrained environments.

This feature is designed to optimize platform resource utilization, specifically targeting processing and memory efficiency. This enables greater flexibility for StarlingX deployments in use cases with tighter footprint constraints.

Kubernetes Upgrade Procedure Optimization - Multi-Node¶

This feature enhances Kubernetes version upgrades across all StarlingX configurations including AIO-DX, AIO-DX with worker nodes, standard configurations with controller storage, and standard configurations with dedicated storage extending beyond AIO-SX.

The following enhancements are introduced:

Pre-caching of container images for all relevant versions during the upgrade’s preliminary phase.
The upgrade system now supports multi-node multi-K8s-version K8s upgrade (both manual, and orchestrated), ie.:
- it supports multi-node upgrades of multiple Kubernetes versions in a single manual upgrade
- it supports multi-node upgrades of multiple Kubernetes versions in a single orchestration

Previously, for multi-node environments, the Kubernetes upgrade process had to be repeated end-to-end for each version in sequence.

Now, the upgrade system checks for kubelet version skew, allowing kubelet components to run up to three minor versions behind the control plane. This enhancement enables multi-version upgrades in a single cycle, eliminating the need to upgrade kubelet through each intermediate version. As a result, the overall number of upgrade steps is significantly reduced.

See:

Hardware Updates ¶

See:

Kubernetes Verified Commercial Hardware

Bug status ¶

Fixed bugs¶

This release provides fixes for a number of defects. Refer to the StarlingX bug database to review the R11.0 Fixed Bugs.

Known Limitations and Procedural Changes ¶

The following are known limitations you may encounter with your StarlingX 11.0 and earlier releases. Workarounds are suggested where applicable.

Note

These limitations are considered temporary and will likely be resolved in a future release.

Stx 11.0 Limitations¶

Security Limitations¶

RSA required to be the platform issuer private key ¶

The system-local-ca issuer needs to use RSA type certificate/key. The usage of other types of private keys is currently not supported during bootstrap or with the Update system-local-ca or Migrate Platform Certificates to use Cert Manager procedures. However, if the platform certificates were migrated to cert-manager in previous versions (StarlingX Releases 9.0 and earlier), it is possible that a non RSA private key was used. In this case, the procedure needs to be rerun providing an RSA certificate/private key pair before upgrading to StarlingX Release 10.0.

Procedural Changes: N/A.

Multiple trusted CA certificates with same Distinguished Name are not supported ¶

The presence of multiple trusted CA (ssl_ca) certificates with the same Distinguished Name (DN) is invalid. An attempt to install a new trusted CA (ssl_ca) certificate with the same DN as another already installed as trusted will not succeed and a warning message will be returned.

Procedural Changes: N/A.

Kubernetes Root CA Certificates ¶

Kubernetes does not properly support k8s_root_ca_cert and k8s_root_ca_key being an Intermediate CA.

Procedural Changes: Accept internally generated k8s_root_ca_cert/key or customize only with a Root CA certificate and key.

For external access to kube-apiserver, the proxy (HAproxy) uses the Rest API / GUI certificate, which supports Intermediate CAs. The issuer (system-local-ca) can be customized at bootstrap. See Ansible Bootstrap Configurations for more information.

External Authentication to kube-apiserver Using Client Certificates ¶

SSL termination for external connections to kube-apiserver is now handled by HAProxy, which establishes a new connection to the API server on behalf of the external client. As a result, client certificate authentication is now restricted to the admin user (kubernetes-admin). Token-based authentication remains fully supported and unchanged.

Procedural Changes: N/A.

Password Expiry does not work on LDAP user login ¶

On Debian, the warning message is not being displayed for Active Directory users, when a user logs in and the password is nearing expiry. Similarly, on login when a user’s password has already expired, the password change prompt is not being displayed.

Procedural Changes: It is recommended that users rely on Directory administration tools for “Windows Active Directory” servers to handle password updates, reminders and expiration. It is also recommended that passwords should be updated every 3 months.

Note

The expired password can be reset via Active Directory by IT administrators.

Upgrade activation: cert-manager does not start issuing certificates after upversion ¶

During upgrade activation and upversioning, cert-manager usually takes less than a minute to be available and start issuing certificates. Occassionally, cert-manager can take more time than expected. This behavior is associated with an Open source issue. For more details see https://github.com/cert-manager/cert-manager/issues/7138#issuecomment-2422983418.

Since the cert-manager application is required, the upgrade activation will fail if the app takes too long to be available after the upversion. The following log will be displayed in /var/log/software.log

Error from server (NotFound): secrets "stx-test-cm" not found
certificate.cert-manager.io "stx-test-cm" deleted
software-controller-daemon: software_controller.py(837): INFO: 15 received
from deploy-activate with deploy-state activate-failed
software-controller-daemon: software_controller.py(870): INFO: Received
deploy state changed to DEPLOY_STATES.ACTIVATE_FAILED, agent deploy-activate

Procedural Changes: Cert-manager should recover by itself after a few minutes. If required, the following certificate used for test purposes in the upgrade activation can be created manually to ensure cert-manager is ready before reattempting the upgrade.

cat <<eof> cm_test_cert.yml
---
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  creationTimestamp: null
  name: system-local-ca
spec:
  ca:
    secretName: system-local-ca
status: {}
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  creationTimestamp: null
  name: stx-test-cm
  namespace: cert-manager
spec:
  commonName: stx-test-cm
  issuerRef:
    kind: ClusterIssuer
    name: system-local-ca
  secretName: stx-test-cm
status: {}
eof

$ kubectl apply -f cm_test_cert.yml

$ rm cm_test_cert.yml

$ kubectl wait certificate -n cert-manager stx-test-cm --for=condition=Ready --timeout 20m

# Verify that the TLS secret associated with the cert was created, using the following:

$ kubectl get secret -n cert-manager stx-test-cm

cert-manager cm-acme-http-solver pod fails ¶

On a multinode setup, when you deploy an acme issuer to issue a certificate, the cm-acme-http-solver pod might fail and stays in “ImagePullBackOff” state due to the following defect https://github.com/cert-manager/cert-manager/issues/5959.

Procedural Changes:

If you are using the namespace “test”, create a docker-registry secret “testkey” with local registry credentials in the “test” namespace.

~(keystone_admin)]$ kubectl create secret docker-registry testkey --docker-server=registry.local:9001 --docker-username=admin --docker-password=Password*1234 -n test

Use the secret “testkey” in the issuer spec as follows:

apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
 name: stepca-issuer
 namespace: test
spec:
 acme:
   server: https://test.com:8080/acme/acme/directory
   skipTLSVerify: true
   email: test@test.com
   privateKeySecretRef:
     name: stepca-issuer
   solvers:
   - http01:
       ingress:
         podTemplate:
           spec:
             imagePullSecrets:
             - name: testkey
         class:  nginx

Vault application is not supported during bootstrap ¶

The Vault application cannot be configured during bootstrap.

Procedural Changes:

The application must be configured after the platform nodes are unlocked / enabled / available, a storage backend is configured, and platform-integ-apps is applied. If Vault is to be run in HA configuration (3 vault server pods) then at least three controller / worker nodes must be unlocked / enabled / available.

Vault application support for running on application cores ¶

By default the Vault application’s pods will run on platform cores. When changing the core selection from platform cores to application cores the following additional procedure is required for the vault application.

Procedural Changes:

“If static kube-cpu-mgr-policy is selected and when overriding the label app.starlingx.io/component for Vault namespace or pods, there are two requirements:

The Vault server pods need to be restarted as directed by Hashicorp Vault documentation. Restart each of the standby server pods in turn, then restart the active server pod.
Ensure that sufficient hosts with worker function are available to run the Vault server pods on application cores.

See: Kubernetes CPU Manager Policies.

Restart the Vault Server pods¶

The Vault server pods do not restart automatically.

Procedural Changes: If the pods are to be re-labelled to switch execution from platform to application cores, or vice-versa, then the pods need to be restarted.

Under kubernetes the pods are restarted using the kubectl delete pod command. See, Hashicorp Vault documentation for the recommended procedure for restarting server pods in HA configuration, https://support.hashicorp.com/hc/en-us/articles/23744227055635-How-to-safely-restart-a-Vault-cluster-running-on-Kubernetes.

Ensure that sufficient hosts are available to run the server pods on application cores¶

The standard cluster with less than 3 worker nodes does not support Vault HA on the application cores. In this configuration (less than three cluster hosts with worker function):

Procedural Changes:

When setting label app.starlingx.io/component=application with the Vault app already applied in HA configuration (3 vault server pods), ensure that there are 3 nodes with worker function to support the HA configuration.

When applying Vault for the first time and with app.starlingx.io/component set to “application”: ensure that the server replicas is also set to 1 for non-HA configuration. The replicas for Vault server are overriden both for the Vault Helm chart and the Vault manager Helm chart:

cat <<EOF > vault_overrides.yaml
server:
  extraLabels:
    app.starlingx.io/component: application
  ha:
    replicas: 1
injector:
  extraLabels:
    app.starlingx.io/component: application
EOF

cat <<EOF > vault-manager_overrides.yaml
manager:
  extraLabels:
    app.starlingx.io/component: application
server:
  ha:
    replicas: 1
EOF

$ system helm-override-update vault vault vault --values vault_overrides.yaml

$ system helm-override-update vault vault-manager vault --values vault-manager_overrides.yaml

Kubernetes upgrade fails if Portieris is applied ¶

Kubernetes upgrade fails if Portieris is applied prior to the upgrade.

Procedural Changes: Remove the Portieris application prior to the kubernetes upgrade. Perform the kubernetes upgrade and apply the Portieris application.

Portieris Helm override ‘caCert’ is renamed and moved to Portieris Helm chart ¶

The ‘caCert’ Helm override of portieris-certs Helm chart is moved to the Portieris Helm chart as ‘TrustedCACert’.

Procedural Changes: Before upgrading from StarlingX 10.0 to 11.0, if ‘caCert’ Helm override is applied to the portieris-certs Helm chart to trust a custom CA certificate, apply the ‘TrustedCACert’ Helm override to ‘portieris’ Helm chart to trust the certificate. See, Install Portieris for information on TrustedCACert Helm override.

Authorization based on Local LDAP Groups is not supported for Harbor ¶

When using Local LDAP for authentication of the Harbor system application, you cannot use Local LDAP Groups for authorization; you can only use individual Local LDAP users for authorization.

Procedural Changes: Use only individual Local LDAP users for specifying authorization.

Harbor cannot be deployed during bootstrap ¶

The Harbor application cannot be deployed during bootstrap due to the bootstrap deployment dependencies such as early availability of storage class.

Procedural Changes: N/A.

Windows Active Directory ¶

Limitation: The Kubernetes API does not support uppercase IPv6 addresses.

Procedural Changes: The issuer_url IPv6 address must be specified as lowercase.
Limitation: The refresh token does not work.

Procedural Changes: If the token expires, manually replace the ID token. For more information, see, Configure Kubernetes Client Access.
Limitation: TLS error logs are reported in the oidc-dex container on subclouds. These logs should not have any system impact.

Procedural Changes: NA

Security Audit Logging for K8s API ¶

A custom policy file can only be created at bootstrap in apiserver_extra_volumes. If a custom policy file was configured at bootstrap, then after bootstrap the user has the option to configure the parameter audit-policy-file to either this custom policy file (/etc/kubernetes/my-audit-policy-file.yml) or the default policy file /etc/kubernetes/default-audit-policy.yaml. If no custom policy file was configured at bootstrap, then the user can only configure the parameter audit-policy-file to the default policy file.

Only the parameter audit-policy-file is configurable after bootstrap, so the other parameters (audit-log-path, audit-log-maxsize, audit-log-maxage and audit-log-maxbackup) cannot be changed at runtime.

Procedural Changes: NA

See: Kubernetes Operator Command Logging.

Networking Limitations¶

Controller-0/1 PXEboot Network Communication Failure(200.003) Alarm Raised After Upgrade ¶

Alarm triggered: Controller-0/1 PXE boot network communication failure (Error Alarm 200.003) following system upgrade.

Procedural Changes:

Identify the PXEboot file.
```
grep -l "net:pxeboot" "/etc/network/interfaces.d/"/* 2>/dev/null
/etc/network/interfaces.d//ifcfg-enp0s8:9
```
If the label differs from ‘:2’ (e.g., displays as ifcfg-enp0s8:9), proceed with the following step

Copy the file in the same directory.

cp /etc/network/interfaces.d//ifcfg-enp0s8:9 /etc/network/interfaces.d//ifcfg-enp0s8:2

Restart mtcClient.
```
systemctl restart mtcClient.service
```

Wait up to one minute for the alarm to clear. Repeat this process for all nodes.

Add / delete operations on pods results in errors ¶

Under some circumstances, add / delete operations on pods results in error getting ClusterInformation: connection is unauthorized: Unauthorized and also results in pods staying in ContainerCreating/Terminating state. This error may also prevent users from locking a host.

Procedural Changes: If this error occurs run the following kubectl describe pod -n <namespace> <pod name> command. The following message is displayed:

error getting ClusterInformation: connection is unauthorized: Unauthorized

Limitation: There is also a known issue with the Calico CNI that may occur in rare occasions if the Calico token required for communication with the kube-apiserver becomes out of sync due to NTP skew or issues refreshing the token.

Procedural Changes: Delete the calico-node pod (causing it to automatically restart) using the following commands:

$ kubectl get pods -n kube-system --show-labels | grep calico

$ kubectl delete pods -n kube-system -l k8s-app=calico-node

Application Pods with SRIOV Interfaces ¶

Application Pods with SR-IOV Interfaces require a restart-on-reboot: “true” label in their pod spec template.

Pods with SR-IOV interfaces may fail to start after a platform restore or Simplex upgrade and persist in the Container Creating state due to missing PCI address information in the CNI configuration.

Procedural Changes: Application pods that require|SRIOV| should add the label restart-on-reboot: “true” to their pod spec template metadata. All pods with this label will be deleted and recreated after system initialization, therefore all pods must be restartable and managed by a Kubernetes controller (i.e. DaemonSet, Deployment or StatefulSet) for auto recovery.

Pod Spec template example:

template:
    metadata:
      labels:
        tier: node
        app: sriovdp
        restart-on-reboot: "true"

PTP O-RAN Spec Compliant Timing API Notification ¶

The v1 API only supports monitoring a single ptp4l + phc2sys instance.

Procedural Changes: Ensure the system is not configured with multiple instances when using the v1 API.
The O-RAN Cloud Notification defines a /././sync API v2 endpoint intended to allow a client to subscribe to all notifications from a node. This endpoint is not supported StarlingX Release 9.0.

Procedural Changes: A specific subscription for each resource type must be created instead.
v1 / v2
- v1: Support for monitoring a single ptp4l instance per host - no other services can be queried/subscribed to.
- v2: The API conforms to O-RAN.WG6.O-Cloud Notification API-v02.01 with the following exceptions, that are not supported in StarlingX Release 9.0.
  - O-RAN SyncE Lock-Status-Extended notifications
  - O-RAN SyncE Clock Quality Change notifications
  - O-RAN Custom cluster names
  - /././sync endpoint
Procedural Changes: See the respective PTP-notification v1 and v2 document subsections for further details.

v1: https://docs.starlingx.io/api-ref/ptp-notification-armada-app/api_ptp_notifications_definition_v1.html

v2: https://docs.starlingx.io/api-ref/ptp-notification-armada-app/api_ptp_notifications_definition_v2.html

`ptp4l` error “timed out while polling for tx timestamp” reported for NICs using the Intel ice driver ¶

NICs using the Intel® ice driver may report the following error in the ptp4l logs, which results in a PTP port switching to FAULTY before re-initializing.

Note

PTP ports frequently switching to FAULTY may degrade the accuracy of the PTP timing.

ptp4l[80330.489]: timed out while polling for tx timestamp
ptp4l[80330.489]: increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug

Note

This is due to a limitation with the Intel® ice driver as the driver cannot guarantee the time interval to return the timestamp to the ptp4l user space process which results in the occasional timeout error message.

Procedural Changes: The Procedural Changes recommended by Intel is to increase the tx_timestamp_timeout parameter in the ptp4l config. The increased timeout value gives more time for the ice driver to provide the timestamp to the ptp4l user space process. Timeout values of 50ms and 700ms have been validated. However, the user can use a different value if it is more suitable for their system.

~(keystone_admin)]$ system ptp-instance-parameter-add <instance_name> tx_timestamp_timeout=700
~(keystone_admin)]$ system ptp-instance-apply

Note

The ptp4l timeout error log may also be caused by other underlying issues, such as NIC port instability. Therefore, it is recommended to confirm the NIC port is stable before adjusting the timeout values.

PTP is not supported on Broadcom 57504 NIC ¶

PTP is not supported on the Broadcom 57504 NIC.

Procedural Changes: None. Do not configure PTP instances on the Broadcom 57504 NIC.

synce4l CLI options are not supported ¶

The SyncE configuration using the synce4l is not supported in StarlingX Release 24.09.

The service type of synce4l in the ptp-instance-add command is not supported in StarlingX Release 24.09.

Procedural Changes: N/A.

ptp-notification application is not supported during bootstrap ¶

Deployment of ptp-notification during bootstrap time is not supported due to dependencies on the system PTP configuration which is handled post-bootstrap.

Procedural Changes: N/A.
The helm-chart-attribute-modify command is not supported for ptp-notification because the application consists of a single chart. Disabling the chart would render ptp-notification non-functional.

Procedural Changes: N/A.

The ptp-notification-demo App is Not a System-Managed Application ¶

The ptp-notification-demo app is provided for demonstration purposes only. Therefore, it is not supported on typical platform operations such as Upgrades and Backup and Restore.

Procedural Changes: NA

Silicom TimeSync (STS) card limitations ¶

Silicom and Intel based Time Sync NICs may not be deployed on the same system due to conflicting time sync services and operations.

PTP configuration for Silicom TimeSync (STS) cards is handled separately from StarlingX host PTP configuration and may result in configuration conflicts if both are used at the same time.

The sts-silicom application provides a dedicated phc2sys instance which synchronizes the local system clock to the Silicom TimeSync (STS) card. Users should ensure that phc2sys is not configured via StarlingX PTP Host Configuration when the sts-silicom application is in use.

Additionally, if StarlingX PTP Host Configuration is being used in parallel for non-STS NICs, users should ensure that all ptp4l instances do not use conflicting domainNumber values.
When the Silicom TimeSync (STS) card is configured in timing mode using the sts-silicom application, the card goes through an initialization process on application apply and server reboots. The ports will bounce up and down several times during the initialization process, causing network traffic disruption. Therefore, configuring the platform networks on the Silicom TimeSync (STS) card is not supported since it will cause platform instability.

Procedural Changes: N/A.

N3000 Image in the containerd cache ¶

The StarlingX system without an N3000 image in the containerd cache fails to configure during a reboot cycle, and results in a failed / disabled node.

The N3000 device requires a reset early in the startup sequence. The reset is done by the n3000-opae image. The image is automatically downloaded on bootstrap and is expected to be in the cache to allow the reset to succeed. If the image is not in the cache for any reason, the image cannot be downloaded as registry.local is not up yet at this point in the startup. This will result in the impacted host going through multiple reboot cycles and coming up in an enabled/degraded state. To avoid this issue:

Ensure that the docker filesystem is properly engineered to avoid the image being automatically removed by the system if flagged as unused. For instructions to resize the filesystem, see Increase Controller Filesystem Storage Allotments Using the CLI
Do not manually prune the N3000 image.

Procedural Changes: Use the procedure below.

Procedure

Lock the node.

~(keystone_admin)]$ system host-lock controller-0

Pull the (N3000) required image into the containerd cache.

~(keystone_admin)]$ crictl pull registry.local:9001/docker.io/starlingx/n3000-opae:stx.8.0-v1.0.2

Unlock the node.

~(keystone_admin)]$ system host-unlock controller-0

Deploying an App using nginx controller fails with internal error after controller.name override ¶

An Helm override of controller.name to the nginx-ingress-controller app may result in errors when creating ingress resources later on.

Example of Helm override:

Procedural Changes: NA

Distributed Cloud Limitations¶

Subcloud Restore to N-1 Release with Additional Patches ¶

If a subcloud is required to be restored to N-1 (Stx 10.0) release beyond the N-1 ISO patch (prepatched ISO) level, use the following prestage and deploy steps:

Restore the subcloud with install to the N-1 release using the following command:
```
$ dcmanager subcloud-backup restore --subcloud <subcloud> --with-install --release <YY.MM>
```
Note

The subcloud will be reinstalled with the N-1 (pre-patched) ISO
Prestage the subcloud with additional N-1 patches if applicable after running the following command:
```
restore dcmanager subcloud prestage --for-sw-deploy --release <N-1> <subcloud>
```
Use dcmanager prestage-strategy create/apply command to prestage more than one subcloud.

Apply the N-1 patches on the subcloud using the following command:

$ dcmanager sw-deploy-strategy create <subcloud> --release <N-1-highest-patch-release-available>

$ dcmanager sw-deploy-strategy apply

Use --group option to create a strategy for more than one subcloud.

Subcloud install or restore to the previous release ¶

If the System Controller is on StarlingX Release 11.0, subclouds can be deployed or restored to either StarlingX Release 10.0 or StarlingX Release 11.0.

The following operations have limited support for subclouds of the previous release:

Subcloud error reporting

The following operations are not supported for subclouds of the previous release:

Orchestrated subcloud kubernetes upgrade

Procedural Changes: N/A.

See: Subclouds Previous Major Release Management.

Subcloud Upgrade with Kubernetes Versions ¶

Before upgrading a cluster, ensure that the Kubernetes version is updated to the latest one supported by the current (older) platform version. This step is necessary because the new platform version only supports that specific Kubernetes version. Orchestrated Kubernetes upgrades are not supported for N-1 subclouds. Therefore, before upgrading the System Controller to Stx 11.0, verify that both the System Controller and all existing subclouds are running Kubernetes version v1.29.2; the latest version supported by Stx 10.0.

Procedural Changes: N/A.

Enhanced Parallel Operations for Distributed Cloud ¶

No parallel operation should be performed while the System Controller is being patched.
Only one type of parallel operation can be performed at a time. For example, subcloud prestaging or upgrade orchestration should be postponed while batch subcloud deployment is still in progress.

Examples of parallel operation:

any type of dcmanager orchestration (prestage, sw-deploy, kube-upgrade, kube-rootca-update)
concurrent dcmanager subcloud add
dcmanager subcloud-backup/subcloud-backup restore with –group option

Procedural Changes: N/A.

Container-Infrastructure Limitations¶

Kubernetes Memory Manager Policies ¶

The interaction between the kube-memory-mgr-policy=static and the Topology Manager policy “restricted” can result in pods failing to be scheduled or started even when there is sufficient memory. This occurs due to the restrictive design of the NUMA-aware memory manager, which prevents the same NUMA node from being used for both single and multi-NUMA allocations.

Procedural Changes: It is important for users to understand the implications of these memory management policies and configure their systems accordingly to avoid unexpected failures.

For detailed configuration options and examples, refer to the Kubernetes documentation at https://kubernetes.io/docs/tasks/administer-cluster/memory-manager/.

Alarm 900.024 Raised When Uploading N-1 Patch Release to the System Controller ¶

When uploading an N-1 patch release to the System Controller, alarm 900.024 (Obsolete Patch) will be triggered.

This behavior is specific to the System Controller and occurs only when uploading an N-1 patch

Procedural Changes: This warning can be safely ignored.

Kubevirt Limitations ¶

The following limitations apply to Kubevirt in StarlingX Release 24.09:

Limitation: Kubernetes does not provide CPU Manager detection.

Procedural Changes: Add cpumanager to Kubevirt:

apiVersion: kubevirt.io/v1
kind: KubeVirt
metadata:
  name: kubevirt
  namespace: kubevirt
spec:
  configuration:
    developerConfiguration:
      featureGates:
        - LiveMigration
        - Macvtap
        - Snapshot
              - CPUManager

Check the label, using the following command:

~(keystone_admin)]$ kubectl describe node | grep cpumanager

where `cpumanager=true`

Limitation: Huge pages do not show up under cat /proc/meminfo inside a guest VM. Although, resources are being consumed on the host. For example, if a VM is using 4GB of Huge pages, the host shows the same 4GB of huge pages used. The huge page memory is exposed as normal memory to the VM.

Procedural Changes: You need to configure Huge pages inside the guest OS.

See the Installation Guides at https://docs.starlingx.io/ for more details.

Limitation: Virtual machines using Persistent Volume Claim (PVC) must have a shared ReadWriteMany (RWX) access mode to be live migrated.

Procedural Changes: Ensure PVC is created with RWX.
```
$ class=cephfs --access-mode=ReadWriteMany

$ virtctl image-upload --pvc-name=cirros-vm-disk-test-2 --pvc-size=500Mi --storage-class=cephfs --access-mode=ReadWriteMany --image-path=/home/sysadmin/Kubevirt-GA-testing/latest-manifest/kubevirt-GA-testing/cirros-0.5.1-x86_64-disk.img --uploadproxy-url=https://10.111.54.246 -insecure
```
Note
- Live migration is not allowed with a pod network binding of bridge interface type ()
- Live migration requires ports 49152, 49153 to be available in the virt-launcher pod. If these ports are explicitly specified in the masquarade interface, live migration will not function.

For live migration with SR-IOV interface:

specify networkData: in cloudinit, so when the VM moves to another node it will not loose the IP config
specify nameserver and internal FQDNs to connect to cluster metadata server otherwise cloudinit will not work
fix the MAC address otherwise when the VM moves to another node the MAC address will change and cause a problem establishing the link

Example:

cloudInitNoCloud:
         networkData: |
           ethernets:
             sriov-net1:
               addresses:
               - 128.224.248.152/23
               gateway: 128.224.248.1
               match:
                 macAddress: "02:00:00:00:00:01"
               nameservers:
                 addresses:
                 - 10.96.0.10
                 search:
                 - default.svc.cluster.local
                 - svc.cluster.local
                 - cluster.local
               set-name: sriov-link-enabled
           version: 2

Limitation: Snapshot CRDs and controllers are not present by default and needs to be installed on StarlingX.

Procedural Changes: To install snapshot CRDs and controllers on Kubernetes, see:
Additionally, create VolumeSnapshotClass for Cephfs and RBD:
```
cat <<EOF>cephfs-storageclass.yaml
—
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: csi-cephfsplugin-snapclass
driver: cephfs.csi.ceph.com
parameters:
  clusterID: 60ee9439-6204-4b11-9b02-3f2c2f0a4344
  csi.storage.k8s.io/snapshotter-secret-name: ceph-pool-kube-cephfs-data
  csi.storage.k8s.io/snapshotter-secret-namespace: default deletionPolicy: Delete

EOF
```
```
    cat <<EOF>rbd-storageclass.yaml
    —
    apiVersion: snapshot.storage.k8s.io/v1
    kind: VolumeSnapshotClass
    metadata:
      name: csi-rbdplugin-snapclass
    driver: rbd.csi.ceph.com
    parameters:
      clusterID: 60ee9439-6204-4b11-9b02-3f2c2f0a4344
      csi.storage.k8s.io/snapshotter-secret-name: ceph-pool-kube-rbd
      csi.storage.k8s.io/snapshotter-secret-namespace: default deletionPolicy: Delete
    EOF

.. note::

    Get the cluster ID from : ``kubectl describe sc cephfs, rbd``
```
Limitation: Live migration is not possible when using configmap as a filesystem. Currently, virtual machine instances (VMIs) cannot be live migrated as virtiofs does not support live migration.

Procedural Changes: N/A.
Limitation: Live migration is not possible when a VM is using secret exposed as a filesystem. Currently, virtual machine instances cannot be live migrated since virtiofs does not support live migration.

Procedural Changes: N/A.
Limitation: Live migration will not work when a VM is using ServiceAccount exposed as a file system. Currently, VMIs cannot be live migrated since virtiofs does not support live migration.

Procedural Changes: N/A.

Docker Network Bridge Not Supported ¶

The Docker Network Bridge, previously created by default, is removed and no longer supported in StarlingX Release 9.0 as the default bridge IP address collides with addresses already in use.

As a result, docker can no longer be used for running containers. This impacts building docker images directly on the host.

Procedural Changes: Create a Kubernetes pod that has network access, log in to the container, and build the docker images.

Upper case characters in host names cause issues with kubernetes labelling ¶

Upper case characters in host names cause issues with kubernetes labelling.

Procedural Changes: Host names should be lower case.

Kubernetes Taint on Controllers for Standard Systems ¶

In Standard systems, a Kubernetes taint is applied to controller nodes in order to prevent application pods from being scheduled on those nodes; since controllers in Standard systems are intended ONLY for platform services. If application pods MUST run on controllers, a Kubernetes toleration of the taint can be specified in the application’s pod specifications.

Procedural Changes: Customer applications that need to run on controllers on Standard systems will need to be enabled/configured for Kubernetes toleration in order to ensure the applications continue working after an upgrade from StarlingX Release 6.0 to StarlingX future Releases. It is suggested to add the Kubernetes toleration to your application prior to upgrading to StarlingX 9.0 Release.

You can specify toleration for a pod through the pod specification (PodSpec). For example:

spec:
....
template:
....
    spec
      tolerations:
        - key: "node-role.kubernetes.io/master"
        operator: "Exists"
        effect: "NoSchedule"
        - key: "node-role.kubernetes.io/control-plane"
        operator: "Exists"
        effect: "NoSchedule"

See: Taints and Tolerations.

Application Fails After Host Lock/Unlock ¶

In some situations, application may fail to apply after host lock/unlock due to previously evicted pods.

Procedural Changes: Use the kubectl delete command to delete the evicted pods and reapply the application.

Application Apply Failure if Host Reset ¶

If an application apply is in progress and a host is reset it will likely fail. A re-apply attempt may be required once the host recovers and the system is stable.

Procedural Changes: Once the host recovers and the system is stable, a re-apply may be required.

Platform CPU Usage Alarms ¶

Alarms may occur indicating platform cpu usage is greater than 90% if a large number of pods are configured using liveness probes that run every second.

Procedural Changes: To mitigate either reduce the frequency for the liveness probes or increase the number of platform cores.

Pods Using isolcpus ¶

The isolcpus feature currently does not support allocation of thread siblings for cpu requests (i.e. physical thread +HT sibling).

Procedural Changes: For optimal results, if hyperthreading is enabled then isolcpus should be allocated in multiples of two in order to ensure that both SMT siblings are allocated to the same container.

Procedural Changes: N/A.

Deleting image tags in registry.local may delete tags under the same name ¶

When deleting image tags in the registry.local docker registry, you should be aware that the deletion of an <image-name:tag-name> will delete all tags under the specified <image-name> that have the same ‘digest’ as the specified <image-name:tag-name>. For more information, see, Delete Image Tags in the Docker Registry.

Procedural Changes: NA

Distributed Cloud Limitations¶

Limitation for Day-2 Deployment Manager operations ¶

After completing Day-1 operations and initiating a Day-2 update for the Host resource, a system config update strategy is generated. Consequently, alarms indicating the presence of this strategy in the system are triggered.

If a new Day-2 update is executed immediately after another update, and before the previous strategy is created, it may lead to unexpected results .

Before proceeding with Day-2 operations, use the following Procedural Changes:

Procedural Changes: Wait for any alarms related to the system config update strategy to clear, which indicates the completion of the strategy. Once the alarms are cleared execute a new Day-2 update using either reconfiguration or playbook re-application to apply new changes that were not applied in the previous update.

Software Management Limitations¶

Deploy does not fail after a system reboot ¶

Deploy does not fail after a system reboot.

Procedural Changes: Run the sudo software-deploy-set-failed --hostname/-h <hostname> --confirm utility to manually move the deploy and deploy host to a failed state which is caused by a failover, lost power, network outage etc. You can only run this utility with root privileges on the active controller.

The utility displays the current state and warns the user about the next steps to be taken in case the user needs to continue executing the utility. It also displays the new states and the next operation to be executed.

ISO/SIG Upload to Central Cloud Fails when Using sudo ¶

To upload a software patch or major release to the System Controller region using the --os-region-name SystemController option, the upload command must be authenticated with Keystone.

Procedural Changes: Do not use sudo with the --os-region-name SystemController option. For example, avoid using sudo software upload <software release> command.

Note

When using the -local option, you must provide the absolute path to the release files.

Note

When using software upload commands with --os-region-name SystemController to upload a software patch or major release to the System Controller region, Keystone authentication is required.

Important

Do not use sudo in combination with the --os-region-name SystemController option. For example, avoid using:

$ sudo software --os-region-name SystemController upload <software-release>

Instead, ensure the command is executed with proper authentication and without sudo.

For more information see, Upload Software Releases Using the CLI

RT Throttling Service not running after Lock/Unlock on Upgraded Subclouds ¶

During the upgrade process, the USM post-upgrade script modifies systemd presets to define which services should be automatically enabled or disabled. As part of this process, any user-enabled custom services may be set to “disabled” after the upgrade completes.

Since this change occurs post-upgrade, systemd will not automatically re-enable the affected service during subsequent lock / unlock operations. By default, USM disables custom services not explicitly listed in the systemd presets. Since service definitions can vary between releases, USM relies on these presets to determine enablement status per host during the upgrade. If a custom service is not included in the presets, it will be marked as disabled and remain inactive after lock / unlock even following a successful upgrade.

Log message during the upgrade:

controller-0 usm-initialize[3061]: info Removed
/etc/systemd/system/multi-user.target.wants/sysctl-rt-sched-apply.service

Procedural Changes: Once the upgrade to StarlingX Release 11.0 completes, run the service-enable and service-start commands for all custom / user services before issuing the first lock / unlock (or reboot).

The enable and start commands for this service are required only once prior to the initial lock / unlock operation. After this step is completed, there is no further need to manually start or enable custom services, as the USM post-upgrade script has already run during the upgrade process.

sw-manager sw-deploy-strategy apply fails ¶

sw-manager apply fails to apply the patch.

Note

The Procedural Changes is applicable only if the sw-manager sw-deploy-strategy fails with the following issues.

To show the operation is in an aborted state due to a timeout, run the following command.

~(keystone_admin)]$ sw-manager sw-deploy-strategy show

Strategy Patch Strategy:
  strategy-uuid:                          2082ab5e-a387-4b6a-be23-50ac23317725
  controller-apply-type:                  serial
  storage-apply-type:                     serial
  worker-apply-type:                      serial
  default-instance-action:                stop-start
  alarm-restrictions:                     strict
  current-phase:                          abort
  current-phase-completion:               100%
  state:                                  aborted
  apply-result:                           timed-out
  apply-reason:
  abort-result:                           success
  abort-reason:

If step 1 fails with ‘timed-out’ results, check if the timeout has occurred due to step-name ‘wait-alarms-clear’ using the command below.

To display results ‘wait for alarm’ that has timed out and run the following command.

~(keystone_admin)]$ sw-manager sw-deploy-strategy show --details

step-name:                    wait-alarms-clear
timeout:                      2400 seconds
start-date-time:              2024-03-27 19:21:15
end-date-time:                2024-03-27 20:01:16
result:                       timed-out

To list the 750.006 alarm, use the following command.

~(keystone_admin)]$ fm alarm-list

+----------+---------------------------+--------------------+----------+---------------+
| Alarm ID | Reason Text               |      Entity ID     | Severity | Time Stamp    |
+----------+---------------------------+--------------------+----------+---------------+
| 750.006  | A configuration change    | platform-integ-apps| warning  |    2024-03-27T|
|          | requires a reapply of the |                    |          |    19:21:15.  |
|          | platform-k8s_application= |                    |          |    471422     |
|          | integ-apps application.   |                    |          |               |
+----------+---------------------------+--------------------+----------+---------------+

VIM orchestrated patch strategy failed with the 900.103 alarm being triggered.

~(keystone_admin)]$ fm alarm-list

+----------+---------------------------+--------------------+----------+---------------+
| Alarm ID | Reason Text               |      Entity ID     | Severity | Time Stamp    |
+----------+---------------------------+--------------------+----------+---------------+
| 900.103  | Software patch auto-apply | orchestration=sw-  | critical | 2024-03-26T03T|
|          | failed                    |                    |          |               |
+----------+---------------------------+--------------------+----------+---------------+

Procedural Changes - Option 1

Check the system for existing alarms using the fm alarm-list command. If the existing alarms can be ignored then use the sw-manager sw-deploy-strategy create --alarm-restrictions relaxed command to ignore any alarms during patch orchestration
If the alarm was not ignored in using the command in step 1 and the issue is seen when you encounter patch apply failure, check if alarm ‘750.006’ is present on the system.

Delete the failed strategy using the following command.

~(keystone_admin)]$ sw-manager sw-deploy-strategy delete

Create a new strategy.

~(keystone_admin)]$ sw-manager sw-deploy-strategy create --alarm-restrictions relaxed

Apply the strategy.

~(keystone_admin)]$ sw-manager sw-deploy-strategy apply

Procedural Changes - Option 2

Create a new strategy (alarm-restrictions are not relaxed).
```
~(keystone_admin)]$ sw-manager sw-deploy-strategy create
```
Apply the strategy.
```
~(keystone_admin)]$ sw-manager sw-deploy-strategy apply
```
When the sw-deploy-strategy is in progress, and when at ‘wait-alarms-clear’ step (this can be found from ‘sw-manager patch strategy show –details | grep “step-name”’), check if alarm 750.006 is present, then execute the below steps.
Execute the command.
```
~(keystone_admin)]$ system application-apply platform-integ-apps
```
This will re-apply the application and clear the alarm ‘750.006’.
If the alarm still persists after step 3, manually delete the alarm using fm alarm-delete <uuid of alarm 750.006> command.

Platform Services Limitations¶

Kubernetes Pod Core Dump Handler may fail due to a missing Kubernetes token ¶

In certain cases the Kubernetes Pod Core Dump Handler may fail due to a missing Kubernetes token resulting in disabling configuration of the coredump on a per pod basis and limiting namespace access. If application coredumps are not being generated, verify if the k8s-coredump token is empty on the configuration file: /etc/k8s-coredump-conf.json using the following command:

~(keystone_admin)]$ ~$ sudo cat /etc/k8s-coredump-conf.json
{
    "k8s_coredump_token": ""

}

Procedural Changes: If the k8s-coredump token is empty in the configuration file and the kube-apiserver is verified to be responsive, users can re-execute the create-k8s-account.sh script in order to generate the appropriate token after a successful connection to kube-apiserver using the following commands:

~(keystone_admin)]$ :/home/sysadmin$ sudo chmod +x /etc/k8s-coredump/create-k8s-account.sh

~(keystone_admin)]$ :/home/sysadmin$ sudo /etc/k8s-coredump/create-k8s-account.sh

Uploaded Applications Show Incorrect Progress During Platform Upgrade ¶

The outputs of the system application-list and system application-show commands may display status messages indicating that dependencies for uploaded applications are missing even after those dependencies have been applied or updated.

Note

If the required dependencies are actually met, this does not prevent the applications from being applied.

Procedural Changes: N/A.

Restart Required for containerd to Apply Config Changes for AIO-SX ¶

On AIO-SX systems, certain container images were removed from the registry due to the image garbage collector and changes introduced during the Kubernetes upgrade. This may impact workloads that rely on specific image versions.

Procedural Changes: Increasing the Docker filesystem size will help retain the image in the containerd cache. Additionally, only for AIO-SX it is recommended to restart containerd after the Kubernetes upgrade. For more details, see “Docker Size updates”.

Limitation Using Regular Expressions in some Parameters while Configuring Stalld ¶

Stalld supports regular expressions in some parameters such as:

ignore_threads
ignore_processes

For example, Stalld can be instructed to ignore all threads that start with the keyword runner; stalld –ignore_threads=”runner.*””

Procedural Changes: In StarlingX Release 24.09.300 the above functionality is not available when using the system host-label api, therefore, the user will have to explicitly specify the threads to ignore.

system host-label-assign controller-0 starlingx.io/stalld.ignore_threads="runnerA"

BMC Password ¶

The BMC password cannot be updated.

Procedural Changes: In order to update the BMC password, de-provision the BMC, and then re-provision it again with the new password.

Configure Stalld ¶

It is recommended to configure Stalld during initial setup. If the workload is high, runtime stalld configuration may not take effect till the node is rebooted

Procedural Changes: Stalld should be configured during initial system setup.

Sub-Numa Cluster Configuration not Supported on Skylake Servers ¶

Sub-Numa cluster configuration is not supported on Skylake servers.

Procedural Changes: For servers with Skylake Gold or Platinum CPUs, Sub-NUMA clustering must be disabled in the BIOS.

Debian Bootstrap ¶

On CentOS bootstrap worked even if dns_servers were not present in the localhost.yml. This does not work for Debian bootstrap.

Procedural Changes: You need to configure the dns_servers parameter in the localhost.yml, as long as no FQDNs were used in the bootstrap overrides in the localhost.yml file for Debian bootstrap.

Installing a Debian ISO ¶

The disks and disk partitions need to be wiped before the install. Installing a Debian ISO may fail with a message that the system is in emergency mode if the disks and disk partitions are not completely wiped before the install, especially if the server was previously running a CentOS ISO.

Procedural Changes: When installing a lab for any Debian install, the disks must first be completely wiped using the following procedure before starting an install.

Use the following wipedisk commands to run before any Debian install for each disk (eg: sda, sdb, etc):

sudo wipedisk
# Show
sudo sgdisk -p /dev/sda
# Clear part table
sudo sgdisk -o /dev/sda

Note

The above commands must be run before any Debian install. The above commands must also be run if the same lab is used for CentOS installs after the lab was previously running a Debian ISO.

Metrics Server Update across Upgrades ¶

After a platform upgrade, the Metrics Server will NOT be automatically updated.

Procedural Changes: To update the Metrics Server, See: Install Metrics Server

Backup and Restore Playbook fails due to self-triggered “backup in progress”/”restore in progress” flag ¶

Backup and Restore causes the Playbook to fail due to self-triggered “backup in progress” / “restore in progress” flag.

Procedural Changes: Retry the backup after manually removing the flag /etc/platform/.backup_in_progress if it has been more than 10 minutes based on the error message:

"backup has already been started less than x minutes ago.
Wait to start a new backup or manually remove the backup flag in
/etc/platform/.backup_in_progress "

For a “restore in progress” flag, reinstall and retry the restore operation.

Optimized-Edge Limitations¶

Data Streaming Accelerator Error During a USM Upgrade ¶

During the upgrade from StarlingX Release 10.0 to 11.0, DSA init container fails and will remain in CrashLoopBack until DSA is fully upgraded to StarlingX Release 11.0.

The issue occurs because a new parameter driver_name is required to configure the workqueues in idxd driver in kernel 6.12.40. This behavior should not impact platform upgrade but DSA may not be configured until intel-device-plugins-operator is successfully upgraded.

Procedural Changes: To overcome this behavior, the new parameters can be added by applying the following Helm overrides before the upgrade.

For example, create the following override file:

$ cat << 'EOF' > dsa-override.yml
overrideConfig:
dsa.conf: |
    [
       {
        "dev":"dsaX",
        "read_buffer_limit":0,
        "groups":[
          {
            "dev":"groupX.0",
            "read_buffers_reserved":0,
            "use_read_buffer_limit":0,
            "read_buffers_allowed":8,
            "grouped_workqueues":[
              {
                "dev":"wqX.0",
                "mode":"dedicated",
                "size":16,
                "group_id":0,
                "priority":10,
                "block_on_fault":1,
                "type":"user",
                "name":"dpdk_appX0",
                "driver_name":"user",
                "threshold":15
              }
            ],
            "grouped_engines":[
               {
                "dev":"engineX.0",
                "group_id":0
              },
            ]
          },
          {
            "dev":"groupX.1",
            "read_buffers_reserved":0,
            "use_read_buffer_limit":0,
            "read_buffers_allowed":8,
            "grouped_workqueues":[
              {
                "dev":"wqX.1",
                "mode":"dedicated",
                "size":16,
                "group_id":1,
                "priority":10,
                "block_on_fault":1,
                "type":"user",
                "name":"dpdk_appX1",
                "driver_name":"user",
                "threshold":15
              }
            ],
            "grouped_engines":[
              {
                "dev":"engineX.1",
                "group_id":1
              },
            ]
          },
          {
            "dev":"groupX.2",
            "read_buffers_reserved":0,
            "use_read_buffer_limit":0,
            "read_buffers_allowed":8,
            "grouped_workqueues":[
              {
                "dev":"wqX.2",
                "mode":"dedicated",
                "size":16,
                "group_id":2,
                "priority":10,
                "block_on_fault":1,
                "type":"user",
                "name":"dpdk_appX2",
                "driver_name":"user",
                "threshold":15
              }
            ],
            "grouped_engines":[
              {
                "dev":"engineX.2",
                "group_id":2
              },
            ]
          },
          {
            "dev":"groupX.3",
            "read_buffers_reserved":0,
            "use_read_buffer_limit":0,
            "read_buffers_allowed":8,
            "grouped_workqueues":[
              {
                "dev":"wqX.3",
                "mode":"dedicated",
                "size":16,
                "group_id":3,
                "priority":10,
                "block_on_fault":1,
                "type":"user",
                "name":"dpdk_appX3",
                "driver_name":"user",
                "threshold":15
              }
            ],
            "grouped_engines":[
              {
                "dev":"engineX.3",
                "group_id":3
              },
            ]
          },
        ]
       }
     ]
EOF

Then apply the override file:

$ system helm-override-update intel-device-plugins-operator intel-device-plugins-dsa intel-device-plugins-operator --values dsa-override.yml

Apply the intel-device-plugins-operator application.

$ system application-apply intel-device-plugins-operator

Console Session Issues during Installation ¶

After bootstrap and before unlocking the controller, if the console session times out (or the user logs out), systemd does not work properly. fm, sysinv and mtcAgent do not initialize.

Procedural Changes: If the console times out or the user logs out between bootstrap and unlock of controller-0, then, to recover from this issue, you must re-install the ISO.

Power Metrics Application in Real Time Kernels ¶

When executing Power Metrics application in Real Time kernels, the overall scheduling latency may increase due to inter-core interruptions caused by the MSR (Model-specific Registers) reading.

Due to intensive workloads the kernel may not be able to handle the MSR reading interruptions resulting in stalling data collection due to not being scheduled on the affected core.

Storage Limitations¶

Limitations of the Rook Ceph Application During Upgrade from Version 1.13 on AIO-DX ¶

During the upgrade from v1.13 to v1.16 on a AIO-DX platform, mon quorum may be temporarily disrupted for a few minutes. Once the upgrade completes, all monitors are expected to come back online and quorum should re-establish successfully.

Procedural Changes: N/A.

Rook Ceph Application Limitation During Floating Monitor Removal ¶

On a AIO-DX system, removing the floating monitor using “system controllerfs-modify ceph-float –functions=”” may lead to temporary system instability, including the possibility of uncontrolled swacts.

Procedural Changes: To avoid this issue, ensure that all finalizers are removed from the floating monitor Rook Ceph chart after its deletion, using the following command:

$ kubectl patch hr rook-ceph-floating-monitor -p '{"metadata":{"finalizers":[]}}' --type=merge

Host fails to lock during an upgrade ¶

After adding multiple OSDs simultaneously configured in the Ceph cluster, some OSDs may remain in a configuring state even though the cluster is healthy and the OSD is deployed. This is an intermittent issue that only occurs on systems with Ceph storage backend configured with more than one OSD per host. This causes the system host-lock command to fail with the following error:

$ system host-lock controller-<id>
controller-<id> : Rejected: Can not lock a controller with storage devices
in 'configuring' state.

Since system host-lock on the controller fails and the OSD is still in the configuring state, the upgrade is blocked from proceeding.

Procedural Changes: Use the following steps to proceed with the upgrade.

List the OSDID in the ‘configuring’ state using the following command:
```
$ system host-stor-list <hostname>
```
Identify the OSD using the following command:
```
$ ceph osd find osd.<OSDID>
```

If the OSD is found manually update the database inventory using the stor uuid:

$ sudo -u postgres psql -U postgres -d sysinv -c "UPDATE i_istor SET state='configured' WHERE uuid='<STOR_UUID>';";

Ceph Daemon Crash and Health Warning ¶

After a Ceph daemon crash, an alarm is displayed to verify Ceph health.

Run ceph -s to display the following message:

cluster:
    id:     <id>
    health: HEALTH_WARN
            1 daemons have recently crashed

One or more Ceph daemons have crashed, and the crash has not yet been archived or acknowledged by the administrator.

Procedural Changes: To archive the crash, clear the health check warning and the alarm.

List the timestamp/uuid crash-ids for all newcrash information:
```
[sysadmin@controller-0 ~(keystone_admin)]$ ceph crash ls-new
```

Display details of a saved crash.

[sysadmin@controller-0 ~(keystone_admin)]$ ceph crash info <crash-id>

Archive the crash so it no longer appears in ceph crash ls-new output.

[sysadmin@controller-0 ~(keystone_admin)]$ ceph crash archive <crash-id>

After archiving the crash, make sure the recent crash is not displayed.
```
[sysadmin@controller-0 ~(keystone_admin)]$ ceph crash ls-new
```
If more than one crash needs to be archived run the following command.
```
[sysadmin@controller-0 ~(keystone_admin)]$ ceph crash archive-all
```

Rook Ceph Application Limitation ¶

After applying Rook Ceph application in an AIO-DX configuration the 800.001 - Storage Alarm Condition: HEALTH_WARN alarm may be triggered.

Procedural Changes: Restart the pod of the monitor associated with the slow operations detected by Ceph. Check ceph -s.

Avoid host lock/unlock during application apply ¶

Host lock and unlock operations may interfere with applications that are in the applying state.

Procedural Changes: Re-applying or removing / installing applications may be required. Application status can be checked using the system application-list command.

Perform a host lock during application apply ¶

Host lock and unlock operations may interfere with applications that are in the applying state.

Procedural Changes: Re-applying or removing / installing applications may be required. Application status can be checked using the system application-list command.

Rook-ceph Application Limitations ¶

This section documents the following known limitations you may encounter with the rook-ceph application and procedural changes that you can use to resolve the issue.

Remove all OSDs in a host

The procedure to remove OSDs will not work as expected when removing all OSDs from a host. The Ceph cluster gets stuck in HEALTH_WARN state.

Note

Use the Procedural change only if the cluster is stuck in HEALTH_WARN state after removing all OSDs on a host.

Procedural Changes:

Check the cluster health status.
Check crushmap tree.
Remove the host(s) that are empty in the command executed before
Check the cluster health status.

Use the rook-ceph apply command when a host with OSD is in offline state

The rook-ceph apply will not allocate the OSDs correctly if the host is offline.

Note

Use either of the procedural changes below only if the OSDs are not allocated in the Ceph cluster.

Procedural Changes 1:

Check if the OSD is not in crushmap tree.
Restart the rook-ceph operator pod.

Note

Wait for about 5 minutes to let the operator to try to recoever the OSDs.
Check if the OSDs have been added in crushmap tree.

Procedural Changes 2:

Check if the OSD is not in the crushmap tree OR it is in the crushmap tree but not allocated in the correct location (within a host).
Lock the host

Wait for the host to be locked.
Get the list from the OSDs inventory from the host.
Remove the OSDs from the inventory.
Reapply the rook-ceph application.

Wait for OSDs prepare pods to be recreated.
Add the OSDs in the inventory.
Reapply the rook-ceph application.

Wait for new OSD pods to be created and running.

Critical alarm 800.001 after Backup and Restore on AIO-SX Systems ¶

A Critical alarm 800.001 may be triggered after running the Restore Playbook. The alarm details are as follows:

~(keystone_admin)]$ fm alarm-list
+-------+----------------------------------------------------------------------+--------------------------------------+----------+---------------+
| Alarm | Reason Text                                                          | Entity ID                            | Severity | Time Stamp    |
| ID    |                                                                      |                                      |          |               |
+-------+----------------------------------------------------------------------+--------------------------------------+----------+---------------+
| 800.  | Storage Alarm Condition: HEALTH_ERR. Please check 'ceph -s' for more | cluster=                             | critical | 2024-08-29T06 |
| 001   | details.                                                             | 96ebcfd4-3ea5-4114-b473-7fd0b4a65616 |          | :57:59.701792 |
|       |                                                                      |                                      |          |               |
+-------+----------------------------------------------------------------------+--------------------------------------+----------+---------------+

Procedural Changes: To clear this alarm run the following commands:

Note

Applies only to AIO-SX systems.

FS_NAME=kube-cephfs
METADATA_POOL_NAME=kube-cephfs-metadata
DATA_POOL_NAME=kube-cephfs-data

# Ensure that the Ceph MDS is stopped
sudo rm -f /etc/pmon.d/ceph-mds.conf
sudo /etc/init.d/ceph stop mds

# Recover MDS state from filesystem
ceph fs new ${FS_NAME} ${METADATA_POOL_NAME} ${DATA_POOL_NAME} --force

# Try to recover from some common errors
sudo ceph fs reset ${FS_NAME} --yes-i-really-mean-it

cephfs-journal-tool --rank=${FS_NAME}:0 event recover_dentries summary
cephfs-journal-tool --rank=${FS_NAME}:0 journal reset
cephfs-table-tool ${FS_NAME}:0 reset session
cephfs-table-tool ${FS_NAME}:0 reset snap
cephfs-table-tool ${FS_NAME}:0 reset inode
sudo /etc/init.d/ceph start mds

Error installing Rook Ceph on AIO-DX with host-fs-add before controllerfs-add ¶

When you provision controller-0 manually prior to unlock, the following sequence of commands fail:

~(keystone_admin)]$ system storage-backend-add ceph-rook --confirmed
~(keystone_admin)]$ system host-fs-add controller-0 ceph=20
~(keystone_admin)]$ system controllerfs-add ceph-float=20

The following error occurs when you run the controllerfs-add command:

“Failed to create controller filesystem ceph-float: controllers have pending LVG updates, please retry again later”.

Procedural Changes: To avoid this issue, run the commands in the following sequence:

~(keystone_admin)]$ system storage-backend-add ceph-rook --confirmed
~(keystone_admin)]$ controllerfs-add ceph-float=20
~(keystone_admin)]$ system host-fs-add controller-0 ceph=20

Intermittent installation of Rook-Ceph on Distributed Cloud ¶

While installing rook-ceph, if the installation fails, this is due to ceph-mgr-provision not being provisioned correctly.

Procedural Changes: It is recommended to use the system application-remove rook-ceph --force to initiate rook-ceph installation.

Storage Nodes are not considered part of the Kubernetes cluster ¶

When running the system kube-host-upgrade-list command the output must only display controller and worker hosts that have control-plane and kubelet components. Storage nodes do not have any of those components and so are not considered a part of the Kubernetes cluster.

Procedural Changes: Do not include Storage nodes as part of the Kubernetes upgrade.

Optimization with a Large number of OSDs ¶

As Storage nodes are not optimized, you may need to optimize your Ceph configuration for balanced operation across deployments with a high number of OSDs. This results in an alarm being generated even if the installation succeeds.

800.001 - Storage Alarm Condition: HEALTH_WARN. Please check ‘ceph -s’

Procedural Changes: To optimize your storage nodes with a large number of OSDs, it is recommended to use the following commands:

~(keystone_admin)]$ ceph osd pool set kube-rbd pg_num 256
~(keystone_admin)]$ ceph osd pool set kube-rbd pgp_num 256

Storage Nodes Recovery on Power Outage ¶

Storage nodes take 10-15 minutes longer to recover in the event of a full power outage.

Procedural Changes: NA

Ceph Recovery on an AIO-DX System ¶

In certain instances Ceph may not recover on an AIO-DX system, and remains in the down state when viewed using the :command”ceph -s command; for example, if an OSD comes up after a controller reboot and a swact occurs, or other possible causes for example, hardware failure of the disk or the entire host, power outage, or switch down.

Procedural Changes: There is no specific command or procedure that solves the problem for all possible causes. Each case needs to be analyzed individually to find the root cause of the problem and the solution.

Restrictions on the Size of Persistent Volume Claims (PVCs)¶

There is a limitation on the size of Persistent Volume Claims (PVCs) that can be used for all StarlingX Releases.

Procedural Changes: It is recommended that all PVCs should be a minimum size of 1GB. For more information, see, https://bugs.launchpad.net/starlingx/+bug/1814595.

platform-integ-apps application update aborted after removing StarlingX 9.0 ¶

When StarlingX 9.0 is removed, the platform-integ-apps application is downgraded, and a message will be displayed:

ceph-csi failure:release rbd-provisioner: Failed during apply :Helm upgrade
failed: cannot patch "rbd.csi.ceph.com" with kind CSIDriver: CSIDriver.storage.k8s.io
"rbd.csi.ceph.com" is invalid: spec.fsGroupPolicy: Invalid value:
"ReadWriteOnceWithFSType": field is immutable.

Procedural Changes: To resolve this problem do the following:

Remove the Container Storage Interface (CSI) drivers using the following commands:

~(keystone_admin)]$ kubectl delete csidriver cephfs.csi.ceph.com

~(keystone_admin)]$ kubectl delete csidriver rbd.csi.ceph.com

Update the application so that the correct version is installed.

~(keystone_admin)]$ system application-update /usr/local/share/applications/helm/platform-integ-apps-22.12-72.tgz

NetApp Permission Error ¶

When installing/upgrading to Trident 20.07.1 and later, and Kubernetes version 1.17 or higher, new volumes created will not be writable if:

The storageClass does not specify parameter.fsType
The pod using the requested PVC has an fsGroup enforced as part of a
Security constraint

Procedural Changes: Specify parameter.fsType in the localhost.yml file under netapp_k8s_storageclasses parameters as below.

The following example shows a minimal configuration in localhost.yml:

ansible_become_pass: xx43U~a96DN*m.?
trident_setup_dir: /tmp/trident
netapp_k8s_storageclasses:
    - metadata:
        name: netapp-nas-backend
    provisioner: netapp.io/trident
    parameters:
        backendType: "ontap-nas"
        fsType: "nfs"

netapp_k8s_snapshotstorageclasses:
    - metadata:
        name: csi-snapclass

See: Configure an External NetApp Deployment as the Storage Backend

Restrictions on the Minimum Size of Persistent Volume Claims (PVCs)¶

There is a limitation on the size of Persistent Volume Claims (PVCs) that can be used for all StarlingX Releases.

Procedural Changes: It is recommended that all PVCs should be a minimum size of 1GB. For more information, see, https://bugs.launchpad.net/starlingx/+bug/1814595.

Failure to clean up platform-integ-apps files/Helm release ¶

The System Controller does not have Ceph configured, so the platform-integ-apps is not installed and the images are not automatically downloaded to registry.central when upgrading the platform.

The missing images on the subclouds are:

registry.central:9001/docker.io/openstackhelm/ceph-config-helper:ubuntu_focal_18.2.0-1-20231013
registry.central:9001/quay.io/cephcsi/cephcsi:v3.10.1
registry.central:9001/registry.k8s.io/sig-storage/csi-attacher:v4.4.2
registry.central:9001/registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.9.1
registry.central:9001/registry.k8s.io/sig-storage/csi-provisioner:v3.6.2
registry.central:9001/registry.k8s.io/sig-storage/csi-resizer:v1.9.2
registry.central:9001/registry.k8s.io/sig-storage/csi-snapshotter:v6.3.2

If the System Controller does not have Ceph configured and the subclouds have Ceph configured, then the images need to be manually uploaded to the registry.central before starting the upgrade of the subclouds.

To push the images to the registry.central, run the following commands on the System Controller:

# Change the variables according to the setup
REGISTRY_PREFIX="server:port/path"
REGISTRY_USERNAME="admin"
REGISTRY_PASSWORD="password"

sudo docker login registry.local:9001 --username ${REGISTRY_USERNAME} --password ${REGISTRY_PASSWORD}
for image in\
docker.io/openstackhelm/ceph-config-helper:ubuntu_focal_18.2.0-1-20231013 \
registry.k8s.io/sig-storage/csi-attacher:v4.4.2 \
registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.9.1 \
registry.k8s.io/sig-storage/csi-provisioner:v3.6.2 \
registry.k8s.io/sig-storage/csi-resizer:v1.9.2 \
registry.k8s.io/sig-storage/csi-snapshotter:v6.3.2 \
quay.io/cephcsi/cephcsi:v3.10.1

do
sudo docker pull ${REGISTRY_PREFIX}/${image}
sudo docker tag ${REGISTRY_PREFIX}/${image} registry.local:9001/${image}
sudo docker push registry.local:9001/${image}
done

Procedural Changes: In case the subcloud upgrade finishes without the correct images pushed to the registry.central, it is still possible to recover the system following the steps below.

After pushing the images to the registry.central, each subcloud must be recovered with the following steps (these commands should be run on the Subcloud):

source /etc/platform/openrc

# Remove old app manually
sudo rm -rf /opt/platform/helm/22.12/platform-integ-apps;
sudo rm -rf /opt/platform/fluxcd/22.12/platform-integ-apps;
sudo -u postgres psql postgres -d sysinv -c "DELETE from kube_app WHERE name = 'platform-integ-apps';";
sudo sm-restart service sysinv-inv && sudo sm-restart service sysinv-conductor;
sleep 15; # Wait services to restart
system application-upload /usr/local/share/applications/helm/platform-integ-apps-22.12-72.tgz;
sleep 15; # Wait upload to fail (It is expcected to fail here)
system application-delete platform-integ-apps;
system application-upload /usr/local/share/applications/helm/platform-integ-apps-22.12-72.tgz;
sleep 10; # Wait for the upload to succeed
system application-apply platform-integ-apps;

Note

The images need to be pushed to the registry.central registry before upgrading the subclouds.

Operating System Limitations¶

BPF is disabled ¶

BPF cannot be used in the PREEMPT_RT/low latency kernel, due to the inherent incompatibility between PREEMPT_RT and BPF, see, https://lwn.net/Articles/802884/.

Some packages might be affected when PREEMPT_RT and BPF are used together. This includes the following, but not limited to these packages.

libpcap
libnet
dnsmasq
qemu
nmap-ncat
libv4l
elfutils
iptables
tcpdump
iproute
gdb
valgrind
kubernetes
cni
strace
mariadb
libvirt
dpdk
libteam
libseccomp
binutils
libbpf
dhcp
lldpd
containernetworking-plugins
golang
i40e
ice

Procedural Changes: It is recommended not to use BPF with real time kernel. If required it can still be used, for example, debugging only.

Control Group parameter ¶

The control group (cgroup) parameter kmem.limit_in_bytes has been deprecated, and results in the following message in the kernel’s log buffer (dmesg) during boot-up and/or during the Ansible bootstrap procedure: “kmem.limit_in_bytes is deprecated and will be removed. Please report your use case to linux-mm@kvack.org if you depend on this functionality.” This parameter is used by a number of software packages in StarlingX, including, but not limited to, systemd, docker, containerd, libvirt etc.

Procedural Changes: NA. This is only a warning message about the future deprecation of an interface.

Subcloud Reconfig may fail due to missing inventory file ¶

The dcmanager subcloud reconfig command may fail due to a missing file /var/opt/dc/ansible/<subcloud_name>_inventory.yml.

Procedural Changes: Provide the floating OAM IP address of the subcloud using the “–bootstrap-address” argument. For example:

~(keystone_admin)]$ dcmanager subcloud reconfig --sysadmin-password <password> --deploy-config deployment-config.yaml --bootstrap-address <floating_OAM_IP_address> <subcloud_name>

Horizon GUI Limitations¶

Unable to create Kubernetes Upgrade Strategy for Subclouds using Horizon GUI ¶

When creating a Kubernetes Upgrade Strategy for a subcloud using the Horizon GUI, it fails and displays the following error:

kube upgrade pre-check: Invalid kube version(s), left: (v1.24.4), right:
(1.24.4)

Procedural Changes: Use the following steps to create the strategy:

Procedure

Create a strategy for subcloud Kubernetes upgrade using the dcmanager kube-upgrade-strategy create --to-version <version> command.
Apply the strategy using the Horizon GUI or the CLI using the command dcmanager kube-upgrade-strategy apply.

Apply a Kubernetes Upgrade Strategy using Horizon

Procedural Changes: N/A.

k8s-coredump only supports lowercase annotation ¶

Creating K8s pod core dump fails when setting the starlingx.io/core_pattern parameter in upper case characters on the pod manifest. This results in the pod being unable to find the target directory and fails to create the coredump file.

Procedural Changes: The starlingx.io/core_pattern parameter only accepts lower case characters for the path and file name where the core dump is saved.

See: Kubernetes Pod Core Dump Handler.

Huge Page Limitation on Postgres ¶

Debian postgres version supports huge pages, and by default uses 1 huge page if it is available on the system, decreasing by 1 the number of huge pages available.

Procedural Changes: The huge page setting must be disabled by setting /etc/postgresql/postgresql.conf: "huge_pages = off". The postgres service needs to be restarted using the Service Manager sudo sm-restart service postgres command.

Warning

The Procedural Changes is not persistent, therefore, if the host is rebooted it will need to be applied again. This will be fixed in a future release.

Quartzville Tools ¶

The following celo64e and nvmupdate64e commands are not supported in StarlingX, Release 9.0 due to a known issue in Quartzville tools that crashes the host.

Procedural Change: Reboot the host using the boot screen menu.

Deprecated Notices in Stx 11.0 ¶

In-tree and Out-of-tree drivers¶

In StarlingX Release 11.0 only the out-of-tree versions of the Intel ice i40e, and iavf drivers are supported. Switching between in-tree and out-of-tree driver versions are not supported.

The out_of_tree_drivers service parameter and the out-of-tree-drivers boot parameter are deprecated and should not be modified to switch to in-tree driver versions. The values will be ignored, and the system will always use the out-of-tree versions of the Intel ice, i40e, and iavf drivers.

See: Intel Driver Versions

Kubernetes Root CA boostrap overrides¶

The overrides k8s_root_ca_cert, k8s_root_ca_key, and apiserver_cert_sans will be deprecated in a future release. External connections to kube-apiserver are now routed through a proxy that identifies itself using the REST API/GUI certificate issued by the platform issuer (system-local-ca).

See: Ansible Bootstrap Configurations

kubernetes-power-manager¶

Intel has stopped support for the kubernetes-power-manager application. This is still being supported by StarlingX and will be removed in a future release.

cpu_busy_cycles metric is deprecated and must be replaced with cpu_c0_state_residency_percent for continued usage (if the metrics are customized via helm overrides).

For more information, see Configurable Power Manager.

Bare metal Ceph¶

Host-based Ceph is deprecated in StarlingX Release 11.0. Adoption of Rook-Ceph is recommended for new deployments to avoid service disruption introduced by Bare Metal Ceph to Rook migration.

Static Configuration for Hardware Accelerator Cards¶

Static configuration for hardware accelerator cards is deprecated in StarlingX Release 24.09.00 and will be discontinued in future releases. Use FEC operator instead.

See Switch between Static Method Hardware Accelerator and SR-IOV FEC Operator

N3000 FPGA Firmware Update Orchestration¶

The N3000 FPGA Firmware Update Orchestration has been deprecated in StarlingX Release 24.09.00. For more information, see N3000 FPGA Overview for more information.

show-certs.sh Script¶

The show-certs.sh script that is available when you ssh to a controller is deprecated in StarlingX Release 11.0.

The new response format of the ‘system certificate-list’ RESTAPI / CLI now provides the same information as provided by show-certs.sh.

Kubernetes APIs¶

Kubernetes APIs that will be removed in K8s 1.27 are listed below:

See: https://kubernetes.io/docs/reference/using-api/deprecation-guide/#v1-27

ptp-notification v1 API¶

The ptp-notification v1 API can still be used in StarlingX Release 11.0. The v1 API will be removed in a future release and only the O-RAN Compliant Notification API (ptp-notification v2 API) will be supported.

Note

It is recommended that all new deployments use the O-RAN Compliant Notification API (ptp-notification v2 API).

Removed in Stx 11.0 ¶

MacVTap Interfaces¶

MacVTap interfaces for KubeVirt VMs are not supported in StarlingX Release 11.0 and future releases.

Release Information for other versions ¶

You can find details about a release on the specific release page at: https://wiki.openstack.org/wiki/StarlingX/Release_Plan#List_of_Releases.

Version	Release Date	Notes	Status
StarlingX R11.0	2025-11	https://docs.starlingx.io/r/stx.11.0/releasenotes/index.html	Maintained
StarlingX R10.0	2025-02	https://docs.starlingx.io/r/stx.10.0/releasenotes/index.html	Maintained
StarlingX R9.0	2024-03	https://docs.starlingx.io/r/stx.9.0/releasenotes/index.html	EOL
StarlingX R8.0	2023-02	https://docs.starlingx.io/r/stx.8.0/releasenotes/index.html	EOL
StarlingX R7.0	2022-07	https://docs.starlingx.io/r/stx.7.0/releasenotes/index.html	EOL
StarlingX R6.0	2021-12	https://docs.starlingx.io/r/stx.6.0/releasenotes/index.html	EOL
StarlingX R5.0.1	2021-09	https://docs.starlingx.io/r/stx.5.0/releasenotes/index.html	EOL
StarlingX R5.0	2021-05	https://docs.starlingx.io/r/stx.5.0/releasenotes/index.html	EOL
StarlingX R4.0	2020-08		EOL
StarlingX R3.0	2019-12		EOL
StarlingX R2.0.1	2019-10		EOL
StarlingX R2.0	2019-09		EOL
StarlingX R12.0	2018-10		EOL

StarlingX follows the release maintenance timelines in the StarlingX Release Plan.

The Status column uses OpenStack maintenance phase definitions.