StarlingX 11.0 Release Notes¶
About this task
StarlingX is a fully integrated edge cloud software stack that provides everything needed to deploy an edge cloud on one, two, or up to 100 servers.
This section describes the new capabilities, Known Limitations and Procedural Changes, Defects fixed and deprecated information in StarlingX 11.0.
ISO image¶
The pre-built ISO (Debian) for StarlingX 11.0 is located at the
StarlingX mirror repo:
Source Code for StarlingX 11.0¶
The source code for StarlingX 11.0 is available on the r/stx.11.0 branch in the StarlingX repositories.
Deployment¶
To deploy StarlingX 11.0, see Consuming StarlingX.
For detailed installation instructions, see StarlingX 11.0 Installation Guides.
New Features / Enhancements / Updates¶
The sections below provide a detailed list of new features and links to the associated user guides (if applicable).
Platform Component Upversion¶
The following platform component versions have been updated in StarlingX Release Stx 11.0
kernel version 6.12.40
Supported Kubernetes versions in StarlingX 11.0:
1.29.2
1.30.6
1.31.5
1.32.2
nginx-ingress-controller
ingress-nginx 4.13.3
cert-manager 1.17.2
platform-integ-apps 3.11.0
ceph-csi-rbd-3.13.1
ceph-csi-cephfs-3.13.1
ceph-pools-audit-1.0.1
Note
The Ceph pools audit chart is now disabled by default. It can be enabled through user-overrides based on user preference, if required.
rook-ceph
rook-ceph-1.16.6
rook-ceph-cluster-1.16.6
rook-ceph-provisioner-2.1.0
rook-ceph-floating-monitor-2.1.0
oidc-auth-apps 2.42.0
dex-0.23.0
secret-observer-0.1.8
oidc-client-0.1.24
Helm chart metrics-server: 3.12.2 (deploys Metrics Server 0.7.2)
kubevirt-app 1.5.0
node-feature-discovery 0.17.3
sriov-fec-operator 2.11.1
node-interface-metrics-exporter 0.1.4
security-profiles-operator 0.8.7
dell-storage
csi-powerflex 2.13.0
csi-powermax 2.13.0
csi-powerscale 2.13.0
csi-powerstore 2.13.0
csi-unity 2.13.0
csm-observability 1.11.0
csm-replication 1.11.0
csm-resiliency 1.12.0
oran-o2 2.2.1
snmp 1.0.5
auditd 1.0.3
portieris 0.13.28
Warning
Kubernetes upgrade fails if Portieris is applied.
intel-device-plugins-operator
intel-device-plugins-operator-0.32.5
intel-device-plugins-qat-0.32.1
intel-device-plugins-gpu-0.32.1
intel-device-plugins-dsa-0.32.1
secret-observer-0.1-1
kubernetes-power-manager 2.5.1
Note
Intel has stopped support for the
kubernetes-power-managerapplication. This is still being supported by StarlingX and will be removed in a future release. For more information, see Configurable Power Manager.cpu_busy_cyclesmetric is deprecated and must be replaced withcpu_c0_state_residency_percentfor continued usage (if the metrics are customized via helm overrides).power-metrics
cadvisor 0.52.1
telegraf 1.34.4
app-istio
Istio 1.26.2
Kiali 2.11.0
FluxCD helm-controller 1.2.0
FluxCD source-controller 1.5.0
FluxCD notification-controller 1.5.0
FluxCD kustomize-controller 1.5.1
Helm 3.17.1 for Kubernetes 1.29-1.32
volume-snapshot-controller
snapshot-controller 6.1.0 for K8s 1.29.2
snapshot-controller 6.3.3 for K8s 1.30.6
snapshot-controller 8.0.0 for K8s 1.31.5 - 1.32.2
snapshot-controller 8.1.0 for K8s 1.33.0
ptp-notification 2.0.75
app-netapp-storage (NetApp Trident CSI) 25.02.1
Mellanox (OFED) ConnectX 24.10-2.1.8
Mellanox ConnectX-6 DX firmware 22.43.2566
ice: 2.3.10
Intel E810 - Required NVM/firmware: 4.80
Intel E825 - Required NVM/firmware: 4.02
Intel E830 - Required NVM/firmware: 1.11
i40e: 2.28.9 / Required NVM/firmware: 9.20
OpenBao is not supported¶
Warning
OpenBao is not supported in StarlingX Stx 11.0. Do not upload/apply this application on a production system.
Secure Pod-to-Pod Communication of Inter-Host Network Traffic¶
To strengthen security across the StarlingX, new measures have been implemented to protect selected pod-to-pod network traffic from both passive and active network attackers including those with access to the cluster host network.
On StarlingX, inter-host pod-to-pod traffic for a service can be configured to be protected by IPsec in tunnel mode over cluster host network. The configurations are defined as IPsec policies and managed by the ipsec-policy-operator Kubernetes system application.
See:
Threat Mitigation¶
Passive attackers: Defend against traffic snooping and unauthorized data observation
Active attackers: Blocked from attempting unauthorized connections to StarlingX cluster hosts
Secure Pod-to-Pod Communication¶
The StarlingX now supports encryption of Calico-based inter-hosts networking using IPsec, ensuring secure Pod-to-Pod traffic across the cluster-host network.
Applies to application pod-to-pod traffic on the cluster-host network
Applications, and applications’ pod-to-pod traffic can selectively be protected
Excludes SR-IOV VF interface traffic
Configuring IPSec policies on pod-to-pod traffic may degrade the CPU performance. Ensure that adequate resources are available to support sustained and peak inter-node traffic
See:
Install IPsec Policy Operator System Application¶
The ipsec-policy-operator system application is managed by the system
application framework and will be automatically uploaded once the system is ready.
Subsequently, the application can be installed by applying its manifest.
See:
Platform Networks Address Reduction for AIO-SX¶
To reduce the number of IP addresses required for Distributed Cloud AIO-SX Subcloud deployments, platform networks are updated to allocate only a single IP address per subcloud, removing the need for additional unit-specific addresses that are no longer required.
However, the platform network IP address must be assigned from a shared subnet, allowing multiple subclouds to use the same network address range. This enables more efficient IP management across large-scale deployments. The OAM network serves as a reference model, as it already supports the necessary capabilities and expected behavior for this configuration.
See:
Intermediate CA Support for Kubernetes Root CA¶
StarlingX now supports the use of server certificates signed by an Intermediate Certificate Authority (CA) for the external kube-apiserver endpoint. This enhancement ensures that external access to the Kubernetes API can be validated under the same root of trust as other platform certificates, improving consistency and security across the system.
Intermediate CA Support for External Connections to kube-apiserver¶
External connections to kube-apiserver are now routed through HAProxy, which
listens on port 6443. HAProxy uses the REST API / GUI certificate issued by
system-local-ca, supporting Intermediate CAs, to perform SSL termination
with the external client. It then initiates a new SSL connection to kube-apiserver,
now operating on port 16443 behind the firewall, on behalf of the client.
External clients must recognize and trust the public certificate of
system-local-ca’s Root CA.
See: Kubernetes Certificates.
Unified PTP Notification Overall Sync State¶
The overall sync state notification (sync-state) describes the health of the timing chain on the local system. A locked state is reported when the system has reference to an external time source (GNSS or PTP) and the system clock is synchronized to that time source.
New Default/Static Platform API/CLI/GUI Access-Control Roles for Configurator and Operator¶
In StarlingX, 5 different keystone roles are supported: admin, reader,
configurator, operator and member.
In StarlingX Release 11.0, the following new keystone roles are introduced:
configurator
operator
Multi-Node Upgrades¶
In StarlingX Release 11.0, the restriction on K8s multi-node orchestrated upgrades has been removed. You can now perform upgrades across multiple nodes in a single orchestration strategy.
Example: Upgrading from v1.29.2 to v1.32.2
PTP Netlink API Integration¶
The following new interface parameters have been added in StarlingX Release 11.0:
ts2phc.pin_index = 1ts2phc.channel = 1
Docker Size updates¶
In StarlingX Release 11.0 the default Docker filesystem size is 30GB. Resize the Docker filesystem on all controllers to a minimum 50GB or more prior to upgrading the system using the following command:
system host-fs-modify <controller-name> docker=<GB>
A new deploy precheck script is added to ensure the docker filesystem size is not less than 50GB.
VIM Rollback Orchestration¶
StarlingX Release 11.0 introduces expanded rollback capabilities to improve system recovery during software deployments:
Manual Rollback is supported across all configurations, including AIO-SX, AIO-DX, Standard, and Standard with dedicated storage.
VIM Orchestrated Rollback is supported on duplex configurations (AIO-DX, AIO-DX+, Standard, and Standard with dedicated storage) for the following scenarios:
Rollback of Major Release software deployments
Rollback of Patch Release software deployments
Rollback of Patched Major Release deployments
Recovery from aborted or failed deployments
These enhancements aim to streamline recovery workflows and reduce downtime across a broader range of deployment scenarios.
See:
Upgrade / Rollback Process Optimization¶
To accelerate recovery from failed operations during software updates and upgrades, a new snapshot-based restore capability is introduced in StarlingX Release 11.0. Unlike traditional backup and restore, this feature leverages OSTree deployment management and LVM volume snapshots to revert the system to a previously saved state without requiring a full reinstall. Snapshots will be created for select LVM volumes, excluding directories such as /opt/backup, /var/log, and /scratch, as outlined in the “Filesystem Summary” below. This capability is currently limited to Simplex systems (AIO-SX).
LVM Name |
Mount Path |
DRBD |
Versioned** |
Snapshot |
|---|---|---|---|---|
root-lv |
/sysroot |
N |
N* |
|
var-lv |
/var |
N |
Y |
|
log-lv |
/var/log |
N |
N |
|
backup-lv |
/var/rootdirs/opt/backups |
N |
N |
|
ceph-mon-lv |
/var/lib/ceph/mon |
N |
N |
|
docker-lv |
/var/lib/docker |
N |
Y |
|
kubelet-lv |
/var/lib/kubelet |
N |
Y |
|
pgsql-lv |
/var/lib/postgresql |
drbd0 |
Y |
Y |
rabbit-lv |
/var/lib/rabbitmq |
drbd1 |
Y |
Y |
dockerdistribution-lv |
/var/lib/docker-distribution |
drbd8 |
N |
N |
platform-lv |
/var/rootdirs/opt/platform |
drbd2 |
Y |
Y |
etcd-lv |
/var/rootdirs/opt/etcd |
drbd7 |
Y |
Y |
extension-lv |
/var/rootdirs/opt/extension |
drbd5 |
N |
N |
dc-vault-lv |
/var/rootdirs/opt/dc-vault |
drbd6 |
Y |
N |
scratch-lv |
/var/rootdirs/scratch |
N |
N |
Managed by OSTree
** Versioned subpaths
See:
Platform Real Time Kernel Robustness¶
Stalld can be configured to use the queue_track backend, which is based
on eBPF. Stalld protects lower priority tasks from starvation.
Unlike other backends, queue_track reduces CPU usage and more accurately
identifies which tasks can be excecuted even if they are currently blocked
waiting for a lock.
See: Configure stall daemon.
Enable CONFIG_GENEVE Kernel Configuration¶
StarlingX Release 11.0 supports geneve.ko kernel module, controlled by the CONFIG_GENEVE kernel config option.
Cloud User Management GUI/CLI/RESTAPI Enhancements; Deletion Restriction¶
In StarlingX Release 11.0, existing Local LDAP users in the sudo group do not need to be migrated to the sys_admin group.
Administrators may retain their existing configuration if required. However, to better align with the platform’s security and access control standards, it is recommended to assign restricted sudo privileges through the sys_admin group.
Administrators may optionally update their configurations by transitioning Local LDAP users from the sudo group to the sys_admin group. This can be done using ONLY the following method:
via pam_group & /etc/security/group.conf to map users into additional groups
In-tree and Out-of-tree drivers¶
In StarlingX Release 11.0 only the out-of-tree versions of the Intel ice,
i40e, and iavf drivers are supported. Switching between in-tree and
out-of-tree driver versions are not supported.
See:
CaaS Traffic Bandwidth Configuration¶
Previously, the max_tx_rate parameter was used to set the maximum transmission
rate for a VF interface, with the short form -r. With the introduction of the
max_rx_rate parameter that is used to configure the maximum receiving
rate, both max_tx_rate and max_rx_rate can now be applied to define
bandwidth limits for platform interfaces. To align with naming conventions:
-tshort form formax_tx_rateparameter allows the configuration of the maximum transmission rate for both VF and platform interfaces.rshort form formax_rx_rateparameter is used to set the maximum receiving rate for platform interfaces.
See:
Rook Ceph Updates and Enhancements¶
Rook Ceph is an orchestrator that provides a containerized solution for Ceph Storage with a specialized Kubernetes Operator to automate the management of the cluster. It is an alternative solution for the bare-metal Ceph storage. See https://rook.io/docs/rook/latest-release/Getting-Started/intro/ for more details.
ECblock pools are renamed: Both data and metadata pools for ECblock on
Rook Ceph changed names to comply with the new standards for upstream Rook Ceph.
Data pool was renamed from
ec-data-pooltokube-ecblockMetadata pool was renamed from
ec-metadata-pooltokube-ecblock-metadata
Ceph version upgrade
Ceph version is upgraded from 18.2.2 to 18.2.5, with minimal impact on the upgrade.
Rook Ceph OSDs Management
To add, remove or replace OSDs in a Rook Container-based Ceph, see Remove a Host From a Rook Ceph Cluster
Note
Host-based Ceph is deprecated in StarlingX Release 11.0.
For any new StarlingX deployments Rook Ceph is mandatory in order to prevent any service disruptions during migration procedures.
User Management GUI/CLI Enhancements¶
For critical operations performed via the StarlingX CLI or GUI such as delete actions or operations that may impact services the system will display a warning indicating that the operation is critical, irreversible, or may affect service availability. Also the system will prompt the user to confirm before proceeding with the execution of the operation.
A user confirmation request can optionally be used to safeguard critical operations performed via the CLI. When the user CLI confirmation request is enabled, CLI users are prompted to explicitly confirm a potentially critical or destructive CLI command, before proceeding with the execution of the CLI command.
See:
Optimized Platform Processing and Memory Usage- Ph1¶
StarlingX Release 11.0 requires approximately 1 GB less memory, enabling more efficient deployment in resource-constrained environments.
This feature is designed to optimize platform resource utilization, specifically targeting processing and memory efficiency. This enables greater flexibility for StarlingX deployments in use cases with tighter footprint constraints.
Kubernetes Upgrade Procedure Optimization - Multi-Node¶
This feature enhances Kubernetes version upgrades across all StarlingX configurations including AIO-DX, AIO-DX with worker nodes, standard configurations with controller storage, and standard configurations with dedicated storage extending beyond AIO-SX.
The following enhancements are introduced:
Pre-caching of container images for all relevant versions during the upgrade’s preliminary phase.
The upgrade system now supports multi-node multi-K8s-version K8s upgrade (both manual, and orchestrated), ie.:
it supports multi-node upgrades of multiple Kubernetes versions in a single manual upgrade
it supports multi-node upgrades of multiple Kubernetes versions in a single orchestration
Previously, for multi-node environments, the Kubernetes upgrade process had to be repeated end-to-end for each version in sequence.
Now, the upgrade system checks for kubelet version skew, allowing kubelet components to run up to three minor versions behind the control plane. This enhancement enables multi-version upgrades in a single cycle, eliminating the need to upgrade kubelet through each intermediate version. As a result, the overall number of upgrade steps is significantly reduced.
See:
Hardware Updates¶
See:
Bug status¶
Fixed bugs¶
This release provides fixes for a number of defects. Refer to the StarlingX bug database to review the R11.0 Fixed Bugs.
Known Limitations and Procedural Changes¶
The following are known limitations you may encounter with your StarlingX 11.0 and earlier releases. Workarounds are suggested where applicable.
Note
These limitations are considered temporary and will likely be resolved in a future release.
Stx 11.0 Limitations¶
Security Limitations¶
RSA required to be the platform issuer private key¶
The system-local-ca issuer needs to use RSA type certificate/key. The usage
of other types of private keys is currently not supported during bootstrap
or with the Update system-local-ca or Migrate Platform Certificates to use
Cert Manager procedures. However, if the platform certificates were migrated to
cert-manager in previous versions (StarlingX Releases 9.0 and earlier),
it is possible that a non RSA private key was used. In this case,
the procedure needs to be rerun providing an RSA certificate/private key
pair before upgrading to StarlingX Release 10.0.
Procedural Changes: N/A.
Multiple trusted CA certificates with same Distinguished Name are not supported¶
The presence of multiple trusted CA (ssl_ca) certificates with the same Distinguished Name (DN) is invalid. An attempt to install a new trusted CA (ssl_ca) certificate with the same DN as another already installed as trusted will not succeed and a warning message will be returned.
Procedural Changes: N/A.
Kubernetes Root CA Certificates¶
Kubernetes does not properly support k8s_root_ca_cert and k8s_root_ca_key being an Intermediate CA.
Procedural Changes: Accept internally generated k8s_root_ca_cert/key or customize only with a Root CA certificate and key.
For external access to kube-apiserver, the proxy (HAproxy) uses the Rest API / GUI certificate, which supports Intermediate CAs. The issuer (system-local-ca) can be customized at bootstrap. See Ansible Bootstrap Configurations for more information.
External Authentication to kube-apiserver Using Client Certificates¶
SSL termination for external connections to kube-apiserver is now handled
by HAProxy, which establishes a new connection to the API server on behalf of
the external client. As a result, client certificate authentication is now
restricted to the admin user (kubernetes-admin). Token-based authentication
remains fully supported and unchanged.
Procedural Changes: N/A.
Password Expiry does not work on LDAP user login¶
On Debian, the warning message is not being displayed for Active Directory users, when a user logs in and the password is nearing expiry. Similarly, on login when a user’s password has already expired, the password change prompt is not being displayed.
Procedural Changes: It is recommended that users rely on Directory administration tools for “Windows Active Directory” servers to handle password updates, reminders and expiration. It is also recommended that passwords should be updated every 3 months.
Note
The expired password can be reset via Active Directory by IT administrators.
Upgrade activation: cert-manager does not start issuing certificates after upversion¶
During upgrade activation and upversioning, cert-manager usually takes less than a minute to be available and start issuing certificates. Occassionally, cert-manager can take more time than expected. This behavior is associated with an Open source issue. For more details see https://github.com/cert-manager/cert-manager/issues/7138#issuecomment-2422983418.
Since the cert-manager application is required, the upgrade activation will fail if the app takes too long to be available after the upversion. The following log will be displayed in /var/log/software.log
Error from server (NotFound): secrets "stx-test-cm" not found
certificate.cert-manager.io "stx-test-cm" deleted
software-controller-daemon: software_controller.py(837): INFO: 15 received
from deploy-activate with deploy-state activate-failed
software-controller-daemon: software_controller.py(870): INFO: Received
deploy state changed to DEPLOY_STATES.ACTIVATE_FAILED, agent deploy-activate
Procedural Changes: Cert-manager should recover by itself after a few minutes. If required, the following certificate used for test purposes in the upgrade activation can be created manually to ensure cert-manager is ready before reattempting the upgrade.
cat <<eof> cm_test_cert.yml
---
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
creationTimestamp: null
name: system-local-ca
spec:
ca:
secretName: system-local-ca
status: {}
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
creationTimestamp: null
name: stx-test-cm
namespace: cert-manager
spec:
commonName: stx-test-cm
issuerRef:
kind: ClusterIssuer
name: system-local-ca
secretName: stx-test-cm
status: {}
eof
$ kubectl apply -f cm_test_cert.yml
$ rm cm_test_cert.yml
$ kubectl wait certificate -n cert-manager stx-test-cm --for=condition=Ready --timeout 20m
# Verify that the TLS secret associated with the cert was created, using the following:
$ kubectl get secret -n cert-manager stx-test-cm
cert-manager cm-acme-http-solver pod fails¶
On a multinode setup, when you deploy an acme issuer to issue a certificate,
the cm-acme-http-solver pod might fail and stays in “ImagePullBackOff” state
due to the following defect https://github.com/cert-manager/cert-manager/issues/5959.
Procedural Changes:
If you are using the namespace “test”, create a docker-registry secret “testkey” with local registry credentials in the “test” namespace.
~(keystone_admin)]$ kubectl create secret docker-registry testkey --docker-server=registry.local:9001 --docker-username=admin --docker-password=Password*1234 -n test
Use the secret “testkey” in the issuer spec as follows:
apiVersion: cert-manager.io/v1 kind: Issuer metadata: name: stepca-issuer namespace: test spec: acme: server: https://test.com:8080/acme/acme/directory skipTLSVerify: true email: test@test.com privateKeySecretRef: name: stepca-issuer solvers: - http01: ingress: podTemplate: spec: imagePullSecrets: - name: testkey class: nginx
Vault application is not supported during bootstrap¶
The Vault application cannot be configured during bootstrap.
Procedural Changes:
The application must be configured after the platform nodes are unlocked /
enabled / available, a storage backend is configured, and platform-integ-apps
is applied. If Vault is to be run in HA configuration (3 vault server pods)
then at least three controller / worker nodes must be unlocked / enabled / available.
Vault application support for running on application cores¶
By default the Vault application’s pods will run on platform cores. When changing the core selection from platform cores to application cores the following additional procedure is required for the vault application.
Procedural Changes:
“If static kube-cpu-mgr-policy is selected and when overriding the label
app.starlingx.io/component for Vault namespace or pods, there are two
requirements:
The Vault server pods need to be restarted as directed by Hashicorp Vault documentation. Restart each of the standby server pods in turn, then restart the active server pod.
Ensure that sufficient hosts with worker function are available to run the Vault server pods on application cores.
See: Kubernetes CPU Manager Policies.
Restart the Vault Server pods¶
The Vault server pods do not restart automatically.
Procedural Changes: If the pods are to be re-labelled to switch execution from platform to application cores, or vice-versa, then the pods need to be restarted.
Under kubernetes the pods are restarted using the kubectl delete pod command. See, Hashicorp Vault documentation for the recommended procedure for restarting server pods in HA configuration, https://support.hashicorp.com/hc/en-us/articles/23744227055635-How-to-safely-restart-a-Vault-cluster-running-on-Kubernetes.
Ensure that sufficient hosts are available to run the server pods on application cores¶
The standard cluster with less than 3 worker nodes does not support Vault HA on the application cores. In this configuration (less than three cluster hosts with worker function):
Procedural Changes:
When setting label app.starlingx.io/component=application with the Vault app already applied in HA configuration (3 vault server pods), ensure that there are 3 nodes with worker function to support the HA configuration.
When applying Vault for the first time and with
app.starlingx.io/componentset to “application”: ensure that the server replicas is also set to 1 for non-HA configuration. The replicas for Vault server are overriden both for the Vault Helm chart and the Vault manager Helm chart:cat <<EOF > vault_overrides.yaml server: extraLabels: app.starlingx.io/component: application ha: replicas: 1 injector: extraLabels: app.starlingx.io/component: application EOF cat <<EOF > vault-manager_overrides.yaml manager: extraLabels: app.starlingx.io/component: application server: ha: replicas: 1 EOF $ system helm-override-update vault vault vault --values vault_overrides.yaml $ system helm-override-update vault vault-manager vault --values vault-manager_overrides.yaml
Kubernetes upgrade fails if Portieris is applied¶
Kubernetes upgrade fails if Portieris is applied prior to the upgrade.
Procedural Changes: Remove the Portieris application prior to the kubernetes upgrade. Perform the kubernetes upgrade and apply the Portieris application.
Portieris Helm override ‘caCert’ is renamed and moved to Portieris Helm chart¶
The ‘caCert’ Helm override of portieris-certs Helm chart is moved to the Portieris Helm chart as ‘TrustedCACert’.
Procedural Changes: Before upgrading from StarlingX 10.0 to 11.0, if ‘caCert’ Helm override is applied to the portieris-certs Helm chart to trust a custom CA certificate, apply the ‘TrustedCACert’ Helm override to ‘portieris’ Helm chart to trust the certificate. See, Install Portieris for information on TrustedCACert Helm override.
Harbor cannot be deployed during bootstrap¶
The Harbor application cannot be deployed during bootstrap due to the bootstrap deployment dependencies such as early availability of storage class.
Procedural Changes: N/A.
Windows Active Directory¶
Limitation: The Kubernetes API does not support uppercase IPv6 addresses.
Procedural Changes: The issuer_url IPv6 address must be specified as lowercase.
Limitation: The refresh token does not work.
Procedural Changes: If the token expires, manually replace the ID token. For more information, see, Configure Kubernetes Client Access.
Limitation: TLS error logs are reported in the oidc-dex container on subclouds. These logs should not have any system impact.
Procedural Changes: NA
Security Audit Logging for K8s API¶
A custom policy file can only be created at bootstrap in apiserver_extra_volumes.
If a custom policy file was configured at bootstrap, then after bootstrap the
user has the option to configure the parameter audit-policy-file to either
this custom policy file (/etc/kubernetes/my-audit-policy-file.yml) or the
default policy file /etc/kubernetes/default-audit-policy.yaml. If no
custom policy file was configured at bootstrap, then the user can only
configure the parameter audit-policy-file to the default policy file.
Only the parameter audit-policy-file is configurable after bootstrap, so
the other parameters (audit-log-path, audit-log-maxsize,
audit-log-maxage and audit-log-maxbackup) cannot be changed at
runtime.
Procedural Changes: NA
Networking Limitations¶
Controller-0/1 PXEboot Network Communication Failure(200.003) Alarm Raised After Upgrade¶
Alarm triggered: Controller-0/1 PXE boot network communication failure (Error Alarm 200.003) following system upgrade.
Procedural Changes:
Identify the PXEboot file.
grep -l "net:pxeboot" "/etc/network/interfaces.d/"/* 2>/dev/null /etc/network/interfaces.d//ifcfg-enp0s8:9
If the label differs from ‘:2’ (e.g., displays as ifcfg-enp0s8:9), proceed with the following step
Copy the file in the same directory.
cp /etc/network/interfaces.d//ifcfg-enp0s8:9 /etc/network/interfaces.d//ifcfg-enp0s8:2
Restart mtcClient.
systemctl restart mtcClient.service
Wait up to one minute for the alarm to clear. Repeat this process for all nodes.
Add / delete operations on pods results in errors¶
Under some circumstances, add / delete operations on pods results in error getting ClusterInformation: connection is unauthorized: Unauthorized and also results in pods staying in ContainerCreating/Terminating state. This error may also prevent users from locking a host.
Procedural Changes: If this error occurs run the following kubectl describe pod -n <namespace> <pod name> command. The following message is displayed:
error getting ClusterInformation: connection is unauthorized: Unauthorized
Limitation: There is also a known issue with the Calico CNI that may occur in rare occasions if the Calico token required for communication with the kube-apiserver becomes out of sync due to NTP skew or issues refreshing the token.
Procedural Changes: Delete the calico-node pod (causing it to automatically restart) using the following commands:
$ kubectl get pods -n kube-system --show-labels | grep calico
$ kubectl delete pods -n kube-system -l k8s-app=calico-node
Application Pods with SRIOV Interfaces¶
Application Pods with SR-IOV Interfaces require a restart-on-reboot: “true” label in their pod spec template.
Pods with SR-IOV interfaces may fail to start after a platform restore or Simplex upgrade and persist in the Container Creating state due to missing PCI address information in the CNI configuration.
Procedural Changes: Application pods that require|SRIOV| should add the label restart-on-reboot: “true” to their pod spec template metadata. All pods with this label will be deleted and recreated after system initialization, therefore all pods must be restartable and managed by a Kubernetes controller (i.e. DaemonSet, Deployment or StatefulSet) for auto recovery.
Pod Spec template example:
template:
metadata:
labels:
tier: node
app: sriovdp
restart-on-reboot: "true"
PTP O-RAN Spec Compliant Timing API Notification¶
The
v1 APIonly supports monitoring a single ptp4l + phc2sys instance.Procedural Changes: Ensure the system is not configured with multiple instances when using the v1 API.
The O-RAN Cloud Notification defines a /././sync API v2 endpoint intended to allow a client to subscribe to all notifications from a node. This endpoint is not supported StarlingX Release 9.0.
Procedural Changes: A specific subscription for each resource type must be created instead.
v1 / v2v1: Support for monitoring a single ptp4l instance per host - no other services can be queried/subscribed to.
v2: The API conforms to O-RAN.WG6.O-Cloud Notification API-v02.01 with the following exceptions, that are not supported in StarlingX Release 9.0.
O-RAN SyncE Lock-Status-Extended notifications
O-RAN SyncE Clock Quality Change notifications
O-RAN Custom cluster names
/././sync endpoint
Procedural Changes: See the respective PTP-notification v1 and v2 document subsections for further details.
ptp4l error “timed out while polling for tx timestamp” reported for NICs using the Intel ice driver¶
NICs using the Intel® ice driver may report the following error in the ptp4l
logs, which results in a PTP port switching to FAULTY before
re-initializing.
Note
PTP ports frequently switching to FAULTY may degrade the accuracy of
the PTP timing.
ptp4l[80330.489]: timed out while polling for tx timestamp
ptp4l[80330.489]: increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug
Note
This is due to a limitation with the Intel® ice driver as the driver cannot
guarantee the time interval to return the timestamp to the ptp4l user
space process which results in the occasional timeout error message.
Procedural Changes: The Procedural Changes recommended by Intel is to increase the
tx_timestamp_timeout parameter in the ptp4l config. The increased
timeout value gives more time for the ice driver to provide the timestamp to
the ptp4l user space process. Timeout values of 50ms and 700ms have been
validated. However, the user can use a different value if it is more suitable
for their system.
~(keystone_admin)]$ system ptp-instance-parameter-add <instance_name> tx_timestamp_timeout=700
~(keystone_admin)]$ system ptp-instance-apply
Note
The ptp4l timeout error log may also be caused by other underlying
issues, such as NIC port instability. Therefore, it is recommended to
confirm the NIC port is stable before adjusting the timeout values.
PTP is not supported on Broadcom 57504 NIC¶
PTP is not supported on the Broadcom 57504 NIC.
Procedural Changes: None. Do not configure PTP instances on the Broadcom 57504 NIC.
synce4l CLI options are not supported¶
The SyncE configuration using the synce4l is not supported in StarlingX
Release 24.09.
The service type of synce4l in the ptp-instance-add command
is not supported in StarlingX Release 24.09.
Procedural Changes: N/A.
ptp-notification application is not supported during bootstrap¶
Deployment of
ptp-notificationduring bootstrap time is not supported due to dependencies on the system PTP configuration which is handled post-bootstrap.Procedural Changes: N/A.
The helm-chart-attribute-modify command is not supported for
ptp-notificationbecause the application consists of a single chart. Disabling the chart would renderptp-notificationnon-functional.Procedural Changes: N/A.
The ptp-notification-demo App is Not a System-Managed Application¶
The ptp-notification-demo app is provided for demonstration purposes only. Therefore, it is not supported on typical platform operations such as Upgrades and Backup and Restore.
Procedural Changes: NA
Silicom TimeSync (STS) card limitations¶
Silicom and Intel based Time Sync NICs may not be deployed on the same system due to conflicting time sync services and operations.
PTP configuration for Silicom TimeSync (STS) cards is handled separately from StarlingX host PTP configuration and may result in configuration conflicts if both are used at the same time.
The sts-silicom application provides a dedicated
phc2sysinstance which synchronizes the local system clock to the Silicom TimeSync (STS) card. Users should ensure thatphc2sysis not configured via StarlingX PTP Host Configuration when the sts-silicom application is in use.Additionally, if StarlingX PTP Host Configuration is being used in parallel for non-STS NICs, users should ensure that all
ptp4linstances do not use conflictingdomainNumbervalues.When the Silicom TimeSync (STS) card is configured in timing mode using the sts-silicom application, the card goes through an initialization process on application apply and server reboots. The ports will bounce up and down several times during the initialization process, causing network traffic disruption. Therefore, configuring the platform networks on the Silicom TimeSync (STS) card is not supported since it will cause platform instability.
Procedural Changes: N/A.
N3000 Image in the containerd cache¶
The StarlingX system without an N3000 image in the containerd cache fails to configure during a reboot cycle, and results in a failed / disabled node.
The N3000 device requires a reset early in the startup sequence. The reset is
done by the n3000-opae image. The image is automatically downloaded on bootstrap
and is expected to be in the cache to allow the reset to succeed. If the image
is not in the cache for any reason, the image cannot be downloaded as
registry.local is not up yet at this point in the startup. This will result
in the impacted host going through multiple reboot cycles and coming up in an
enabled/degraded state. To avoid this issue:
Ensure that the docker filesystem is properly engineered to avoid the image being automatically removed by the system if flagged as unused. For instructions to resize the filesystem, see Increase Controller Filesystem Storage Allotments Using the CLI
Do not manually prune the N3000 image.
Procedural Changes: Use the procedure below.
Procedure
Lock the node.
~(keystone_admin)]$ system host-lock controller-0
Pull the (N3000) required image into the
containerdcache.~(keystone_admin)]$ crictl pull registry.local:9001/docker.io/starlingx/n3000-opae:stx.8.0-v1.0.2
Unlock the node.
~(keystone_admin)]$ system host-unlock controller-0
Deploying an App using nginx controller fails with internal error after controller.name override¶
An Helm override of controller.name to the nginx-ingress-controller app may result in errors when creating ingress resources later on.
Example of Helm override:
Procedural Changes: NA
Distributed Cloud Limitations¶
Subcloud Restore to N-1 Release with Additional Patches¶
If a subcloud is required to be restored to N-1 (Stx 10.0) release beyond the N-1 ISO patch (prepatched ISO) level, use the following prestage and deploy steps:
Restore the subcloud with install to the N-1 release using the following command:
$ dcmanager subcloud-backup restore --subcloud <subcloud> --with-install --release <YY.MM>
Note
The subcloud will be reinstalled with the N-1 (pre-patched) ISO
Prestage the subcloud with additional N-1 patches if applicable after running the following command:
restore dcmanager subcloud prestage --for-sw-deploy --release <N-1> <subcloud>
Use
dcmanager prestage-strategy create/applycommand to prestage more than one subcloud.Apply the N-1 patches on the subcloud using the following command:
$ dcmanager sw-deploy-strategy create <subcloud> --release <N-1-highest-patch-release-available> $ dcmanager sw-deploy-strategy apply
Use
--groupoption to create a strategy for more than one subcloud.
Subcloud install or restore to the previous release¶
If the System Controller is on StarlingX Release 11.0, subclouds can be deployed or restored to either StarlingX Release 10.0 or StarlingX Release 11.0.
The following operations have limited support for subclouds of the previous release:
Subcloud error reporting
The following operations are not supported for subclouds of the previous release:
Orchestrated subcloud kubernetes upgrade
Procedural Changes: N/A.
Subcloud Upgrade with Kubernetes Versions¶
Before upgrading a cluster, ensure that the Kubernetes version is updated to the latest one supported by the current (older) platform version. This step is necessary because the new platform version only supports that specific Kubernetes version. Orchestrated Kubernetes upgrades are not supported for N-1 subclouds. Therefore, before upgrading the System Controller to Stx 11.0, verify that both the System Controller and all existing subclouds are running Kubernetes version v1.29.2; the latest version supported by Stx 10.0.
Procedural Changes: N/A.
Enhanced Parallel Operations for Distributed Cloud¶
No parallel operation should be performed while the System Controller is being patched.
Only one type of parallel operation can be performed at a time. For example, subcloud prestaging or upgrade orchestration should be postponed while batch subcloud deployment is still in progress.
Examples of parallel operation:
any type of
dcmanager orchestration(prestage, sw-deploy, kube-upgrade, kube-rootca-update)concurrent
dcmanager subcloud adddcmanager subcloud-backup/subcloud-backup restorewith –group option
Procedural Changes: N/A.
Container-Infrastructure Limitations¶
Kubernetes Memory Manager Policies¶
The interaction between the kube-memory-mgr-policy=static
and the Topology Manager policy “restricted” can result in pods failing to be
scheduled or started even when there is sufficient memory. This
occurs due to the restrictive design of the NUMA-aware memory manager, which
prevents the same NUMA node from being used for both single and multi-NUMA
allocations.
Procedural Changes: It is important for users to understand the implications of these memory management policies and configure their systems accordingly to avoid unexpected failures.
For detailed configuration options and examples, refer to the Kubernetes documentation at https://kubernetes.io/docs/tasks/administer-cluster/memory-manager/.
Alarm 900.024 Raised When Uploading N-1 Patch Release to the System Controller¶
When uploading an N-1 patch release to the System Controller, alarm 900.024 (Obsolete Patch) will be triggered.
This behavior is specific to the System Controller and occurs only when uploading an N-1 patch
Procedural Changes: This warning can be safely ignored.
Kubevirt Limitations¶
The following limitations apply to Kubevirt in StarlingX Release 24.09:
Limitation: Kubernetes does not provide CPU Manager detection.
Procedural Changes: Add
cpumanagerto Kubevirt:apiVersion: kubevirt.io/v1 kind: KubeVirt metadata: name: kubevirt namespace: kubevirt spec: configuration: developerConfiguration: featureGates: - LiveMigration - Macvtap - Snapshot - CPUManagerCheck the label, using the following command:
~(keystone_admin)]$ kubectl describe node | grep cpumanager where `cpumanager=true`
Limitation: Huge pages do not show up under cat /proc/meminfo inside a guest VM. Although, resources are being consumed on the host. For example, if a VM is using 4GB of Huge pages, the host shows the same 4GB of huge pages used. The huge page memory is exposed as normal memory to the VM.
Procedural Changes: You need to configure Huge pages inside the guest OS.
See the Installation Guides at https://docs.starlingx.io/ for more details.
Limitation: Virtual machines using Persistent Volume Claim (PVC) must have a shared ReadWriteMany (RWX) access mode to be live migrated.
Procedural Changes: Ensure PVC is created with RWX.
$ class=cephfs --access-mode=ReadWriteMany $ virtctl image-upload --pvc-name=cirros-vm-disk-test-2 --pvc-size=500Mi --storage-class=cephfs --access-mode=ReadWriteMany --image-path=/home/sysadmin/Kubevirt-GA-testing/latest-manifest/kubevirt-GA-testing/cirros-0.5.1-x86_64-disk.img --uploadproxy-url=https://10.111.54.246 -insecure
Note
Live migration is not allowed with a pod network binding of bridge interface type ()
Live migration requires ports 49152, 49153 to be available in the virt-launcher pod. If these ports are explicitly specified in the masquarade interface, live migration will not function.
For live migration with SR-IOV interface:
specify networkData: in cloudinit, so when the VM moves to another node it will not loose the IP config
specify nameserver and internal FQDNs to connect to cluster metadata server otherwise cloudinit will not work
fix the MAC address otherwise when the VM moves to another node the MAC address will change and cause a problem establishing the link
Example:
cloudInitNoCloud: networkData: | ethernets: sriov-net1: addresses: - 128.224.248.152/23 gateway: 128.224.248.1 match: macAddress: "02:00:00:00:00:01" nameservers: addresses: - 10.96.0.10 search: - default.svc.cluster.local - svc.cluster.local - cluster.local set-name: sriov-link-enabled version: 2Limitation: Snapshot CRDs and controllers are not present by default and needs to be installed on StarlingX.
Procedural Changes: To install snapshot CRDs and controllers on Kubernetes, see:
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshots.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshotcontents.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshotclasses.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/deploy/kubernetes/snapshot-controller/rbac-snapshot-controller.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/deploy/kubernetes/snapshot-controller/setup-snapshot-controller.yaml
Additionally, create
VolumeSnapshotClassfor Cephfs and RBD:cat <<EOF>cephfs-storageclass.yaml — apiVersion: snapshot.storage.k8s.io/v1 kind: VolumeSnapshotClass metadata: name: csi-cephfsplugin-snapclass driver: cephfs.csi.ceph.com parameters: clusterID: 60ee9439-6204-4b11-9b02-3f2c2f0a4344 csi.storage.k8s.io/snapshotter-secret-name: ceph-pool-kube-cephfs-data csi.storage.k8s.io/snapshotter-secret-namespace: default deletionPolicy: Delete EOF
cat <<EOF>rbd-storageclass.yaml — apiVersion: snapshot.storage.k8s.io/v1 kind: VolumeSnapshotClass metadata: name: csi-rbdplugin-snapclass driver: rbd.csi.ceph.com parameters: clusterID: 60ee9439-6204-4b11-9b02-3f2c2f0a4344 csi.storage.k8s.io/snapshotter-secret-name: ceph-pool-kube-rbd csi.storage.k8s.io/snapshotter-secret-namespace: default deletionPolicy: Delete EOF .. note:: Get the cluster ID from : ``kubectl describe sc cephfs, rbd``Limitation: Live migration is not possible when using configmap as a filesystem. Currently, virtual machine instances (VMIs) cannot be live migrated as
virtiofsdoes not support live migration.Procedural Changes: N/A.
Limitation: Live migration is not possible when a VM is using secret exposed as a filesystem. Currently, virtual machine instances cannot be live migrated since
virtiofsdoes not support live migration.Procedural Changes: N/A.
Limitation: Live migration will not work when a VM is using ServiceAccount exposed as a file system. Currently, VMIs cannot be live migrated since
virtiofsdoes not support live migration.Procedural Changes: N/A.
Docker Network Bridge Not Supported¶
The Docker Network Bridge, previously created by default, is removed and no longer supported in StarlingX Release 9.0 as the default bridge IP address collides with addresses already in use.
As a result, docker can no longer be used for running containers. This impacts building docker images directly on the host.
Procedural Changes: Create a Kubernetes pod that has network access, log in to the container, and build the docker images.
Upper case characters in host names cause issues with kubernetes labelling¶
Upper case characters in host names cause issues with kubernetes labelling.
Procedural Changes: Host names should be lower case.
Kubernetes Taint on Controllers for Standard Systems¶
In Standard systems, a Kubernetes taint is applied to controller nodes in order to prevent application pods from being scheduled on those nodes; since controllers in Standard systems are intended ONLY for platform services. If application pods MUST run on controllers, a Kubernetes toleration of the taint can be specified in the application’s pod specifications.
Procedural Changes: Customer applications that need to run on controllers on Standard systems will need to be enabled/configured for Kubernetes toleration in order to ensure the applications continue working after an upgrade from StarlingX Release 6.0 to StarlingX future Releases. It is suggested to add the Kubernetes toleration to your application prior to upgrading to StarlingX 9.0 Release.
You can specify toleration for a pod through the pod specification (PodSpec). For example:
spec:
....
template:
....
spec
tolerations:
- key: "node-role.kubernetes.io/master"
operator: "Exists"
effect: "NoSchedule"
- key: "node-role.kubernetes.io/control-plane"
operator: "Exists"
effect: "NoSchedule"
See: Taints and Tolerations.
Application Fails After Host Lock/Unlock¶
In some situations, application may fail to apply after host lock/unlock due to previously evicted pods.
Procedural Changes: Use the kubectl delete command to delete the evicted pods and reapply the application.
Application Apply Failure if Host Reset¶
If an application apply is in progress and a host is reset it will likely fail. A re-apply attempt may be required once the host recovers and the system is stable.
Procedural Changes: Once the host recovers and the system is stable, a re-apply may be required.
Platform CPU Usage Alarms¶
Alarms may occur indicating platform cpu usage is greater than 90% if a large number of pods are configured using liveness probes that run every second.
Procedural Changes: To mitigate either reduce the frequency for the liveness probes or increase the number of platform cores.
Pods Using isolcpus¶
The isolcpus feature currently does not support allocation of thread siblings for cpu requests (i.e. physical thread +HT sibling).
Procedural Changes: For optimal results, if hyperthreading is enabled then isolcpus should be allocated in multiples of two in order to ensure that both SMT siblings are allocated to the same container.
Procedural Changes: N/A.
Distributed Cloud Limitations¶
Limitation for Day-2 Deployment Manager operations¶
After completing Day-1 operations and initiating a
Day-2 update for the Host resource, a system config update strategy is
generated. Consequently, alarms indicating the presence of this strategy in the
system are triggered.
If a new Day-2 update is executed immediately after another update, and before the previous strategy is created, it may lead to unexpected results .
Before proceeding with Day-2 operations, use the following Procedural Changes:
Procedural Changes: Wait for any alarms related to the system config update
strategy to clear, which indicates the completion of the strategy. Once the
alarms are cleared execute a new Day-2 update using either reconfiguration or
playbook re-application to apply new changes that were not applied in the
previous update.
Software Management Limitations¶
Deploy does not fail after a system reboot¶
Deploy does not fail after a system reboot.
Procedural Changes: Run the sudo software-deploy-set-failed --hostname/-h <hostname> --confirm utility to manually move the deploy and deploy host to a failed state which is caused by a failover, lost power, network outage etc. You can only run this utility with root privileges on the active controller.
The utility displays the current state and warns the user about the next steps to be taken in case the user needs to continue executing the utility. It also displays the new states and the next operation to be executed.
ISO/SIG Upload to Central Cloud Fails when Using sudo¶
To upload a software patch or major release to the System Controller region
using the --os-region-name SystemController option, the upload command must be
authenticated with Keystone.
Procedural Changes: Do not use sudo with the --os-region-name SystemController
option. For example, avoid using sudo software upload <software release>
command.
Note
When using the -local option, you must provide the absolute path to the
release files.
Note
When using software upload commands with --os-region-name SystemController
to upload a software patch or major release to the System Controller
region, Keystone authentication is required.
Important
Do not use sudo in combination with the --os-region-name SystemController
option. For example, avoid using:
$ sudo software --os-region-name SystemController upload <software-release>
Instead, ensure the command is executed with proper authentication and without sudo.
For more information see, Upload Software Releases Using the CLI
RT Throttling Service not running after Lock/Unlock on Upgraded Subclouds¶
During the upgrade process, the USM post-upgrade script modifies systemd
presets to define which services should be automatically enabled or disabled.
As part of this process, any user-enabled custom services may be set to
“disabled” after the upgrade completes.
Since this change occurs post-upgrade, systemd will not automatically
re-enable the affected service during subsequent lock / unlock operations.
By default, USM disables custom services not explicitly listed in the systemd
presets. Since service definitions can vary between releases, USM relies on
these presets to determine enablement status per host during the upgrade.
If a custom service is not included in the presets, it will be marked as
disabled and remain inactive after lock / unlock even following a successful
upgrade.
Log message during the upgrade:
controller-0 usm-initialize[3061]: info Removed
/etc/systemd/system/multi-user.target.wants/sysctl-rt-sched-apply.service
Procedural Changes: Once the upgrade to StarlingX Release 11.0 completes, run the service-enable and service-start commands for all custom / user services before issuing the first lock / unlock (or reboot).
The enable and start commands for this service are required only once prior to the initial lock / unlock operation. After this step is completed, there is no further need to manually start or enable custom services, as the USM post-upgrade script has already run during the upgrade process.
sw-manager sw-deploy-strategy apply fails¶
sw-manager apply fails to apply the patch.
Note
The Procedural Changes is applicable only if the sw-manager sw-deploy-strategy
fails with the following issues.
To show the operation is in an aborted state due to a timeout, run the following command.
~(keystone_admin)]$ sw-manager sw-deploy-strategy show Strategy Patch Strategy: strategy-uuid: 2082ab5e-a387-4b6a-be23-50ac23317725 controller-apply-type: serial storage-apply-type: serial worker-apply-type: serial default-instance-action: stop-start alarm-restrictions: strict current-phase: abort current-phase-completion: 100% state: aborted apply-result: timed-out apply-reason: abort-result: success abort-reason:
If step 1 fails with ‘timed-out’ results, check if the timeout has occurred due to step-name ‘wait-alarms-clear’ using the command below.
To display results ‘wait for alarm’ that has timed out and run the following command.
~(keystone_admin)]$ sw-manager sw-deploy-strategy show --details step-name: wait-alarms-clear timeout: 2400 seconds start-date-time: 2024-03-27 19:21:15 end-date-time: 2024-03-27 20:01:16 result: timed-out
To list the 750.006 alarm, use the following command.
~(keystone_admin)]$ fm alarm-list +----------+---------------------------+--------------------+----------+---------------+ | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp | +----------+---------------------------+--------------------+----------+---------------+ | 750.006 | A configuration change | platform-integ-apps| warning | 2024-03-27T| | | requires a reapply of the | | | 19:21:15. | | | platform-k8s_application= | | | 471422 | | | integ-apps application. | | | | +----------+---------------------------+--------------------+----------+---------------+
VIM orchestrated patch strategy failed with the 900.103 alarm being triggered.
~(keystone_admin)]$ fm alarm-list +----------+---------------------------+--------------------+----------+---------------+ | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp | +----------+---------------------------+--------------------+----------+---------------+ | 900.103 | Software patch auto-apply | orchestration=sw- | critical | 2024-03-26T03T| | | failed | | | | +----------+---------------------------+--------------------+----------+---------------+
Procedural Changes - Option 1
Check the system for existing alarms using the fm alarm-list command. If the existing alarms can be ignored then use the sw-manager sw-deploy-strategy create --alarm-restrictions relaxed command to ignore any alarms during patch orchestration
If the alarm was not ignored in using the command in step 1 and the issue is seen when you encounter patch apply failure, check if alarm ‘750.006’ is present on the system.
Delete the failed strategy using the following command.
~(keystone_admin)]$ sw-manager sw-deploy-strategy delete
Create a new strategy.
~(keystone_admin)]$ sw-manager sw-deploy-strategy create --alarm-restrictions relaxed
Apply the strategy.
~(keystone_admin)]$ sw-manager sw-deploy-strategy apply
Procedural Changes - Option 2
Create a new strategy (alarm-restrictions are not relaxed).
~(keystone_admin)]$ sw-manager sw-deploy-strategy create
Apply the strategy.
~(keystone_admin)]$ sw-manager sw-deploy-strategy apply
When the
sw-deploy-strategyis in progress, and when at ‘wait-alarms-clear’ step (this can be found from ‘sw-manager patch strategy show –details | grep “step-name”’), check if alarm 750.006 is present, then execute the below steps.Execute the command.
~(keystone_admin)]$ system application-apply platform-integ-apps
This will re-apply the application and clear the alarm ‘750.006’.
If the alarm still persists after step 3, manually delete the alarm using fm alarm-delete <uuid of alarm 750.006> command.
Platform Services Limitations¶
Kubernetes Pod Core Dump Handler may fail due to a missing Kubernetes token¶
In certain cases the Kubernetes Pod Core Dump Handler may fail due to a missing
Kubernetes token resulting in disabling configuration of the coredump on a per
pod basis and limiting namespace access. If application coredumps are not being
generated, verify if the k8s-coredump token is empty on the configuration file:
/etc/k8s-coredump-conf.json using the following command:
~(keystone_admin)]$ ~$ sudo cat /etc/k8s-coredump-conf.json
{
"k8s_coredump_token": ""
}
Procedural Changes: If the k8s-coredump token is empty in the configuration file and the kube-apiserver is verified to be responsive, users can re-execute the create-k8s-account.sh script in order to generate the appropriate token after a successful connection to kube-apiserver using the following commands:
~(keystone_admin)]$ :/home/sysadmin$ sudo chmod +x /etc/k8s-coredump/create-k8s-account.sh
~(keystone_admin)]$ :/home/sysadmin$ sudo /etc/k8s-coredump/create-k8s-account.sh
Uploaded Applications Show Incorrect Progress During Platform Upgrade¶
The outputs of the system application-list and system application-show
commands may display status messages indicating that dependencies for uploaded
applications are missing even after those dependencies have been applied or
updated.
Note
If the required dependencies are actually met, this does not prevent the applications from being applied.
Procedural Changes: N/A.
Restart Required for containerd to Apply Config Changes for AIO-SX¶
On AIO-SX systems, certain container images were removed from the registry due to the image garbage collector and changes introduced during the Kubernetes upgrade. This may impact workloads that rely on specific image versions.
Procedural Changes: Increasing the Docker filesystem size will help retain the
image in the containerd cache. Additionally, only for AIO-SX it is
recommended to restart containerd after the Kubernetes upgrade.
For more details, see “Docker Size updates”.
Limitation Using Regular Expressions in some Parameters while Configuring Stalld¶
Stalld supports regular expressions in some parameters such as:
ignore_threads
ignore_processes
For example, Stalld can be instructed to ignore all threads that start with
the keyword runner; stalld –ignore_threads=”runner.*””
Procedural Changes: In StarlingX Release 24.09.300 the above functionality
is not available when using the system host-label api, therefore, the user
will have to explicitly specify the threads to ignore.
system host-label-assign controller-0 starlingx.io/stalld.ignore_threads="runnerA"
BMC Password¶
The BMC password cannot be updated.
Procedural Changes: In order to update the BMC password, de-provision the BMC, and then re-provision it again with the new password.
Configure Stalld¶
It is recommended to configure Stalld during initial setup. If the workload is high, runtime stalld configuration may not take effect till the node is rebooted
Procedural Changes: Stalld should be configured during initial system setup.
Sub-Numa Cluster Configuration not Supported on Skylake Servers¶
Sub-Numa cluster configuration is not supported on Skylake servers.
Procedural Changes: For servers with Skylake Gold or Platinum CPUs, Sub-NUMA clustering must be disabled in the BIOS.
Debian Bootstrap¶
On CentOS bootstrap worked even if dns_servers were not present in the localhost.yml. This does not work for Debian bootstrap.
Procedural Changes: You need to configure the dns_servers parameter in the localhost.yml, as long as no FQDNs were used in the bootstrap overrides in the localhost.yml file for Debian bootstrap.
Installing a Debian ISO¶
The disks and disk partitions need to be wiped before the install. Installing a Debian ISO may fail with a message that the system is in emergency mode if the disks and disk partitions are not completely wiped before the install, especially if the server was previously running a CentOS ISO.
Procedural Changes: When installing a lab for any Debian install, the disks must first be completely wiped using the following procedure before starting an install.
Use the following wipedisk commands to run before any Debian install for each disk (eg: sda, sdb, etc):
sudo wipedisk
# Show
sudo sgdisk -p /dev/sda
# Clear part table
sudo sgdisk -o /dev/sda
Note
The above commands must be run before any Debian install. The above commands must also be run if the same lab is used for CentOS installs after the lab was previously running a Debian ISO.
Metrics Server Update across Upgrades¶
After a platform upgrade, the Metrics Server will NOT be automatically updated.
Procedural Changes: To update the Metrics Server, See: Install Metrics Server
Backup and Restore Playbook fails due to self-triggered “backup in progress”/”restore in progress” flag¶
Backup and Restore causes the Playbook to fail due to self-triggered “backup in progress” / “restore in progress” flag.
Procedural Changes: Retry the backup after manually removing the flag /etc/platform/.backup_in_progress if it has been more than 10 minutes based on the error message:
"backup has already been started less than x minutes ago.
Wait to start a new backup or manually remove the backup flag in
/etc/platform/.backup_in_progress "
For a “restore in progress” flag, reinstall and retry the restore operation.
Optimized-Edge Limitations¶
Data Streaming Accelerator Error During a USM Upgrade¶
During the upgrade from StarlingX Release 10.0 to 11.0, DSA init container fails and will remain in CrashLoopBack until DSA is fully upgraded to StarlingX Release 11.0.
The issue occurs because a new parameter driver_name is required to
configure the workqueues in idxd driver in kernel 6.12.40. This behavior
should not impact platform upgrade but DSA may not be configured until
intel-device-plugins-operator is successfully upgraded.
Procedural Changes: To overcome this behavior, the new parameters can be added by applying the following Helm overrides before the upgrade.
For example, create the following override file:
$ cat << 'EOF' > dsa-override.yml
overrideConfig:
dsa.conf: |
[
{
"dev":"dsaX",
"read_buffer_limit":0,
"groups":[
{
"dev":"groupX.0",
"read_buffers_reserved":0,
"use_read_buffer_limit":0,
"read_buffers_allowed":8,
"grouped_workqueues":[
{
"dev":"wqX.0",
"mode":"dedicated",
"size":16,
"group_id":0,
"priority":10,
"block_on_fault":1,
"type":"user",
"name":"dpdk_appX0",
"driver_name":"user",
"threshold":15
}
],
"grouped_engines":[
{
"dev":"engineX.0",
"group_id":0
},
]
},
{
"dev":"groupX.1",
"read_buffers_reserved":0,
"use_read_buffer_limit":0,
"read_buffers_allowed":8,
"grouped_workqueues":[
{
"dev":"wqX.1",
"mode":"dedicated",
"size":16,
"group_id":1,
"priority":10,
"block_on_fault":1,
"type":"user",
"name":"dpdk_appX1",
"driver_name":"user",
"threshold":15
}
],
"grouped_engines":[
{
"dev":"engineX.1",
"group_id":1
},
]
},
{
"dev":"groupX.2",
"read_buffers_reserved":0,
"use_read_buffer_limit":0,
"read_buffers_allowed":8,
"grouped_workqueues":[
{
"dev":"wqX.2",
"mode":"dedicated",
"size":16,
"group_id":2,
"priority":10,
"block_on_fault":1,
"type":"user",
"name":"dpdk_appX2",
"driver_name":"user",
"threshold":15
}
],
"grouped_engines":[
{
"dev":"engineX.2",
"group_id":2
},
]
},
{
"dev":"groupX.3",
"read_buffers_reserved":0,
"use_read_buffer_limit":0,
"read_buffers_allowed":8,
"grouped_workqueues":[
{
"dev":"wqX.3",
"mode":"dedicated",
"size":16,
"group_id":3,
"priority":10,
"block_on_fault":1,
"type":"user",
"name":"dpdk_appX3",
"driver_name":"user",
"threshold":15
}
],
"grouped_engines":[
{
"dev":"engineX.3",
"group_id":3
},
]
},
]
}
]
EOF
Then apply the override file:
$ system helm-override-update intel-device-plugins-operator intel-device-plugins-dsa intel-device-plugins-operator --values dsa-override.yml
Apply the intel-device-plugins-operator application.
$ system application-apply intel-device-plugins-operator
Console Session Issues during Installation¶
After bootstrap and before unlocking the controller, if the console session times
out (or the user logs out), systemd does not work properly. fm, sysinv and
mtcAgent do not initialize.
Procedural Changes: If the console times out or the user logs out between bootstrap and unlock of controller-0, then, to recover from this issue, you must re-install the ISO.
Power Metrics Application in Real Time Kernels¶
When executing Power Metrics application in Real Time kernels, the overall scheduling latency may increase due to inter-core interruptions caused by the MSR (Model-specific Registers) reading.
Due to intensive workloads the kernel may not be able to handle the MSR reading interruptions resulting in stalling data collection due to not being scheduled on the affected core.
Storage Limitations¶
Limitations of the Rook Ceph Application During Upgrade from Version 1.13 on AIO-DX¶
During the upgrade from v1.13 to v1.16 on a AIO-DX platform, mon quorum may be temporarily disrupted for a few minutes. Once the upgrade completes, all monitors are expected to come back online and quorum should re-establish successfully.
Procedural Changes: N/A.
Rook Ceph Application Limitation During Floating Monitor Removal¶
On a AIO-DX system, removing the floating monitor using “system controllerfs-modify ceph-float –functions=”” may lead to temporary system instability, including the possibility of uncontrolled swacts.
Procedural Changes: To avoid this issue, ensure that all finalizers are removed from the floating monitor Rook Ceph chart after its deletion, using the following command:
$ kubectl patch hr rook-ceph-floating-monitor -p '{"metadata":{"finalizers":[]}}' --type=merge
Host fails to lock during an upgrade¶
After adding multiple OSDs simultaneously configured in the Ceph cluster, some OSDs may remain in a configuring state even though the cluster is healthy and the OSD is deployed. This is an intermittent issue that only occurs on systems with Ceph storage backend configured with more than one OSD per host. This causes the system host-lock command to fail with the following error:
$ system host-lock controller-<id>
controller-<id> : Rejected: Can not lock a controller with storage devices
in 'configuring' state.
Since system host-lock on the controller fails and the OSD is still in
the configuring state, the upgrade is blocked from proceeding.
Procedural Changes: Use the following steps to proceed with the upgrade.
List the OSDID in the ‘configuring’ state using the following command:
$ system host-stor-list <hostname>
Identify the OSD using the following command:
$ ceph osd find osd.<OSDID>
If the OSD is found manually update the database inventory using the
stor uuid:$ sudo -u postgres psql -U postgres -d sysinv -c "UPDATE i_istor SET state='configured' WHERE uuid='<STOR_UUID>';";
Ceph Daemon Crash and Health Warning¶
After a Ceph daemon crash, an alarm is displayed to verify Ceph health.
Run ceph -s to display the following message:
cluster:
id: <id>
health: HEALTH_WARN
1 daemons have recently crashed
One or more Ceph daemons have crashed, and the crash has not yet been archived or acknowledged by the administrator.
Procedural Changes: To archive the crash, clear the health check warning and the alarm.
List the timestamp/uuid crash-ids for all newcrash information:
[sysadmin@controller-0 ~(keystone_admin)]$ ceph crash ls-new
Display details of a saved crash.
[sysadmin@controller-0 ~(keystone_admin)]$ ceph crash info <crash-id>
Archive the crash so it no longer appears in
ceph crash ls-newoutput.[sysadmin@controller-0 ~(keystone_admin)]$ ceph crash archive <crash-id>
After archiving the crash, make sure the recent crash is not displayed.
[sysadmin@controller-0 ~(keystone_admin)]$ ceph crash ls-new
If more than one crash needs to be archived run the following command.
[sysadmin@controller-0 ~(keystone_admin)]$ ceph crash archive-all
Rook Ceph Application Limitation¶
After applying Rook Ceph application in an AIO-DX configuration the
800.001 - Storage Alarm Condition: HEALTH_WARN alarm may be triggered.
Procedural Changes: Restart the pod of the monitor associated with the
slow operations detected by Ceph. Check ceph -s.
Avoid host lock/unlock during application apply¶
Host lock and unlock operations may interfere with applications that are in the applying state.
Procedural Changes: Re-applying or removing / installing applications may be required. Application status can be checked using the system application-list command.
Perform a host lock during application apply¶
Host lock and unlock operations may interfere with applications that are in the applying state.
Procedural Changes: Re-applying or removing / installing applications may be required. Application status can be checked using the system application-list command.
Rook-ceph Application Limitations¶
This section documents the following known limitations you may encounter with the rook-ceph application and procedural changes that you can use to resolve the issue.
Remove all OSDs in a host
The procedure to remove OSDs will not work as expected when removing all
OSDs from a host. The Ceph cluster gets stuck in HEALTH_WARN state.
Note
Use the Procedural change only if the cluster is stuck in HEALTH_WARN
state after removing all OSDs on a host.
Procedural Changes:
Check the cluster health status.
Check crushmap tree.
Remove the host(s) that are empty in the command executed before
Check the cluster health status.
Use the rook-ceph apply command when a host with OSD is in offline state
The rook-ceph apply will not allocate the OSDs correctly if the host is offline.
Note
Use either of the procedural changes below only if the OSDs are not allocated in the Ceph cluster.
Procedural Changes 1:
Check if the OSD is not in crushmap tree.
Restart the rook-ceph operator pod.
Note
Wait for about 5 minutes to let the operator to try to recoever the OSDs.
Check if the OSDs have been added in crushmap tree.
Procedural Changes 2:
Check if the OSD is not in the crushmap tree OR it is in the crushmap tree but not allocated in the correct location (within a host).
Lock the host
Wait for the host to be locked.
Get the list from the OSDs inventory from the host.
Remove the OSDs from the inventory.
Reapply the rook-ceph application.
Wait for OSDs prepare pods to be recreated.
Add the OSDs in the inventory.
Reapply the rook-ceph application.
Wait for new OSD pods to be created and running.
Critical alarm 800.001 after Backup and Restore on AIO-SX Systems¶
A Critical alarm 800.001 may be triggered after running the Restore Playbook. The alarm details are as follows:
~(keystone_admin)]$ fm alarm-list
+-------+----------------------------------------------------------------------+--------------------------------------+----------+---------------+
| Alarm | Reason Text | Entity ID | Severity | Time Stamp |
| ID | | | | |
+-------+----------------------------------------------------------------------+--------------------------------------+----------+---------------+
| 800. | Storage Alarm Condition: HEALTH_ERR. Please check 'ceph -s' for more | cluster= | critical | 2024-08-29T06 |
| 001 | details. | 96ebcfd4-3ea5-4114-b473-7fd0b4a65616 | | :57:59.701792 |
| | | | | |
+-------+----------------------------------------------------------------------+--------------------------------------+----------+---------------+
Procedural Changes: To clear this alarm run the following commands:
Note
Applies only to AIO-SX systems.
FS_NAME=kube-cephfs
METADATA_POOL_NAME=kube-cephfs-metadata
DATA_POOL_NAME=kube-cephfs-data
# Ensure that the Ceph MDS is stopped
sudo rm -f /etc/pmon.d/ceph-mds.conf
sudo /etc/init.d/ceph stop mds
# Recover MDS state from filesystem
ceph fs new ${FS_NAME} ${METADATA_POOL_NAME} ${DATA_POOL_NAME} --force
# Try to recover from some common errors
sudo ceph fs reset ${FS_NAME} --yes-i-really-mean-it
cephfs-journal-tool --rank=${FS_NAME}:0 event recover_dentries summary
cephfs-journal-tool --rank=${FS_NAME}:0 journal reset
cephfs-table-tool ${FS_NAME}:0 reset session
cephfs-table-tool ${FS_NAME}:0 reset snap
cephfs-table-tool ${FS_NAME}:0 reset inode
sudo /etc/init.d/ceph start mds
Error installing Rook Ceph on AIO-DX with host-fs-add before controllerfs-add¶
When you provision controller-0 manually prior to unlock, the following sequence of commands fail:
~(keystone_admin)]$ system storage-backend-add ceph-rook --confirmed
~(keystone_admin)]$ system host-fs-add controller-0 ceph=20
~(keystone_admin)]$ system controllerfs-add ceph-float=20
The following error occurs when you run the controllerfs-add command:
“Failed to create controller filesystem ceph-float: controllers have pending LVG updates, please retry again later”.
Procedural Changes: To avoid this issue, run the commands in the following sequence:
~(keystone_admin)]$ system storage-backend-add ceph-rook --confirmed
~(keystone_admin)]$ controllerfs-add ceph-float=20
~(keystone_admin)]$ system host-fs-add controller-0 ceph=20
Intermittent installation of Rook-Ceph on Distributed Cloud¶
While installing rook-ceph, if the installation fails, this is due to
ceph-mgr-provision not being provisioned correctly.
Procedural Changes: It is recommended to use the system application-remove rook-ceph --force to initiate rook-ceph installation.
Storage Nodes are not considered part of the Kubernetes cluster¶
When running the system kube-host-upgrade-list command the output must only display controller and worker hosts that have control-plane and kubelet components. Storage nodes do not have any of those components and so are not considered a part of the Kubernetes cluster.
Procedural Changes: Do not include Storage nodes as part of the Kubernetes upgrade.
Optimization with a Large number of OSDs¶
As Storage nodes are not optimized, you may need to optimize your Ceph configuration for balanced operation across deployments with a high number of OSDs. This results in an alarm being generated even if the installation succeeds.
800.001 - Storage Alarm Condition: HEALTH_WARN. Please check ‘ceph -s’
Procedural Changes: To optimize your storage nodes with a large number of OSDs, it is recommended to use the following commands:
~(keystone_admin)]$ ceph osd pool set kube-rbd pg_num 256
~(keystone_admin)]$ ceph osd pool set kube-rbd pgp_num 256
Storage Nodes Recovery on Power Outage¶
Storage nodes take 10-15 minutes longer to recover in the event of a full power outage.
Procedural Changes: NA
Ceph Recovery on an AIO-DX System¶
In certain instances Ceph may not recover on an AIO-DX system, and remains in the down state when viewed using the :command”ceph -s command; for example, if an OSD comes up after a controller reboot and a swact occurs, or other possible causes for example, hardware failure of the disk or the entire host, power outage, or switch down.
Procedural Changes: There is no specific command or procedure that solves the problem for all possible causes. Each case needs to be analyzed individually to find the root cause of the problem and the solution.
Restrictions on the Size of Persistent Volume Claims (PVCs)¶
There is a limitation on the size of Persistent Volume Claims (PVCs) that can be used for all StarlingX Releases.
Procedural Changes: It is recommended that all PVCs should be a minimum size of 1GB. For more information, see, https://bugs.launchpad.net/starlingx/+bug/1814595.
platform-integ-apps application update aborted after removing StarlingX 9.0¶
When StarlingX 9.0 is removed, the platform-integ-apps application is
downgraded, and a message will be displayed:
ceph-csi failure:release rbd-provisioner: Failed during apply :Helm upgrade
failed: cannot patch "rbd.csi.ceph.com" with kind CSIDriver: CSIDriver.storage.k8s.io
"rbd.csi.ceph.com" is invalid: spec.fsGroupPolicy: Invalid value:
"ReadWriteOnceWithFSType": field is immutable.
Procedural Changes: To resolve this problem do the following:
Remove the Container Storage Interface (CSI) drivers using the following commands:
~(keystone_admin)]$ kubectl delete csidriver cephfs.csi.ceph.com ~(keystone_admin)]$ kubectl delete csidriver rbd.csi.ceph.com
Update the application so that the correct version is installed.
~(keystone_admin)]$ system application-update /usr/local/share/applications/helm/platform-integ-apps-22.12-72.tgz
NetApp Permission Error¶
When installing/upgrading to Trident 20.07.1 and later, and Kubernetes version 1.17 or higher, new volumes created will not be writable if:
The storageClass does not specify
parameter.fsType- The pod using the requested PVC has an
fsGroupenforced as part of a Security constraint
- The pod using the requested PVC has an
Procedural Changes: Specify parameter.fsType in the localhost.yml file under
netapp_k8s_storageclasses parameters as below.
The following example shows a minimal configuration in localhost.yml:
ansible_become_pass: xx43U~a96DN*m.?
trident_setup_dir: /tmp/trident
netapp_k8s_storageclasses:
- metadata:
name: netapp-nas-backend
provisioner: netapp.io/trident
parameters:
backendType: "ontap-nas"
fsType: "nfs"
netapp_k8s_snapshotstorageclasses:
- metadata:
name: csi-snapclass
See: Configure an External NetApp Deployment as the Storage Backend
Restrictions on the Minimum Size of Persistent Volume Claims (PVCs)¶
There is a limitation on the size of Persistent Volume Claims (PVCs) that can be used for all StarlingX Releases.
Procedural Changes: It is recommended that all PVCs should be a minimum size of 1GB. For more information, see, https://bugs.launchpad.net/starlingx/+bug/1814595.
Failure to clean up platform-integ-apps files/Helm release¶
The System Controller does not have Ceph configured,
so the platform-integ-apps is not installed and the images are not
automatically downloaded to registry.central when upgrading the platform.
The missing images on the subclouds are:
registry.central:9001/docker.io/openstackhelm/ceph-config-helper:ubuntu_focal_18.2.0-1-20231013
registry.central:9001/quay.io/cephcsi/cephcsi:v3.10.1
registry.central:9001/registry.k8s.io/sig-storage/csi-attacher:v4.4.2
registry.central:9001/registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.9.1
registry.central:9001/registry.k8s.io/sig-storage/csi-provisioner:v3.6.2
registry.central:9001/registry.k8s.io/sig-storage/csi-resizer:v1.9.2
registry.central:9001/registry.k8s.io/sig-storage/csi-snapshotter:v6.3.2
If the System Controller does not have Ceph configured and the subclouds have Ceph configured, then the images need to be manually uploaded to the registry.central before starting the upgrade of the subclouds.
To push the images to the registry.central, run the following commands on the System Controller:
# Change the variables according to the setup
REGISTRY_PREFIX="server:port/path"
REGISTRY_USERNAME="admin"
REGISTRY_PASSWORD="password"
sudo docker login registry.local:9001 --username ${REGISTRY_USERNAME} --password ${REGISTRY_PASSWORD}
for image in\
docker.io/openstackhelm/ceph-config-helper:ubuntu_focal_18.2.0-1-20231013 \
registry.k8s.io/sig-storage/csi-attacher:v4.4.2 \
registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.9.1 \
registry.k8s.io/sig-storage/csi-provisioner:v3.6.2 \
registry.k8s.io/sig-storage/csi-resizer:v1.9.2 \
registry.k8s.io/sig-storage/csi-snapshotter:v6.3.2 \
quay.io/cephcsi/cephcsi:v3.10.1
do
sudo docker pull ${REGISTRY_PREFIX}/${image}
sudo docker tag ${REGISTRY_PREFIX}/${image} registry.local:9001/${image}
sudo docker push registry.local:9001/${image}
done
Procedural Changes: In case the subcloud upgrade finishes without the correct images pushed to the registry.central, it is still possible to recover the system following the steps below.
After pushing the images to the registry.central, each subcloud must be recovered with the following steps (these commands should be run on the Subcloud):
source /etc/platform/openrc
# Remove old app manually
sudo rm -rf /opt/platform/helm/22.12/platform-integ-apps;
sudo rm -rf /opt/platform/fluxcd/22.12/platform-integ-apps;
sudo -u postgres psql postgres -d sysinv -c "DELETE from kube_app WHERE name = 'platform-integ-apps';";
sudo sm-restart service sysinv-inv && sudo sm-restart service sysinv-conductor;
sleep 15; # Wait services to restart
system application-upload /usr/local/share/applications/helm/platform-integ-apps-22.12-72.tgz;
sleep 15; # Wait upload to fail (It is expcected to fail here)
system application-delete platform-integ-apps;
system application-upload /usr/local/share/applications/helm/platform-integ-apps-22.12-72.tgz;
sleep 10; # Wait for the upload to succeed
system application-apply platform-integ-apps;
Note
The images need to be pushed to the registry.central registry before upgrading the subclouds.
Operating System Limitations¶
BPF is disabled¶
BPF cannot be used in the PREEMPT_RT/low latency kernel, due to the inherent incompatibility between PREEMPT_RT and BPF, see, https://lwn.net/Articles/802884/.
Some packages might be affected when PREEMPT_RT and BPF are used together. This includes the following, but not limited to these packages.
libpcap
libnet
dnsmasq
qemu
nmap-ncat
libv4l
elfutils
iptables
tcpdump
iproute
gdb
valgrind
kubernetes
cni
strace
mariadb
libvirt
dpdk
libteam
libseccomp
binutils
libbpf
dhcp
lldpd
containernetworking-plugins
golang
i40e
ice
Procedural Changes: It is recommended not to use BPF with real time kernel. If required it can still be used, for example, debugging only.
Control Group parameter¶
The control group (cgroup) parameter kmem.limit_in_bytes has been deprecated, and results in the following message in the kernel’s log buffer (dmesg) during boot-up and/or during the Ansible bootstrap procedure: “kmem.limit_in_bytes is deprecated and will be removed. Please report your use case to linux-mm@kvack.org if you depend on this functionality.” This parameter is used by a number of software packages in StarlingX, including, but not limited to, systemd, docker, containerd, libvirt etc.
Procedural Changes: NA. This is only a warning message about the future deprecation of an interface.
Subcloud Reconfig may fail due to missing inventory file¶
The dcmanager subcloud reconfig command may fail due to a missing file /var/opt/dc/ansible/<subcloud_name>_inventory.yml.
Procedural Changes: Provide the floating OAM IP address of the subcloud using the “–bootstrap-address” argument. For example:
~(keystone_admin)]$ dcmanager subcloud reconfig --sysadmin-password <password> --deploy-config deployment-config.yaml --bootstrap-address <floating_OAM_IP_address> <subcloud_name>
Horizon GUI Limitations¶
Unable to create Kubernetes Upgrade Strategy for Subclouds using Horizon GUI¶
When creating a Kubernetes Upgrade Strategy for a subcloud using the Horizon GUI, it fails and displays the following error:
kube upgrade pre-check: Invalid kube version(s), left: (v1.24.4), right:
(1.24.4)
Procedural Changes: Use the following steps to create the strategy:
Procedure
Create a strategy for subcloud Kubernetes upgrade using the dcmanager kube-upgrade-strategy create --to-version <version> command.
Apply the strategy using the Horizon GUI or the CLI using the command dcmanager kube-upgrade-strategy apply.
Apply a Kubernetes Upgrade Strategy using Horizon
Procedural Changes: N/A.
k8s-coredump only supports lowercase annotation¶
Creating K8s pod core dump fails when setting the
starlingx.io/core_pattern parameter in upper case characters on the pod
manifest. This results in the pod being unable to find the target directory
and fails to create the coredump file.
Procedural Changes: The starlingx.io/core_pattern parameter only accepts
lower case characters for the path and file name where the core dump is saved.
Huge Page Limitation on Postgres¶
Debian postgres version supports huge pages, and by default uses 1 huge page if it is available on the system, decreasing by 1 the number of huge pages available.
Procedural Changes: The huge page setting must be disabled by setting
/etc/postgresql/postgresql.conf: "huge_pages = off". The postgres service
needs to be restarted using the Service Manager sudo sm-restart service postgres
command.
Warning
The Procedural Changes is not persistent, therefore, if the host is rebooted it will need to be applied again. This will be fixed in a future release.
Quartzville Tools¶
The following celo64e and nvmupdate64e commands are not supported in StarlingX, Release 9.0 due to a known issue in Quartzville tools that crashes the host.
Procedural Change: Reboot the host using the boot screen menu.
Deprecated Notices in Stx 11.0¶
In-tree and Out-of-tree drivers¶
In StarlingX Release 11.0 only the out-of-tree versions of the Intel ice
i40e, and iavf drivers are supported. Switching between in-tree and
out-of-tree driver versions are not supported.
The out_of_tree_drivers service parameter and the out-of-tree-drivers boot
parameter are deprecated and should not be modified to switch to in-tree driver
versions. The values will be ignored, and the system will always use the
out-of-tree versions of the Intel ice, i40e, and iavf drivers.
Kubernetes Root CA boostrap overrides¶
The overrides k8s_root_ca_cert, k8s_root_ca_key, and apiserver_cert_sans
will be deprecated in a future release. External connections to kube-apiserver
are now routed through a proxy that identifies itself using the REST API/GUI
certificate issued by the platform issuer (system-local-ca).
kubernetes-power-manager¶
Intel has stopped support for the kubernetes-power-manager application. This
is still being supported by StarlingX and will be removed in a future release.
cpu_busy_cycles metric is deprecated and must be replaced with
cpu_c0_state_residency_percent for continued usage
(if the metrics are customized via helm overrides).
For more information, see Configurable Power Manager.
Bare metal Ceph¶
Host-based Ceph is deprecated in StarlingX Release 11.0. Adoption of Rook-Ceph is recommended for new deployments to avoid service disruption introduced by Bare Metal Ceph to Rook migration.
Static Configuration for Hardware Accelerator Cards¶
Static configuration for hardware accelerator cards is deprecated in StarlingX Release 24.09.00 and will be discontinued in future releases. Use FEC operator instead.
See Switch between Static Method Hardware Accelerator and SR-IOV FEC Operator
N3000 FPGA Firmware Update Orchestration¶
The N3000 FPGA Firmware Update Orchestration has been deprecated in StarlingX Release 24.09.00. For more information, see N3000 FPGA Overview for more information.
show-certs.sh Script¶
The show-certs.sh script that is available when you ssh to a controller is
deprecated in StarlingX Release 11.0.
The new response format of the ‘system certificate-list’ RESTAPI / CLI now
provides the same information as provided by show-certs.sh.
Kubernetes APIs¶
Kubernetes APIs that will be removed in K8s 1.27 are listed below:
See: https://kubernetes.io/docs/reference/using-api/deprecation-guide/#v1-27
ptp-notification v1 API¶
The ptp-notification v1 API can still be used in StarlingX Release 11.0. The v1 API will be removed in a future release and only the O-RAN Compliant Notification API (ptp-notification v2 API) will be supported.
Note
It is recommended that all new deployments use the O-RAN Compliant Notification API (ptp-notification v2 API).
Removed in Stx 11.0¶
MacVTap Interfaces¶
MacVTap interfaces for KubeVirt VMs are not supported in StarlingX Release 11.0 and future releases.
Release Information for other versions¶
You can find details about a release on the specific release page at: https://wiki.openstack.org/wiki/StarlingX/Release_Plan#List_of_Releases.
Version |
Release Date |
Notes |
Status |
StarlingX R11.0 |
2025-11 |
https://docs.starlingx.io/r/stx.11.0/releasenotes/index.html |
Maintained |
StarlingX R10.0 |
2025-02 |
https://docs.starlingx.io/r/stx.10.0/releasenotes/index.html |
Maintained |
StarlingX R9.0 |
2024-03 |
EOL |
|
StarlingX R8.0 |
2023-02 |
EOL |
|
StarlingX R7.0 |
2022-07 |
EOL |
|
StarlingX R6.0 |
2021-12 |
EOL |
|
StarlingX R5.0.1 |
2021-09 |
EOL |
|
StarlingX R5.0 |
2021-05 |
EOL |
|
StarlingX R4.0 |
2020-08 |
EOL |
|
StarlingX R3.0 |
2019-12 |
EOL |
|
StarlingX R2.0.1 |
2019-10 |
EOL |
|
StarlingX R2.0 |
2019-09 |
EOL |
|
StarlingX R12.0 |
2018-10 |
EOL |
StarlingX follows the release maintenance timelines in the StarlingX Release Plan.
The Status column uses OpenStack maintenance phase definitions.