Troubleshooting NetApp Storage Issues¶
This section describes common issues that occur when using NetApp storage with StarlingX OpenStack.
NetApp TLS Certificate Not Found¶
Cinder volume or backup pods fail to start.
This issue occurs when the controller node does not have the NetApp TLS certificate file, or the application deployment did not create the corresponding Kubernetes secret.
Verify that the certificate file exists on the controller and that the Kubernetes secret is present.
$ ls -la /var/opt/openstack/certs/netapp.pem
$ kubectl -n openstack get secret netapp-ca-cert
If the secret does not exist, copy the certificate file to the expected path and reapply the StarlingX OpenStack application.
$ system application-apply StX-openstack
Trident Backends Not Discovered¶
Cinder does not display any NetApp storage backends.
This condition typically occurs when Trident backends are not installed correctly, are reporting an unhealthy state, or when the required NetApp StorageClasses have not been created.
Verify the health of Trident backends and backend configurations:
$ kubectl -n trident get tridentbackends
$ kubectl -n trident get tridentbackendconfigs
Verify that NetApp StorageClasses exist:
$ kubectl get sc | grep netapp
If the backends or StorageClasses are missing, reinstall or correct the Trident configuration before reapplying the StarlingX OpenStack application.
Cinder Volume or Backup Pod Errors¶
Cinder volume or backup pods start but encounter runtime errors or fail during volume or backup operations.
These errors typically result from backend connectivity issues, authentication or credential problems, or invalid Cinder configuration.
Inspect the logs of the affected pods to determine the underlying cause:
$ kubectl -n openstack logs -l application=cinder,component=volume --tail=100
$ kubectl -n openstack logs -l application=cinder,component=backup --tail=100
Correct any reported configuration or backend errors and reapply the application if required.
NFS Mount Failures¶
Cinder or Nova pods fail with NFS mount errors, and volumes or instances fail to start.
This issue typically occurs when network connectivity, routing, or NetApp NFS export policy configuration is incorrect.
Verify the following:
Compute nodes can reach the NetApp Data LIF.
The NetApp export policy explicitly allows read/write and superuser access for the compute node source subnet.
The specified NFS export path exists on the NetApp SVM and is correctly configured.
Update the export policy or network configuration as required to resolve the issue.
iSCSI or Fibre Channel Session Issues¶
SAN volumes fail to attach, or Nova instances cannot access attached volumes.
This issue typically occurs when iSCSI sessions are not established, Fibre Channel paths are unavailable, or multipath devices are not configured correctly on the compute nodes.
Verify that active iSCSI sessions exist:
$ sudo iscsiadm -m session
Verify that multipath devices are present and healthy:
$ sudo multipath -ll
Resolve any SAN connectivity, zoning, or multipath configuration issues, then retry the volume operation.
PVC Stuck in Pending State¶
PVC remain in the Pending state and do not bind to a
PersistentVolume.
This issue typically occurs when the storage provisioner cannot create a volume due to backend errors, missing or misconfigured StorageClasses, or insufficient permissions.
Describe the affected PVC to review detailed provisioning information:
$ kubectl -n openstack describe pvc <PVC_NAME>
Inspect Kubernetes events for provisioning failures:
$ kubectl get events -n openstack \
--field-selector reason=ProvisioningFailed
Resolve the reported issue, such as correcting a missing StorageClass or fixing a backend error, and then retry the operation.
Glance PVC Resize Failure During Application Apply¶
The system application-apply StX-openstack command fails after you increase
the Glance volume.size value in the Helm overrides.
This failure occurs when the StorageClass backing the Glance PersistentVolumeClaim does not support volume expansion.
A typical error appears as:
error expanding pvc: StorageClass "<STORAGE_CLASS_NAME>" does not allow volume expansion
Identify the StorageClass used by the Glance PVC:
$ kubectl -n openstack get pvc -l application=glance \
-o jsonpath='{.items[0].spec.storageClassName}'
Verify that the StorageClass allows volume expansion:
$ kubectl get sc <STORAGE_CLASS_NAME> \
-o jsonpath='{.allowVolumeExpansion}'
If the output is not true, update the Trident StorageClass configuration to
set allowVolumeExpansion: true. Reinstall or update the NetApp backend
configuration as required, then retry the Glance override update and re-run
system application-apply.
Limitations and Known Issues¶
Cinder Snapshots versus NetApp Snapshots (iSCSI and FC)¶
When you create a Cinder volume snapshot on a NetApp iSCSI or FC backend, NetApp creates a LUN clone (FlexClone) rather than a traditional ONTAP FlexVol Snapshot or a new FlexVol volume. This behavior is expected.
Why This Occurs
With ONTAP iSCSI and FC backends, Cinder volumes are backed by ONTAP LUNs within a FlexVol. ONTAP Snapshots operate at the FlexVol level and not at the individual LUN level, so they cannot be mapped directly to a Cinder snapshot (which targets a single volume). Instead, the Cinder NetApp driver uses ONTAP’s FlexClone technology to create a space-efficient LUN clone that represents the Cinder snapshot.
This behavior is documented in the upstream Cinder Snapshots versus NetApp Snapshots .
What to Expect on the NetApp Side
Given the following OpenStack operations:
# Create a volume
$ openstack volume create --image $IMAGE --size 2 cirros-iscsi-vol
# Create a snapshot
$ openstack volume snapshot create \
--volume cirros-iscsi-vol cirros-iscsi-vol-snap1
# Create a new volume from the snapshot
$ openstack volume create \
--snapshot cirros-iscsi-vol-snap1 --size 2 cirros-iscsi-vol-from-snap1
NetApp creates three LUN objects in the same FlexVol:
LUN Path |
is-clone |
Description |
|---|---|---|
/vol/<flexvol>/<volume-uuid> |
false |
Original Cinder volume |
/vol/<flexvol>/snapshot-<snapshot-id> |
true |
Cinder snapshot (LUN clone of parent) |
/vol/<flexvol>/<new-volume-uuid> |
true |
Volume created from snapshot (LUN clone) |
Verify this behavior on the NetApp CLI:
$ netapp-cluster::> lun show -vserver <SVM_NAME> -fields is-clone
FlexClone License Requirement
This behavior requires the FlexClone license to be enabled on the NetApp cluster. Without the license, snapshot and clone operations fail.
Verify the license status:
$ netapp-cluster::> license show -package flexclone
Note
For NFS backends, Cinder snapshots use file-level FlexClone (cloning the NFS file that represents the volume). For FlexGroup volumes, snapshot operations fall back to the generic NFS implementation due to current FlexClone limitations.
Volume Attachment Desynchronization¶
In some scenarios, a volume may appear as available in Cinder while still showing as attached in Nova. This can occur when a volume attachment operation is interrupted midway (for example, due to network issues with the storage backend, API pod restarts, or parallel attachment timeouts). This section covers how to identify the problem and how to clean up the orphaned database entry to restore normal operation.
Symptoms
Cinder reports the volume as available:
$ openstack volume show <volume-uuid> -> status: available
The same volume appears under volumes_attached in Nova:
$ openstack server show <server-uuid> -> volumes_attached: id='<volume-uuid>'
Attempting to reattach the volume fails with “already attached” errors
Attempting to detach fails because Cinder has no record of the attachment
Horizon may display Something Went Wrong if the volume is deleted while in this state
Cause
Nova writes a BDM entry before completing the Cinder attachment. If the operation is interrupted, Nova retains the BDM while Cinder does not record the attachment. This is a known upstream issue (Bug #2116931).
Common triggers include:
Storage backend connectivity issues (for example, iSCSI “No route to host” errors) causing attachment operations to take longer than expected
Parallel volume attachments to the same VM causing lock contention and RPC timeouts
Nova API pod restarts during in-flight attachment operations
Workaround
Identify and remove the orphaned BDM entry from Nova’s database.
Procedure
Retrieve the MariaDB password:
$ kubectl get secret -n openstack mariadb-dbadmin-password \
-o jsonpath='{.data.MYSQL_DBADMIN_PASSWORD}' | base64 -d; echo
Identify a running MariaDB pod:
$ MARIADB_POD=$(kubectl get pods -n openstack \
-l component=server,application=mariadb \
--field-selector=status.phase=Running \
-o jsonpath='{.items[0].metadata.name}')
Check for the orphaned BDM in Nova:
$ kubectl exec -n openstack $MARIADB_POD -- \
mysql -u root -p"<DB_PASSWORD>" nova -e \
"select * from block_device_mapping \
where volume_id='<VOLUME_ID>' and deleted=0;"
Confirm that Cinder has no matching attachment:
$ kubectl exec -n openstack $MARIADB_POD -- \
mysql -u root -p"<DB_PASSWORD>" cinder -e \
"select * from volume_attachment \
where volume_id='<VOLUME_ID>' and deleted=0;"
If no rows are returned, the desync is confirmed.
Remove the orphaned BDM entry from Nova’s database:
$ kubectl exec -n openstack $MARIADB_POD -- \
mysql -u root -p"<DB_PASSWORD>" nova -e \
"update block_device_mapping \
set deleted=1 where volume_id='<VOLUME_ID>' and deleted = 0;"
After running this command, the volume will no longer appear in the server’s volumes_attached list and can be attached again normally.
Recommended Practices
Attach volumes to a VM one at a time (sequentially) rather than in parallel
Ensure stable network connectivity to the storage backend before performing volume operations
Note
This workaround resolves the issue only temporarily. A permanent fix is tracked in upstream OpenStack Bug 2116931 .
Glance Storage Backend Migration Not Supported¶
StarlingX OpenStack does not support live migration of Glance images between different
storage backends, for example, switching from PVC-backed FC storage to Cinder
store, or between any two backend types. Changing the
storage_conf.volume_storage_class_priority in the Glance overrides and
re-applying the application will reconfigure the Glance service to use the new
backend, but existing images stored on the previous backend will not be
migrated automatically and will become inaccessible.
When This Applies
PVC → Cinder store
Cinder store → PVC
PVC using one StorageClass → PVC using another StorageClass
Workaround
To change the Glance storage backend, you must manually save images, remove the application, reconfigure storage, and recreate the images on the new backend.
Procedure
Identify and save all Glance images.
# List existing images: $ openstack image list --status active $ openstack image save --file /home/sysadmin/glance-backup/<IMAGE_NAME>.raw <IMAGE_ID> # Remove application $ source /etc/platform/openrc $ system application-remove StX-openstack # Update Glance overrides and re-apply $ system helm-override-update --reuse-values --values glance.yaml StX-openstack glance openstack $ system application-apply StX-openstack # Recreate images $ openstack image create --disk-format <DISK_FORMAT> --container-format <CONTAINER_FORMAT> --file /path/<IMAGE_NAME>.raw <IMAGE_NAME>
Impact¶
OpenStack services are unavailable during application removal and re-apply
Running VMs continue to operate
Image UUIDs change and must be updated in referencing artifacts
IPv6 Inline NFS Volume Mounts Not Supported¶
Kubernetes inline NFS volumes do not support IPv6 NFS server addresses. This affects Nova ephemeral storage using the NFS Shares backend with IPv6 Data LIFs.
Background¶
When using inline NFS, the Pod spec includes:
volumes:
- name: nova-instances
nfs:
server: "[NetApp NFS Data LIF address]"
path: /openstack_instances
Inline NFS volumes cannot pass mount options such as proto=tcp6.
Why IPv6 Fails¶
IPv6 requires proto=tcp6 and nfsvers=4. Inline mounts inherit defaults
from /etc/nfsmount.conf, which typically use IPv4 settings.
Recommended Alternative¶
Use PVC-backed ephemeral storage for IPv6 environments. Trident configures IPv6 mount options at the CSI layer.
storage_conf:
volume_storage_class_priority:
- pvc
pvc:
volume:
size: 100Gi
storage_class_priority:
- netapp-nfs