Troubleshooting NetApp Storage Issues

This section describes common issues that occur when using NetApp storage with StarlingX OpenStack.

NetApp TLS Certificate Not Found

Cinder volume or backup pods fail to start.

This issue occurs when the controller node does not have the NetApp TLS certificate file, or the application deployment did not create the corresponding Kubernetes secret.

Verify that the certificate file exists on the controller and that the Kubernetes secret is present.

$ ls -la /var/opt/openstack/certs/netapp.pem
$ kubectl -n openstack get secret netapp-ca-cert

If the secret does not exist, copy the certificate file to the expected path and reapply the StarlingX OpenStack application.

$ system application-apply StX-openstack

Trident Backends Not Discovered

Cinder does not display any NetApp storage backends.

This condition typically occurs when Trident backends are not installed correctly, are reporting an unhealthy state, or when the required NetApp StorageClasses have not been created.

Verify the health of Trident backends and backend configurations:

$ kubectl -n trident get tridentbackends
$ kubectl -n trident get tridentbackendconfigs

Verify that NetApp StorageClasses exist:

$ kubectl get sc | grep netapp

If the backends or StorageClasses are missing, reinstall or correct the Trident configuration before reapplying the StarlingX OpenStack application.

Cinder Volume or Backup Pod Errors

Cinder volume or backup pods start but encounter runtime errors or fail during volume or backup operations.

These errors typically result from backend connectivity issues, authentication or credential problems, or invalid Cinder configuration.

Inspect the logs of the affected pods to determine the underlying cause:

$ kubectl -n openstack logs -l application=cinder,component=volume --tail=100
$ kubectl -n openstack logs -l application=cinder,component=backup --tail=100

Correct any reported configuration or backend errors and reapply the application if required.

NFS Mount Failures

Cinder or Nova pods fail with NFS mount errors, and volumes or instances fail to start.

This issue typically occurs when network connectivity, routing, or NetApp NFS export policy configuration is incorrect.

Verify the following:

  • Compute nodes can reach the NetApp Data LIF.

  • The NetApp export policy explicitly allows read/write and superuser access for the compute node source subnet.

  • The specified NFS export path exists on the NetApp SVM and is correctly configured.

Update the export policy or network configuration as required to resolve the issue.

iSCSI or Fibre Channel Session Issues

SAN volumes fail to attach, or Nova instances cannot access attached volumes.

This issue typically occurs when iSCSI sessions are not established, Fibre Channel paths are unavailable, or multipath devices are not configured correctly on the compute nodes.

Verify that active iSCSI sessions exist:

$ sudo iscsiadm -m session

Verify that multipath devices are present and healthy:

$ sudo multipath -ll

Resolve any SAN connectivity, zoning, or multipath configuration issues, then retry the volume operation.

PVC Stuck in Pending State

PVC remain in the Pending state and do not bind to a PersistentVolume.

This issue typically occurs when the storage provisioner cannot create a volume due to backend errors, missing or misconfigured StorageClasses, or insufficient permissions.

Describe the affected PVC to review detailed provisioning information:

$ kubectl -n openstack describe pvc <PVC_NAME>

Inspect Kubernetes events for provisioning failures:

$ kubectl get events -n openstack \
  --field-selector reason=ProvisioningFailed

Resolve the reported issue, such as correcting a missing StorageClass or fixing a backend error, and then retry the operation.

Glance PVC Resize Failure During Application Apply

The system application-apply StX-openstack command fails after you increase the Glance volume.size value in the Helm overrides.

This failure occurs when the StorageClass backing the Glance PersistentVolumeClaim does not support volume expansion.

A typical error appears as:

error expanding pvc: StorageClass "<STORAGE_CLASS_NAME>" does not allow volume expansion

Identify the StorageClass used by the Glance PVC:

$ kubectl -n openstack get pvc -l application=glance \
  -o jsonpath='{.items[0].spec.storageClassName}'

Verify that the StorageClass allows volume expansion:

$ kubectl get sc <STORAGE_CLASS_NAME> \
  -o jsonpath='{.allowVolumeExpansion}'

If the output is not true, update the Trident StorageClass configuration to set allowVolumeExpansion: true. Reinstall or update the NetApp backend configuration as required, then retry the Glance override update and re-run system application-apply.

Limitations and Known Issues

Cinder Snapshots versus NetApp Snapshots (iSCSI and FC)

When you create a Cinder volume snapshot on a NetApp iSCSI or FC backend, NetApp creates a LUN clone (FlexClone) rather than a traditional ONTAP FlexVol Snapshot or a new FlexVol volume. This behavior is expected.

Why This Occurs

With ONTAP iSCSI and FC backends, Cinder volumes are backed by ONTAP LUNs within a FlexVol. ONTAP Snapshots operate at the FlexVol level and not at the individual LUN level, so they cannot be mapped directly to a Cinder snapshot (which targets a single volume). Instead, the Cinder NetApp driver uses ONTAP’s FlexClone technology to create a space-efficient LUN clone that represents the Cinder snapshot.

This behavior is documented in the upstream Cinder Snapshots versus NetApp Snapshots .

What to Expect on the NetApp Side

Given the following OpenStack operations:

# Create a volume
$ openstack volume create --image $IMAGE --size 2 cirros-iscsi-vol

# Create a snapshot
$ openstack volume snapshot create \
  --volume cirros-iscsi-vol cirros-iscsi-vol-snap1

# Create a new volume from the snapshot
$ openstack volume create \
  --snapshot cirros-iscsi-vol-snap1 --size 2 cirros-iscsi-vol-from-snap1

NetApp creates three LUN objects in the same FlexVol:

LUN Path

is-clone

Description

/vol/<flexvol>/<volume-uuid>

false

Original Cinder volume

/vol/<flexvol>/snapshot-<snapshot-id>

true

Cinder snapshot (LUN clone of parent)

/vol/<flexvol>/<new-volume-uuid>

true

Volume created from snapshot (LUN clone)

Verify this behavior on the NetApp CLI:

$ netapp-cluster::> lun show -vserver <SVM_NAME> -fields is-clone

FlexClone License Requirement

This behavior requires the FlexClone license to be enabled on the NetApp cluster. Without the license, snapshot and clone operations fail.

Verify the license status:

$ netapp-cluster::> license show -package flexclone

Note

For NFS backends, Cinder snapshots use file-level FlexClone (cloning the NFS file that represents the volume). For FlexGroup volumes, snapshot operations fall back to the generic NFS implementation due to current FlexClone limitations.

Volume Attachment Desynchronization

In some scenarios, a volume may appear as available in Cinder while still showing as attached in Nova. This can occur when a volume attachment operation is interrupted midway (for example, due to network issues with the storage backend, API pod restarts, or parallel attachment timeouts). This section covers how to identify the problem and how to clean up the orphaned database entry to restore normal operation.

Symptoms

  • Cinder reports the volume as available:

    $ openstack volume show <volume-uuid> -> status: available
    
  • The same volume appears under volumes_attached in Nova:

    $ openstack server show <server-uuid>  -> volumes_attached: id='<volume-uuid>'
    
  • Attempting to reattach the volume fails with “already attached” errors

  • Attempting to detach fails because Cinder has no record of the attachment

  • Horizon may display Something Went Wrong if the volume is deleted while in this state

Cause

Nova writes a BDM entry before completing the Cinder attachment. If the operation is interrupted, Nova retains the BDM while Cinder does not record the attachment. This is a known upstream issue (Bug #2116931).

Common triggers include:

  • Storage backend connectivity issues (for example, iSCSI “No route to host” errors) causing attachment operations to take longer than expected

  • Parallel volume attachments to the same VM causing lock contention and RPC timeouts

  • Nova API pod restarts during in-flight attachment operations

Workaround

Identify and remove the orphaned BDM entry from Nova’s database.

Procedure

  1. Retrieve the MariaDB password:

$ kubectl get secret -n openstack mariadb-dbadmin-password \
  -o jsonpath='{.data.MYSQL_DBADMIN_PASSWORD}' | base64 -d; echo
  1. Identify a running MariaDB pod:

$ MARIADB_POD=$(kubectl get pods -n openstack \
  -l component=server,application=mariadb \
  --field-selector=status.phase=Running \
  -o jsonpath='{.items[0].metadata.name}')
  1. Check for the orphaned BDM in Nova:

$ kubectl exec -n openstack $MARIADB_POD -- \
  mysql -u root -p"<DB_PASSWORD>" nova -e \
  "select * from block_device_mapping \
    where volume_id='<VOLUME_ID>' and deleted=0;"
  1. Confirm that Cinder has no matching attachment:

$ kubectl exec -n openstack $MARIADB_POD -- \
  mysql -u root -p"<DB_PASSWORD>" cinder -e \
  "select * from volume_attachment \
    where volume_id='<VOLUME_ID>' and deleted=0;"

If no rows are returned, the desync is confirmed.
  1. Remove the orphaned BDM entry from Nova’s database:

$ kubectl exec -n openstack $MARIADB_POD -- \
  mysql -u root -p"<DB_PASSWORD>" nova -e \
  "update block_device_mapping \
    set deleted=1 where volume_id='<VOLUME_ID>' and deleted = 0;"

After running this command, the volume will no longer appear in the server’s volumes_attached list and can be attached again normally.

Recommended Practices

  • Attach volumes to a VM one at a time (sequentially) rather than in parallel

  • Ensure stable network connectivity to the storage backend before performing volume operations

Note

This workaround resolves the issue only temporarily. A permanent fix is tracked in upstream OpenStack Bug 2116931 .

Glance Storage Backend Migration Not Supported

StarlingX OpenStack does not support live migration of Glance images between different storage backends, for example, switching from PVC-backed FC storage to Cinder store, or between any two backend types. Changing the storage_conf.volume_storage_class_priority in the Glance overrides and re-applying the application will reconfigure the Glance service to use the new backend, but existing images stored on the previous backend will not be migrated automatically and will become inaccessible.

When This Applies

  • PVC → Cinder store

  • Cinder store → PVC

  • PVC using one StorageClass → PVC using another StorageClass

Workaround

To change the Glance storage backend, you must manually save images, remove the application, reconfigure storage, and recreate the images on the new backend.

Procedure

  1. Identify and save all Glance images.

# List existing images:
$ openstack image list --status active
$ openstack image save --file /home/sysadmin/glance-backup/<IMAGE_NAME>.raw <IMAGE_ID>

# Remove application
$ source /etc/platform/openrc
$ system application-remove StX-openstack

# Update Glance overrides and re-apply
$ system helm-override-update --reuse-values   --values glance.yaml StX-openstack glance openstack

$ system application-apply StX-openstack

# Recreate images
$ openstack image create   --disk-format <DISK_FORMAT>   --container-format <CONTAINER_FORMAT>   --file /path/<IMAGE_NAME>.raw   <IMAGE_NAME>

Impact

  • OpenStack services are unavailable during application removal and re-apply

  • Running VMs continue to operate

  • Image UUIDs change and must be updated in referencing artifacts

IPv6 Inline NFS Volume Mounts Not Supported

Kubernetes inline NFS volumes do not support IPv6 NFS server addresses. This affects Nova ephemeral storage using the NFS Shares backend with IPv6 Data LIFs.

Background

When using inline NFS, the Pod spec includes:

volumes:
  - name: nova-instances
    nfs:
      server: "[NetApp NFS Data LIF address]"
      path: /openstack_instances

Inline NFS volumes cannot pass mount options such as proto=tcp6.

Why IPv6 Fails

IPv6 requires proto=tcp6 and nfsvers=4. Inline mounts inherit defaults from /etc/nfsmount.conf, which typically use IPv4 settings.