Replace OSDs on an AIO-SX Single Disk System with BackupΒΆ
When replacing OSDs on an AIO-SX system with replication factor 1, it is possible to make a backup.
Prerequisites
Verify if there is an available disk to create a new OSD in order to backup data from an existing OSD. Make sure the disk is at least the same size as the disk to be replaced.
~(keystone_admin)$ system host-disk-list controller-0
Procedure
Add the new OSD with the previously displayed disk UUID of the available disk identified in the prerequisites.
~(keystone_admin)$ system host-stor-add controller-0 <disk uuid>
Wait for the new OSD to get configured. Run ceph -s to verify that the output shows two OSDs and that the cluster has finished recovery. Make sure the Ceph cluster is healthy (
HEALTH_OK
) before proceeding.Change replication factor of the pools to 2.
~(keystone_admin)$ ceph osd lspools # will list all ceph pools ~(keystone_admin)$ ceph osd pool set <pool-name> size 2 ~(keystone_admin)$ ceph osd pool set <pool-name> nosizechange true
This will make the cluster enter a recovery state:
[sysadmin@controller-0 ~(keystone_admin)]$ ceph -s cluster: id: 38563514-4726-4664-9155-5efd5701de86 health: HEALTH_WARN Degraded data redundancy: 3/57 objects degraded (5.263%), 3 pgs degraded services: mon: 1 daemons, quorum controller-0 (age 28m) mgr: controller-0(active, since 27m) mds: kube-cephfs:1 {0=controller-0=up:active} osd: 2 osds: 2 up (since 6m), 2 in (since 6m) data: pools: 3 pools, 192 pgs objects: 32 objects, 1000 MiB usage: 1.2 GiB used, 16 GiB / 18 GiB avail pgs: 2.604% pgs not active 3/57 objects degraded (5.263%) 184 active+clean 5 activating 2 active+recovery_wait+degraded 1 active+recovering+degraded io: recovery: 323 B/s, 1 keys/s, 3 objects/s
Wait for recovery to end and the Ceph cluster to become healthy.
~(keystone_admin)$ ceph -s cluster: id: 38563514-4726-4664-9155-5efd5701de86 health: HEALTH_OK services: mon: 1 daemons, quorum controller-0 (age 28m) mgr: controller-0(active, since 28m) mds: kube-cephfs:1 {0=controller-0=up:active} osd: 2 osds: 2 up (since 7m), 2 in (since 7m) data: pools: 3 pools, 192 pgs objects: 32 objects, 1000 MiB usage: 2.2 GiB used, 15 GiB / 18 GiB avail pgs: 192 active+clean
Lock the system.
~(keystone_admin)$ system host-lock controller-0
Mark the OSD out.
~(keystone_admin)$ ceph osd out osd.<id>
Wait for the rebalance to finish.
[sysadmin@controller-0 ~(keystone_admin)]$ ceph -s cluster: id: 38563514-4726-4664-9155-5efd5701de86 health: HEALTH_OK services: mon: 1 daemons, quorum controller-0 (age 37m) mgr: controller-0(active, since 36m) mds: kube-cephfs:1 {0=controller-0=up:active} osd: 2 osds: 2 up (since 15m), 1 in (since 2s) data: pools: 3 pools, 192 pgs objects: 32 objects, 1000 MiB usage: 808 MiB used, 8.0 GiB / 8.8 GiB avail pgs: 192 active+clean progress: Rebalancing after osd.0 marked out [..............................]
Stop the OSD and purge it from the Ceph cluster.
~(keystone_admin)$ sudo mv /etc/pmon.d/ceph.conf ~/ ~(keystone_admin)$ sudo /etc/init.d/ceph stop osd.<id>
Obtain the stor UUID and delete it from the platform.
~(keystone_admin)$ system host-stor-list controller-0 # list all stors ~(keystone_admin)$ system host-stor-delete <stor uuid> # delete stor
Purge the disk from the Ceph cluster.
~(keystone_admin)$ ceph osd purge osd.<id> --yes-i-really-mean-it
Remove the OSD entry in /etc/ceph/ceph.conf.
Unmount and remove any remaining folders.
~(keystone_admin)$ sudo umount /var/lib/ceph/osd/ceph-<id> ~(keystone_admin)$ sudo rm -rf /var/lib/ceph/osd/ceph-<id>/
Set the pool to allow size changes.
~(keystone_admin)$ ceph osd pool set <pool-name> nosizechange false
Unlock machine.
~(keystone_admin)$ system host-unlock controller-0
Verify that the Ceph cluster is healthy.
~(keystone_admin)$ ceph -s
If you see a
HEALTH_ERR
message like the following:controller-0:~$ ceph -s cluster: id: 38563514-4726-4664-9155-5efd5701de86 health: HEALTH_ERR 1 filesystem is degraded 1 filesystem has a failed mds daemon 1 filesystem is offline no active mgr services: mon: 1 daemons, quorum controller-0 (age 38s) mgr: no daemons active (since 3s) mds: kube-cephfs:0/1, 1 failed osd: 1 osds: 1 up (since 14m), 1 in (since 15m) data: pools: 3 pools, 192 pgs objects: 32 objects, 1000 MiB usage: 1.1 GiB used, 7.7 GiB / 8.8 GiB avail pgs: 192 active+clean
Wait a few minutes until the Ceph cluster shows
HEALTH_OK
.controller-0:~$ ceph -s cluster: id: 38563514-4726-4664-9155-5efd5701de86 health: HEALTH_OK services: mon: 1 daemons, quorum controller-0 (age 2m) mgr: controller-0(active, since 96s) mds: kube-cephfs:1 {0=controller-0=up:active} osd: 1 osds: 1 up (since 46s), 1 in (since 17m) task status: data: pools: 3 pools, 192 pgs objects: 32 objects, 1000 MiB usage: 1.1 GiB used, 7.7 GiB / 8.8 GiB avail pgs: 192 active+clean
The OSD tree should display the new OSD and not the previous one.
controller-0:~$ ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.00850 root storage-tier -2 0.00850 chassis group-0 -3 0.00850 host controller-0 1 hdd 0.00850 osd.1 up 1.00000 1.00000