首先停止故障osd运行:

# ceph osd out osd.4
# systemctl stop ceph-osd@4.service

解除bcache绑定(停用前端缓存),假设故障osd是建立在bcache1上,缓存盘的cset-uuid为805bc5f1-36e9-4685-a86a-3ea8c03f1172,我osd只是偶尔出现错误即将故障,而不是不能识盘,所以正常解除绑定清理脏数据:

# echo 805bc5f1-36e9-4685-a86a-3ea8c03f1172  > /sys/block/bcache1/bcache/detach

查看缓存数量,降为0全部刷进去后下一步操作:

# cat /sys/block/bcache1/bcache/dirty_data

现场更换硬盘后,查看丢失的Virtual Drive编号:

# /opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -LALL –aAll | grep Virtual
Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Virtual Drive: 1 (Target Id: 1)
Virtual Drive: 2 (Target Id: 2)
Virtual Drive: 4 (Target Id: 4)

# /opt/MegaRAID/MegaCli/MegaCli64 -GetPreservedCacheList -a0
                                     
Adapter #0

Virtual Drive(Target ID 03): Missing.

可以确认丢失的Virtual Drive是Target ID 03,清理缓存:

# /opt/MegaRAID/MegaCli/MegaCli64 -DiscardPreservedCache -L3 -a0
                                     
Adapter #0

Virtual Drive(Target ID 03): Preserved Cache Data Cleared.

Exit Code: 0x00

重建raid0,Slot Number是4:

# /opt/MegaRAID/MegaCli/MegaCli64 -CfgLdAdd -r0 [32:4] WB Direct -a0
                                 

为新的数据盘绑定bcache,假设重建后盘符为 /dev/sdd:

# wipefs -a /dev/sdd
# make-bcache -B /dev/sdd -C /dev/nvme0n1

卸载osd挂载目录

# umount /var/lib/ceph/osd/ceph-4

停用ceph集群数据平衡:

# for i in noout nobackfill norecover noscrub nodeep-scrub;do ceph osd set $i;done

从crush map 中移除osd:

# ceph osd crush remove osd.4
removed item id 4 name 'osd.4' from crush map

删除故障osd的密钥:

# ceph auth del osd.4
updated

删除故障osd:

#ceph osd rm 4

接下来添加新的osd,ceph-node-3节点bcache3:

#ceph-deploy osd create ceph-node-3 --data /dev/bcache3

完成后启用ceph集群数据平衡:

# for i in noout nobackfill norecover noscrub nodeep-scrub;do ceph osd unset $i;done

部分摘录学习自:

https://blog.csdn.net/ct1150/article/details/87367518
https://blog.csdn.net/signmem/article/details/110927220
https://www.cnblogs.com/ajunyu/p/11165950.html

标签: Ceph, Bcache

添加新评论