记一次Ceph pg unfound处理过程
source link: http://os.51cto.com/art/202102/643948.htm
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
今天检查ceph集群,发现有pg丢失,于是就有了本文~~~
1.查看集群状态
[root@k8snode001 ~]# ceph health detail HEALTH_ERR 1/973013 objects unfound (0.000%); 17 scrub errors; Possible data damage: 1 pg recovery_unfound, 8 pgs inconsistent, 1 pg repair; Degraded data redundancy: 1/2919039 objects degraded (0.000%), 1 pg degraded OBJECT_UNFOUND 1/973013 objects unfound (0.000%) pg 2.2b has 1 unfound objects OSD_SCRUB_ERRORS 17 scrub errors PG_DAMAGED Possible data damage: 1 pg recovery_unfound, 8 pgs inconsistent, 1 pg repair pg 2.2b is active+recovery_unfound+degraded, acting [14,22,4], 1 unfound pg 2.44 is active+clean+inconsistent, acting [14,8,21] pg 2.73 is active+clean+inconsistent, acting [25,14,8] pg 2.80 is active+clean+scrubbing+deep+inconsistent+repair, acting [4,8,14] pg 2.83 is active+clean+inconsistent, acting [14,13,6] pg 2.ae is active+clean+inconsistent, acting [14,3,2] pg 2.c4 is active+clean+inconsistent, acting [8,21,14] pg 2.da is active+clean+inconsistent, acting [23,14,15] pg 2.fa is active+clean+inconsistent, acting [14,23,25] PG_DEGRADED Degraded data redundancy: 1/2919039 objects degraded (0.000%), 1 pg degraded pg 2.2b is active+recovery_unfound+degraded, acting [14,22,4], 1 unfound
从输出发现pg 2.2b is active+recovery_unfound+degraded, acting [14,22,4], 1 unfound
现在我们来查看pg 2.2b,看看这个pg的想想信息。
[root@k8snode001 ~]# ceph pg dump_json pools |grep 2.2b dumped all 2.2b 2487 1 1 0 1 9533198403 3048 3048 active+recovery_unfound+degraded 2020-07-23 08:56:07.669903 10373'5448370 10373:7312614 [14,22,4] 14 [14,22,4] 14 10371'5437258 2020-07-23 08:56:06.637012 10371'5437258 2020-07-23 08:56:06.637012 0
可以看到它现在只有一个副本
2.查看pg map
[root@k8snode001 ~]# ceph pg map 2.2b osdmap e10373 pg 2.2b (2.2b) -> up [14,22,4] acting [14,22,4]
从pg map可以看出,pg 2.2b分布到osd [14,22,4]上
3.查看存储池状态
[root@k8snode001 ~]# ceph osd pool stats k8s-1 pool k8s-1 id 2 1/1955664 objects degraded (0.000%) 1/651888 objects unfound (0.000%) client io 271 KiB/s wr, 0 op/s rd, 52 op/s wr [root@k8snode001 ~]# ceph osd pool ls detail|grep k8s-1 pool 2 'k8s-1' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 88 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
4.尝试恢复pg 2.2b丢失地块
[root@k8snode001 ~]# ceph pg repair 2.2b
如果一直修复不成功,可以查看卡住PG的具体信息,主要关注recovery_state,命令如下
[root@k8snode001 ~]# ceph pg 2.2b query { "...... "recovery_state": [ { "name": "Started/Primary/Active", "enter_time": "2020-07-21 14:17:05.855923", "might_have_unfound": [], "recovery_progress": { "backfill_targets": [], "waiting_on_backfill": [], "last_backfill_started": "MIN", "backfill_info": { "begin": "MIN", "end": "MIN", "objects": [] }, "peer_backfill_info": [], "backfills_in_flight": [], "recovering": [], "pg_backend": { "pull_from_peer": [], "pushing": [] } }, "scrub": { "scrubber.epoch_start": "10370", "scrubber.active": false, "scrubber.state": "INACTIVE", "scrubber.start": "MIN", "scrubber.end": "MIN", "scrubber.max_end": "MIN", "scrubber.subset_last_update": "0'0", "scrubber.deep": false, "scrubber.waiting_on_whom": [] } }, { "name": "Started", "enter_time": "2020-07-21 14:17:04.814061" } ], "agent_state": {} }
如果repair修复不了;两种解决方案,回退旧版或者直接删除
5.解决方案
回退旧版 [root@k8snode001 ~]# ceph pg 2.2b mark_unfound_lost revert 直接删除 [root@k8snode001 ~]# ceph pg 2.2b mark_unfound_lost delete
6.验证
我这里直接删除了,然后ceph集群重建pg,稍等会再看,pg状态变为active+clean
[root@k8snode001 ~]# ceph pg 2.2b query { "state": "active+clean", "snap_trimq": "[]", "snap_trimq_len": 0, "epoch": 11069, "up": [ 12, 22, 4 ],
再次查看集群状态
[root@k8snode001 ~]# ceph health detail HEALTH_OK
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK