4

记一次Ceph pg unfound处理过程

 3 years ago
source link: http://os.51cto.com/art/202102/643948.htm
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

fU7jeaF.png!mobile

今天检查ceph集群,发现有pg丢失,于是就有了本文~~~

1.查看集群状态

[root@k8snode001 ~]# ceph health detail 
HEALTH_ERR 1/973013 objects unfound (0.000%); 17 scrub errors; Possible data damage: 1 pg recovery_unfound, 8 pgs inconsistent, 1 pg repair; Degraded data redundancy: 1/2919039 objects degraded (0.000%), 1 pg degraded 
OBJECT_UNFOUND 1/973013 objects unfound (0.000%) 
    pg 2.2b has 1 unfound objects 
OSD_SCRUB_ERRORS 17 scrub errors 
PG_DAMAGED Possible data damage: 1 pg recovery_unfound, 8 pgs inconsistent, 1 pg repair 
    pg 2.2b is active+recovery_unfound+degraded, acting [14,22,4], 1 unfound 
    pg 2.44 is active+clean+inconsistent, acting [14,8,21] 
    pg 2.73 is active+clean+inconsistent, acting [25,14,8] 
    pg 2.80 is active+clean+scrubbing+deep+inconsistent+repair, acting [4,8,14] 
    pg 2.83 is active+clean+inconsistent, acting [14,13,6] 
    pg 2.ae is active+clean+inconsistent, acting [14,3,2] 
    pg 2.c4 is active+clean+inconsistent, acting [8,21,14] 
    pg 2.da is active+clean+inconsistent, acting [23,14,15] 
    pg 2.fa is active+clean+inconsistent, acting [14,23,25] 
PG_DEGRADED Degraded data redundancy: 1/2919039 objects degraded (0.000%), 1 pg degraded 
    pg 2.2b is active+recovery_unfound+degraded, acting [14,22,4], 1 unfound 

从输出发现pg 2.2b is active+recovery_unfound+degraded, acting [14,22,4], 1 unfound

现在我们来查看pg 2.2b,看看这个pg的想想信息。

[root@k8snode001 ~]# ceph pg dump_json pools    |grep 2.2b 
dumped all 
2.2b       2487                  1        1         0       1  9533198403 3048     3048                active+recovery_unfound+degraded 2020-07-23 08:56:07.669903  10373'5448370  10373:7312614  [14,22,4]         14  [14,22,4]             14  10371'5437258 2020-07-23 08:56:06.637012   10371'5437258 2020-07-23 08:56:06.637012             0 

可以看到它现在只有一个副本

2.查看pg map

[root@k8snode001 ~]# ceph pg map 2.2b 
osdmap e10373 pg 2.2b (2.2b) -> up [14,22,4] acting [14,22,4] 

从pg map可以看出,pg 2.2b分布到osd [14,22,4]上

3.查看存储池状态

[root@k8snode001 ~]# ceph osd pool stats k8s-1 
pool k8s-1 id 2 
  1/1955664 objects degraded (0.000%) 
  1/651888 objects unfound (0.000%) 
  client io 271 KiB/s wr, 0 op/s rd, 52 op/s wr 
 
[root@k8snode001 ~]# ceph osd pool ls detail|grep k8s-1 
pool 2 'k8s-1' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 88 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd 

4.尝试恢复pg 2.2b丢失地块

[root@k8snode001 ~]# ceph pg repair 2.2b 

如果一直修复不成功,可以查看卡住PG的具体信息,主要关注recovery_state,命令如下

[root@k8snode001 ~]# ceph pg 2.2b  query 
{ 
    "...... 
    "recovery_state": [ 
        { 
            "name": "Started/Primary/Active", 
            "enter_time": "2020-07-21 14:17:05.855923", 
            "might_have_unfound": [], 
            "recovery_progress": { 
                "backfill_targets": [], 
                "waiting_on_backfill": [], 
                "last_backfill_started": "MIN", 
                "backfill_info": { 
                    "begin": "MIN", 
                    "end": "MIN", 
                    "objects": [] 
                }, 
                "peer_backfill_info": [], 
                "backfills_in_flight": [], 
                "recovering": [], 
                "pg_backend": { 
                    "pull_from_peer": [], 
                    "pushing": [] 
                } 
            }, 
            "scrub": { 
                "scrubber.epoch_start": "10370", 
                "scrubber.active": false, 
                "scrubber.state": "INACTIVE", 
                "scrubber.start": "MIN", 
                "scrubber.end": "MIN", 
                "scrubber.max_end": "MIN", 
                "scrubber.subset_last_update": "0'0", 
                "scrubber.deep": false, 
                "scrubber.waiting_on_whom": [] 
            } 
        }, 
        { 
            "name": "Started", 
            "enter_time": "2020-07-21 14:17:04.814061" 
        } 
    ], 
    "agent_state": {} 
} 

如果repair修复不了;两种解决方案,回退旧版或者直接删除

5.解决方案

回退旧版 
[root@k8snode001 ~]# ceph pg  2.2b  mark_unfound_lost revert 
直接删除 
[root@k8snode001 ~]# ceph pg  2.2b  mark_unfound_lost delete 

6.验证

我这里直接删除了,然后ceph集群重建pg,稍等会再看,pg状态变为active+clean

[root@k8snode001 ~]#  ceph pg  2.2b query 
{ 
    "state": "active+clean", 
    "snap_trimq": "[]", 
    "snap_trimq_len": 0, 
    "epoch": 11069, 
    "up": [ 
        12, 
        22, 
        4 
    ], 

再次查看集群状态

[root@k8snode001 ~]# ceph health detail 
HEALTH_OK 

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK