Today, I’ll make a little explanation on how to solve this warning coming from ceph status.
Maybe this case does not match your error, but I think the commands I used give hoy a way to follow and solve it.
The whole process is written on the wiki
See general status:
# ceph -s ... health: HEALTH_WARN Degraded data redundancy: 2 pgs degraded, 8 pgs undersized ... # ceph health detail HEALTH_WARN Degraded data redundancy: 2 pgs degraded, 8 pgs undersized PG_DEGRADED Degraded data redundancy: 2 pgs degraded, 8 pgs undersized pg 14.0 is stuck undersized for 510298.054479, current state active+undersized, last acting [5,12] pg 14.1 is stuck undersized for 510298.091712, current state active+undersized, last acting [18,7] pg 14.2 is stuck undersized for 510298.007891, current state active+undersized+degraded, last acting [7,18] pg 14.3 is stuck undersized for 510298.086409, current state active+undersized, last acting [8,5] pg 14.4 is stuck undersized for 510298.054479, current state active+undersized+degraded, last acting [5,18] pg 14.5 is stuck undersized for 510298.033776, current state active+undersized, last acting [16,1] pg 14.6 is stuck undersized for 510298.086409, current state active+undersized, last acting [8,3] pg 14.7 is stuck undersized for 510298.091649, current state active+undersized, last acting [18,3]
Why pg 14 is the **only one** that is failing???
Getting details of one PG:
# ceph tell 14.0 query | jq . { "state": "active+undersized", ... "stat_sum": { "num_bytes": 0, "num_objects": 0, "num_object_clones": 0, "num_object_copies": 0, "num_objects_missing_on_primary": 0, "num_objects_missing": 0, "num_objects_degraded": 0, "num_objects_misplaced": 0, "num_objects_unfound": 0, "num_objects_dirty": 0, "num_whiteouts": 0, "num_read": 0, "num_read_kb": 0, "num_write": 0, "num_write_kb": 0, "num_scrub_errors": 0, "num_shallow_scrub_errors": 0, "num_deep_scrub_errors": 0, "num_objects_recovered": 0, "num_bytes_recovered": 0, "num_keys_recovered": 0, "num_objects_omap": 0, "num_objects_hit_set_archive": 0, "num_bytes_hit_set_archive": 0, "num_flush": 0, "num_flush_kb": 0, "num_evict": 0, "num_evict_kb": 0, "num_promote": 0, "num_flush_mode_high": 0, "num_flush_mode_low": 0, "num_evict_mode_some": 0, "num_evict_mode_full": 0, "num_objects_pinned": 0, "num_legacy_snapsets": 0, "num_large_omap_objects": 0, "num_objects_manifest": 0, "num_omap_bytes": 0, "num_omap_keys": 0, "num_objects_repaired": 0 ... ], "recovery_state": [ { "name": "Started/Primary/Active", "enter_time": "2020-08-11 11:50:38.233290", "might_have_unfound": [], "recovery_progress": { "backfill_targets": [], "waiting_on_backfill": [], "last_backfill_started": "MIN", "backfill_info": { "begin": "MIN", "end": "MIN", "objects": [] }, "peer_backfill_info": [], "backfills_in_flight": [], "recovering": [], "pg_backend": { "pull_from_peer": [], "pushing": [] } }, "scrub": { "scrubber.epoch_start": "0", "scrubber.active": false, "scrubber.state": "INACTIVE", "scrubber.start": "MIN", "scrubber.end": "MIN", "scrubber.max_end": "MIN", "scrubber.subset_last_update": "0'0", "scrubber.deep": false, "scrubber.waiting_on_whom": [] } }, { "name": "Started", "enter_time": "2020-08-11 11:50:37.502984" } ], "agent_state": {} }
pg 14 does not have data nor activity!!!!
Let’s see in-deep the rest of the pg’s:
# ceph pg dump version 2063560 stamp 2020-08-17 09:46:06.557196 last_osdmap_epoch 0 last_pg_scan 0 PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN ... 5.7 3 0 0 0 0 897 0 0 3 3 active+clean 2020-08-17 00:25:41.809330 3605'5 3629:3491 [5,15,14,18] 5 [5,15,14,18] 5 3605'5 2020-08-17 00:25:41.809266 3605'5 2020-08-13 05:37:52.184231 0 14.6 0 0 0 0 0 0 0 0 0 0 active+undersized 2020-08-11 11:50:38.208103 0'0 3628:8 [8,3] 8 [8,3] 8 0'0 2020-08-11 11:50:37.135596 0'0 2020-08-11 11:50:37.135596 0 ... 14 0 0 0 0 0 0 0 0 14 14 13 4652 0 0 0 0 259597692 2169033596 4701468 24546 24546 12 1182753 0 0 0 0 80144984316 0 0 98035 98035 11 15 0 0 0 0 0 75562576 256644 21660 21660 10 69520 0 0 0 0 29298471706 0 0 24446 24446 5 5 0 0 0 0 2050 0 0 5 5 6 8 0 0 0 0 0 0 0 2747 2747 7 76 0 0 0 0 14374 11048 60 12097 12097 8 207 0 0 0 0 0 0 0 24532 24532 sum 1257236 0 0 0 0 109703070138 2244607220 4958172 208082 208082 OSD_STAT USED AVAIL USED_RAW TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM 19 32 GiB 2.0 TiB 33 GiB 2.0 TiB [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] 17 4 18 28 GiB 2.0 TiB 29 GiB 2.0 TiB [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,19] 22 6 17 48 GiB 2.0 TiB 49 GiB 2.0 TiB [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,18,19] 21 9 16 20 GiB 2.0 TiB 21 GiB 2.0 TiB [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,17,18,19] 17 4 15 32 GiB 2.0 TiB 33 GiB 2.0 TiB [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,16,17,18,19] 19 4 14 36 GiB 2.0 TiB 37 GiB 2.0 TiB [0,1,2,3,4,5,6,7,8,9,10,11,12,13,15,16,17,18,19] 20 2 13 27 GiB 2.0 TiB 28 GiB 2.0 TiB [0,1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,18,19] 17 3 12 47 GiB 2.0 TiB 48 GiB 2.0 TiB [0,1,2,3,4,5,6,7,8,9,10,11,13,14,15,16,17,18,19] 23 8 11 12 GiB 2.0 TiB 13 GiB 2.0 TiB [0,1,2,3,4,5,6,7,8,9,10,12,13,14,15,16,17,18,19] 11 5 10 17 GiB 2.0 TiB 18 GiB 2.0 TiB [0,1,2,3,4,5,6,7,8,9,11,12,13,14,15,16,17,18,19] 14 2 3 24 GiB 2.0 TiB 25 GiB 2.0 TiB [0,1,2,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19] 14 5 2 20 GiB 2.0 TiB 21 GiB 2.0 TiB [0,1,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19] 12 4 1 36 GiB 2.0 TiB 37 GiB 2.0 TiB [0,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19] 18 4 0 41 GiB 2.0 TiB 42 GiB 2.0 TiB [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19] 21 2 4 24 GiB 2.0 TiB 25 GiB 2.0 TiB [0,1,2,3,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19] 18 4 5 40 GiB 2.0 TiB 42 GiB 2.0 TiB [0,1,2,3,4,6,7,8,9,10,11,12,13,14,15,16,17,18,19] 23 10 6 55 GiB 1.9 TiB 56 GiB 2.0 TiB [0,1,2,3,4,5,7,8,9,10,11,12,13,14,15,16,17,18,19] 21 4 7 35 GiB 2.0 TiB 36 GiB 2.0 TiB [0,1,2,3,4,5,6,8,9,10,11,12,13,14,15,16,17,18,19] 23 5 8 32 GiB 2.0 TiB 33 GiB 2.0 TiB [0,1,2,3,4,5,6,7,9,10,11,12,13,14,15,16,17,18,19] 16 6 9 31 GiB 2.0 TiB 33 GiB 2.0 TiB [0,1,2,3,4,5,6,7,8,10,11,12,13,14,15,16,17,18,19] 21 5 sum 636 GiB 39 TiB 659 GiB 40 TiB * NOTE: Omap statistics are gathered during deep scrub and may be inaccurate soon afterwards depending on utilisation. See http://docs.ceph.com/docs/master/dev/placement-group/#omap-statistics for further details.
pg 14 is the only one that **don’t have** a high availability (3 replicas or more)… why?
# ceph pg dump_stuck inactive ok # ceph pg dump_stuck stale ok # ceph pg dump_stuck undersized ok PG_STAT STATE UP UP_PRIMARY ACTING ACTING_PRIMARY 14.1 active+undersized [18,7] 18 [18,7] 18 14.0 active+undersized [5,12] 5 [5,12] 5 14.3 active+undersized [8,5] 8 [8,5] 8 14.2 active+undersized+degraded [7,18] 7 [7,18] 7 14.6 active+undersized [8,3] 8 [8,3] 8 14.7 active+undersized [18,3] 18 [18,3] 18 14.4 active+undersized+degraded [5,18] 5 [5,18] 5 14.5 active+undersized [16,1] 16 [16,1] 16 # ceph pg force-recovery 14.0 pg 14.0 doesn't require recovery; # ceph pg force-backfill 14.0 pg 14.0 doesn't require backfilling; # ceph pg force-recovery 14.4 instructing pg(s) [14.4] on osd.5 to force-recovery; # ceph pg force-backfill 14.4 instructing pg(s) [14.4] on osd.5 to force-backfill; # ceph pg force-recovery 14.2 instructing pg(s) [14.2] on osd.7 to force-recovery; # ceph pg force-backfill 14.2 instructing pg(s) [14.2] on osd.7 to force-backfill; # ceph pg ls PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP 5.0 1 0 0 0 348 0 0 1 active+clean 18h 3605'2 ... 13.7 559 0 0 0 25166390 290987928 629844 3072 active+clean 33h 3629'373240 3629:428639 [16,2,17,1]p16 [16,2,17,1]p16 2020-08-16 00:16:42.372384 2020-08-13 15:45:28.525122 14.0 0 0 0 0 0 0 0 0 active+undersized 5d 0'0 3628:8 [5,12]p5 [5,12]p5 2020-08-11 11:50:37.135596 2020-08-11 11:50:37.135596 14.1 0 0 0 0 0 0 0 0 active+undersized 5d 0'0 3628:8 [18,7]p18 [18,7]p18 2020-08-11 11:50:37.135596 2020-08-11 11:50:37.135596 14.2 0 0 0 0 0 0 0 7 active+undersized+degraded 5d 3629'7 3629:21 [7,18]p7 [7,18]p7 2020-08-11 11:50:37.135596 2020-08-11 11:50:37.135596 14.3 0 0 0 0 0 0 0 0 active+undersized 5d 0'0 3628:8 [8,5]p8 [8,5]p8 2020-08-11 11:50:37.135596 2020-08-11 11:50:37.135596 14.4 0 0 0 0 0 0 0 7 active+undersized+degraded 5d 3629'7 3629:21 [5,18]p5 [5,18]p5 2020-08-11 11:50:37.135596 2020-08-11 11:50:37.135596 14.5 0 0 0 0 0 0 0 0 active+undersized 5d 0'0 3628:8 [16,1]p16 [16,1]p16 2020-08-11 11:50:37.135596 2020-08-11 11:50:37.135596 14.6 0 0 0 0 0 0 0 0 active+undersized 5d 0'0 3628:8 [8,3]p8 [8,3]p8 2020-08-11 11:50:37.135596 2020-08-11 11:50:37.135596 14.7 0 0 0 0 0 0 0 0 active+undersized 5d 0'0 3628:8 [18,3]p18 [18,3]p18 2020-08-11 11:50:37.135596 2020-08-11 11:50:37.135596 * NOTE: Omap statistics are gathered during deep scrub and may be inaccurate soon afterwards depending on utilisation. See http://docs.ceph.com/docs/master/dev/placement-group/#omap-statistics for further details.
Thoughts: \\
* PG 14 is **empty** !!!
* PG 14 is not scrub’in !!!
* PG 14 does not backfill!!!!!
\\
**WHY????????**
# ceph osd lspools 5 .rgw.root 6 default.rgw.control 7 default.rgw.meta 8 default.rgw.log 10 default.rgw.buckets.data 11 default.rgw.buckets.index 12 cephfs_data-ftp 13 cephfs_metadata-ftp 14 default.rgw.buckets.non-ec
Maybe PG 14 is **EMPTY** cause pool 14 is unused!!!!!\\
\\
Pool is empty, but there are more empty pools and the warning comes only from this.
# ceph osd pool autoscale-status POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE default.rgw.buckets.non-ec 0 3.0 40940G 0.0000 1.0 8 warn ... default.rgw.log 0 4.0 40940G 0.0000 1.0 8 on
This pool does not have autoscale!!!!\\
Turning it on:
# ceph osd pool set default.rgw.buckets.non-ec pg_autoscale_mode on set pool 14 pg_autoscale_mode to on # ceph osd pool autoscale-status POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE default.rgw.buckets.non-ec 0 3.0 40940G 0.0000 1.0 8 on ... default.rgw.log 0 4.0 40940G 0.0000 1.0 8 on
check:
# ceph -s ... health: HEALTH_WARN Degraded data redundancy: 8 pgs undersized ... # ceph health detail HEALTH_WARN Degraded data redundancy: 8 pgs undersized PG_DEGRADED Degraded data redundancy: 8 pgs undersized pg 14.0 is stuck undersized for 513135.338453, current state active+undersized, last acting [5,12] pg 14.1 is stuck undersized for 513135.375686, current state active+undersized, last acting [18,7] pg 14.2 is stuck undersized for 513135.291865, current state active+undersized, last acting [7,18] pg 14.3 is stuck undersized for 513135.370383, current state active+undersized, last acting [8,5] pg 14.4 is stuck undersized for 513135.338453, current state active+undersized, last acting [5,18] pg 14.5 is stuck undersized for 513135.317750, current state active+undersized, last acting [16,1] pg 14.6 is stuck undersized for 513135.370383, current state active+undersized, last acting [8,3] pg 14.7 is stuck undersized for 513135.375623, current state active+undersized, last acting [18,3]
ceph is moving!!! But still has WARNING, looking for placement rule:
# ceph osd pool ls detail | grep "non-ec" pool 14 'default.rgw.buckets.non-ec' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 3630 flags hashpspool stripe_width 0 application rgw
pool 14 has the default placement rule, switching to CiberterminalRule:
# ceph osd pool set default.rgw.buckets.non-ec crush_rule CiberterminalRule set pool 14 crush_rule to CiberterminalRule # ceph osd pool ls detail | grep "non-ec" pool 14 'default.rgw.buckets.non-ec' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 3631 flags hashpspool stripe_width 0 application rgw
Check:
# ceph health detail HEALTH_OK # ceph -s cluster: id: a3a799ce-f1d3-4230-a915-06e988fee767 health: HEALTH_OK ...
**OUUUUU YEAHHHHHHHHH**