Troubleshooting CEPH: Degraded data redundancy

Today, I’ll make a little explanation on how to solve this warning coming from ceph status.
Maybe this case does not match your error, but I think the commands I used give hoy a way to follow and solve it.

The whole process is written on the wiki

See general status:

# ceph -s
...
    health: HEALTH_WARN
            Degraded data redundancy: 2 pgs degraded, 8 pgs undersized
...
# ceph health detail
HEALTH_WARN Degraded data redundancy: 2 pgs degraded, 8 pgs undersized
PG_DEGRADED Degraded data redundancy: 2 pgs degraded, 8 pgs undersized
    pg 14.0 is stuck undersized for 510298.054479, current state active+undersized, last acting [5,12]
    pg 14.1 is stuck undersized for 510298.091712, current state active+undersized, last acting [18,7]
    pg 14.2 is stuck undersized for 510298.007891, current state active+undersized+degraded, last acting [7,18]
    pg 14.3 is stuck undersized for 510298.086409, current state active+undersized, last acting [8,5]
    pg 14.4 is stuck undersized for 510298.054479, current state active+undersized+degraded, last acting [5,18]
    pg 14.5 is stuck undersized for 510298.033776, current state active+undersized, last acting [16,1]
    pg 14.6 is stuck undersized for 510298.086409, current state active+undersized, last acting [8,3]
    pg 14.7 is stuck undersized for 510298.091649, current state active+undersized, last acting [18,3]

Why pg 14 is the **only one** that is failing???
Getting details of one PG:

# ceph tell 14.0 query | jq .
{
  "state": "active+undersized",
...
      "stat_sum": {
        "num_bytes": 0,
        "num_objects": 0,
        "num_object_clones": 0,
        "num_object_copies": 0,
        "num_objects_missing_on_primary": 0,
        "num_objects_missing": 0,
        "num_objects_degraded": 0,
        "num_objects_misplaced": 0,
        "num_objects_unfound": 0,
        "num_objects_dirty": 0,
        "num_whiteouts": 0,
        "num_read": 0,
        "num_read_kb": 0,
        "num_write": 0,
        "num_write_kb": 0,
        "num_scrub_errors": 0,
        "num_shallow_scrub_errors": 0,
        "num_deep_scrub_errors": 0,
        "num_objects_recovered": 0,
        "num_bytes_recovered": 0,
        "num_keys_recovered": 0,
        "num_objects_omap": 0,
        "num_objects_hit_set_archive": 0,
        "num_bytes_hit_set_archive": 0,
        "num_flush": 0,
        "num_flush_kb": 0,
        "num_evict": 0,
        "num_evict_kb": 0,
        "num_promote": 0,
        "num_flush_mode_high": 0,
        "num_flush_mode_low": 0,
        "num_evict_mode_some": 0,
        "num_evict_mode_full": 0,
        "num_objects_pinned": 0,
        "num_legacy_snapsets": 0,
        "num_large_omap_objects": 0,
        "num_objects_manifest": 0,
        "num_omap_bytes": 0,
        "num_omap_keys": 0,
        "num_objects_repaired": 0
...
  ],
  "recovery_state": [
    {
      "name": "Started/Primary/Active",
      "enter_time": "2020-08-11 11:50:38.233290",
      "might_have_unfound": [],
      "recovery_progress": {
        "backfill_targets": [],
        "waiting_on_backfill": [],
        "last_backfill_started": "MIN",
        "backfill_info": {
          "begin": "MIN",
          "end": "MIN",
          "objects": []
        },
        "peer_backfill_info": [],
        "backfills_in_flight": [],
        "recovering": [],
        "pg_backend": {
          "pull_from_peer": [],
          "pushing": []
        }
      },
      "scrub": {
        "scrubber.epoch_start": "0",
        "scrubber.active": false,
        "scrubber.state": "INACTIVE",
        "scrubber.start": "MIN",
        "scrubber.end": "MIN",
        "scrubber.max_end": "MIN",
        "scrubber.subset_last_update": "0'0",
        "scrubber.deep": false,
        "scrubber.waiting_on_whom": []
      }
    },
    {
      "name": "Started",
      "enter_time": "2020-08-11 11:50:37.502984"
    }
  ],
  "agent_state": {}
}

pg 14 does not have data nor activity!!!!
Let’s see in-deep the rest of the pg’s:

# ceph pg dump
version 2063560
stamp 2020-08-17 09:46:06.557196
last_osdmap_epoch 0
last_pg_scan 0
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES      OMAP_BYTES* OMAP_KEYS* LOG  DISK_LOG STATE                      STATE_STAMP                VERSION      REPORTED       UP            UP_PRIMARY ACTING        ACTING_PRIMARY LAST_SCRUB   SCRUB_STAMP                LAST_DEEP_SCRUB DEEP_SCRUB_STAMP           SNAPTRIMQ_LEN 
...
5.7           3                  0        0         0       0        897           0          0    3        3               active+clean 2020-08-17 00:25:41.809330       3605'5      3629:3491  [5,15,14,18]          5  [5,15,14,18]              5       3605'5 2020-08-17 00:25:41.809266          3605'5 2020-08-13 05:37:52.184231             0 
14.6          0                  0        0         0       0          0           0          0    0        0          active+undersized 2020-08-11 11:50:38.208103          0'0         3628:8         [8,3]          8         [8,3]              8          0'0 2020-08-11 11:50:37.135596             0'0 2020-08-11 11:50:37.135596             0 
...
14       0 0 0 0 0           0          0       0    14    14 
13    4652 0 0 0 0   259597692 2169033596 4701468 24546 24546 
12 1182753 0 0 0 0 80144984316          0       0 98035 98035 
11      15 0 0 0 0           0   75562576  256644 21660 21660 
10   69520 0 0 0 0 29298471706          0       0 24446 24446 
5        5 0 0 0 0        2050          0       0     5     5 
6        8 0 0 0 0           0          0       0  2747  2747 
7       76 0 0 0 0       14374      11048      60 12097 12097 
8      207 0 0 0 0           0          0       0 24532 24532 
                                                                  
sum 1257236 0 0 0 0 109703070138 2244607220 4958172 208082 208082 
OSD_STAT USED    AVAIL   USED_RAW TOTAL   HB_PEERS                                          PG_SUM PRIMARY_PG_SUM 
19        32 GiB 2.0 TiB   33 GiB 2.0 TiB  [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18]     17              4 
18        28 GiB 2.0 TiB   29 GiB 2.0 TiB  [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,19]     22              6 
17        48 GiB 2.0 TiB   49 GiB 2.0 TiB  [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,18,19]     21              9 
16        20 GiB 2.0 TiB   21 GiB 2.0 TiB  [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,17,18,19]     17              4 
15        32 GiB 2.0 TiB   33 GiB 2.0 TiB  [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,16,17,18,19]     19              4 
14        36 GiB 2.0 TiB   37 GiB 2.0 TiB  [0,1,2,3,4,5,6,7,8,9,10,11,12,13,15,16,17,18,19]     20              2 
13        27 GiB 2.0 TiB   28 GiB 2.0 TiB  [0,1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,18,19]     17              3 
12        47 GiB 2.0 TiB   48 GiB 2.0 TiB  [0,1,2,3,4,5,6,7,8,9,10,11,13,14,15,16,17,18,19]     23              8 
11        12 GiB 2.0 TiB   13 GiB 2.0 TiB  [0,1,2,3,4,5,6,7,8,9,10,12,13,14,15,16,17,18,19]     11              5 
10        17 GiB 2.0 TiB   18 GiB 2.0 TiB  [0,1,2,3,4,5,6,7,8,9,11,12,13,14,15,16,17,18,19]     14              2 
3         24 GiB 2.0 TiB   25 GiB 2.0 TiB [0,1,2,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]     14              5 
2         20 GiB 2.0 TiB   21 GiB 2.0 TiB [0,1,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]     12              4 
1         36 GiB 2.0 TiB   37 GiB 2.0 TiB [0,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]     18              4 
0         41 GiB 2.0 TiB   42 GiB 2.0 TiB [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]     21              2 
4         24 GiB 2.0 TiB   25 GiB 2.0 TiB [0,1,2,3,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]     18              4 
5         40 GiB 2.0 TiB   42 GiB 2.0 TiB [0,1,2,3,4,6,7,8,9,10,11,12,13,14,15,16,17,18,19]     23             10 
6         55 GiB 1.9 TiB   56 GiB 2.0 TiB [0,1,2,3,4,5,7,8,9,10,11,12,13,14,15,16,17,18,19]     21              4 
7         35 GiB 2.0 TiB   36 GiB 2.0 TiB [0,1,2,3,4,5,6,8,9,10,11,12,13,14,15,16,17,18,19]     23              5 
8         32 GiB 2.0 TiB   33 GiB 2.0 TiB [0,1,2,3,4,5,6,7,9,10,11,12,13,14,15,16,17,18,19]     16              6 
9         31 GiB 2.0 TiB   33 GiB 2.0 TiB [0,1,2,3,4,5,6,7,8,10,11,12,13,14,15,16,17,18,19]     21              5 
sum      636 GiB  39 TiB  659 GiB  40 TiB                                                                         

* NOTE: Omap statistics are gathered during deep scrub and may be inaccurate soon afterwards depending on utilisation. See http://docs.ceph.com/docs/master/dev/placement-group/#omap-statistics for further details.

pg 14 is the only one that **don’t have** a high availability (3 replicas or more)… why?

# ceph pg dump_stuck inactive
ok
# ceph pg dump_stuck stale
ok
# ceph pg dump_stuck undersized
ok
PG_STAT STATE                      UP     UP_PRIMARY ACTING ACTING_PRIMARY 
14.1             active+undersized [18,7]         18 [18,7]             18 
14.0             active+undersized [5,12]          5 [5,12]              5 
14.3             active+undersized  [8,5]          8  [8,5]              8 
14.2    active+undersized+degraded [7,18]          7 [7,18]              7 
14.6             active+undersized  [8,3]          8  [8,3]              8 
14.7             active+undersized [18,3]         18 [18,3]             18 
14.4    active+undersized+degraded [5,18]          5 [5,18]              5 
14.5             active+undersized [16,1]         16 [16,1]             16 
# ceph pg force-recovery 14.0
pg 14.0 doesn't require recovery; 

# ceph pg force-backfill 14.0
pg 14.0 doesn't require backfilling; 

# ceph pg force-recovery 14.4
instructing pg(s) [14.4] on osd.5 to force-recovery; 

# ceph pg force-backfill 14.4
instructing pg(s) [14.4] on osd.5 to force-backfill; 

# ceph pg force-recovery 14.2
instructing pg(s) [14.2] on osd.7 to force-recovery; 

# ceph pg force-backfill 14.2
instructing pg(s) [14.2] on osd.7 to force-backfill; 

# ceph pg ls
PG    OBJECTS DEGRADED MISPLACED UNFOUND BYTES      OMAP_BYTES* OMAP_KEYS* LOG  STATE                      SINCE VERSION      REPORTED       UP               ACTING           SCRUB_STAMP                DEEP_SCRUB_STAMP           
5.0         1        0         0       0        348           0          0    1               active+clean   18h       3605'2      ...
13.7      559        0         0       0   25166390   290987928     629844 3072               active+clean   33h  3629'373240    3629:428639   [16,2,17,1]p16   [16,2,17,1]p16 2020-08-16 00:16:42.372384 2020-08-13 15:45:28.525122 
14.0        0        0         0       0          0           0          0    0          active+undersized    5d          0'0         3628:8         [5,12]p5         [5,12]p5 2020-08-11 11:50:37.135596 2020-08-11 11:50:37.135596 
14.1        0        0         0       0          0           0          0    0          active+undersized    5d          0'0         3628:8        [18,7]p18        [18,7]p18 2020-08-11 11:50:37.135596 2020-08-11 11:50:37.135596 
14.2        0        0         0       0          0           0          0    7 active+undersized+degraded    5d       3629'7        3629:21         [7,18]p7         [7,18]p7 2020-08-11 11:50:37.135596 2020-08-11 11:50:37.135596 
14.3        0        0         0       0          0           0          0    0          active+undersized    5d          0'0         3628:8          [8,5]p8          [8,5]p8 2020-08-11 11:50:37.135596 2020-08-11 11:50:37.135596 
14.4        0        0         0       0          0           0          0    7 active+undersized+degraded    5d       3629'7        3629:21         [5,18]p5         [5,18]p5 2020-08-11 11:50:37.135596 2020-08-11 11:50:37.135596 
14.5        0        0         0       0          0           0          0    0          active+undersized    5d          0'0         3628:8        [16,1]p16        [16,1]p16 2020-08-11 11:50:37.135596 2020-08-11 11:50:37.135596 
14.6        0        0         0       0          0           0          0    0          active+undersized    5d          0'0         3628:8          [8,3]p8          [8,3]p8 2020-08-11 11:50:37.135596 2020-08-11 11:50:37.135596 
14.7        0        0         0       0          0           0          0    0          active+undersized    5d          0'0         3628:8        [18,3]p18        [18,3]p18 2020-08-11 11:50:37.135596 2020-08-11 11:50:37.135596 

* NOTE: Omap statistics are gathered during deep scrub and may be inaccurate soon afterwards depending on utilisation. See http://docs.ceph.com/docs/master/dev/placement-group/#omap-statistics for further details.

Thoughts: \\
* PG 14 is **empty** !!!
* PG 14 is not scrub’in !!!
* PG 14 does not backfill!!!!!
\\
**WHY????????**

# ceph osd lspools
5 .rgw.root
6 default.rgw.control
7 default.rgw.meta
8 default.rgw.log
10 default.rgw.buckets.data
11 default.rgw.buckets.index
12 cephfs_data-ftp
13 cephfs_metadata-ftp
14 default.rgw.buckets.non-ec

Maybe PG 14 is **EMPTY** cause pool 14 is unused!!!!!\\
\\
Pool is empty, but there are more empty pools and the warning comes only from this.

# ceph osd pool autoscale-status
 POOL                          SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE 
 default.rgw.buckets.non-ec      0                 3.0        40940G  0.0000                 1.0       8              warn      
...
 default.rgw.log                 0                 4.0        40940G  0.0000                 1.0       8              on

This pool does not have autoscale!!!!\\
Turning it on:

# ceph osd pool set default.rgw.buckets.non-ec pg_autoscale_mode on
set pool 14 pg_autoscale_mode to on
# ceph osd pool autoscale-status
 POOL                          SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE 
 default.rgw.buckets.non-ec      0                 3.0        40940G  0.0000                 1.0       8              on        
...
 default.rgw.log                 0                 4.0        40940G  0.0000                 1.0       8              on

check:

# ceph  -s
...
    health: HEALTH_WARN
            Degraded data redundancy: 8 pgs undersized
...
 
# ceph health detail 
HEALTH_WARN Degraded data redundancy: 8 pgs undersized
PG_DEGRADED Degraded data redundancy: 8 pgs undersized
    pg 14.0 is stuck undersized for 513135.338453, current state active+undersized, last acting [5,12]
    pg 14.1 is stuck undersized for 513135.375686, current state active+undersized, last acting [18,7]
    pg 14.2 is stuck undersized for 513135.291865, current state active+undersized, last acting [7,18]
    pg 14.3 is stuck undersized for 513135.370383, current state active+undersized, last acting [8,5]
    pg 14.4 is stuck undersized for 513135.338453, current state active+undersized, last acting [5,18]
    pg 14.5 is stuck undersized for 513135.317750, current state active+undersized, last acting [16,1]
    pg 14.6 is stuck undersized for 513135.370383, current state active+undersized, last acting [8,3]
    pg 14.7 is stuck undersized for 513135.375623, current state active+undersized, last acting [18,3]

ceph is moving!!! But still has WARNING, looking for placement rule:

# ceph osd pool ls detail | grep "non-ec"
pool 14 'default.rgw.buckets.non-ec' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 3630 flags hashpspool stripe_width 0 application rgw

pool 14 has the default placement rule, switching to CiberterminalRule:

# ceph osd pool set default.rgw.buckets.non-ec crush_rule CiberterminalRule
set pool 14 crush_rule to CiberterminalRule

# ceph osd pool ls detail | grep "non-ec"
pool 14 'default.rgw.buckets.non-ec' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 3631 flags hashpspool stripe_width 0 application rgw

Check:

# ceph health detail
HEALTH_OK
# ceph -s
  cluster:
    id:     a3a799ce-f1d3-4230-a915-06e988fee767
    health: HEALTH_OK
 ...

**OUUUUU YEAHHHHHHHHH**

Leave a Reply