Ceph Troubleshooting: Debugging CRUSH maps

The last days I’ve been busy trying to find a “error” on my CRUSH map.
I found that some of my OSD’s where underused or unused at all… I didn’t know why, cause I built a CRUSH map from scratch with the common architecture based on datacenter, rack & cluster. And It was correct from the ceph point of view (It was running on the cluster).

I decided to simplify the map to a much simpler one.
Something like this:
Simple CRUSH map

Output form ceph osd crush tree:

proceph-osm-001 ~/crushmaps/crush_20190725 # ceph osd crush tree
ID CLASS WEIGHT  TYPE NAME                 
-3       2.00000 root default              
-2       1.00000     host ciberterminal2_cluster 
 1   hdd 1.99899         osd.1             
 3   hdd 1.99899         osd.3             
 5   hdd 1.99899         osd.5             
 7   hdd 1.99899         osd.7             
 9   hdd 1.99899         osd.9             
11   hdd 1.99899         osd.11            
13   hdd 1.99899         osd.13            
15   hdd 1.99899         osd.15            
17   hdd 1.99899         osd.17            
19   hdd 1.99899         osd.19            
-1       1.00000     host ciberterminal_cluster  
 0   hdd 1.99899         osd.0             
 2   hdd 1.99899         osd.2             
 4   hdd 1.99899         osd.4             
 6   hdd 1.99899         osd.6             
 8   hdd 1.99899         osd.8             
10   hdd 1.99899         osd.10            
12   hdd 1.99899         osd.12            
14   hdd 1.99899         osd.14            
16   hdd 1.99899         osd.16            
18   hdd 1.99899         osd.18

But still it didn’t allocate PG’s on pair nodes!

I kept reading about CRUSH parameters, then i found a good explanation of chooseleaf here:

step chooseleaf firstn: What I’ve found is that this value is the DEPTH of your CRUSH tree:

0 Single Node Cluster

1 for a multi node cluster in a single rack

2 for a multi node, multi chassis cluster with multiple hosts in a chassis

3 for a multi node cluster with hosts across racks, etc.

But this definition was not accurate… You can try to understand the official documentaion, read about:
step choose firstn {num} type {bucket-type}
What I understand is that the {num} is the “osd pool size” at this stage on the CRUSH mapping algorithm.
And on the next step:
step chooseleaf firstn {num} type {bucket-type}
The “osd pool size” on the “leaf” nodes.

So in my case, I’ll have 4 copies at “datacenter” layer and 2 on the “osd” layer, so CRUSH will put 2 copies of the leaf copies on each datacenter.

What I’ve seen is you want to have a fix number on “size” and “min_size” on your pool. The best option will be setup a placement rule forcing that options.
For example:

rule CiberterminalRule {
        id 1 
        type replicated
        min_size 2
        max_size 10
        # begin iterating in the "root" of the crush tree
        step take default
        step choose firstn 4 type datacenter
        step chooseleaf firstn 2 type osd
        step emit
}

The checks give me a “good” result:

# crushtool --test -i other_crushmap_new.bin --show-utilization-all --rule 1 --num-rep=4                                                                                                                                                   
devices weights (hex): [10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000]
rule 1 (CiberterminalRule), x = 0..1023, numrep = 4..4
rule 1 (CiberterminalRule) num_rep 4 result size == 4:  1024/1024
  device 0:              stored : 212    expected : 102.4
  device 1:              stored : 204    expected : 102.4
  device 2:              stored : 177    expected : 102.4
  device 3:              stored : 211    expected : 102.4
  device 4:              stored : 194    expected : 102.4
  device 5:              stored : 216    expected : 102.4
  device 6:              stored : 225    expected : 102.4
  device 7:              stored : 209    expected : 102.4
  device 8:              stored : 220    expected : 102.4
  device 9:              stored : 210    expected : 102.4
  device 10:             stored : 213    expected : 102.4
  device 11:             stored : 193    expected : 102.4
  device 12:             stored : 194    expected : 102.4
  device 13:             stored : 190    expected : 102.4
  device 14:             stored : 209    expected : 102.4
  device 15:             stored : 223    expected : 102.4
  device 16:             stored : 197    expected : 102.4
  device 17:             stored : 189    expected : 102.4
  device 18:             stored : 207    expected : 102.4
  device 19:             stored : 203    expected : 102.4

Leave a Reply