The last days I’ve been busy trying to find a “error” on my CRUSH map.
I found that some of my OSD’s where underused or unused at all… I didn’t know why, cause I built a CRUSH map from scratch with the common architecture based on datacenter, rack & cluster. And It was correct from the ceph point of view (It was running on the cluster).
I decided to simplify the map to a much simpler one.
Something like this:
Output form ceph osd crush tree
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | proceph-osm-001 ~ /crushmaps/crush_20190725 # ceph osd crush tree ID CLASS WEIGHT TYPE NAME -3 2.00000 root default -2 1.00000 host ciberterminal2_cluster 1 hdd 1.99899 osd.1 3 hdd 1.99899 osd.3 5 hdd 1.99899 osd.5 7 hdd 1.99899 osd.7 9 hdd 1.99899 osd.9 11 hdd 1.99899 osd.11 13 hdd 1.99899 osd.13 15 hdd 1.99899 osd.15 17 hdd 1.99899 osd.17 19 hdd 1.99899 osd.19 -1 1.00000 host ciberterminal_cluster 0 hdd 1.99899 osd.0 2 hdd 1.99899 osd.2 4 hdd 1.99899 osd.4 6 hdd 1.99899 osd.6 8 hdd 1.99899 osd.8 10 hdd 1.99899 osd.10 12 hdd 1.99899 osd.12 14 hdd 1.99899 osd.14 16 hdd 1.99899 osd.16 18 hdd 1.99899 osd.18 |
But still it didn’t allocate PG’s on pair nodes!
I kept reading about CRUSH parameters, then i found a good explanation of chooseleaf here:
step chooseleaf firstn
: What I’ve found is that this value is the DEPTH of your CRUSH tree:
- 0 Single Node Cluster
- 1 for a multi node cluster in a single rack
- 2 for a multi node, multi chassis cluster with multiple hosts in a chassis
- 3 for a multi node cluster with hosts across racks, etc.
But this definition was not accurate… You can try to understand the official documentaion, read about:
step choose firstn {num} type {bucket-type}
What I understand is that the {num}
is the “osd pool size” at this stage on the CRUSH mapping algorithm.
And on the next step:
step chooseleaf firstn {num} type {bucket-type}
The “osd pool size” on the “leaf” nodes.
So in my case, I’ll have 4 copies at “datacenter” layer and 2 on the “osd” layer, so CRUSH will put 2 copies of the leaf copies on each datacenter.
What I’ve seen is you want to have a fix number on “size” and “min_size” on your pool. The best option will be setup a placement rule forcing that options.
For example:
1 2 3 4 5 6 7 8 9 10 11 | rule CiberterminalRule { id 1 type replicated min_size 2 max_size 10 # begin iterating in the "root" of the crush tree step take default step choose firstn 4 type datacenter step chooseleaf firstn 2 type osd step emit } |
The checks give me a “good” result:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | # crushtool --test -i other_crushmap_new.bin --show-utilization-all --rule 1 --num-rep=4 devices weights (hex): [10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000] rule 1 (CiberterminalRule), x = 0..1023, numrep = 4..4 rule 1 (CiberterminalRule) num_rep 4 result size == 4: 1024 /1024 device 0: stored : 212 expected : 102.4 device 1: stored : 204 expected : 102.4 device 2: stored : 177 expected : 102.4 device 3: stored : 211 expected : 102.4 device 4: stored : 194 expected : 102.4 device 5: stored : 216 expected : 102.4 device 6: stored : 225 expected : 102.4 device 7: stored : 209 expected : 102.4 device 8: stored : 220 expected : 102.4 device 9: stored : 210 expected : 102.4 device 10: stored : 213 expected : 102.4 device 11: stored : 193 expected : 102.4 device 12: stored : 194 expected : 102.4 device 13: stored : 190 expected : 102.4 device 14: stored : 209 expected : 102.4 device 15: stored : 223 expected : 102.4 device 16: stored : 197 expected : 102.4 device 17: stored : 189 expected : 102.4 device 18: stored : 207 expected : 102.4 device 19: stored : 203 expected : 102.4 |