The last days I’ve been busy trying to find a “error” on my CRUSH map.
I found that some of my OSD’s where underused or unused at all… I didn’t know why, cause I built a CRUSH map from scratch with the common architecture based on datacenter, rack & cluster. And It was correct from the ceph point of view (It was running on the cluster).
I decided to simplify the map to a much simpler one.
Something like this:
Output form ceph osd crush tree
:
proceph-osm-001 ~/crushmaps/crush_20190725 # ceph osd crush tree ID CLASS WEIGHT TYPE NAME -3 2.00000 root default -2 1.00000 host ciberterminal2_cluster 1 hdd 1.99899 osd.1 3 hdd 1.99899 osd.3 5 hdd 1.99899 osd.5 7 hdd 1.99899 osd.7 9 hdd 1.99899 osd.9 11 hdd 1.99899 osd.11 13 hdd 1.99899 osd.13 15 hdd 1.99899 osd.15 17 hdd 1.99899 osd.17 19 hdd 1.99899 osd.19 -1 1.00000 host ciberterminal_cluster 0 hdd 1.99899 osd.0 2 hdd 1.99899 osd.2 4 hdd 1.99899 osd.4 6 hdd 1.99899 osd.6 8 hdd 1.99899 osd.8 10 hdd 1.99899 osd.10 12 hdd 1.99899 osd.12 14 hdd 1.99899 osd.14 16 hdd 1.99899 osd.16 18 hdd 1.99899 osd.18
But still it didn’t allocate PG’s on pair nodes!
I kept reading about CRUSH parameters, then i found a good explanation of chooseleaf here:
step chooseleaf firstn
: What I’ve found is that this value is the DEPTH of your CRUSH tree:
- 0 Single Node Cluster
- 1 for a multi node cluster in a single rack
- 2 for a multi node, multi chassis cluster with multiple hosts in a chassis
- 3 for a multi node cluster with hosts across racks, etc.
But this definition was not accurate… You can try to understand the official documentaion, read about:
step choose firstn {num} type {bucket-type}
What I understand is that the {num}
is the “osd pool size” at this stage on the CRUSH mapping algorithm.
And on the next step:
step chooseleaf firstn {num} type {bucket-type}
The “osd pool size” on the “leaf” nodes.
So in my case, I’ll have 4 copies at “datacenter” layer and 2 on the “osd” layer, so CRUSH will put 2 copies of the leaf copies on each datacenter.
What I’ve seen is you want to have a fix number on “size” and “min_size” on your pool. The best option will be setup a placement rule forcing that options.
For example:
rule CiberterminalRule { id 1 type replicated min_size 2 max_size 10 # begin iterating in the "root" of the crush tree step take default step choose firstn 4 type datacenter step chooseleaf firstn 2 type osd step emit }
The checks give me a “good” result:
# crushtool --test -i other_crushmap_new.bin --show-utilization-all --rule 1 --num-rep=4 devices weights (hex): [10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000,10000] rule 1 (CiberterminalRule), x = 0..1023, numrep = 4..4 rule 1 (CiberterminalRule) num_rep 4 result size == 4: 1024/1024 device 0: stored : 212 expected : 102.4 device 1: stored : 204 expected : 102.4 device 2: stored : 177 expected : 102.4 device 3: stored : 211 expected : 102.4 device 4: stored : 194 expected : 102.4 device 5: stored : 216 expected : 102.4 device 6: stored : 225 expected : 102.4 device 7: stored : 209 expected : 102.4 device 8: stored : 220 expected : 102.4 device 9: stored : 210 expected : 102.4 device 10: stored : 213 expected : 102.4 device 11: stored : 193 expected : 102.4 device 12: stored : 194 expected : 102.4 device 13: stored : 190 expected : 102.4 device 14: stored : 209 expected : 102.4 device 15: stored : 223 expected : 102.4 device 16: stored : 197 expected : 102.4 device 17: stored : 189 expected : 102.4 device 18: stored : 207 expected : 102.4 device 19: stored : 203 expected : 102.4