Make [u8]::cmp implementation branchless
The current implementation generates rather ugly assembly code, branching when the common parts are equal. By performing the comparison of the lengths upfront using a subtraction, the assembly gets much prettier: https://godbolt.org/z/4e5fnEKGd.
This will probably not impact speed too much, as the expensive part is in most cases the `memcmp`, but it sure looks better (I'm porting a sorting algorithm currently, and that branch just bothered me).