optimize utf8_is_cont_byte() to speed up str.chars().count()
it shows consistent improvements across several x86_64 feature levels
```
old, -O2, x86-64
test str::str_char_count_emoji ... bench: 1,924 ns/iter (+/- 26)
test str::str_char_count_lorem ... bench: 879 ns/iter (+/- 12)
test str::str_char_count_lorem_short ... bench: 5 ns/iter (+/- 0)
new, -O2, x86-64
test str::str_char_count_emoji ... bench: 1,878 ns/iter (+/- 21)
test str::str_char_count_lorem ... bench: 851 ns/iter (+/- 11)
test str::str_char_count_lorem_short ... bench: 4 ns/iter (+/- 0)
old, -O2, x86-64-v2
test str::str_char_count_emoji ... bench: 1,477 ns/iter (+/- 46)
test str::str_char_count_lorem ... bench: 675 ns/iter (+/- 15)
test str::str_char_count_lorem_short ... bench: 5 ns/iter (+/- 0)
new, -O2, x86-64-v2
test str::str_char_count_emoji ... bench: 1,323 ns/iter (+/- 39)
test str::str_char_count_lorem ... bench: 593 ns/iter (+/- 18)
test str::str_char_count_lorem_short ... bench: 4 ns/iter (+/- 0)
old, -O2, x86-64-v3
test str::str_char_count_emoji ... bench: 748 ns/iter (+/- 7)
test str::str_char_count_lorem ... bench: 348 ns/iter (+/- 2)
test str::str_char_count_lorem_short ... bench: 5 ns/iter (+/- 0)
new, -O2, x86-64-v3
test str::str_char_count_emoji ... bench: 650 ns/iter (+/- 4)
test str::str_char_count_lorem ... bench: 301 ns/iter (+/- 1)
test str::str_char_count_lorem_short ... bench: 5 ns/iter (+/- 0)
```