Auto merge of #88834 - the8472:char-count, r=joshtriplett
optimize str::from_utf8() validation when slice contains multibyte chars and str.chars().count() in all cases
The change shows small but consistent improvements across several x86 target feature levels. I also tried to optimize counting with `slice.as_chunks` but that yielded more inconsistent results, bigger improvements for some optimization levels, lesser ones in others.
```
old, -O2, x86-64
test str::str_char_count_emoji ... bench: 1,924 ns/iter (+/- 26)
test str::str_char_count_lorem ... bench: 879 ns/iter (+/- 12)
test str::str_char_count_lorem_short ... bench: 5 ns/iter (+/- 0)
new, -O2, x86-64
test str::str_char_count_emoji ... bench: 1,878 ns/iter (+/- 21)
test str::str_char_count_lorem ... bench: 851 ns/iter (+/- 11)
test str::str_char_count_lorem_short ... bench: 4 ns/iter (+/- 0)
old, -O2, x86-64-v2
test str::str_char_count_emoji ... bench: 1,477 ns/iter (+/- 46)
test str::str_char_count_lorem ... bench: 675 ns/iter (+/- 15)
test str::str_char_count_lorem_short ... bench: 5 ns/iter (+/- 0)
new, -O2, x86-64-v2
test str::str_char_count_emoji ... bench: 1,323 ns/iter (+/- 39)
test str::str_char_count_lorem ... bench: 593 ns/iter (+/- 18)
test str::str_char_count_lorem_short ... bench: 4 ns/iter (+/- 0)
old, -O2, x86-64-v3
test str::str_char_count_emoji ... bench: 748 ns/iter (+/- 7)
test str::str_char_count_lorem ... bench: 348 ns/iter (+/- 2)
test str::str_char_count_lorem_short ... bench: 5 ns/iter (+/- 0)
new, -O2, x86-64-v3
test str::str_char_count_emoji ... bench: 650 ns/iter (+/- 4)
test str::str_char_count_lorem ... bench: 301 ns/iter (+/- 1)
test str::str_char_count_lorem_short ... bench: 5 ns/iter (+/- 0)
```
and for the multibyte-char string validation:
```
old, -O2, x86-64
test str::str_validate_emoji ... bench: 4,606 ns/iter (+/- 64)
new, -O2, x86-64
test str::str_validate_emoji ... bench: 3,837 ns/iter (+/- 60)
```