A partial archive of discourse.wicg.io as of Saturday February 24, 2024.

String .surrogateLength to count surrogate pairs of unicode characters


Continuing the discussion from Input character class whitelisting and blacklisting:

Counting the length of strings with emoji input can be a tricky… Some (not all) emoji are surrogate pairs of unicode characters, meaning Javascript understands them as two characters, not one.

While this is technically correct, it’s sometimes useful (validation, hints etc) to present a string length back to a user in a way that’s meaningful to them (one emoji = one character).

For example…

"Hello 😃😃".length -> 10
"Hello 😃😃".surrogateLength -> 8

Unless I’m misremembering, JS defined that the default String iterator iterates by code point, not code unit, so this will work:

[..."Hello 😃😃"].length -> 8

(Note that there’s still a distinction between code points and grapheme clusters. If you used a skin-tone selector to modify an emoji, that’s two code points, tho only one grapheme cluster. Accented letters might be one or two codepoints, depending on whether they were entered as precomposed or letter + combining character.)


And, of course, while this can be “solved” by some forms of Unicode normalization, one should note that the notion of the “length” of a Unicode string (rather than the length of the “raw” JS string) is a very tricky subject, and even counting code points post-normalization may not be adequate to address what could be considered its “true” length (due to things like ligatures and Zero-Width Joiners).

I think more than length or surrogateLength, there should be a utf8ByteLength, which reports the amount of space the string will take up in its most common modern representation (without having to actually reserve the space to do the calculation locally). This may be somewhere in Intl with all the other functions around Unicode string normalization - I don’t actually touch this space in depth that often (mostly owing to its aforementioned hairiness).