String .surrogateLength to count surrogate pairs of unicode characters

MerlinMason · 2015-08-07

Continuing the discussion from Input character class whitelisting and blacklisting:

Input character class whitelisting and blacklisting

Continuing the discussion from Invoke emoji input:

Invoke emoji input

Why wouldn’t you want this to just be part of the browser functionality for every text input? Most/all text inputs are potential places you could put emoji.

That brings up an interesting point, though: is there a way in the current set of input attributes to really say “don’t present input methods for these classes of character”? Like, we have type hints for overall forms of input to present, like number pads, and we have complex validation expressions to mark certain overall strings as invalid, but there’s no really good hinting mechanism for “don’t even suggest the user be allowed to type this kind of character because I’m just going to reject it” (unless that disallowed set is “everything but digits” or “non-URL/email characters”).

Sure, a UA could try breaking down pattern to see if it can figure out what characters are contextually allowed by certain rules (although that’s going to fail for a ton of complex pattern values), but I think it’d be better if we had full-on “whitelist” and “blacklist” attributes to explicitly opt-in or opt-out to the inputs to certain well-defined character classes/sets (which could also be expanded to having the UA / polyfill actively filter out those characters, which is buggy code I know I’ve certainly written for more than a few scenarios).

Granted, restricting the character set is frequently not the best UX (see the way GitHub handles invalid characters when creating repo names, where it just tells you what name it’s going to convert it to) - but it also frequently is the best UX for many other inputs and contenteditable-based controls (namely, ones involving numbers).

Also also, blacklisting and whitelisting by named character classes is a more friendly alternative to validation than pattern in terms of locale-dependent patterns (even though most locale-dependent patterns already have their own type - which can’t be applied to contenteditable).

Counting the length of strings with emoji input can be a tricky… Some (not all) emoji are surrogate pairs of unicode characters, meaning Javascript understands them as two characters, not one.

While this is technically correct, it’s sometimes useful (validation, hints etc) to present a string length back to a user in a way that’s meaningful to them (one emoji = one character).

For example…

"Hello 😃😃".length -> 10
"Hello 😃😃".surrogateLength -> 8

tabatkins · 2015-08-07

Unless I’m misremembering, JS defined that the default String iterator iterates by code point, not code unit, so this will work:

[..."Hello 😃😃"].length -> 8

(Note that there’s still a distinction between code points and grapheme clusters. If you used a skin-tone selector to modify an emoji, that’s two code points, tho only one grapheme cluster. Accented letters might be one or two codepoints, depending on whether they were entered as precomposed or letter + combining character.)

stuartpb · 2015-08-07

And, of course, while this can be “solved” by some forms of Unicode normalization, one should note that the notion of the “length” of a Unicode string (rather than the length of the “raw” JS string) is a very tricky subject, and even counting code points post-normalization may not be adequate to address what could be considered its “true” length (due to things like ligatures and Zero-Width Joiners).

I think more than length or surrogateLength, there should be a utf8ByteLength, which reports the amount of space the string will take up in its most common modern representation (without having to actually reserve the space to do the calculation locally). This may be somewhere in Intl with all the other functions around Unicode string normalization - I don’t actually touch this space in depth that often (mostly owing to its aforementioned hairiness).