Whitespace is a difficult topic for many reasons, not least of which is that it’s hard to see what others are talking about because, well, there’s nothing there to see.
One pain point I consistently run into is the inconsistency between what HTML calls a “space character” and what JavaScript matches with it’s regex \s
.
In HTML, classes are separated by “space characters”, which are explicitly:
- tab (U+0009)
- line feed (U+000A)
- form feed (U+000C)
- carriage return (U+000D)
- space (U+0020)
In JavaScript \s
matches*:
- character tab (U+0009)
- line feed (U+000A)
- line tab (U+000B)
- form feed (U+000C)
- carriage return (U+000D)
- space (U+0020)
- no-break space (U+00A0)
- ogham space mark (U+1680)
- mongolian vowel separator (U+180E)
- en quad (U+2000)
- em quad (U+2001)
- en space (U+2002)
- em space (U+2003)
- three-per-em space (U+2004)
- four-per-em space (U+2005)
- six-per-em space (U+2006)
- figure space (U+2007)
- punctuation space (U+2008)
- thin space (U+2009)
- hair space (U+200A)
- line separator (U+2028)
- paragraph separator (U+2029)
- narrow no-break space (U+202F)
- medium mathematical space (U+205F)
- ideographic space (U+3000)
- zero width no-break space (U+FEFF)
* I’ve probably made a mistake with this list, it’s hard to find a definitive source that’s clear.
And so, because HTML’s definition of “space characters” doesn’t match JavaScript’s implementation of what are essentially “space characters”, there are lots and lots of bugs implementing anything that should be separated by space characters.
I could try to play whack-a-mole, such as when I reported the bug in jQuery’s hasClass implementation, but these sorts of issues crop up all over the place, such as on MDN’s suggested shim for the Element.classList
api (oops).
Playing whack-a-mole can be fun for a while, but it quickly becomes repetitive and tiresome. In all honesty, I don’t think anyone will ever naturally run into the bugs with the classList
shim’s implementation because I don’t think any developer in their right mind would attempt to separate classes with characters other than:
- space
- tab
- line feed
- carriage return
which is also why jQuery could be so popular for so long while having such a simple bug in what is arguably one of the most-used features of the library (it affected addClass
and removeClass
in addition to hasClass
).
So, for the sake of simplifying things, can we redefine HTML’s space characters to be equivalent to whatever \s
matches in JavaScript? It would simplify things for developers to be able to use \s
, and as far as compatibility is concerned it would implicitly “fix” what are currently buggy implementations that naively assumed \s
was an acceptable match for HTML’s “space characters”.
As for the biggest concern about backwards compatibility breaking existing sites, I don’t have the data to support the change either way. If I had to guess, there simply aren’t sites that are making use of unicode whitespace characters as part of class names because it’d be silly, and any that might, are probably doing so as a sort of tech demo.
tl;dr:
“space characters” in HTML should be redefined to be whatever JavaScript uses for its \s
regex pattern to simplify implementation.