Whitespace is hard and buggy. Can we normalize it?

zzzzbov · 2016-04-11

Whitespace is a difficult topic for many reasons, not least of which is that it’s hard to see what others are talking about because, well, there’s nothing there to see.

One pain point I consistently run into is the inconsistency between what HTML calls a “space character” and what JavaScript matches with it’s regex \s.

In HTML, classes are separated by “space characters”, which are explicitly:

tab (U+0009)
line feed (U+000A)
form feed (U+000C)
carriage return (U+000D)
space (U+0020)

In JavaScript \s matches*:

character tab (U+0009)
line feed (U+000A)
line tab (U+000B)
form feed (U+000C)
carriage return (U+000D)
space (U+0020)
no-break space (U+00A0)
ogham space mark (U+1680)
mongolian vowel separator (U+180E)
en quad (U+2000)
em quad (U+2001)
en space (U+2002)
em space (U+2003)
three-per-em space (U+2004)
four-per-em space (U+2005)
six-per-em space (U+2006)
figure space (U+2007)
punctuation space (U+2008)
thin space (U+2009)
hair space (U+200A)
line separator (U+2028)
paragraph separator (U+2029)
narrow no-break space (U+202F)
medium mathematical space (U+205F)
ideographic space (U+3000)
zero width no-break space (U+FEFF)

_{* I’ve probably made a mistake with this list, it’s hard to find a definitive source that’s clear.}

And so, because HTML’s definition of “space characters” doesn’t match JavaScript’s implementation of what are essentially “space characters”, there are lots and lots of bugs implementing anything that should be separated by space characters.

I could try to play whack-a-mole, such as when I reported the bug in jQuery’s hasClass implementation, but these sorts of issues crop up all over the place, such as on MDN’s suggested shim for the Element.classList api (oops).

Playing whack-a-mole can be fun for a while, but it quickly becomes repetitive and tiresome. In all honesty, I don’t think anyone will ever naturally run into the bugs with the classList shim’s implementation because I don’t think any developer in their right mind would attempt to separate classes with characters other than:

space
tab
line feed
carriage return

which is also why jQuery could be so popular for so long while having such a simple bug in what is arguably one of the most-used features of the library (it affected addClass and removeClass in addition to hasClass).

So, for the sake of simplifying things, can we redefine HTML’s space characters to be equivalent to whatever \s matches in JavaScript? It would simplify things for developers to be able to use \s, and as far as compatibility is concerned it would implicitly “fix” what are currently buggy implementations that naively assumed \s was an acceptable match for HTML’s “space characters”.

As for the biggest concern about backwards compatibility breaking existing sites, I don’t have the data to support the change either way. If I had to guess, there simply aren’t sites that are making use of unicode whitespace characters as part of class names because it’d be silly, and any that might, are probably doing so as a sort of tech demo.

tl;dr:
“space characters” in HTML should be redefined to be whatever JavaScript uses for its \s regex pattern to simplify implementation.

tabatkins · 2016-04-12

HTML’s “whitespace” is simply the 5 ASCII whitespace characters. CSS matches it exactly, too.

JS instead tracks the Unicode definition of whitespace, and theoretically updates its definition as Unicode releases new versions. It made a different choice, but it’s not necessarily a better one - some of the whitespace chars in Unicode are very non-obvious, particularly when everything else is in ASCII. There’s a discussion going on right now in JS about whether to lock down the definition of whitespace to a specific set - if that goes thru, then JS’s definition of whitespace won’t match the \s regex class either.

zcorpan · 2016-04-14

I don’t think HTML can reasonably change what its space characters are. They are used all over the place, not just in class="". How about adding a regexp flag or something to JS that redefines \s to be HTML’s space characters?

Crissov · 2016-04-15

Javascript’s \s seems similar to \p{White_Space} whereas PCRE’s \s is equal to POSIX [:space:], i.e. [\t\v\f \n\r] – except for vertical tab \v U+000B, this corresponds to the defintion used in HTML and CSS. The actual definition from ECMA-262:2015, however, reads:

The production CharacterClassEscape :: s evaluates by returning the set of characters containing the characters that are on the right-hand side of the WhiteSpace (11.2) or LineTerminator (11.3) productions.

That’s TAB, VT, FF, SP, NBSP, ZWNBSP, USP and LF, CR, LS, PS, which result in [:space:\xA0\xFEFF\p{GC=Zs}\x2028\x2029], because WhiteSpace actually excludes all other characters that have the White_Space property but are not classified in category Zs “Separator, space”.

I believe it would be better and more helpful if JS provided a shorthand character class that either exactly matched the HTML/CSS definition of whitespace or the PCRE one. I’m not sure which letters would still be available for that purpose, if any. I think it has been proposed before (and dismissed, for some reason) to include VT in W3C’s whitespace definition.

zzzzbov · 2016-04-15

As a note:

I’m not against redefining \s to match HTML better, or redefining \s a bit and redefining HTML’s “space characters” a bit (such as to include \v), but the problem I see is that we’ve got this situation of 14 competing standards, and I’m pretty certain some of them are good enough to rally around. I’d like to avoid creating an entirely new standard, but even just discussing what standards exist and what benefits they each offer would be advantageous.

My actual expectations are that nothing will change with the standard excuse of “we might break something so we can’t risk fixing it!”.

chaoaretasty · 2016-04-15

A new character class makes sense here, having \s work differently in different browsers would be a nightmare, I’d suggest rather than coming up with completely our own using the vim class from the link you gave of \_s.

While looking at the possibility of adding to the Regex classes I’d be up for including \h and \v for horizontal and vertical whitespace too.

The main downside of changing Regex like this is compatibility and testing, authors need a way of knowing if they can use these functions. The easiest way, and allowing for future Regex changes if needed, would be to add Regex.supports which could be passed a single token and return true or false based on if it’s a token with any given meaning.

r12a · 2016-04-18

It seems to me also that the needed definition of whitespace depends on the context. For example, it may make sense to treat Mongolian vowel separator as \s for searching or other regex functions (i’m not sure), but i doubt it would make sense to do so for class name separation, since this character is not a word separator (see http://r12a.github.io/scripts/mongolian/block#char180E).

zzzzbov · 2016-04-18

I’ve wondered about no-break space (U+00A0) which is a space so it makes sense to be matched as a space character (and is matched by \s) but is explicitly not a separator, so it would make sense to not use it for separating values in class attributes.

That said, at some point we need to move from the theory discussion into a practical discussion. I don’t have the appropriate data to make any significant argument, but I’d like to know how frequently space characters other than [ \t\r\n] are used in class attributes, or anywhere else that HTML’s “space characters” apply. If they’re never used, then changing the spec becomes a “simple” matter of implementation.

I imagine the number of webpages that rely on such behavior are limited, but I defer to the data.

zzzzbov · 2016-04-22

I’ve just reported yet another jQuery bug regarding whitespace handling, and it’s only reaffirming my belief that the current definition of “space characters” doesn’t match practical usage.

I firmly believe changing the definition so that the theoretical definition (/[\t\n\f\r ]/) matches the practical definition (/\s/) of “space characters” will have negligible negative impact, and will significantly improve the web by reducing bytes in popular libraries which are downloaded millions of times a day.

tabatkins · 2016-04-22

The other way around is much more likely to be palatable - as others have said, HTML and CSS both agree on whitespace charactes, and there are literally trillions of HTML documents in existence; changing something as fundamental as whitespace is likely to have consequences.

zzzzbov · 2016-04-22

@tabatkins So how can we gather data to determine which way is less impactful? More importantly, where do I go from here? I can identify the issue, but I don’t know what information would be necessary to convince people that any change is worthwhile.

tabatkins · 2016-04-22

If you’re wanting to go the “change regex to match what html/css do”, you’re gonna want to email es-discuss@mozilla.org. Cite the bugs that people keep having with whitespace, see if it gets you anywhere.

If you want to do the “get HTML/CSS to change to match regex”, best is to open a bug on HTML and start the conversation there. Again, cite the bugs people are having with whitespace. CSS should be able to follow HTML pretty easily.

I’m not confident you’ll be able to get either change thru, but I think the first has higher changes than the second. An alternative approach might be to suggest the addition of a new regex escape that only capture ASCII whitespace, or a flag that reinterprets \s and \S to only care about ASCII whitespace.

MT · 2016-04-22

+1.

zzzzbov · 2016-04-23

It’s a good thing I’m stubborn (or is it?).

That sounds pretty reasonable. Probably would want to suggest two flags given that \u00A0 is in the extended ASCII character set. I’d be fine with something like /\s/a and /\s/x for ASCII and extended ASCII matching respectively. Of course, then there’s the question of what happens when you have something like /\s/axu…

tabatkins · 2016-04-24

It’s a good thing I’m stubborn (or is it?).

Yeah, nothing wrong with being stubborn, it’s needed to get difficult things thru. Just setting up expectations appropriately, that I believe this is a low-success endeavor. Doesn’t mean it’s not worth trying if you’ve got the time.

Probably would want to suggest two flags given that \u00A0 is in the extended ASCII character set.

I’d stay away from anything about U+A0. It is not a space character in HTML or CSS, so your primary argument (convergence between the web-platform languages) doesn’t apply at all. Arguing for a third category solely because you think it might be useful to recognize an additional character as a space will distract from and weaken your overall position.

/\s/x

Note that x is already a regex flag, which makes literal whitespace be ignored in the pattern so you can space things out for readability.

/\s/axu

The u flag just turns on unicode-aware ranges and escapes. It wouldn’t have any interaction with your “a” flag. I recommend, again, not even proposing a second flag.

chaoaretasty · 2016-04-27

The other problem I’m seeing with the flag approach versus a new character class is that it wouldn’t be possible to have matches against both HTML whitespace and unicode whitespace (not entirely sure what situations you’d want this, maybe attempting to invoke Zalgol by parsing HTML in it, but still…)