Unicode Named Character Entities

stuartpb · 2015-02-25

Continuing the discussion from Add new character entity references like &power; - to match new Unicode specs:

Add new character entity references like &power; - to match new Unicode specs

[A] mechanism to enter arbitrary codepoints by their name might be a better idea. Like:
&"OHM SIGN";
&"ZERO WIDTH JOINER";
You get the idea. Final notation is of course debatable.

Rather than inventing yet more named entities (with arbitrary names) to HTML, I think this would be a great forward-facing way to allow for semantic referencing of Unicode characters using plain ASCII source - just drop all the Unicode character names into the routine for looking up a named entity (after comparing against the “legacy” entity names). As an alternative to quotes in the entity name, it might be permissible to allow underscores in the identifier, per section 4.8 of the Unicode standard, eg &ohm_sign;.

simevidas · 2015-03-03

Just to clarify, this type of semantic referencing only benefits the author of the source text, and the users of the web page do not encounter these references, correct?

stuartpb · 2015-03-03

Correct, users only encounter these references if they’re viewing the source.

simevidas · 2015-03-03

In that case, it makes maybe more sense to create tools for this, e.g.

the editor provides autocomplete which maps Unicode names to the corresponding characters, e.g. when I start typing “ZERO WIDTH…”, a drop down menu suggests that character,
a built tool (preprocessor) which converts the &"NAME"; syntax into the corresponding character.

Even if, say, Chrome and Firefox, implemented support for your proposed syntax, authors would still have to provide fallbacks, and in this case, a preprocessor would be a good option (a better option than a JS library which dynamically replaces this syntax on a live page). So why not just leave it at a preprocessor?

stuartpb · 2015-07-31

Because maybe it’s not a good option. Maybe preprocessing is prohibitive. Maybe the authoring environment is a low-end/IoT device. (Maybe people in the developing world without access to MacBook Pros would like to make apps too!) Maybe they’re publishing their content with a direct PUT to a CouchDB server. Maybe they don’t otherwise need a preprocessor because they just don’t have a problem with users not being able to see the character if their UA doesn’t support it (ie. they’re bundling the content with something like nw.js or Crosswalk).

I don’t know about this specific feature - I hear what you’re saying about it being a lot of backward-incompatibility, where it’s not a lot of work to adjust to (and, realistically, if you’re looking up the character’s Unicode name, you’re probably going to see its codepoint) - but as I’ve been saying for the last couple days, we can’t just throw everything onto power-tool stacks we’ve built ourselves outside the browser. That’s how we end up building slow, rickety, inconsistent, newbie-hostile dystopias that gradually collapse under the weight of their own self-imposed constraints.