Named Emoji Entities (or Short Names)

Crissov · 2016-08-03

There has been some discussion previously about allowing generic access to Unicode character names from entity references.

Emojis are popular and most of them seem to have a single unambiguous canonic “short name” which is often different from its Unicode name. The syntax convention, also supported here in Discourse, is different, though: instead of a mandatory ampersand prefix and sometimes optional semicolon suffix as in SGML / XML / HTML &foo;, they are used with colons on both sides :foo:.

I wonder whether browsers should support the colon syntax out of the box and they should hence be added to HTML or if it would be better to use the “short name” with the established …ML syntax. So I merged two JSON files, expecting a lot of clashes:

There were only few.

Clashes

Card suits &clubs; (U+2663 ), &hearts; (U+2665 ) and &spades; (U+2660 ) are actually equivalent except for an explicit VS-16 (U+FE0F). There are also variants &clubsuit;, &heartsuit; and &spadesuit;, and, in case anyone wonders, the fourth card suit is currently available as &diams; and &diamondsuit; or as :diamonds:.
The same applies to the telephone symbol &phone; (U+260E ) except that it also has a preferred alias short name and thus &telephone; can be used unambiguously for the emoji character.
Bank note emojis, on the other hand, have separate code points from the currency symbols. Since the fallback is good, the emoji short codes could be altered, e.g. to use an uppercase initial letter or a verbose suffix like emoji or symbol.
- ¥ U+00A5 ¥ vs. emoji U+1F4B4
- &dollar; U+0024 $ vs. emoji U+1F4B5
- € U+20AC € vs. emoji U+1F4B6
- £ U+00A3 £ vs. emoji U+1F4B7
A single two-letter shortname for a flag or regional indicator sequence clashes with existing named entities: ⁢ for Italy (U+1F1EE+1F1F9 ) is already occupied by MathML’s Invisible Times (U+2062). However, there already is the systematic alias &flag_it; and one could decide to not support the two-letter flag names for those that had existed in Japanese telcos implementations prior to Unicode standardization (like Discourse does as well for instance) at all, e.g. &us; and &de;.

There are four clashes left, three of which have borderline acceptable fallback from emoji to HTML, but one is completely off. They would probably need a new designator that should also be back-ported to short-names.

&ring; U+02DA ˚ vs. U+1F48D
&smile; U+2323 ⌣ vs. U+1F604
&top; U+22A4 ⊤ vs. U+1F51D
&dash; U+2010 ‐ vs. U+1F4A8

PS: A couple of “short names” result in entity references that would perhaps be malformed according to SGML and XML rules (not sure about the underscore _), but that probably doesn’t concern WHATWG HTML:

&-1;: U+1F44E
&+1;: U+1F44D
&100;: U+1F4AF
&1234;: U+1F522
&8ball;: U+1F3B1

PPS: I also wonder whether variation selectors (at least U+FE0x) and Fitzpatrick scale emoji skin tone modifiers should be available as named entities, e.g. &emoji; = U+FE0F so &phone;&emoji; gave the same result as :telephone:.

chaoaretasty · 2016-08-05

I looked back at the past threads. They make good points on why this feature isn’t needed and i don’t see anything as having changed since.

There already exists 2 perfectly good methods to insert any given unicode character (directly as a character or using the escaped character code). This will add a whole browser generation of incompatibility and the only gain will be saving a lookup or copy and paste for a handful of developers that are handcoding HTML that needs to use them who have also memorized the specific emoji names. The cost/benefit ratio is completely skewed for this feature.

Crissov · 2016-08-12

I don’t really disagree, but this seems just like an argument against specifying &smile;, not against :smile:.

I also think that invisible characters like variation selectors are a different thing, so &emoji; for VS-16 could still make sense.

chaoaretasty · 2016-08-12

As an argument against specifying :smile: there’s the fact that you’re specifying an escape sequence that never previously existed but as you yourself has said is one that has been used before, you’re going to break a lot of pages by parsing text that was meant to just be text. As a simple example pages that just explain the allowed emoji on a forum will go from:

Use :smile: to show a

To

Use to show a

I think this is something that should be handled on a site-by-site basis.

I can however see the benefit of providing some improvements for developers but i think directly in HTML is the wrong place. Perhaps instead a built in javascript method that allows unicode lookup on short names like unicode.lookup('smile') which can be used to easily build up custom parser functions or have such parser functions built in too allowing a developer to call unicode.parse(inputtext) which will use the :smile: syntax. There could also be an overload to only parse or lookup ranges rather than arbitrary charecters like unicode.parse(inputtext, 'emoji')

Crissov · 2016-08-12

No “colon entity” substitution in code, kbd and pre then.

chaoaretasty · 2016-08-12

That doesn’t solve the problem though. You are adding in a whole class of character escaping that never previously existed. Content has been written without the need to consider this substitution and a change like this will retroactively alter content across the web. If I want the character string :smile: just in my text I would have to wrap it in tags for non-semantic reasons, as well as giving these tags a class as i will need to undo any styles I may have given them.

At the very least if something like this was going to be brought in it would have to be an opt-in feature. But again I really don’t see any argument in its favour that significantly impacts the cost/benefit.

Crissov · 2016-08-16

Does anyone actually put :smile: in their texts expecting anything other than (except in the “type :smile: to get ” case)? If deemed necessary, the permissible character classes next to the colons could be restricted.

chaoaretasty · 2016-08-16

I don’t know, I don’t have a representative data sample I can search for :\w+: through (and yes you would need the general case not just comparing against a list of emoji short names as the number of emoji is increasing) but that’s part of the problem with adding an entirely new but off the top of my head it could include any of the following

Any chat logs from before emoji were everywhere and parsed into icons by default (in which case they may well have meant a very similar thing but did not mean that actually have a picture).
Typos
All sorts of generated strings where colons are used as delimiters
As above but especially pointing out server generated errors for these instances where I really don’t want my exceptions to be full of emoji
IPv6 use colons and both 1234 and abcd are valid current emoji and valid address segments abcd:1234:abcd:1234 which could easily be typoed to an emoji (eg on discourse starting with a colon or doubling a colon in the middle will do it) :abcd:1234:abcd::1234: is 1234:abcd:
The fact that any such parsing needs an escape sequence and something better than doing this on an element level is needed

Yes some are more niche than others but the point is that without too much effort it’s possible to list several situations it could be an issue. You could start going through all the edge cases and start adding rules in but then you’re going to have a much more fragile parsing, with a much more complicated set of rules. Arguing in favour of extending the &entity; syntax would be a far easier position.

You also still fail to argue in favour of the feature on a cost/benefit ratio for the feature. The only benefits I can see are:

Minor simplification to developers hand ending HTML
Might remove the need for developers to hook in an emoji parsing library server or client side (key word is might, see below)

As well as the issues above you also need to include the fact that:

There will be a period of browser incompatibility on a fundamental feature of how HTML handles strings (so polyfills and parsing libraries will need to be in place for quite a while)
Developers may well want to override handling of emoji (to give their own versions of the glyphs via fonts or images) so they are going to need a parsing library anyway
It adds complexity to the spec
Time and effort on the part of the vendors will be needed to far more fully spec out the specific handling of when and where to parse (see above)
Speccing needs to also consider feature detection, whether it is opt in/opt out/always on (and if opt in/out then how to do this without JS)
Time and effort to implement compared to any other browser priorities (the idea that “all new features start at -100”)
Anyone that writes an HTML parser is going to need to add all this complexity too

So currently we are at very little benefit to a huge amount of cost and you need to give good reasons why parsing libraries and preprocessors aren’t the appropriate place (especially as said libraries are improving and that many comment systems need a parser pipeline for other things anyway).

If this is something that you feel should be added then you need to present strong compelling arguments for it to overcome those costs.

chaoaretasty · 2016-08-16

Doing this as a separate reply to split the issue. As mentioned previously I do think improving unicode lookup functions in JavaScript to including looking up via shortnames and to limit the unicode ranges would be a good idea, they would complement String.fromCharPoint in a useful manner and as a low level function could improve the base of any number of emoji libraries. This approach would also be consistent with the Extensible Web manifesto approach of opening up low level functions and letting developers choose how to move forward with them.

Crissov · 2016-08-16

These are indeed niches and neither very likely nor unsolvable, IP6 addresses for instance would be mostly covered if there needed to be no character, a whitespace character or another colon-entity at both sides of a colon-entity. I’m not at all sure whether &emoji;, :emoji: or even something like &:emoji; would be a good idea to include in HTML. I asked here to inform an opinion, so thanks for helping with that. I believe parsing colon-sequences in browsers could improve accessibility and the user experience in some cases. Anyhow, I thought about it again:

If HTML ever wanted to introduce named entity references for emoji characters (&foo;), they should be based upon existing short-codes and I’ve shown that it would pose fewer problems than might be expected, although I probably should have used Emoji One’s EAC, Github’s Gemoji or Muan.co’s Emojilib – there’s a lack of authority for the standardization of short-names.
Short codes are primarily used in user-generated content. They’ll either be input directly (maybe with the help of auto-completion) or via an emoji picker GUI (using images or a native font). If they’re used in the canonic backend encoding, they need to be transformed to either embedded images (and their alt text) or Unicode characters for the frontend. If the browser knew whether the backend expected :smile:, 1f600.png, \u1F600 or 😀 it could do the conversion, but in almost every case that’ll already involve scripting, so I tend to agree that ECMA-Script would be the right place to specify conversion functions and maps.