There has been some discussion previously about allowing generic access to Unicode character names from entity references.
Emojis are popular and most of them seem to have a single unambiguous canonic “short name” which is often different from its Unicode name. The syntax convention, also supported here in Discourse, is different, though: instead of a mandatory ampersand prefix and sometimes optional semicolon suffix as in SGML / XML / HTML &foo;
, they are used with colons on both sides :foo:
.
I wonder whether browsers should support the colon syntax out of the box and they should hence be added to HTML or if it would be better to use the “short name” with the established …ML syntax. So I merged two JSON files, expecting a lot of clashes:
- https://html.spec.whatwg.org/entities.json
- https://github.com/iamcal/emoji-data/blob/master/emoji.json (edited version)
- → https://github.com/Crissov/unicode-proposals/blob/master/entities.json
There were only few.
Clashes
-
Card suits
♣
(U+2663 ),♥
(U+2665 ) and♠
(U+2660 ) are actually equivalent except for an explicit VS-16 (U+FE0F). There are also variants♣
,♥
and♠
, and, in case anyone wonders, the fourth card suit is currently available as♦
and♦
or as:diamonds:
. - The same applies to the telephone symbol
☎
(U+260E ) except that it also has a preferred alias short name and thus&telephone;
can be used unambiguously for the emoji character. -
Bank note emojis, on the other hand, have separate code points from the currency symbols. Since the fallback is good, the emoji short codes could be altered, e.g. to use an uppercase initial letter or a verbose suffix like
emoji
orsymbol
.-
¥
U+00A5 ¥ vs. emoji U+1F4B4 -
$
U+0024 $ vs. emoji U+1F4B5 -
€
U+20AC € vs. emoji U+1F4B6 -
£
U+00A3 £ vs. emoji U+1F4B7
-
- A single two-letter shortname for a flag or regional indicator sequence clashes with existing named entities:
⁢
for Italy (U+1F1EE+1F1F9 ) is already occupied by MathML’s Invisible Times (U+2062). However, there already is the systematic alias&flag_it;
and one could decide to not support the two-letter flag names for those that had existed in Japanese telcos implementations prior to Unicode standardization (like Discourse does as well for instance) at all, e.g.&us;
and&de;
.
There are four clashes left, three of which have borderline acceptable fallback from emoji to HTML, but one is completely off. They would probably need a new designator that should also be back-ported to short-names.
-
˚
U+02DA ˚ vs. U+1F48D -
⌣
U+2323 ⌣ vs. U+1F604 -
⊤
U+22A4 ⊤ vs. U+1F51D -
‐
U+2010 ‐ vs. U+1F4A8
PS: A couple of “short names” result in entity references that would perhaps be malformed according to SGML and XML rules (not sure about the underscore _
), but that probably doesn’t concern WHATWG HTML:
-
&-1;
: U+1F44E -
&+1;
: U+1F44D -
&100;
: U+1F4AF -
&1234;
: U+1F522 -
&8ball;
: U+1F3B1
PPS: I also wonder whether variation selectors (at least U+FE0x) and Fitzpatrick scale emoji skin tone modifiers should be available as named entities, e.g. &emoji;
= U+FE0F so ☎&emoji;
gave the same result as :telephone:
.