Unify metadata in a more understandable & performant fashion


#1

Page/site metadata is a horrible, inescapable, frustrating, time-sucking mess. There’s no way to achieve compatibility with one user-agent without having other user-agents suffer, if only because of extra HTML to download.

Icons & Decoration

Icons and other bookmarking data are easily the worst of the lot.

If you believe the standards about browsers deciding what icons to download based on the sizes attribute, then you would merely have to write this titanic pile of sludge in order for every browser/crawler to download only what it needs:

<link rel="icon" type="image/png" href="/icons/favicon-16.png" sizes="16x16">
<link rel="icon" type="image/png" href="/icons/favicon-32.png" sizes="32x32">
<link rel="icon" type="image/png" href="/icons/favicon-96.png" sizes="96x96">
<link rel="icon" type="image/png" href="/icons/android-chrome-36.png" sizes="36x36">
<link rel="icon" type="image/png" href="/icons/android-chrome-48.png" sizes="48x48">
<link rel="icon" type="image/png" href="/icons/android-chrome-72.png" sizes="72x72">
<link rel="icon" type="image/png" href="/icons/android-chrome-96.png" sizes="96x96">
<link rel="icon" type="image/png" href="/icons/android-chrome-144.png" sizes="144x144">
<link rel="icon" type="image/png" href="/icons/android-chrome-192.png" sizes="192x192">
<link rel="icon" type="image/png" href="/icons/opera-coast.png" sizes="228x228">

<link rel="shortcut icon" href="/icons/favicon.ico" sizes="16x16 32x32 64x64">
<link rel="icon" type="image/svg+xml" sizes="any" href="/icons/vector-icon.svg">
<link rel="mask-icon" color="#ffffff" sizes="any" href="/icons/safari-pinned-icon.svg">

<link rel="apple-touch-icon" sizes="57x57" href="/icons/apple-touch-icon-57x57.png">
<link rel="apple-touch-icon" sizes="60x60" href="/icons/apple-touch-icon-60x60.png">
<link rel="apple-touch-icon" sizes="72x72" href="/icons/apple-touch-icon-72x72.png">
<link rel="apple-touch-icon" sizes="76x76" href="/icons/apple-touch-icon-76x76.png">
<link rel="apple-touch-icon" sizes="114x114" href="/icons/apple-touch-icon-114x114.png">
<link rel="apple-touch-icon" sizes="120x120" href="/icons/apple-touch-icon-120x120.png">
<link rel="apple-touch-icon" sizes="144x144" href="/icons/apple-touch-icon-144x144.png">
<link rel="apple-touch-icon" sizes="152x152" href="/icons/apple-touch-icon-152x152.png">
<link rel="apple-touch-icon" sizes="180x180" href="/icons/apple-touch-icon-180x180.png">

<link rel="manifest" href="/icons/manifest.json">

<meta name="apple-mobile-web-app-title" content="My App">
<meta name="application-name" content="My App">
<meta name="msapplication-TileColor" content="#da532c">
<meta name="msapplication-TileImage" content="/icons/mstile-144.png">
<meta name="msapplication-config" content="/icons/browserconfig.xml">
<meta name="msapplication-starturl" content="http://example.com/start.html">
<meta name="msapplication-navbutton-color" content="#f00">
<meta name="theme-color" content="#ffffff">

(I left out things like the Yandex Tableau API and the other tags Apple and Microsoft look for when Adding to Home Screen/Pinning, so this isn’t even a worst-case scenario.)

But of course, it’s not that easy. Some browsers only download the last referenced icon. Others, the first. Most of them ignore the PNG icons and go for the ICO if you leave a reference to it. Firefox downloads all of them during the critical path which is horrible and you should star this bugzilla to put a stop to it.

Philippe Bernard at RealFaviconGenerator has done a frightening amount of research and compatibility work, and even he is stumped on the best way of going about this issue, since performance and developer complexity are directly butting heads here. His FAQ is a document of pain, and the compatibility charts are worse.

Lack of standardization has allowed “magic URLs” to blossom, which results in servers handing out 404s for spurious requests. /favicon.ico is the perennial problem child, but this happens with all the different size naming conventions for apple-touch-icons up there, and /browserconfig.xml.

Yeah, if you don’t define all of those apple-touch-icons in the HTML with the proper sizes, and your host won’t let you put them in the root (think Blogger or Tumblr), here come the 404s.

These unnecessary 404s are bad for the server, since 404 pages are often more complex than serving a tiny static file, and bad for the client, as they tied up their requests uselessly and received a far heavier 404 HTML page than a 16x16 ICO.

As for theme-color, it’s already starting to fragment with Safari’s new Pinned Tab feature debacle. Should it be used as a foreground color? A background color? Better add more meta tags!

I for one would love for this to be rolled into CSS instead, similar to @viewport. That way, we could have something like this:

@interface {
  color: #f00;
  background-color: #00f;
  fill: #0f0;
  icon: image-set("/icons/16.png" 16w, "/icons/32.png" 32w, "/icons/64.png" 64w);
}

But so far this sort of thing seems a better fit inside the WebApp Manifest. Fine by me. As long as it’s not gunking up every single page load. If we end up with a bloated metadata file, fine, at least it’s cacheable. But inviting problems on every single HTML page like this is a distraction from creating the actual content.

With careful testing and compromises, you can get 80% of the way there for icons & bookmarks with only a fraction of the end-user pain. But who wants to test their tiny icon frippery over the landscape of browser+OS combinations?

Link Previews

These are extraordinarily popular across services, showing a clear use-case with user benefits.

  • Facebook’s link previews
  • Twitter’s Cards
  • Pinterest’s Rich Pins
  • Google+'s link previews
  • Tumblr’s Link Posts
  • LinkedIn’s link previews
  • Skype conversations’ link previews
  • Search engines’ “rich snippets”, and arguably also their regular search results
  • Any of the services that employ embed.ly, meaning Reddit, bit.ly, Disqus, Medium, and many, many others

They all have the same baseline of required information:

  1. Page Name, because <title> often has the site name and/or taglines in it
  2. Canonical URL, because even Google can’t perfectly predict what query strings are important and which aren’t
  3. Preview Image, likely for aesthetic (but important) reasons
  4. A short tagline or description of what lies beyond

Many also display author information and the site name/URL. Many of them do fall back onto native HTML semantics, like <link rel="canonical">, so there may be value in providing a standard way of specifying site-wide information, featuring an on-page image, etc.

Service Autodiscovery

These aren’t a big problem, but they rarely change over a website, so having a unified place to store them and future things like them would be lovely.

<link rel="search" type="application/opensearchdescription+xml" href="/etc/osd.xml">
<link rel="alternate" type="application/rss+xml" href="/etc/rss.xml">
<link rel="alternate" type="application/atom+xml" href="/etc/atom.xml">
<link rel="pingback" href="/whatever">

These are the old guard, so for compatibility reasons we can’t do much about them, but for new experiments like Webmention I’d really rather not add yet another <link> for each.

Existing Standards

There are many metadata standards, solving the same problems over and over again:

  • Facebook’s (obsolete) Share API
  • XML Namespaces (also probably obsolete, and good riddance)
  • OpenGraph
  • Twitter Cards
  • Schema.org
    • …using RDFa
    • …using microdata
    • …using JSON-LD
  • microformats

Bespoke APIs for analytics and search services that specify the same damn information again also run rampant, like Swift and Parse.ly.

There are two existing methods for unifying site metadata: the in-development Web Applications Manifest, and Microsoft’s browserconfig.xml.

We shouldn’t scrap the in-page methods because specifying unique info or dynamically changing it on a page can be useful. Vimeo switches its favicon to show Playing/Paused status, for example. Other information should remained embedded in the HTML for performance reasons, like prefetch <link>s.

Instead, I want to make either browserconfig.xml or the WebApp Manifest extensible and provide a single, cacheable location for information that is highly unlikely to change across a site. XML is uncool and never did work all that well, so the Manifest is more appealing.

Existing HTML methods

I consider these shortcomings to be a failure of HTML, since the whole point was to let machines read it. I don’t even mean in the high-level, woo woo “Semantic Web” fashion, but “ah yes this is an element” and other bits of breaking up text.

Authorship

This one is especially bad. Between <address>, <link rel="author">, and <meta name="author">, we have 3 methods of authorship that all have problems and almost no consumers use.

<address> suffers from an incredibly confusing name, baroque restrictions on use, and a vague content model, meaning even valid use is difficult to parse.

The <meta> and <link> methods of naming and linking authors fail miserably on pages with multiple articles by multiple people, like home pages and other indexes.

Site & Page Titles

<title> was designed only for documents and didn’t anticipate the needs of sites. However, adding context is useful, and has been known to be useful since 1992:

<TITLE>Introduction -- AFS user's Guide</TITLE>

The proliferation of methods to parse out the document name from the site name suggests there is a real need for improvement here.

Document name:

  • <meta name="title"> (Yes, really)
  • <meta name="og:title">
  • <meta name="twitter:title">
  • http://schema.org/name
  • .h-entry >> .p-name

Site name:

  • <meta name="og:site_name">
  • <meta name="twitter:site">
  • <head itemscope itemtype="http://schema.org/WebSite"> <title itemprop="name">
  • .h-feed >> .p-name

It would be backwards compatible to put tags inside <title> for clarification, since it ignores any attempted tags inside when displayed to the user. Something like this:

<title><document>My First Post</document> - <site>My Weblog</site></title>

It could also be <page> and <collection> or whatever you like, the actual names aren’t important.

And So On

This is already far too long, but the rabbit hole grows ever deeper. Peeking inside the <head> of large news sites is like staring into a cement truck hauling sardines. All that would be needed is some agreed-upon, extensible metadata file that can be cached, thereby providing an alternative to new standards and features that all want their own slot in the <head>.


#2

Regarding icons: Making icons less painful and murky