<irrelevant> don't index this information


#1

The Problem

  • I use Alta-Vista to search for a Tweet I sent last month.
  • The search engine returns a selection of pages it thinks contains the text.
  • Looking at the pages show the text isn’t present - but, say, an embedded Twitter stream is.

It’s quite common for a search engine to index a page and not realise that the information is either highly dynamic (say an RSS feed of headlines, a “most read” list, a “top comment from elsewhere” box etc.)

It would be useful to tell crawlers (and, potentially, screen readers, find-in-page boxes etc) to say “ignore this bit of the page.”

Saying <irrelevant> is a bit tongue-in-cheek. I suppose that <aside> is the correct tag, but perhaps with an added attribude <aside ignore="ignore">?


#2

I think you are onto a good point here; I think search engines have shun away from letting content authors write mark-up to trick their engines but with embedded widgets you kind of want a way to declare when something will expire.

There has been numerous times when I was searching on Google to find the content gone, looking at the cached version it was perhaps a sidebar of latest articles on stackoverflow to another article.

What I would however suggest is something more like:

content-expiry="2014-07-04T05:04:43+00:00"

In that this could be specified as a negative time to achieve complete irrelevance and any ISO date format was supported: http://en.wikipedia.org/wiki/ISO_8601

However obviously search engines would need to protect against abuse but I think that is only as complex as other things.


#3

The attribute version is better and not something to hard to be used like today as the only thing that needs changing are search engines, not the browser, as it will ignore the attribute.

You shuld try sending it’s proposition to the search engines.


#4

I’ve encountered OP’s twitter (and facebook) example with increasing regularity. My thought has been that it’s just a matter of time until the search engines recognize this and incorporate it into their algorithms.

My concern with something like <irrelevant> is that it may put an additional burden on developers to include those tags, lest they get penalized for not including them AND that the “problem” would continue regardless.

Also, and i’ll admit I’m absolutely no expert in this area, it seems like WAI-ARIA if used correctly on social widgets could indicate to search engines that the related info is fleeting and should not be indexed.


#5

I’ll be honest the blanket noindex approach has been tried many times before in many ways: http://en.wikipedia.org/wiki/Noindex

This is my thinking why a expiry date would get more traction with the likes of Google.

We had a Google search appliance at work and the noindex comments were very useful; clearly there must be a reason there is a lack of support for them in the main search engine.

@madcampos my reasoning for suggesting only time based formats is the same; keywords like: today, tomorrow etc are easy to use but also are less specific. Search engines will likely be annoyed if everything started using expired content as a way to add duplicated content. Where as the date based format could actually require the implementer to significantly change the content in that time frame and penalise editors who repeatedly use it to mark spam or duplicated content as expired.

That would likely squash the issue raised by @metaprinter - that content authors would be penalised for misuse of the attribute and not when they are not using it at all. The likes of Google just want to penalise sites that are doing bad. Content based widgets would likely fix themselves quickly with the new attribute too.