[Proposal] Content Aggregation Technology

robin · 2020-10-12

Dear all,

I am delighted to announce that we are releasing the first public draft of Content Aggregation Technology (CAT), about which we invite feedback from the Web community at large.

This document is intended to be the first step on a journey. Over the past years, many proprietary content aggregation technologies have been developed and deployed (AMP, Apple News Format, Facebook Instant Articles, MIP, Web Bundles). They have, however, largely been produced without the involvement of the parties whose content they aggregate. The primary intent of this document is to rectify this situation.

As such, CAT focuses primarily on taking two steps towards defining aggregation technology through cooperative methods rather than unilaterally, by leveraging market power:

it offers a framework with which to evaluate various approaches to content aggregation; and
it outlines a general architecture and a set of high-level requirements meant to improve upon the current situation.

In line with its cooperative intent, CAT proposes no solutions. These will need to be defined as outcomes of this discussion. Assuming the high-level requirements are roughly correct, however, improvements to the situation can be made incrementally and with orthogonal changes.

CAT is brought to your attention jointly by:

Advance and Advance Local;
The Globe and Mail;
The News Media Alliance (representing over 2,000 news organisations in the U.S.);
StarTribune;
Trib Total Media;
USA TODAY / Gannett;
The Washington Post; and
The New York Times.

The Web already has most of the ingredients needed to deliver amazing experiences that benefit all parties. We look forward to working with all of you to move away from unilateral and proprietary aggregators and to jointly develop the few additions that can improve the Web not just for powerful platforms, but for everyone.

frivoal · 2020-10-13

I feel like a good answer to the problem space you’re exploring would very likely also be a good answer to a what’s a good web-centric format for ebooks.

EPUB for instance, is made of web parts, but the end result isn’t very web like, which I’d attribute in large parts to the fact that it isn’t part of the hyperlinked web. It also has an impedance mismatch with the web due to being a single document made of multiple HTML files, without have a terribly clear model for how the stitching of those parts happen.

I’d suggest expanding the range of technologies that are being considered to also cover EPUB (and things like it), and possibly to expand the criteria on which all technologies are being evaluated if it feels like the current ones are missing some important aspect exposed by the addition of ebook focused technologies. Maybe something about archivability, or content integrity.

cartr · 2020-10-13

From a user’s perspective, my main concern with this draft is that it asserts that “optimal” user experience and privacy can be attained by giving news publishers unfettered access to the full power of the Web platform. In fact, a primary motivator behind both open solutions like RSS readers and proprietary solutions like AMP is that this is not the case.

The problems with news websites are extensively documented online, such as in https://idlewords.com/talks/website_obesity.htm. They can also be confirmed by visiting any major news website in a browser without an ad-blocker enabled. (I visited usatoday.com and left the homepage open for a few minutes, and it sent over two thousand requests to an astonishing variety of origins.) News websites are notorious for user-hostile patterns like autoplaying video popups and advertisements that cause UI elements to shift around just as you’re about to click on them.

It’s definitely possible to solve that problem through open standards. Here’s one idea: we could standardize on using something like Microformats to mark up article content so it can be extracted and displayed by user-agents without executing arbitrary publisher-provided JavaScript. This could work similarly to how RSS clients and platforms like Mastodon do today, by discarding all tags and attributes except for a small subset of acceptable semantic HTML. It could also be integrated with Transmissible Entitlement Tokens, so publishers can confirm the user has paid for the content before delivering the machine-readable version.

Unfortunately, I get the impression that work like this will not happen as part of this project. The draft describes displaying multiple articles in a carousel – a standard feature present in nearly all popular RSS readers – as “excessively hostile” and counter to the “decisively mutualistic and consensual philosophy of this document”. If even basic UI affordances like that provoke such a reaction, I can’t imagine that taking all decisions about content formatting out of publishers’ hands (and putting them into users’) would ever be considered acceptable.

robin · 2020-10-19

You bring up good points, but I think that the answer is more complex than that.

A lot of the time the companies pushing for greater bloat in the ad system turn out to be the same companies also then offering a “solution” in the form of AMP or similar projects. Ads are also increasingly invasive because those same companies are using their market power to keep extracting more money from the ad system, leaving less to publishers.

This isn’t long-term sustainable. The ad system needs fixing and the aggregation system needs fixing, but the latter shouldn’t be a solution to the first.

The document doesn’t intend to rule out RSS as hostile. There is plenty that could work with RSS, even though if (as you describe) that’s only for static content then that will be necessarily limited. If you look at things that readers flock to massively, like Covid maps or election results, neither of those works well (or at all) with purely static content. RSS is also different: it’s a direct relationship between readers and publishers.

robin · 2020-10-19

I think we were trying to solve specifically the problem of aggregation and not extend it too much, for instance to describe properties of any data format on the Web.

That said, Dave has also brought up EPUB on Twitter and there may indeed be an interesting path forward for better ebooks and CAT to merge.

kbsspl · 2020-10-20

https://github.com/nytimes/std-cat/ currently returns a 404.

cartr · 2020-10-22

I disagree.

Some of the organizations creating proprietary news-article formats also contribute to bloat and privacy invasion in the Web ad ecosystem. Some of them do not. For the ones that don’t, the primary reason to develop a proprietary restricted format for news articles is to provide better privacy and user-experience than linking to news websites directly. Any standard that does not address this will not be adopted by platforms or users.

This isn’t unique to ads, either. First-party content is entirely capable of causing problems on its own – many people use features like Reader Mode even when they also have an ad-blocker installed.

It’s worth noting that this content is a small fraction of all the news published online, and it’s often closer to a web application than a traditional article. I don’t think it’d be unreasonable to create a solution that, say, allows publishers to include hyperlinks to those web applications in their static content.

Not necessarily. RSS is an open standard, and there are platforms like Feedly and Flipboard that use it to provide a more curated/aggregated experience.

excitedbox · 2021-03-09

Content aggregation should be done using a format such as JSON through some sort of API. As was mentioned above you can not give another party access in ANY way.

Aggregation is a synonym for collection and not for connection. That means Accessing a content provider and collecting resources.

Any Spec that will see any kind of mass adoption will look more like an API with Json or RSS.

It would be very easy for each side to have an API end point and the request to the content provider includes a url to your own end point along with what functions are supported in addition to a security token. This way servers can have 2 way communication requesting and receiving data from each other while keeping an impenetrable barrier for both parties.

Right now we are seeing such crazy invasion of privacy that even KFC accesses the list of programs running on your PC while browsing their website and this HAS TO STOP IMMEDIATELY or the customers will start to revolt.

This data collection and sharing between “partners” has gotten so out of hand that there is no way to know what data has been leaked when a security breach happens. A company such as Target might have troves of data of people who have never shopped there so they would never know that they are at risk of getting breached themselves.

Hackers are even using FB quizzes to build up profiles on family members so when you list your mothers eye color on some “What Dragon are You” Quiz they use that to break Forgotten Password Security Questions on Bank Accounts.

I find that this oversight in the spec shows how little focus there is on the user’s rights and privacy and the only concern is how to help businesses. This is 100% the wrong approach.