Building websites that cannot read, exfiltrate, or store the data they operate on

pauwels · 2022-01-05

Summary

Redact is a project that aims to leverage only existing and well-known browser components to build “zero-knowledge” websites: sites that allow for user interaction with user data, but cannot themselves see that data. The project’s aim is to provably guarantee a site’s inability to abuse or sell user data, while maintaining the site’s ability to provide rich interactions between users and their data. This is achieved in a way that is fully backwards-compatible with “web 1.0”: HTML elements only, and absolutely no JS or client-side crypto. Furthermore, unlike existing data privacy initiatives that are based around trust and promising to abide by a user’s preferences, this technology provides technological guarantees that require no trust of the website on the user’s part or assumption of liability of having to securely store data on the website’s part.

The basic premise of the technology works as such:

A user installs the Redact client on their local device as a native app (this could be integrated into the browser in the future)
The client opens a port on the local device, and begins listening to requests
When the user visits a Redact-enabled website, the website serves a standard HTML page, but wherever user data would typically be placed, these individual pieces of data are represented as iframe elements pointing to the Redact client

a. An example of this would be an iframe pointing to localhost:8080/.firstName.
The client responds with a series of pages that block CSRF attacks, and finally serves the data (as a string, integer, boolean, or multimedia type) inside the secured iframe element

Thanks to CORS, CSP, and the iframe sandboxing features long-stabilized and available in any modern browser, the data displayed in this iframe is completely unavailable for reading or exfiltration by the parent website. Additionally, thanks to the endpoint’s CSRF protections, the data endpoint cannot be manually hit to exfiltrate data via the XmlHttpRequest API. However, this only covers displaying of user data.

Redact also supports full CRUD operations on “redacted” data. By including the edit=true query parameter to the iframe source, the Redact client responds to the request with an input field appropriate to the requested data’s type. The user can then input a different value for the requested field, from within the secure iframe. We are currently exploring ways to provide visual feedback that a field has been “redacted” and can be safely interacted with, similar to TLS’ green-lock icon.

Redact also includes features for smooth integration of redacted fields with the styling and UI of the parent website. A css=... query parameter allows for including arbitrary styling which will be applied to the returned iframe contents.

Redact includes a federated storage component for non-technical users to be able to store their user-data for use and re-use across multiple websites in a secure way. The local client does not store a user’s data locally. When data is submitted to the client via an editable iframe, it is encrypted using a well-known symmetric encryption algorithm as provided by libsodium. The encrypted blob is sent, along with the unencrypted key value for retrieval, to a third-party storage provider whom need not be trusted. Similarly, when data is requested by a website via the iframe API, it is fetched from storage, decrypted locally, and then served via secure iframe.

Redact’s final component allows for a website to provide a customized, “logged in” view to each of their users, with redacted data. Websites can optionally provide a (same-origin) relay_url=... query parameter to the editable iframe source. When the user submits data to the field, the relay_url endpoint is notified that the user has done so, providing the key of the data submitted and some highly restricted metadata such as submission time. This allows a website to keep a database of references/paths to user data associated with user sessions, and then populate the page with those data fields when the same user comes to visit the page again. This also eliminates the need to ever have another login screen.

Redact essentially allows a website to orchestrate and templatize a user’s interactions with their data, without ever seeing that underlying data. Imagine a social media site capable of knowing that a million users shared a million posts amongst each other, without ever knowing the contents of those posts, or in effect any concrete data about its users. It combines this with a federated way for user’s to store their own data, shifting the web’s paradigm of data ownership from platform-based to user-based. As a side-effect, it also eliminates one of the web’s trickiest subjects: login screens.

Motivation and Use Cases

The motivation for this technology arose out of data privacy concerns and the opaque abuse of user data for the purposes of behavior profiling and prediction. When data privacy and security was thrust into the public eye in 2018 with the Cambridge Analytica scandal, the primary question the developers of Redact asked themselves was, “How do we allow for the rich user experience of the web, without the providers of those experiences being capable of reading the contents of user interactions?”

Use cases:

A private, user-focused social media website. In such a website, users who have added each other as friends would be capable of reading each other’s posts, and the website would be capable of hosting and organizing those interactions, but the contents of the posts would be hidden from the website owner. A VERY rough/bare-bones implementation of this is available at redact-feed-ui.dev.pauwelslabs.com. This is the first publically available Redacted website.
A secure and private EHR or telehealth portal. One of the primary issues in migrating health data to the web is the health portal provider’s ability (or inability) to properly secure that data. With the availability of a technology like Redact, the developer of a health portal could focus on building an innovative UI connecting patients to health providers without the overhead of having to build secure storage for user data. Additionally, by storing their own data, users/patients can transport that data across different portal providers.
Small, self-contained redacted modules. Rather than immediately switching to entire redacted websites, things like end-to-end encrypted chat modules could be packaged and provided to a website as a chunk of HTML and JS, where only that module is redacted.

Compatibility Risk

A redacted website is dependent on having a locally installed client able to provide responses to secure iframe requests. If such a client were integrated in a web browser, it would reduce the barrier of entry towards accessing Redact websites, but it would also mean that if such a client were removed, it would completely break previously working websites.

Redact as a technology and protocol could co-exist with existing websites, but it would be incompatible with how data is currently handled on the web.

Links to implementations and demos

Redact has been fully implemented in an end-to-end fashion, complete with client and storage implementations, and a working redacted website showcasing the majority of its basic features. All code is provided under GPLv2 licenses.

Client: GitHub - pauwels-labs/redact-client: Receives incoming requests from the browser and serves up decrypted contents in a secured iframe in response.

Storage: GitHub - pauwels-labs/redact-store: Provides a universal encrypted data storage interface for Redact.

Crypto abstractions on top of libsodium: GitHub - pauwels-labs/redact-crypto: Contains all cryptographic abstractions used across redact codebases.

Some very basic docs and installation instructions are available here: https://docs.redact.ws

A redact-enabled website is available here: https://redact-feed-ui.dev.pauwelslabs.com

Concerns and Mitigations

Ultimate privacy is not always a good thing. Moderation helps ensure communities abide not only to a country’s laws, but also to a wider moral and ethical standard as defined by the owners of a website. Often cited is avoiding the proliferation of CSAM, or child sexual-abuse material, on platforms which use end-to-end encryption to guarantee the privacy of a user’s data. Detecting and fact-checking misinformation has also become an important aspect of being a responsible platform on the internet.

There are a couple potential ways to mitigate these issues. Similar to Apple/WhatsApp’s proposals to package CSAM-detecting algorithms in client-side code for end-to-end encrypted messaging systems, such algorithms could be implemented, reviewed, and approved by the community and included within the Redact client. If the client is packaged as part of a browser, websites could additionally specify that certain algorithms must be available to scan user data, and report offending behavior to the website owner, either with or without the original material.

We are open to other alternatives and technologies that could be securely implemented at the client level to detect and report illegal material.

Minigugus · 2022-05-27

I had a similar idea around a year ago, where instead of an external server, the browser was endorsing this role: Privacy-Safe Storage API · Issue #28 · privacycg/proposals · GitHub

I agree privacy and security considerations are insane, especially with offline capabilities web apps already have (Service Workers, Cache API, IndexedDB API, and so on) I guess the only limitation nowadays is the popularity of this technology, can’t wait for W3C feedback

I just created this post [Proposal] Kind of Trusted Execution Environment for Browsers when I saw yours, what do you think? You’re the first person I discovered that’s promoting this technology

pauwels · 2022-05-28

Hi @Minigugus ! Thanks for the encouragement! I like your post, I’ll add some comments on that and Github post as well to see if we can get some more traction.

I’ve been working on this idea for about 4 years now, and I’m fairly confident that it achieves its goals and security guarantees. However, having now also tried to publish it and present it to many more people from VCs to tech enthusiasts, there’s a common thread: the people we’re building this for don’t want this. Tech people have all been enthusiastic about it (great thread we had on the Rust subreddit about the idea: redact: tool for building decentralized, end-to-end encrypted websites : rust), but businesses run as soon as you tell them they won’t be able to access user data, and users can “get” what you’re saying if you explain it well enough, but using it is (currently) completely out of their grasp.

Furthermore, there are a lot of tricky problems that need to be solved to make this a useful system, some of which we’ve tried to tackle in Redact, some we haven’t. User and key management is a big one, as is sharing one user’s data across that user’s devices, or accessing that data on someone else’s device. In Redact we have the storage API to allow third-parties to securely receive and store encrypted data and keys, and we exclusively use TLS certificates to authenticate/authorize anyone talking to anything. None of that deals with securely storing and transporting keys, or managing root certificate authorities however.

Another issue is the sharing of data between parties which trust each other. If I’m a medical portal website and I orchestrate interactions between patients and doctors, both patient and doctor need to be able to interact with a shared set of encrypted data.

I’d be happy to call or videochat to talk more, or maybe help you setup Redact, which is essentially a working implementation of all this without requiring deep code integration with the browser.

raphaellouis · 2022-06-02

Hi all!

1. Important considerations

I think the idea proposed here is good
I myself created “an open proposal” that is similar to this: An interesting open, free, libre proposal for a scalable, sustainable, secure, private, accessible business model compared to alternatives such as Topics, Cookies, Flocs and other user tracking technologies
There is this proposal here as well: Measuring user interaction time with website - #12 by raphaellouis and/or Extending HTML As a Hypermedia , [Proposal] Document Element Metadata

2. Concept

I think if we put together all the proposals we would have what I call Open, Free, Libre Statistics or OFLS - The idea is that the html itself generates metadata about users - you don’t need any other user information
This data can be public, anonymous, private - by default, the data of each website that the person accesses is private, that is, only the company that takes care of this data - it is not google or any outside company - Unless that company wants or doesn’t use a security policy that prevents data sharing - Even if the company shares this data with another company - this data is not the user’s - it is the generated company’s own data
I think that with this, companies would be safer, as they would think about whether they could share data or not - given that this data is not user data, but consumption data generated by users who accessed certain pages
the data generated in the pages is anonymous, private - it does not represent user data itself. That’s why I said it’s anonymous and private. Private, because who is responsible for this data is the company itself. And anonymous, because it is not done by the user’s information, but by the interaction that the user makes on the page.
The company can make this data public if it wants or leave this data private, the company decides
What’s interesting about this idea is that part of this consumption data goes back to the user - the user could make this public and private too - If the user makes this data public - page interaction data - it could help scientific research or market research - he could earn a percentage value on it - Browsers like Brave have a rewards program that allows the user to earn money - that’s all happens when the user sees an ad on the internet

3. Notes

The advantage of this is that other companies can use this open data to improve their products or services - which generates better security policies for users and for the company itself and for the technology market itself, etc.
If the user leaves page interaction data publicly - this allows you all to see companies that are using this data - it’s a discreet way to check how these companies keep this data - eventually, each person can charge security for this data
This page generated data does not produce a user profile - but only a generic interaction profile.
This could be done if the html itself has autogenerated metadata based on consumption of the page itself
My idea is different from the Brave browser proposal, I would like discreet and non-invasive advertising, that is, that the company manages its own advertising, marketing without advertising - annoying popup for the user.

4. Doubts

Is Redact data temporary?
What do you all think of the concept: Open, Free, Libre Statistics or OFLS?

pauwels · 2022-06-03

Is Redact data temporary?

No, Redacted data is persisted (encrypted) in the federated storage lawyer, and can be retrieved from any device with the Redact client installed.

What do I think of your concept?

I have to admit I’m a little confused. Although I get what you’re trying to push, your description and proposal is very scattered and covers a lot of topics at a very high level without going into specifics, which makes it hard to understand what you’re trying to accomplish exactly.

If I could give you a recommendation, it would be focus on one very specific, very precise use case and work through it end-to-end, identifying the problematic parts and how you would fix them in the context of an implementation of your idea. This will help others relate to and participate in what you’re trying to do.

raphaellouis · 2022-06-03

@pauwels Please, could you analyze my algorithm? reference: An interesting open, free, libre proposal for a scalable, sustainable, secure, private, accessible business model compared to alternatives such as Topics, Cookies, Flocs and other user tracking technologies