A ZIP API in the browser?

domenic · 2014-05-19

I have for a long time heard from certain segments of the community that they would appreciate being able to do various “zip” related things in the browser. The exact ask is fuzzy, often coming in the form:

I’d like to have some version of the capabilities provided by Node’s zlib API. To summarize, that includes:
- GZIP compression/decompression
- DEFLATE compression/decompression
- “Raw” DEFLATE compression/decompression (I don’t know what this is)
- Both streaming versions of these, and binary data chunk versions (i.e. one large ArrayBuffer -> another smaller/larger ArrayBuffer).
I’d like to be able to create “.zip” files in my browser, so that I can prompt users to download them.
An example use case would be clicking “export these statistics to a zip of csv files”
This also allows the creation of .zip-based files, for example Microsoft Office files, or .cbz comic book files, or most Ebook formats

You can accomplish these today using user-space libraries, e.g. zip.js or zlib.js (“new users can only put two links in a post”). There are probably some Emscripten and/or asm.js versions floating around that are even very fast.

However, I think it would be valuable to provide these natively for the web platform. To me, it’s especially compelling to think of all that juicy C(++) code being shipped with every browser already, at least for the zlib case, which with just a bit of extra effort could be exposed for web platform authors. That way, you wouldn’t need to include a large script to download, or have to deal with web workers, or anything of the sort.

I often this kind of API as my favorite example of things that would be perfect to add to the web platform, but nobody has made the time to gather implementer interest or spec them yet.

As such I was hoping to get the ball rolling on this conversation. What do people think of the above asks and use cases? Are any implementers interested in this? Are there objections stemming from the fact that you can do this in JS today, even if it’s not as fast or seamless as you might like? Has this secretly been something you’ve always dreamed about and never dared to ask for?

creationix · 2014-05-19

Deflate and inflate are the hard parts to implement. I’m currently using the pako library from nodeca on github. It’s fast, smallish, and written in easy-to-consume common-js format.

If there was such an API in the browser, I would want just the basics. I would make it sync, but make it available to web workers so that it could be offloaded along with other cpu intensive work that usually surrounds this kind of work.

For JS-Git I need non-buffered deflate and inflate. I don’t need raw-deflate or gzip style deflate, but I’m sure others would like these extra formats. I also need a form of inflate where I can feed it bytes a chunk at a time and the parser tells me when the end of the deflate stream has been reached and gives me back the extra bytes. I don’t need the output streaming, but others might. We should probably have streaming inflate and deflate for completeness and symmetry.

If it is helpful I could draft up a concrete API for discussion.

johnallsopp · 2014-05-19

A use case I’ve explored is for localStorage. Use of existing JS zip libs for images as dataURLs is unrealistically slow I’ve found. So it’s a +1 from me

mathias · 2014-05-20

+1

Why should this API only be available in the browser, though? I understand browsers currently have things like deflate compression built-in while stand-alone ECMAScript engines don’t, but if feasible, I’d prefer to add these features to ECMAScript so that they’re available in Node.js etc. as well.

robin · 2014-05-20

In the past the feedback on this has been “it can be done in script, even if optimisations are needed we should see what comes up” and “this would need streams”.

I think we’re at the point where we ought to move ahead with what we have. +1 to get the ball rolling.

domenic · 2014-05-20

@mathias: I don’t think this is in any way a language feature. It might be an API that multiple environments implement, like setTimeout, but it wouldn’t be part of the language or VM.

@creationix given the positive feedback here, I think at least a first-pass draft of a concrete API would be very helpful. Your experience as someone with a concrete use case would be invaluable :). And your tendency to start with just the basics sounds good too.

mathias · 2014-05-20

Why not? Can you elaborate on this? setTimeout is an interesting example; ideally that should just be part of ECMAScript too.

Every feature that gets standardized is an opportunity to increase interoperability between ECMAScript engines in browsers and non-browser JS engines. It would be a shame to dismiss opportunity right from the start.

robin · 2014-05-20

I don’t see why it needs to be either. It can be specified on its own, then if ES wants to make it a language requirement it can just reference it. There’s no reason that specifying it on its own makes it browser-only.

sindresorhus · 2014-05-20

It might make sense as a part of the JS standard library:
http://wiki.ecmascript.org/doku.php?id=harmony:modules_standard

For example in the Python standard library: https://docs.python.org/3/library/archiving.html

creationix · 2014-05-20

Quick question: What do you want to use for binary data? “raw” encoded strings where each character is a char-code between 0 and 255 or Unit8Array or something else?

domenic · 2014-05-20

ArrayBuffer is the general plan.

creationix · 2014-05-20

// Normal deflate as a simple sync function
var deflated = zlib.deflate(data);
var inflated = zlib.inflate(deflated);

// Variants for the other two common encodings
zlib.deflateRaw(data);
zlib.inflateRaw(data);
zlib.gzip(data);
zlib.gunzip(data);

// Streaming Interface
var deflater = zlib.deflateStream();
var out = deflater.write(chunk);
out = deflater.write(chunk);
out = deflater.flush();

// When you know how many bytes to send
var inflater = zlib.inflateStream();
var out = inflater.write(chunk);
out = inflater.write(chunk);
out = inflater.flush();

// When you don't know where the deflate stream ends
var inflater = zlib.inflateStream(onEnd);
var out = inflater.write(chunk);
out = inflater.write(chunk)
function onEnd(extra) {
  // extra is the extra bytes from the last chunk that don't belong.
  // It could be zero-byte if you fed in the exact amount
}
// And then variants for the other two formats
// Or there could be an options hash in all these interfaces if that was
// perferred over multiple named functions.

creationix · 2014-05-20

The exact stream interface isn’t important. What matters to me is the onEnd in the last example. I need this in js-git where I have a stream of unknown length and unknown number of bytes in that stream are deflate data. I don’t know how many bytes to hand off to inflate. The only way to know is to perform the actual inflate which knows internally when it’s state machine reaches the end state.

For performance, it might make more sense if the stream didn’t constantly flush output data. Instead something like:

var inflater = zlib.inflateStream();
inflater.write(chunk);
inflater.write(chunk);
var out = inflater.flush();
inflater.write(chunk);
out = inflater.end();

Here flush gives you the data that’s ready to be emitted. End is flush and check to tell the parser there will be no more data. This will throw if more data is expected, so end is like a validator.

frkay · 2014-05-21

For your information Deflate is name of the main compression scheme used within ZIP archives. Raw Deflate is the bare minimum compressed data stream (no header of any kind, no checksum), usually it is used with Zlib warpers (small header and checksum), GZIP defines the file format associated to .gz files (it adds a GZIP header, a file name, a time stamp and the original file size) the sole compression method used by GZIP is Deflate. All these are defined by RFC 1951 (Deflate), RFC 1950 (Zlib) and RFC 1952 (GZIP). But Deflate is an aging compression method, over time new ones (producing smaller compressed data, not really faster) have been added to ZIP archives, Deflate64, BZIP2, PPMD and even LZMA. http://en.wikipedia.org/wiki/Zip_(file_format)#Compression_methods Supporting only Deflate would thus only cover a subset of ZIP and it the implementation would appear to be stuck in the 90’s.

madcampos · 2014-05-22

while i realy like the idea of this i think that with newer methods comming to play we run into the same problems that video, audio, webrtc, picture and all this stuff will have a few years from now. As an api i think it should provide basic functions that all this methods use, like low level actions, huffman, dictionaries search and the like. i think that with the ongoing optimizations we can do much but we also need ways to control stuff like memory allocation or types so that this kind of algorithm run as fast as possible.

domenic · 2014-05-22

That is a great point. Hmm.

This is largely solved already by asm.js and asm.js-like techniques, i.e. using a fixed-size arraybuffer and manipulating the memory there manually instead of using JS’s native memory semantics, and using “type annotations” like x | 0 for integers.

brianleroux · 2014-05-22

Ah! was driving me nuts I couldn’t login. Anyhow, that’s fixed. There is a Cordova plugin for this: http://plugins.cordova.io/#/package/org.chromium.zip

I’ve used it. Its helpful for grabbing assets progressively. (Say: level 2 in a game.)

tomByrer · 2014-05-23

How about this API? http://stuk.github.io/jszip/documentation/api_jszip.html

mehdishojaei · 2014-05-25

I think zip functionality must be an ECMAScript feature, as @mathias mentioned.

mounir · 2014-05-26

I think it would be interesting to have some metrics about the time it takes to do some basic unzip operations on different systems in JS and natively. It would also be great to have a reference implementation and see how it runs on top of FTLJIT or ASM.js.