Expose function tables outside of asm.js module

chadaustin · 2014-11-14

If asm.js passes a function pointer (as a numeric index into a function table) to JavaScript, JavaScript currently has no way to look up the actual underlying function object. This is a severe penalty for Embind, which relies on the ability to map numeric function pointers to JavaScript function objects.

At the end of this post, see benchmark results with (oldcomp, non-asm.js) and without this functionality. You can see that the old Emscripten compiler without asm.js generates substantially faster calls through Embind.

Without function object lookup, Embind instead must go through an Emscripten-generated dynCall function, which adds some costly indirection: https://github.com/kripken/emscripten/blob/master/src/embind/embind.js#L907

I now believe it’s possible to eliminate part of the overhead (Function.prototype.bind) it still wouldn’t be as fast as having direct access to the function.

Some proposals:

allow function tables to be exported, but frozen. e.g. return { table_viii: Object.freeze(table_viii) }
allow export of a new type of function which takes an index and returns the function object. e.g. return { get_function_viii: function(idx) { return table_viii[idx]; } }

Alon proposed duplicating the function table into the export list with some kind of format that includes the type signature and the index, i.e. “fp_viii_1: some_internal_function” but that would have a rather large code size impact.

Benchmark results: JS → C++ via embind

Can’t figure out how to make a table in Specifiction markdown, so here’s a blockquote:

Firefox 33
Oldcomp / not asm
56636000

Firefox 33
Fastcomp / “almost asm”
52250000

Chrome 38
Oldcomp / not asm
27212000

Firefox 33
Fastcomp / asm.js
10681868

Firefox 33
Oldcomp / asm.js
10657152

Chrome 38
Fastcomp / “almost asm”
3247677

Chrome 38
Fastcomp / asm.js
3234561

Chrome 38
Oldcomp / asm.js
3210571

luke · 2014-11-14

The function tables are segregated by type so couldn’t you have a per-table function:

function asmModule() {
    "use asm";
    function a1() { return 1 }
    function a2() { return 2 }
    function call_a(i) { i=i|0; return a_tbl[i&1]()|0 }
    var a_tbl = [a1, a2];
    return call_a;
}

which you could use wherever you would have called an element of the proposed exported function table?

chadaustin · 2014-11-14

That’s exactly what Emscripten already does, exposed as dynCall_, and that’s what embind currently uses in asm.js.

The additional indirection hurts performance, which led to this issue.

luke · 2014-11-14

Ahh, when you said “dynCall”, I was thinking Runtime.dynCall, which takes (sig, ptr, args) and does dynamic lookup. But looking at the embind code you posted, it looks like requireFunction is basically returning dynCall_sig.bind(undefined, i) which skips all that on each call.

A few questions about the benchmark results:

Is this a microbenchmark (where you have a loop calling a function with a trivial body in a loop) or a macro benchmark?
Instead of using .bind(), could you use an equivalent lambda that closes over rawFunction? I’m curious if that would help v8 inline.

Regardless of whether we add this feature, FF won’t be able to inline across the JS/asm.js boundary and will go through a somewhat costly trampoline (you can see the self time in the trampoline in the FF profiler if you check “Show Gecko Platform Data” in the devtools config panel). If you have a real (non-microbenchmark) case where you are calling through Embind a ton, could you instead put more of that code inside asm.js so that there are fewer JS->asm.js calls?

chadaustin · 2014-11-14

This is a microbenchmark. Sadly, I cannot do an actual apples-to-apples comparison in our application until fastcomp is upgraded to LLVM 3.4 or 3.5. LLVM 3.3 doesn’t have a functional -mergefunc, which we rely on for code size and performance. When fastcomp is upgraded, I can run an apples-to-apples comparison.

I would like to eliminate the .bind() now that I know how, and if you think it would help make the case, I can try to get to that this weekend.

I agree that the asm.js trampoline will hurt, but we care a lot about performance in Chrome too, and V8 benefits from having a direct JavaScript -> embind invoker -> “asm.js” function path.

Reducing calls between JS and asm.js is also important, but sometimes tricky. We’re at the optimization stage of our application where we’ve spent the last four months looking for 1% gains here and there, and this falls in that bucket.

azakai · 2014-11-15

I pushed an EXPORT_FUNCTION_TABLES option to emscripten incoming now, which reflects the function tables out of the asm module. This lets you avoid those dynCalls, which should avoid all that call indirection overhead, at the cost of a slight code increase, for the reflected tables. Would be interesting to see embind perf using that.

chadaustin · 2014-11-15

https://dl.dropboxusercontent.com/u/1602057/embind%20call%20benchmarks%202.pdf

I ran the microbenchmarks again, once with Function.prototype.bind replaced with a generated trampoline per function binding ( https://github.com/kripken/emscripten/commit/a563da7adcebe324aa00da2d9be9e3e71cd4f634#diff-1e25583510869c425881c98eac3b295bR905 ), and the second with Alon’s EXPORT_FUNCTION_TABLES ( https://github.com/kripken/emscripten/commit/5a3ae46c88dad44fdf89bed25e85a5f8085735fe ).

As you can see, replacing the .bind with a generated trampoline is dramatically faster in Chrome, and noticeably faster in Firefox.

Having access to the function tables is another noticeable win. Alon’s change duplicates the function tables, which is unfortunate for code size and because it requires exporting functions that may only be called internally or by function pointer.

If function tables could be exported, we would see the best of both worlds: high performance and small code size.

There is +/- a few percent noise in those measurements. The data was gathered in an Ubuntu VM on my Haswell 4771. Each number is the average of three runs.

azakai · 2014-11-15

Thanks for the detailed numbers!

It looks like the table mirroring achieves very good performance. Firefox without asm validation and with the table mirrors reaches the highest value, 75076000.

As to why Firefox with asm validation is slower, it’s likely that this is a microbenchmark that stresses JS calling into the asm.js module. That has higher overhead when asm.js opts are on. In a large app where each such call does enough work it should be worth it, but on a test like this, this is the expected result.

Less clear is why Chrome does so poorly - its best result is 27566000, almost 3x slower than Firefox. Might be worth filing a bug for them to take a look.

It’s slightly curious that fastcomp (with mirrored tables) is a little slower on Chrome than oldcomp. However given the almost 3x slowness in the last paragraph, it could be noise compared to whatever bigger issue exists there.

Anyhow, overall, it looks like moving embind to mirrored function tables on fastcomp is a good thing, aside from the code size change. With that, fastcomp is almost as good as oldcomp on Chrome, and better on Firefox.

That leaves the issue of exposing function tables in asm.js itself as mostly a code size issue, not a perf issue. I agree that would be the optimal result, however, speccing exposing the function tables is not trivial (they shouldn’t be modifiable from outside, for example).

How big was the code size change with mirrored tables?

chadaustin · 2014-11-17

Until fastcomp is on LLVM 3.4 or greater, I can’t really do an apples-to-apples code size change in our codebase. We rely on -mergefunc pretty heavily at this point, and it’s broken in LLVM 3.3.

In my microbenchmarking, there aren’t enough function pointers to really show what the total code size impact would be. Mirrored function tables has two code size costs:

there’s the obvious code size impact of having each table in the generated JS twice.
in addition, mirroring the tables requires exporting functions that otherwise would be internal. almost all function pointers used by embind are for internal invoker functions that should not otherwise be exported. same for many vtable entries.

That said, I don’t know the percentage of our code that makes up function tables. I can check once fastcomp is upgraded to a newer LLVM.

azakai · 2014-11-17

We hope to have a wip 3.5 branch fairly soon, https://github.com/kripken/emscripten-fastcomp/issues/51#issuecomment-63221450