Request For Comments: SIMD.js in asm.js

luke · 2014-10-21

Here is a description of the proposed changes to asm.js validation to incorporate ES7 SIMD.js. These changes have landed in Firefox Nightly but (along with SIMD.js) will not be released until SIMD.js has stabilized in TC39. Thus, we are definitely open to feedback, comments and suggestions concerning the validation rules. (For any discussion of SIMD.js itself; please go to the GitHub repo or es-discuss.)

The extension to asm.js validation is comprised of:

New value types for SIMD expressions:
int32x4, float32x4
These vector types are not super/sub-types of anything:
- Since not a subtype of extern, vector types cannot be passed/returned from FFI (see topic)
Other 16-byte SIMD vector types are being proposed (e.g. float64x2, int8x16). The rest of the proposal will stick to just int32x4, float32x4, though.
New stdlib imports and global types for SIMD constructors and operations:
SIMD constructor imports stdlib.SIMD.(int32x4|float32x4) are given types {int32x4ctor, float32x4ctor}, resp.
Ability to import SIMD operations off of SIMD constructors.
- Same link-time validation rules apply as with stdlib Math imports.
- For example, var i4 = stdlib.SIMD.int32x4; var i4add = i4.add; has signature (int32x4,int32x4)->int32x4.
- Still working on the full list (will post later), but basically: everything in SIMD.js.
New numeric literal form: simdCtor(x,y,z,w)
where simdCtor is a global of type {int32x4ctor, float32x4ctor}.
For int32x4ctor, x,y,z,w must each be int numeric literals
For float32x4ctor, x,y,z,w must each be numeric literals (no fround coercion necessary/allowed)
Can be used in return, variable, and global variable type annotations.
New type annotation form: check(…).
where check is a stdlib import of (int32x4ctor|float32x4).check, an operation that throws if the operand is not already a value of the associated SIMD vector type, otherwise returning the operand unmodified.
Can be used for parameter, return, and global variable import type annotations.
Can be used to provide the actual return type of a SIMD-returning function in ValidateCall.
A new value type, “doublelit”, which is a subtype of double and the type given to numeric literals containing a decimal character.
The reason for splitting double is to allow certain float32x4 SIMD ops to be passed double literals without requiring fround.
For example, float32x4.splat has signature: float -> float32x4 ∧ doublit -> float32x4.
The SIMD constructors can be called as stdlib functions:
int32x4ctor has signature: (intish⁴)->int32x4
float32x4ctor has signature: ((floatish + doublelit)⁴)->float32x4
New dot-access expression forms:
expr.(x|y|z|w), where expr has type {int32x4, float32x4} and the result type is {signed, float}, resp.
expr.signMask, where expr has type {int32x4, float32x4} and the result type is signed

Here is an example asm.js module that uses these features:

function asmModule(stdlib, imports) {
    "use asm";
    var i4 = stdlib.SIMD.int32x4;  // simd constructor
    var i4c = i4.check;            // used for type annotations
    var i4add = i4.add;            // import simd op
    var g1 = i4c(imports.g1);       // global var import
    var g2 = i4(0,1,2,3);          // global var initialized
    function f(i,j) {
        i = i|0;
        j = i4c(j);                 // simd parameter
        var k = i4(0,0,0,0);       // simd local var
        k = i4(i, i+1, i+2, i+3);  // simd constructor call
        k = i4add(j, k);           // operation call
        return i4c(k);              // simd return
    }
    function g(i) {
        i = i|0;
        g2 = i4c(f(i, g1));         // simd-returning callsite
        i = g2.w;                  // simd property access
        return i|0;
    }
    return g
}
print(asmModule(this, {g1:SIMD.int32x4(9,9,9,9)})(5)); // 17

Again, comments welcome. Sorry for the liberties taken with mixed regex/set syntax. I will update the proposal in-place in response to discussion.

abchatra · 2014-11-05

Looks good to me in general. Questions:

[quote=“luke, post:1, topic:676”] For int32x4ctor, x,y,z,w must each be fixnum or unsigned numeric literals [/quote] Why can’t this be signed?

[quote=“luke, post:1, topic:676”] Can be used in return, variable, and global variable type annotations. [/quote] How does the return to javascript world will handle SIMD type? I understand you disallow for foreign functions. This is for the return from external call.

[quote=“luke, post:1, topic:676”] Can be used for parameter, return, and global variable import type annotations. [/quote] How is parameter coercion done? Is it something like float32x4(a.x, a.y, a.z, a.w);? If yes what happens if a is undefined or null? Also if you have an example for global import type annotation that would be great.

There is no mention of Float32x4Array & Int32x4Array typed array access here. Is that reserved for future?

sunfish · 2014-11-05

It’s consistent with the existing asm.js 6.8.2 “NumericLiteral”. I don’t know the specific reason, but I’d guess it’s because it simplifies validation, since a leading ‘-’ is a unary operator in the grammar rather than being part of the literal.

At a return out of asm.js, the return value may be boxed, similar to how other scalar values may be boxed.

In addition to the 4-argument form, the float32 constructor also has a single-argument form which is the “type annotation” form mentioned above. float32x4(a) returns a unmodified, provided that the type check passes.

I think for now, having just load and store is attractive for its simplicity, since unaligned accesses are an important use case for SIMD, and load and store can support both aligned and unaligned accesses in a consistent way. And, they don’t need the shift trick used elsewhere in asm.js (x[i>>4]), which Float32x4Array etc. would need.

I’ve also started contemplating proposing ‘alignedLoad’ and ‘alignedStore’ functions to accompany them, which could be semantically identical to ‘load’ and store’, but would allow us to define different performance characteristics. We could make ‘alignedLoad’ and ‘alignedStore’ faster on aligned accesses on some platforms at the expense of making them drastically slower on unaligned addresses (as in, a hardware misalignment trap may be generated which the JIT handles so that it can transparently fix everything up). That way, we could get what speed benefits there are to be had from known alignments, with consistency between the aligned and unaligned syntax and semantics, and all without the shift trick in asm.js. What would you think of this idea?

abchatra · 2014-11-05

Thanks Dan for the answers.

[quote=“sunfish, post:3, topic:676”] It’s consistent with the existing asm.js 6.8.2 “NumericLiteral”. I don’t [/quote] Negative -NumericLiteral are allowed in variable declarations. See asm.js 5.4 variable type annotations.

If its legal to return the SIMD boxed value to javascript, why disallow it in foreign functions?

Ah. Thanks. Though where is this API defined? I don’t see this as part of constructor API in SIMD.js spec

Interesting idea. Though there are kernel which will trap the hardware exception for you and make the operation seamless but slow (I remember Windows ARM version did that long ago). It might be hard to gauge if you are getting a predictable performance or not. Saying this I am no expert here. What platform are you looking at to optimize?

Do you mean direct load from the typed array heap buffer instead of Float32x4Array? That might work. Tc39 committee may not like it though .

ben · 2014-11-05

Interesting discussions!

I think this should indeed accept signed. Moreover, it seems misleading the Numeric literal form accepts unsigned. Indeed, ToInt32 is applied to the input anyhow, so the unsigned would be cast to an int32. So I propose the NumericLiteral int32x4 accept fixnum and signed.

This is there, for instance. If the constructor only receives 1 argument which is the right SIMD type, the original value is returned.

asm.js being a strict subset of JS, it sounds acceptable to me to have features in SIMD.js which aren’t in asm.js (and not the other way around, as a strict subset). In this particular case, load and store provide the same solution to the problem of loading and storing values, and avoid introducing new array views and thus new -ish types (out of bounds accesses would need float32x4ish and int32x4ish).

luke · 2014-11-05

Sorry for mixup with int32x4 literals; that was just my bug in transcribing the type rules in Odin and Ben is right. Fixed in the OP.

You’re right it is rather irregular. The essential difference that motivated the current design is that FFI functions are meant to be somewhat fast (they have a special IC-like calling path in Odin), and boxing a SIMD value into a GC object is going to be really slow operation. Since symmetry is appealing, perhaps instead we should add the general asm.js rule that the parameter and return type of exported functions should be <: extern? Any thoughts on this?

In theory, you’d get the same alignment-trap-faulting behavior when executing as native, so, in that respect, asm.js is still predictably close to native :).

Here’s the link to the load/store proposal. We’ve also tentatively run it by a few TC39 members and no major objects after the rationale is given. By phrasing the load/store operations in terms of indexing the given view, there’s just optimized version of what you could write in JS.

luke · 2014-11-05

Well, I was wrong twice on this int32x4 numeric literal business. Current Odin accepts any int numeric literal (so anything in the range [INT32_MIN, UINT32_MAX]). Of course that’s not a reason in and of itself. While it may seem more explicit to only accept int32s (given that int32x4 performs ToInt32 on its arguments), the two reasons for accepting any int are:

it is more symmetric with scalar integer variable initializers (which have the same numeric literal range)
even though you can express any big unsigned [INT32_MAX+1, UINT32_MAX] literal as a negative literal, it can be more convenient to use a big unsigned literal when, e.g., specifying a bit pattern in hex (and we actually expect people to read/write SIMD.js in asm.js by hand).

Sorry for the churn. Updated in OP, but certainly open to other opinions.

abchatra · 2014-11-06

Thanks Folks for the detailed answers.

I prefer symmetry. Its so much easier to understand the rules when you have symmetry. Though I will leave this to your preference.

Do you have an example for global variable import annotation for simdCtor?

luke · 2014-11-06

Sounds good to me. Sound good to you Benjamin?

I really should have put up an example module that used all these features. Adding that to the OP now. Let me know if this leaves anything still ambiguous.

ben · 2014-11-12

Although I really like the argument of symmetry between FFI and exported functions, I think the workaround to not being able to pass SIMD arguments to an exported function would be to pass 4 arguments instead and create the SIMD vector in the function body. This seems pretty artificial, would increase register or stack pressure.

/* i.e. instead of */
function f(vec) {
  vec = SIMD_int32x4(vec);
  // do something with vec
}
/* we'd have */
function f(x, y, z, w) {
  var vec = SIMD_int32x4(x, y, z, w);
  // do something with vec
}

For returning SIMD values, it’s even worse, as when one wants to return a SIMD variable, one would have to store the four lanes, store them in global variables (4 SIMD lanes extractions and 4 global vars stores) and implement accessors to these global vars.

This argument needs to be mitigated with the usage of external SIMD values. If SIMD values could be passed to other platform APIs directly (say, to WebGL for instance), this could be interesting to keep. Moreover, the asm.js SIMD API might be a subset of the full SIMD.js API, so this would be nice to keep for interacting between asm.js code and SIMD-not-in-asmjs code. In a nutshell, this seems to depend upon the use cases. I don’t have a strong opinion here, except for the cases of readibility and simplicity.

luke · 2014-11-12

Yes, but passing SIMD vectors by value through exported functions is also going to be slow (likely slower, b/c GC interaction). Internal asm.js->asm.js calls needn’t be penalized since they shouldn’t be calling exported functions. For that matter, from what I hear, real SIMD kernels don’t even need the ability to pass SIMD vectors as arguments at all, so asm.js supporting SIMD argument types is already of questionable value (and will cause horrible performance on engines that aren’t able either inline the callee or pass SIMD vectors as unboxed arguments).

Nagy_Mostafa · 2014-12-16

Does float32x4ctor support mixed args of floatish and doublelit ? The way it is specified now, it doesn’t. So either all floatish or all doublelit. This causes asymmetry with int32x4ctor which supports all combinations (because int literal types are subtype of intish). Do we need to specify all combinations of floatish/doublelit args as overloads ?

luke · 2014-12-16

Oops, you’re exactly right, you shouldn’t have to specify all-floatish or all-doublelit; this is indeed what is in Firefox atm. Instead of a big overload set, I think we can formalize it with a sum type: (floatish + doublelit)⁴->float32x4. In general a sum type would require some sort of dynamic tagging, but the intended meaning here is that the immediate argument must be either floatish or doublelit, so you can statically know what you have. Make sense? Updated above.

Weiliang_Lin · 2015-01-07

Any purpose for this design? Instead of

  function g(i) {
    i = i|0;
    var g3 = i4(f(i, g1)); 
    i = g3.w; 
    return i|0;
  }

We have to write as below

  function g(i) {
    i = i|0;
    var g3 = i4(0, 0, 0, 0);
    g3 = i4(f(i, g1)); 
    i = g3.w;
    return i|0;
  }

James_Vickers · 2015-03-11

While the vectors look useful, that’s not how I want to program. Conceptually, there is the step of loading the values into vectors. I don’t know if the compiler will optimize this away, of course it would help if it did this when possible. In terms of the JavaScript code, it’s still requiring multiple instructions.

Also, this proposal is limited to 4 SIMD lanes (for floats). I’d like something able to make use of the capabilities of processors such as Haswell that can do 8 at once. I suggest that the proposal gets extended to include 256 bit wide SIMD instructions, with the SIMD being emulated if it’s not available (eg CPUs capable of 4 at a time would do that twice, but the JS code is the same).

I’d like there to be the means to call a SIMD instruction with one instruction, which would be very much like intrinsics in C++. I understand that there could be some problems with allowing the full range of instructions because of safety (like gather?). Allowing / encouraging SIMD with one statement would really be what SIMD is about - it does start with ‘Single Instruction’. This would require aligned data to be used, such as Typed Arrays, or smalloc in node/iojs.

I want to be able to call _mm256_div_ps and the like from JavaScript.

I’d like the SIMD functions to be callable on aligned data. Instead of a pointer, they can be given an object reference and position within that object (if that can be made to compile efficiently and safely).

luke · 2015-03-11

These are interesting points to discuss, but they concern the SIMD.js proposal itself (currently being discussed on a GitHub repo). What is being described in this topic is just the embedding of SIMD.js into asm.js by assigning types that guarantee full optimization.

sunfish · 2015-03-11

When you talk about “single instructions” and “multiple instructions”, I don’t quite understand what you’re saying.

Concerning the second paragraph, it’s an open design question. We’re doing 128-bit today, because it’s useful and widely implementable, and it’s a good way to work out the set of operations needed and the semantics. It will certainly be possible to extend the API to 256-bit or even 512-bit or more in the future. However, wider SIMD types either mean that application writers will need to write multiple versions of their code to get portability, or JITs will have to split wider types up on some machines, and while that’s doable, it can greatly increase register pressure, so it isn’t entirely ideal. Perhaps what we really want, beyond 128-bit, is N-bit, where N is determined by the JIT. Or perhaps we want something else entirely. But it’s an open question right now.

Concerning alignment, this is also something that is still an open design question. There are two aspects to it: alignment of the data, and alignment assumptions of the accesses. Alignment of the data certainly matters, but it’ll need to be addressed within ArrayBuffer and other places where memory is actually allocated. Alignment assumptions of the accesses are less clearly valuable; if the data is actually aligned, unknown-alignment accesses are just as fast as known-alignment accesses on many processors. And, known-alignment accesses will add complexity in JS engines, as they’d have to handle alignment traps when the data isn’t actually aligned. So, it’s still being considered.

Memory references via load and store within objects will likely be something we can add once Typed Objects are standardized.

Of course, one can also have SIMD values as properties of normal JS objects or as global or local variables, which you can reference directly without using load/store.

If you have further questions or thoughts on the SIMD.js API itself, you’re welcome to file an issue on the GitHub repo issue tracker. Hopefully soon we’ll be switching over to a more proper forum, but at the moment the work is largely focused around writing the polyfill code as a reference implementation, so GitHub remains somewhat convenient.

ben · 2015-03-27

In addition to the original post, here’s a list of the SIMD operations for each type, as implemented as of today in Spidermonkey (Firefox’s JS VM). This list may change in the future, as implementation evolves, adding / substracting types / operations.

Here’s an example of a module that imports an operation and uses it:

function f(glob) {
  "use asm";
  var int32x4 = glob.SIMD.int32x4;
  var add = int32x4.add;
  var check = int32x4.check;
  function g() {
    var v = int32x4(1, 2, 3, 4);
    var w = int32x4(4, 5, 6, 7);
    return check(add(v, w));
  }
  return g;

Semantics of these operations is described by the polyfill implementation on the github’s repo linked in the first message.

SIMD.float32x4.abs: float32x4 -> float32x4
SIMD.float32x4.add: float32x4,float32x4 -> float32x4
SIMD.float32x4.and: float32x4,float32x4 -> float32x4
SIMD.float32x4.bitselect: int32x4,float32x4,float32x4 -> float32x4
SIMD.float32x4.check: float32x4 -> float32x4
SIMD.float32x4.div: float32x4,float32x4 -> float32x4
SIMD.float32x4.equal: float32x4,float32x4 -> int32x4
SIMD.float32x4.fromInt32x4Bits: int32x4 -> float32x4
SIMD.float32x4.fromInt32x4: int32x4 -> float32x4
SIMD.float32x4.greaterThan: float32x4,float32x4 -> int32x4
SIMD.float32x4.greaterThanOrEqual: float32x4,float32x4 -> int32x4
SIMD.float32x4.lessThan: float32x4,float32x4 -> int32x4
SIMD.float32x4.lessThanOrEqual: float32x4,float32x4 -> int32x4
SIMD.float32x4.load: Uint8ArrayView,intish -> float32x4
SIMD.float32x4.loadX: Uint8ArrayView,intish-> float32x4
SIMD.float32x4.loadXY: Uint8Array,intish-> float32x4
SIMD.float32x4.loadXYZ: Uint8Array,intish-> float32x4
SIMD.float32x4.max: float32x4,float32x4 -> float32x4
SIMD.float32x4.maxNum: float32x4,float32x4 -> float32x4
SIMD.float32x4.min: float32x4,float32x4 -> float32x4
SIMD.float32x4.minNum: float32x4,float32x4 -> float32x4
SIMD.float32x4.mul: float32x4,float32x4 -> float32x4
SIMD.float32x4.neg: float32x4 -> float32x4
SIMD.float32x4.notEqual: float32x4,float32x4 -> int32x4
SIMD.float32x4.not: float32x4 -> float32x4
SIMD.float32x4.or: float32x4,float32x4 -> float32x4
SIMD.float32x4.reciprocalApproximation: float32x4 -> float32x4
SIMD.float32x4.reciprocalSqrtApproximation: float32x4 -> float32x4
SIMD.float32x4.select: int32x4,float32x4,float32x4 -> float32x4
SIMD.float32x4.shuffle: float32x4,float32x4,(int literal between 0 and 7 inclusive)**4 -> float32x4
SIMD.float32x4.splat: (floatish|double lit) -> void
SIMD.float32x4.sqrt: float32x4 -> float32x4
SIMD.float32x4.store: Uint8Array,intish,float32x4 -> void
SIMD.float32x4.storeX: Uint8Array,intish,float32x4 -> void
SIMD.float32x4.storeXY: Uint8Array,intish,float32x4 -> void
SIMD.float32x4.storeXYZ: Uint8Array,intish,float32x4 -> void
SIMD.float32x4.sub: float32x4,float32x4 -> float32x4
SIMD.float32x4.swizzle: float32x4,(int literal between 0 and 3 inclusive) -> float32x4
SIMD.float32x4.withW: float32x4,(floatish|double lit) -> float32x4
SIMD.float32x4.withX: float32x4,(floatish|double lit) -> float32x4
SIMD.float32x4.withY: float32x4,(floatish|double lit) -> float32x4
SIMD.float32x4.withZ: float32x4,(floatish|double lit) -> float32x4
SIMD.float32x4.xor: float32x4,float32x4 -> float32x4
SIMD.int32x4.add: int32x4,int32x4 -> int32x4
SIMD.int32x4.and: int32x4,int32x4 -> int32x4
SIMD.int32x4.bitselect: int32x4,int32x4,int32x4 -> int32x4
SIMD.int32x4.check: int32x4 -> int32x4
SIMD.int32x4.equal: int32x4,int32x4 -> int32x4
SIMD.int32x4.fromFloat32x4Bits: float32x4 -> int32x4
SIMD.int32x4.fromFloat32x4: float32x4 -> int32x4
SIMD.int32x4.greaterThan: int32x4,int32x4 -> int32x4
SIMD.int32x4.greaterThanOrEqual: int32x4,int32x4 -> int32x4
SIMD.int32x4.lessThan: int32x4,int32x4 -> int32x4
SIMD.int32x4.lessThanOrEqual: int32x4,int32x4 -> int32x4
SIMD.int32x4.load: Uint8ArrayView,intish -> int32x4
SIMD.int32x4.loadX: Uint8ArrayView,intish -> int32x4
SIMD.int32x4.loadXY: Uint8ArrayView,intish -> int32x4
SIMD.int32x4.loadXYZ: Uint8ArrayView,intish -> int32x4
SIMD.int32x4.mul: int32x4,int32x4 -> int32x4
SIMD.int32x4.neg: int32x4 -> int32x4
SIMD.int32x4.notEqual: int32x4,int32x4 -> int32x4
SIMD.int32x4.not: int32x4 -> int32x4
SIMD.int32x4.or: int32x4,int32x4 -> int32x4
SIMD.int32x4.select: int32x4,int32x4,int32x4 -> int32x4
SIMD.int32x4.shiftLeftByScalar: int32x4,intish-> int32x4
SIMD.int32x4.shiftRightArithmeticByScalar: int32x4,intish-> int32x4
SIMD.int32x4.shiftRightLogicalByScalar: int32x4,intish-> int32x4
SIMD.int32x4.shuffle: int32x4,int32x4,(int literal between 0 and 7 inclusive)**4 -> int32x4
SIMD.int32x4.splat: integer -> void
SIMD.int32x4.store: Uint8Array,intish,int32x4 -> void
SIMD.int32x4.storeX: Uint8Array,intish,int32x4 -> void
SIMD.int32x4.storeXY: Uint8Array,intish,int32x4 -> void
SIMD.int32x4.storeXYZ: Uint8Array,intish,int32x4 -> void
SIMD.int32x4.sub: int32x4,int32x4 -> int32x4
SIMD.int32x4.swizzle: int32x4,(int literal between 0 and 3 inclusive)**4 -> int32x4
SIMD.int32x4.withW: int32x4,intish -> int32x4
SIMD.int32x4.withX: int32x4,intish -> int32x4
SIMD.int32x4.withY: int32x4,intish -> int32x4
SIMD.int32x4.withZ: int32x4,intish -> int32x4
SIMD.int32x4.xor: int32x4,int32x4 -> int32x4