API Set for Machine Learning on the Web


#1

With the recent breakthroughs in deep learning and related technologies, Machine Learning (ML) algorithms have drastically improved in terms of accuracy, application, performance etc. While typically thought of as a technology only applicable to server technologies, the inferencing process of machine learning models can run on device as well. Development of a machine learning application usually involves two stages:

  • The developer first train the model by first creating a skeleton framework and then iterating the model with large dataset
  • The developer then port the model to production environment so that it can infer insight from user input

Though training typically takes place in the cloud because it requires a significant amount of data and computing power, inference can take place in the cloud or on the device. Running inference on the device has a number of appealing properties, such as performance boost due to edge computing, resistance toward poor or no network, and security/privacy protection, etc.

Although platforms for native applications have all shipped APIs to support machine learning inference on device, similar functionality has been missing on the web platform. To fill the gap, we could provide an API set including:

  1. WebAssembly with GPU and multi-thread support
  2. A WebML (Web Machine Learning) API with a pre-defined set of mathematical functions that the platform can optimize for
  3. A WebNN (Web Neural Network) API that provides a high-level abstraction to run neural networks efficiently.

Please take a look at the explainer for more detailed info, such as use case, problem statement, proposal, related research, etc. Feedbacks are welcomed! I would love to hear more about what you think :grinning: !


#2

GPGPU computing on WebASM would suggest that the native side has been standardized - which it is not, I believe the “de-facto” standard is CUDA, but since that is far from a open standard I’m not sure how that would work. AMD has been working on ROCm and HPC, which somewhat define a open implementation of CUDA - but it is very specific to their hardware and binding nvcc into the browser doesn’t seem like a viable way forward for implementations.

On the other hand, there has been a community group around this: https://www.w3.org/community/gpu/ Safari is the only (and highly experimental) implentation shipping it - I remember seeing someone using this to implement a neural network library.

I would very much like to see some work in this general direction - (2) (linear algebra subroutines) is something that would be nice to have as a first class citizen, possibly with a slightly more usable matrix/vector type. (3) is a bit iffy, as most common libraries used nowadays has a dependency against yet another closed-source library. (cuDNN - although it is possible to live without it, at a cost of slower performance)


#3

webdnn was implemented with webgpu, webgl, and webassembly. The explainer had a link to a list of JS libraries written for this: https://github.com/AngeloKai/js-ml-libraries


#4

IMHO, both (2) and (3) look good to me. Several neural network frameworks like Core ML, WebDNN, etc. are offering their own model data converter for well-known libraries such as Keras, Caffe, TensorFlow, etc. So we could assume neural network model compatibility to some extent. Of course, (2) looks better for extensibility.

FYI: Neural network acceleration is now not limited to GPGPU. There are some examples on native APIs:

  • Android 8.1 NDK provides Neural Network API used by TensorFlow Lite, Caffe2, etc., that could be integrated with a dedicated neural network processor in the device.
  • iOS 11 provides Core ML framework that would make use of neural engine in A11 chip if available.

#5

Thanks for initiating the discussion! I am excited about the idea to bring the hardware accelerated machine learning API to the web platform and would like to contribute.

I’d like to echo the problem statement from hardware optimization angle:

  1. Today’s web platform is disconnected from the most efficient neural network implementation for CPU and GPU.
  2. And it is disconnected from the emerging neural network accelerators.

For #1, using WebAssembly optimziation as example, it is possible to optimize the neural network inference by 128-bit SIMD of WebAssembly. However, a native implementation, like MKL-DNN, is able to leverage wider SIMD instructions, e.g. AVX-512, if it is available on device’s CPU. Optimization by WebGL/WebGPU has similar situation comparing to GPU optimization in native DNN libs.

For #2, the hardware industry moves fast to innovate AI accelerators. Those AI accelerators, from DSP, FPGA to dedicated ASIC, accelerate the performance as well as reduce the power consumption. Especially, that makes efficient neural network inference on edge devices possible.

I agree that a dedicated accelerated machine learning API will fill the gap and enable innovative AI-based web applications on edge devices.

Regarding to the API scope, Angelo proposed three aspects: 1) WebAssembly with GPU and multi-thread 2) WebML and 3) WebNN. It is a complete set. Among them, I’ve happened to look into the Web API for neural network inference. I think it corresponds to 2) or 3)? I’d like to share my thoughts and welcome feedbacks.

The Web API for accelerated neural network inference should:

  1. allow to build neural network by common building blocks, for example convolution, pooling, softmax, normalization and activation.
  2. allow to compile the neural network to native optimized format for hardware execution.
  3. allow to setup input from various sources on web, e.g. media stream, schedule the asynchronous hardware execution and retrieve the output when hardware execution completes.
  4. allow to extend with new building blocks if they are widely supported by native.

With such Web API, the web apps and libs:

  1. can enable various use cases by connecting text, image, video, audio, video, sensor data to neural network as inputs.
  2. can get the best power and performance for neural network inference by offloading to native implementations and exploiting the hardware capabilities.
  3. have the flexibility to integrate different neural network architectures, e.g. MobileNet, SqueezeNet, DenseNet, just name a few, and various model formats, e.g. format of ONNX, Caffe, TensorFlow etc.,
  4. can still innovate new building blocks with WebAssembly and WebGL/WebGPU, thinking as a polyfill, and get acceleration once the API extension is available.

#6

LGTM, @Ningxin_Hu.

Regarding input interfaces, I’d like to clarify our current situation:

  • We can obtain decoded image data as ImageData via CanvasRenderingContext2D.getImageData().
  • We can process audio data via AudioWorklet or ScriptProcessorNode (to be deprecated) in Web Audio API.
  • We can obtain a frame from a <video> element by drawing the frame on a <canvas> element.
  • We can obtain a frame from a MediaStream by attaching the stream to srcObject in a <video> element. Note that there is no equivalent spec to Web Audio API for video streams yet.

Generally, one of the possibly minimum requirements is that ArrayBuffer, TypedArray, or string could be an input. Also, it might be desirable that real-time input like MediaStream or MediaStreamTrack could be an input.


#7

Thanks for the clarification, @tomoyukilabs.

IMO, the TypedArray input is a MVP feature.


#8

There is one concern that loading/parsing model data format in hundreds of MBs might be too heavy for JavaScript runtime. Can we consider another approach to load/parse model data without storing it as a JavaScript variable?


#9

This is a valid concern. In my mind, we may start with some net architectures optimized for size, say MobileNet (~16MB) and SqueezeNet (~5MB). We may also consider API enhancement that allows to load data from URL.


#10

As a proof of concept (POC), we put together the Web ML API polyfill and browser prototype.

You may have interests to check out the examples:

https://huningxin.github.io/webml-examples/

The examples started with MobileNet. The model is current in TensorFlow-Lite file format (FlatBuffers). Testing other architectures (e.g. SqueezeNet) and formats (e.g. ONNX and coreml) is in the plan.

The JavaScript API of POC focuses on neural network inference. The current API is modeled from NN API. It is just served as a starting point for iterations with other APIs, e.g. MPS/BNNS and DirectML.

The polyfill implements two backends: WebAssembly (WASM) for CPU and WebGL2 for GPU.

The browser prototype is based on Chromium M65 and supports Android and MacOS. On Android, it implements Web ML API with NN API. On MacOS, it implements with MPS API.

The examples allow to choose and compare different inference backends, including WASM, WebGL2 and WebML (only available in browser prototype). In our evaluation, the WebML prototype can get 6-7X speedup comparing to existing Web APIs (WASM/WebGL2 polyfill) and deliver close-to-native performance.

For example, the screenshot shows the Chromium Web ML POC is doing image classification at real-time on a MacBook Pro 13":

You may check out the slides for more info about the POC:

Feedbacks and comments are welcome. :slight_smile:


#11

FYI: as you may already know, TensorFlow.js has been released.


#12

Sounds great! I’m so excited to see that.

Testing other architectures (e.g. SqueezeNet) and formats (e.g. ONNX and coreml) is in the plan.

I’m looking forward to these features. Thanks!


#13

The SqueezeNet/ONNX examples were added into Web ML API Examples: https://huningxin.github.io/webml-examples/

Feel free to check it out.


#14

Great to see this work happening. One point of feedback - it would make a lot of sense to design the API in a way this can be pulled out of the main thread, since these are computationally heavy operations. (Also, it states WebML - but it is more or less a neural network API. Might be worth addressing that when officially proposing it.)

API surface wise (especially the operations), I can see why certain limitations were put in place (e.g. missing common parts while having some extremely specific parts) - since it inherits those limitations from the Android NNAPI) - but might not be the best subset when actually trying to move this forward.


#15

Great feedback!

Agree. Besides computation offloading from main thread, I can also see interesting usage if the API can be used in Service Worker, e.g. to serve the image classification request on device if network is not available.

You are right. The WebML is a boarder topic. Our current focus is neural network API. So WebNN is more accurate.

Inheriting from NNAPI has historic reason when we started the investigation. As we mentioned, it is just a starting point to prove the concept with real data. Moving forward, I believe the proposal will be a cross-platform NN API backed by multiple native APIs , e.g. Windows/DirectML, iOS/MacOS/BNNS/MPS and Android/NNAPI.


#16

I strongly agree with multi-threaded WebNN, i.e. exposing WebNN API to Workers. However, I slightly have an concern about exposing it to Service Workers, because this might allow Service Workers to cause unexpected power consumption when receiving a push notification or performing background sync, for example. Thus, I feel we might need to consider security and privacy concerns carefully.


#17

Great works! Also, I’ve been excited about SSD MobileNet demo.

A couple of comments about API design:

  • It would be desirable that we could check availability of each operation, e.g. checking whether ELU is implemented in the browser or not. (in this case, should I check 'ELU' in nn?)
  • IMHO, too much use of constant values (enums) does not seem to fit JavaScript style. However, this would be good if we could not find any better idea.