RFC: Proposal for Face detection API


#21

I like the idea of having custom detectors that can be constructed with different options. However, we might not need the Detector base class, if we make is a mixin (one less thing to throw on the Window object):

[NoInterfaceObject, exposed=Window,Worker]
interface Detector {
    Promise <sequence<DetectedObject>> detect(ImageBitmapsource image);
    // readonly attribute boolean isAccelerated;
}

// FaceDetectorOptions to control the features and performance
[Constructor(optional FaceDetectorOptions faceDetectorOptions)]
interface FaceDetector {
    // face detector specific attributes and methods
    attribute FaceDetectorOperationMode mode;
    attribute boolean detectLandmarks;
};

FaceDetector implements Detector;

Might be able to do the same with DetectedObject, as it can’t be constructed. We should put a constructor on the different types to allow developers to build them themselves if they want from various libraries.


#22

So would that also count for the Sensor object in Generic Sensor API? If so, that feedback would be welcome.

I also think it would be interesting if we could make this API so that it would also make sense for Node.js use-cases, like an IoT device detecting which user is interacting with it.


#23

So would that also count for the Sensor object in Generic Sensor API? If so, that feedback would be welcome.

I don’t know, tbh. We might ping those folks - but I don’t think we need to go there yet. Personally, this doesn’t feel like it falls into Generic Sensor category.

I also think it would be interesting if we could make this API so that it would also make sense for Node.js use-cases, like an IoT device detecting which user is interacting with it.

That’s a “nice to have”, but remember that WebIDL bindings don’t always translate well to Nodejs. The current design might work tho, if it remains unbound to something like “navigator”.


#24

Generic Sensor API is being implemented in Chrome now, so it would be the right time to give such feedback.

If we don’t try to make those APIs translate, they never will and there will be a gap between platforms.


#25

I like @marcosc’s twist to @Ningxin_Hu’s second proposal. It might be interesting to offer the possibility of an OptionsDictionary in detect(), together with the FaceDetectorOptions in the FaceDetector constructor to give a more nuanced per-detection-event versus per-detector configuration parameters (e.g. the minimum size of the shapes to be detected versus the detector definition XML, respectively, would be examples off the top of my head):

[NoInterfaceObject, exposed=Window,Worker]
interface Detector {
  Promise<sequence<DetectedObject>> detect(ImageBitmapsource image, OptionsDictionary options);
}

[Constructor(optional FaceDetectorOptions faceDetectorOptions)]
interface FaceDetector { ... };

FaceDetector implements Detector;

I would also consider removing // readonly attribute boolean isAccelerated;: a (Face)Detector can only be created if and only if it’s accelerated – the amount of code and the computational complexity would prevent us from having a software implementation fallback.

Should I/we try to write a W3C spec of sorts? (Not pegging it to any Working Group or such, necessarily).

I’m not familiar at all with the Generic Sensors API, who could we ask for opinions, @kenchris ?


#26

I like this idea.

It sounds an interesting idea. However, in implementation, setting options in detect() might involve expensive operations, e.g. tearing down. configuring and setting up the HW accelerators.

For iOS implementation, it might need to recreate the CIDetector instance with new options when changing options in detect(). The discussion section mentions:

“A CIDetector object can potentially create and hold a significant amount of resources. Where possible, reuse the same CIDetector instance.”

For Android implementation, it is also need to recreate instances for android.media.FaceDetector or com.google.android.gms.vision.face.FaceDetector.

For RealSense SDK for Windows, the PXCFaceModule instance requires developers to retrieve the configuration, modify and push back which looks like not a cheap operation.

It sounds good to me.

+1. We should do that. @marcosc knows the best practice. Any suggestions?


#27

Should I/we try to write a W3C spec of sorts? (Not pegging it to any Working Group or such, necessarily).

Yeah, let’s kick off an incubation. Created:

If anyone else would like to participate as a contributor, please let me know.


#28

Regarding writing the spec. There is now an index.html file in the repo. Let’s start by sending PRs to that with the IDL and then we can fill in the spec text from there. Happy to help with that.


#29

Some more feedback from Mozilla. Some background research people at Mozilla have been doing into this space: https://wiki.mozilla.org/Project_FoxEye

And there is also a Media Capture Worker: https://w3c.github.io/mediacapture-worker/

In particular, it would be good to look at how to leverage the media-capture worker.


#30

mediacapture-worker is basically what came out of project FoxEye.

With that do you need a special detection API? (which is btw how FoxEye also started)

Yes the algorithm would be in js, but

  1. It doesn’t touch the main thread, which all existing implementations have to do in some way or the other,

and most importantly 2) Applications don’t have to depend on browser vendors maintaining the detection algorithm (or introducing bugs).


#31

With that do you need a special detection API?

Only if there is significant benefits from HW-accelerated shape-detection - as that would, potentially, save burning through the battery by doing it in JS.

  1. Applications don’t have to depend on browser vendors maintaining the detection algorithm (or introducing bugs).

Yes, inconsistent results would be potentially painful (but that’s the exploration we should do). A few people have raised similar concerns with their experience with Web Audio.

If it turns out that the underlying hardware implementations are all garbage, then using a asm.js solution (OpenCV) + mediacapture-worker would be sufficient and desirable.


#32

@marcosc thanks for the WICG, I blended the conversations here and the original README.md into it;

Everyone, but in particular @marcosc @Ningxin_Hu, please have a look – it’s quite drafty, so I’d be more than welcoming PRs :slight_smile:

@pehrsons, IMHO there would be no benefit in offering a Face Detection API in the Browser unless is an accelerated one: There are JS implementations that provide reasonable performance (e.g. https://github.com/mtschirs/js-objectdetect#performance) and it would be in their best interest to make sure they run on a background thread.

Edit: https://wicg.github.io/shape-detection-api is not working yet, but https://rawgit.com/WICG/shape-detection-api/master/index.html is !


#33

Thanks for putting something up. Will try to review in the next few days! Also discussing internally. In the mean time, please do check out the links I sent for the mediacapture-worker.


#34

Couple of things. Re. that proposal, I think it’s a step in the right direction: offer the pixels to JS so that it can do cool stuff with them, stuff that we shouldn’t/can’t ship in the browser (for a variety of reasons). Myself, I had discussions in the near past to offer a similar functionality but via the WHATWG Streams API (by making a MediaStreamTrack offer a ReadableStreamReader) . Two different approaches for the same result. How’s the implementation of mediacapture-worker going?

On the other hand, FoxEye seems to follow the route of, e.g. GStreamer and DirectShow, namely offer a comprehensive media processing filter graph and let the user/dev build it - FoxEye also seems to mimic (?) OpenVX, but the whole document is unclear about that. If OpenVX is widely available, perhaps surfacing it a la WebGL would present less friction for devs…

The Face detection API proposal is about surfacing hardware capabilities and about making them play nicely with the Web and other APIs, so it is complementary and compatible with either the mediacapture-worker spec and FoxEye, but its scope is much more restricted.


#35

Thanks for setting this up. I’ve sent two PRs for minor fixes and logged some topics into issues. Please take a look.


#36

Personally, I don’t think this needs its own dedicated API. This feels like the task for a library that sits on top of the platform, not built into it. Especially since you’re discussing adding more objects to it, the knowledge of these objects then needs to live in the browser rather than in a service or library where the desired objects can be enhances/modified or removed all together. While there are usecases for this, I don’t think that the ubiquity of face detection is so prevalent that this needs to be shipped in the platform.


#37

You beat me to it. I think this would be a prime use-case for WASM, which would allow a library (or multiple ones) to be competitive in terms of performance with a browser-native implementation.


#38

@gregwhitworth, @surma: it has been mentioned a few times in the discussion, but perhaps it’s not evident in the first writeup, so let me quote myself:

IMHO there would be no benefit in offering a Face Detection API in the Browser unless is an accelerated one: There are JS implementations that provide reasonable performance (e.g. https://github.com/mtschirs/js-objectdetect#performance) and it would be in their best interest to make sure they run on a background thread.

and @marcosc said :

Only if there is significant benefits from HW-accelerated shape-detection - as that would, potentially, save burning through the battery by doing it in JS.

This API is all about surfacing existing hardware/accelerated capabilities. Perhaps down the line a polyfill might be envisioned, using WASM or other sophisticated techniques, but that’s beyond the scope of this discussion, I’d say.

Update: Added this to the wicg protospec:

It is not the idea here to compare those algorithms, nor to offer a software fallback implementation of any of them for the Web, but to offer the available hardware capabilities, if any, to the Web Applications.


#39

I did notice that one of the goals of this was perf, it seemed like a minor point on my first read so I appreciate the follow up. If Mozilla is willing to fund the R&D and determine the scale of the perf benefits in HW accelerating this vs doing it in JS that will at least give us some verifiable data to help bolster your stance. With any native feature there is always the need to consider whether that perf improvement is worth the cost of extensibility which current JS libs provide.

and it would be in their best interest to make sure they run on a background thread

I haven’t had a chance to look into too many of these libraries, do you know if they are/aren’t utilizing web workers for their face detection algos? If not, that seems like the best place to start IMO and see if moving this to HWA is even necessary.

Regarding your addition to the spec, what should an OS do that can’t provide HWA for this computation and can only provide software? Does it just fail?


#40

It seems like the title “Shape Detection API” is not explicit in its statement that it only ‘works’ if there is hardware/OS support, and that confuses (some) developers when reading the material for the first time. Do you think we should add the word accelerated somewhere, @marcosc, @Ningxin_Hu, @pehrsons, @gregwhitworth ?

On a different note, I found out lately that Samsung Internet browser has a QR code reader that seems to be offered as an extension but haven’t found any substantial information.

@gregwhitworth: apologies, I didn’t register your questions :slight_smile: – could you please file issues on the GitHub repo?