RFC: Proposal for Face detection API

miguelao · 2016-08-05

Strawman proposal for a Face (and other objects) detection API: https://github.com/miguelao/face-detection

Photos and images constitute the largest chunk of the Web, and many include recognisable features, such as human faces. Detecting these features is computationally expensive, but would lead to interesting use cases e.g. face tagging or detection of high saliency areas. Also, users interacting with WebCams or other Video Capture Devices have become accustomed to camera-like features such as the ability to focus directly on human faces on the screen of their devices. This is particularly true in the case of mobile devices, where hardware manufacturers have long been supporting these features. Unfortunately, Web Apps do not yet have access to these hardware capabilities, which makes the use of compuationally demanding libraries necessary.

Use cases

Live video feeds would like to identify faces in a picture/video as highly salient areas to e.g. give hints to image or video encoders.
Social network pages would like to quickly identify the human faces in a picture/video and offer the user e.g. the possibility of tagging which name corresponds to which face.
Face detection is the first step before Face Recognition: detected faces are used for the recognition phase, greatly speeding the process.
Fun! you can map glasses, funny hats and other overlays on top of the detected faces

Possible future use cases

Hardware vendors provide detectors for other items in widespread use, notably QR codes and text.

Current Workarounds

Some Web Apps -gasp- run Face Detection in Javascript. A performance comparison of some such libraries can be found in https://github.com/mtschirs/js-objectdetect#performance.

Potential for misuse

Face Detection is an expensive operation due to the algorithmic complexity. Many requests, or demanding systems like a live stream feed with a certain frame rate, could slow down the whole system or greatly increase power consumption.

Platform specific implementation notes

Mac OS X / iOS

CoreImage library includes a CIDetector class that provides not only Face Detection, but also QR, Text and Rectangles.

Android

Android provides a stand alone FaceDetector class. It also has a built-in for detecting on the fly while capturing video or taking photos, as part of the Camera2s API.

Rough sketch of a proposal

typedef (HTMLImageElement or
         HTMLVideoElement or
         HTMLCanvasElement or
         Blob or
         ImageData or
         ImageBitmap) ImageBitmapSource;

partial interface navigator {
  Promise <sequence<DOMRect>> detectFaces(ImageBitmapSource);
};

Usage

Simple example

navigator.detectFaces(image).then(boundingBoxes => {
  for (const face of boundingBoxes) {
    console.log(`Face detected at (${face.x}, ${face.y}) with size ${face.width}x${face.height}`);
  }
}).catch(() => {
  console.error("Face detection failed");
});

Notes

Using a particular Face Detector does not preclude using others, in this case the hardware provided can provide seeds or weights for user-defined ones.
Why does Face Detection have such terrible Complexity? The most/best typical algorithm used is the so-called Viola-Jones that uses a cascade of classifiers of different sizes and gives a horrendous O(n^4) - this video exemplifies how the detection process works.

Open questions

template knocked off @marcosc Media Controls API Thanks!

mkay581 · 2016-08-06

I.Love.This! And making it a promise is awesome.

Just a few things that I’m thinking:

Can this proposal be extended to incorporate video also… in maybe some other method that accepts a range of frames? For instance, it would be nice to throw it a video object and find out which frames or time ranges have which faces.
~~It would be nice to see this concept abstracted a bit more to detect things other than faces–like objects (ie. glasses, hats, etc as you mentioned). But detecting faces is an awesome start!~~ Probably unlikely given that this assumes some sort of registry of all possible strings that represent objects, which seems wayyy too complicated and assuming for such a low-level implementation.

marcosc · 2016-08-08

I’m supportive of this (so, consider a +1 from Mozilla). I’m not a huge fan of throwing it on Navigator, but that’s just a bikeshedding. Generally, the shape looks good. We should maybe build a prototype in iOS or Android using a webview?

kenchris · 2016-08-08

I totally agree, I would want this to work for node.js applications as well. For Crosswalk we already experimented with such APIs, so I will let my co-workers aware of this proposal

kenchris · 2016-08-08

This might be of interest: https://crosswalk-project.github.io/realsense-extensions-crosswalk/spec/face.html

jonathank · 2016-08-08

Speaking with @marcosc it appears there are low level support in the OS and hardware for this which is great.

However we should gauge how similar the APIs detect faces.

What other OS’s have support for this, for example Windows and Linux support?

Could we extend this to take shape meta data to work on any sort of shape detection? I realise that the meta data size for this may end up making this a problem.

Should realtime head tracking be something that requires additional auth from the user? (I’m going to say no purely based on the info being there from WebRTC to a server however it may make it more prevalent)

Use cases

Head tracking for video or VR
Presence checking to stop and start video

Possible future use cases

Websites requiring users to stare at an advert for a certain length of time.

Potential for misuse

More sites to require camera to presence check the user

kenchris · 2016-08-08

At least it should be supported on Windows and Linux with the Intel RealSense camera.

marcosc · 2016-08-09

There has been some great feedback on twitter. Most actionable feedback, is to generalize the API to detect shapes instead: So, whatevs.detect("faces", image) or image.detect("other-detectable-thing") or some such.

Ningxin_Hu · 2016-08-09

+1!

I was working on the Crosswalk experimental Face Tracking and Recognition API. I am happy to see this proposal and would like to contribute.

As my understanding, current API models Android FaceDetector, it works great on still image. For face detecting and tracking on live video capture, probably we need another API which models Android Camera2 Face Detector. It works with Media Stream and targets real-time usages.

Rough sketch of Face Detector

[Constructor(MediaStream stream)]
interface FaceDetector : EventTarget {
    readonly        attribute MediaStream       stream;
                    attribute EventHandler      onerror;
                    attribute EventHandler      onfacedetected;
};

[Constructor(DOMString type, FaceDetectionEventInit eventInitDict)]
interface FaceDetectionEvent : Event {
  readonly attribute <Sequence<Face>> faces;
};

dictionary FaceDetectionEventInit : EventInit {
  <Sequence<Face>> faces;
};

Usage

navigator.mediaDevices.getUserMedia(constraints).then(stream => {
  var fd = new FaceDetector(stream);

  fd.onfacedetected = function(event) {
    for (const face of event.faces) {
      console.log(`Face detected at (${face.x}, ${face.y}) with size ${face.width}x${face.height}`);
    }
  }
});

Thoughts?

miguelao · 2016-08-09

From the POV of the algorithms behind the scenes (Haar cascade classifier), all object detection, e.g. Faces, letters, QR codes, is a similar task; I even wrote the Possible future use cases section mentioning QR codes (which AFAIK are only available in iOS -not MacOSX- AVFoundation library here). So, generalizing the API to detect any object type(s) would not be daunting.

In that case, and mimicking IIRC the OpenCV classifier, we might need to extend the result so that detected objects can refer to others as “parents”, i.e. detected eyes or mouths have a face as parent etc. For completion, some APIs also have a last parameter confidence indicating how certain the detection result is (0.0- none, 1.0- über sure ). But I’ve never seen OS/hardware implementations that sophisticated, hence I left it out of the proposal.

marcosc · 2016-08-10

Again, we probably want to generalize this to shapes, one of which is “face(s)”. The shapes would be represented by DOMRects, as per the original proposal.

marcosc · 2016-08-10

So, generalizing the API to detect any object type(s) would not be daunting.

What about having some way of checking the supported types by the sub-system?

Detector.canDetect(type)

Or just a Detector.acceleratedTypes or some such? then the developer can make a choice in their code path about falling back to a library if need be.

Ningxin_Hu · 2016-08-10

DOMRect is good to describe the bonding box. And probably we also want other attributes of a Face, e.g. the orientation, the landmarks position, the emotion, the face ID, the recognition ID etc.,

marcosc · 2016-08-10

DOMRect is good to describe the bonding box. And probably we also want other attributes of a Face, e.g. the orientation, the landmarks position, the emotion, the face ID, the recognition ID etc.,

But, we need to check how interoperable all that is. My understanding is that these APIs are pretty dumb at the HW/OS platform level? Can we get that info across multiple OSs?

We should only provide primitives, on which developers can build more sophisticated stuff.

Ningxin_Hu · 2016-08-10

I think the HW/OS platform capability is catching up. Just have a quick check of the face detection features:

Mac OS X / iOS

CIFaceFeature: face bounds, angle, landmarks (left eye, right eye, mouth), expression (smile, eye close), tracking ID (for video)

Android

FaceDetector.Face: face pose, left eye and right eye position (with eyes distance and middle point) com.google.android.gms.vision.face.Face: face bounds (with position, width and height), rotation, landmarks (mouth, cheek, ear, eye, nose), expression (eye close, smile), face ID

RealSense SDK for Windows

Face Tracking and Recognition: bounds, pose, landmarks (77 points of eye, eyebrow, nose, mouth and cheek), expression (smile, eye close and eye/eyebrow/mouse movements), face ID

marcosc · 2016-08-10

Awesome, Ningxin_Hu! thanks for those links. Ok, we might indeed be able to do something a bit more sophisticated.

miguelao · 2016-08-10

I’m totally +1 for making this a Shape Detector API, will update the GitHub, or I’m happy to review PRs
I wanted to keep this API centered on Face Detection on still images as opposed to the (harder?) problem of Face Tracking on live streams, or the tangential problem of Face Recognition; I thought that would ease its implementation – we could just bundle the two in a single spec with capabilities for both still and live images, SG?

miguelao · 2016-08-10

My previous experiences with OpenCV and the “extra” results (e.g. face orientation) were somehow disappointing, but that should not prevent us/the API from offering them if they’re available, so it SGTM.

Condensing the previous comments (@Ningxin_Hu, @marcosc) and for the still image case, we could have an API working in the lines of:

navigator.detectShapes('faces', image).then(detectedShapes => {

)}.catch ...

enum ShapeType {
    "face",
    "qr",
    "letter",
    // etc ...
};

interface DetectedShape {
    readonly attribute DOMRect location;
    readonly attribute ShapeType type;
    readonly attribute ExtrasDictionary extras; // e.g. {'confidence' : 0.9} etc
};

@marcosc instead of having an enumerator function detailing what is supported (i.e. “face” etc), I thought better to reject the detectShapes() Promise, would that be enough…?

Ningxin_Hu · 2016-08-11

SGTM. Let’s do it step by step.

Ningxin_Hu · 2016-08-11

To detect any object type(s), how about introducing the base Detector and DetectedObject interfaces and inheriting them for concrete types? e.g.:

interface Detector {
    Promise <sequence<DetectedObject>> detect(ImageBitmapsource image);
    // readonly attribute boolean isAccelerated;
};

interface DetectedObject {
    readonly attribute DOMRect boundingBox;
};

// FaceDetectorOptions to control the features and performance
[Constructor(optional FaceDetectorOptions faceDetectorOptions)]
interface FaceDetector : Detector {
    // face detector specific attributes and methods
    attribute FaceDetectorOperationMode mode;
    attribute boolean detectLandmarks;
};

// BarcodeDetector, TextDetector ...

interface DetectedFace : DetectedObject {
    readonly attribute long id;
    readonly attribute sequence<Landmark>? landmarks;
    // other possible attributes
}

// DetectedBarcode, DetectedText ...

The feature detection would be:

if (typeof FaceDetector === 'function') {
    // detect faces ...
}
        
if ('BarcodeDetector' in window) {
    // detect barcode ...
}
        
if (window.TextDetector) {
    // detect text ...
}

Sample of face detection would be:

if (typeof FaceDetector === 'function') {
    let faceDetector = new FaceDetector({mode: 'fast', detectLandmarks: false});
    faceDetector.detect(image).then(detectedFaces => {
        for (const face of detectedFaces) {
            console.log('Face ${face.id} detected at (${face.boundingBox.x}, ${face.boundingBox.y}),' +
                        ' size ${face.boundingBox.width}x${face.boundingBox.height}');
        }
    }).catch(() => {
        console.error('Face detection failed');
    });
}

Thoughts?