Photos and images constitute the largest chunk of the Web, and many include recognisable features, such as human faces. Detecting these features is computationally expensive, but would lead to interesting use cases e.g. face tagging or detection of high saliency areas. Also, users interacting with WebCams or other Video Capture Devices have become accustomed to camera-like features such as the ability to focus directly on human faces on the screen of their devices. This is particularly true in the case of mobile devices, where hardware manufacturers have long been supporting these features. Unfortunately, Web Apps do not yet have access to these hardware capabilities, which makes the use of compuationally demanding libraries necessary.
Use cases
Live video feeds would like to identify faces in a picture/video as highly salient areas to e.g. give hints to image or video encoders.
Social network pages would like to quickly identify the human faces in a picture/video and offer the user e.g. the possibility of tagging which name corresponds to which face.
Face detection is the first step before Face Recognition: detected faces are used for the recognition phase, greatly speeding the process.
Fun! you can map glasses, funny hats and other overlays on top of the detected faces
Possible future use cases
Hardware vendors provide detectors for other items in widespread use, notably QR codes and text.
Face Detection is an expensive operation due to the algorithmic complexity. Many requests, or demanding systems like a live stream feed with a certain frame rate, could slow down the whole system or greatly increase power consumption.
Platform specific implementation notes
Mac OS X / iOS
CoreImage library includes a CIDetector class that provides not only Face Detection, but also QR, Text and Rectangles.
Android provides a stand alone FaceDetector class. It also has a built-in for detecting on the fly while capturing video or taking photos, as part of the Camera2s API.
typedef (HTMLImageElement or
HTMLVideoElement or
HTMLCanvasElement or
Blob or
ImageData or
ImageBitmap) ImageBitmapSource;
partial interface navigator {
Promise <sequence<DOMRect>> detectFaces(ImageBitmapSource);
};
Usage
Simple example
navigator.detectFaces(image).then(boundingBoxes => {
for (const face of boundingBoxes) {
console.log(`Face detected at (${face.x}, ${face.y}) with size ${face.width}x${face.height}`);
}
}).catch(() => {
console.error("Face detection failed");
});
Notes
Using a particular Face Detector does not preclude using others, in this case the hardware provided can provide seeds or weights for user-defined ones.
Why does Face Detection have such terrible Complexity? The most/best typical algorithm used is the so-called Viola-Jones that uses a cascade of classifiers of different sizes and gives a horrendous O(n^4) - this video exemplifies how the detection process works.
Can this proposal be extended to incorporate video also… in maybe some other method that accepts a range of frames? For instance, it would be nice to throw it a video object and find out which frames or time ranges have which faces.
It would be nice to see this concept abstracted a bit more to detect things other than faces–like objects (ie. glasses, hats, etc as you mentioned). But detecting faces is an awesome start! Probably unlikely given that this assumes some sort of registry of all possible strings that represent objects, which seems wayyy too complicated and assuming for such a low-level implementation.
I’m supportive of this (so, consider a +1 from Mozilla). I’m not a huge fan of throwing it on Navigator, but that’s just a bikeshedding. Generally, the shape looks good. We should maybe build a prototype in iOS or Android using a webview?
I totally agree, I would want this to work for node.js applications as well. For Crosswalk we already experimented with such APIs, so I will let my co-workers aware of this proposal
Speaking with @marcosc it appears there are low level support in the OS and hardware for this which is great.
However we should gauge how similar the APIs detect faces.
What other OS’s have support for this, for example Windows and Linux support?
Could we extend this to take shape meta data to work on any sort of shape detection? I realise that the meta data size for this may end up making this a problem.
Should realtime head tracking be something that requires additional auth from the user? (I’m going to say no purely based on the info being there from WebRTC to a server however it may make it more prevalent)
Use cases
Head tracking for video or VR
Presence checking to stop and start video
Possible future use cases
Websites requiring users to stare at an advert for a certain length of time.
Potential for misuse
More sites to require camera to presence check the user
There has been some great feedback on twitter. Most actionable feedback, is to generalize the API to detect shapes instead: So, whatevs.detect("faces", image) or image.detect("other-detectable-thing") or some such.
I was working on the Crosswalk experimental Face Tracking and Recognition API. I am happy to see this proposal and would like to contribute.
As my understanding, current API models Android FaceDetector, it works great on still image. For face detecting and tracking on live video capture, probably we need another API which models Android Camera2 Face Detector. It works with Media Stream and targets real-time usages.
navigator.mediaDevices.getUserMedia(constraints).then(stream => {
var fd = new FaceDetector(stream);
fd.onfacedetected = function(event) {
for (const face of event.faces) {
console.log(`Face detected at (${face.x}, ${face.y}) with size ${face.width}x${face.height}`);
}
}
});
From the POV of the algorithms behind the scenes (Haar cascade classifier), all object detection, e.g. Faces, letters, QR codes, is a similar task; I even wrote the Possible future use cases section mentioning QR codes (which AFAIK are only available in iOS -not MacOSX- AVFoundation library here). So, generalizing the API to detect any object type(s) would not be daunting.
In that case, and mimicking IIRC the OpenCV classifier, we might need to extend the result so that detected objects can refer to others as “parents”, i.e. detected eyes or mouths have a face as parent etc. For completion, some APIs also have a last parameter confidence indicating how certain the detection result is (0.0- none, 1.0- über sure ). But I’ve never seen OS/hardware implementations that sophisticated, hence I left it out of the proposal.
Again, we probably want to generalize this to shapes, one of which is “face(s)”. The shapes would be represented by DOMRects, as per the original proposal.
DOMRect is good to describe the bonding box. And probably we also want other attributes of a Face, e.g. the orientation, the landmarks position, the emotion, the face ID, the recognition ID etc.,
DOMRect is good to describe the bonding box. And probably we also want other attributes of a Face, e.g. the orientation, the landmarks position, the emotion, the face ID, the recognition ID etc.,
But, we need to check how interoperable all that is. My understanding is that these APIs are pretty dumb at the HW/OS platform level? Can we get that info across multiple OSs?
We should only provide primitives, on which developers can build more sophisticated stuff.
I think the HW/OS platform capability is catching up. Just have a quick check of the face detection features:
Mac OS X / iOS
CIFaceFeature: face bounds, angle, landmarks (left eye, right eye, mouth), expression (smile, eye close), tracking ID (for video)
Android
FaceDetector.Face: face pose, left eye and right eye position (with eyes distance and middle point)
com.google.android.gms.vision.face.Face: face bounds (with position, width and height), rotation, landmarks (mouth, cheek, ear, eye, nose), expression (eye close, smile), face ID
RealSense SDK for Windows
Face Tracking and Recognition: bounds, pose, landmarks (77 points of eye, eyebrow, nose, mouth and cheek), expression (smile, eye close and eye/eyebrow/mouse movements), face ID
I’m totally +1 for making this a Shape Detector API, will update the GitHub, or I’m happy to review PRs
I wanted to keep this API centered on Face Detection on still images as opposed to the (harder?) problem of Face Tracking on live streams, or the tangential problem of Face Recognition; I thought that would ease its implementation – we could just bundle the two in a single spec with capabilities for both still and live images, SG?
My previous experiences with OpenCV and the “extra” results (e.g. face orientation) were somehow disappointing, but that should not prevent us/the API from offering them if they’re available, so it SGTM.
Condensing the previous comments (@Ningxin_Hu, @marcosc) and for the still image case, we could have an API working in the lines of:
@marcosc instead of having an enumerator function detailing what is supported (i.e. “face” etc), I thought better to reject the detectShapes() Promise, would that be enough…?