September 2019
tl;dr: One of the most widely used method for face detection and face landmark regression.
Overall impression
The paper seems rather primitive compared to general object detection frameworks like faster rcnn. MTCNN is more like the original rcnn method.
However it is also enlightening that a very shallow CNN (O-Net) applied on top of cropped image patches can regress landmark accurately. Landmark regression given an object bbox may not require that large of a receptive field anyway.
The paper is largely inspired by Hua Gang’s paper cascnn: A Convolutional Neural Network Cascade for Face Detection.
Key ideas
- Three stages
- P-Net: proposal network on 12x12 input size
- R-Net: FP reduction on 24x24 input size
- O-Net: landmark regression on 48x48 input size
- P-Net is trained on patches but deployed convolutionally for detection. (or equivalently in a sliding window fashion)
- R-Net input is obtained from the output of P-Net
- O-Net input is obtained from the output of R-Net
- Multi dataset used differently
- Loss weighed differently and masked differently in different stages
Technical details
- Not a single model, but training can be done jointly.
Notes
- new implementation in TF and original version
- The method to crop an image and resize is essentially ROI align applied to the original image. Maybe we can save some computation by cropping the feature map from the first stage. The features from the fourth stage is most likely already too spatially blurred to contain any localization info.