Learning-AI

DeFCN: End-to-End Object Detection with Fully Convolutional Network

December 2020

tl;dr: End to end object detection with one-to-one label assignment.

Overall impression

The study build upon FCOS. DeFCN points out that the one-to-many label assignment makes NMS necessary. Thus a good one-to-one policy is the key. The paper is also inspired by MultiBox and DETR to use bipartite matching as the matching cost, to allow neural network to learn a better assignment policy.

DeFCN and OneNet:

DeFCN shows that a hand crafted one-to-one label assignment already yields OK-ish performance (10% relative drop in KPI). OneNet also mentions that a predefined location cost + classification is able to yield OK baseline.
Both DeFCN and OneNet adopts a bbox formulation consisting of a point inside GT bbox + 4 distances to the edges. This addresses eccentric objects or objects where center is occluded.
OneNet seem to have inferior performance than DeFCN.

Key ideas

One-to-one label assignment is key.
- One-to-one based on center or anchor is already OK.
- Matching cost by foreground loss (as in DETR) improves KPI
- Modified POTO (prediction aware one to one) cost for matching is even better, as the foreground loss (cls+reg) may be weighted and it may not be optimal for bipartite matching.
- The selection of matching cost is not necessarily differentiable. So theoretically we can use mAP as the cost –> see review on Zhihu.
POTO matching cost
- Spatial priors helps (the center of prediction matched to GT cannot be outside of the GT box)
- Balanced IoU and classification (by multiplication, better than summation)
3D Max Filtering (3DMF)
- CenterNet uses 2D max filtering to replace NMS
- Duplicate predictions majorly come from the nearby spatial regions of the most conf prediction, and comes from neighboring scales. As objects with sizes on the border of a stage may be automatically assigned to neighboring stage of the FPN.
- 3DMF is a module to perform 3D max pooling to provide sharper response. It is used as a differentiable post-processing step inside the network.
Auxiliary loss to speed up convergence.

Technical details

By using POTO and 3D MF, the scores of duplicate samples are significantly suppressed.
On CrowdHuman, the recall is even higher than the theoretical upper limit with GT (applying NMS on GT).
MultiBox is the first paper to propose bipartite matching between pred and GT, way earlier than DETR.

Notes

Review on Zhihu by 1st author
- About spatial prior in matching cost
在α合理的情况下，空间先验不是必须的，但空间先验能够在匹配过程中帮助排除不好的区域，提升绝对性能；研究者在 COCO 实验中采用 center sampling radius=1.5，在 CrowdHuman 实验中采用 inside gt box. 理由很简单，CrowdHuman 的遮挡问题太严重，center 区域经常完全被遮挡。