December 2020
tl;dr: End to end object detection with one-to-one label assignment.
Overall impression
The study build upon FCOS. DeFCN points out that the one-to-many label assignment makes NMS necessary. Thus a good one-to-one policy is the key. The paper is also inspired by MultiBox and DETR to use bipartite matching as the matching cost, to allow neural network to learn a better assignment policy.
DeFCN and OneNet:
- DeFCN shows that a hand crafted one-to-one label assignment already yields OK-ish performance (10% relative drop in KPI). OneNet also mentions that a predefined location cost + classification is able to yield OK baseline.
- Both DeFCN and OneNet adopts a bbox formulation consisting of a point inside GT bbox + 4 distances to the edges. This addresses eccentric objects or objects where center is occluded.
- OneNet seem to have inferior performance than DeFCN.
Key ideas
- One-to-one label assignment is key.
- One-to-one based on center or anchor is already OK.
- Matching cost by foreground loss (as in DETR) improves KPI
- Modified POTO (prediction aware one to one) cost for matching is even better, as the foreground loss (cls+reg) may be weighted and it may not be optimal for bipartite matching.
- The selection of matching cost is not necessarily differentiable. So theoretically we can use mAP as the cost –> see review on Zhihu.
- POTO matching cost
- Spatial priors helps (the center of prediction matched to GT cannot be outside of the GT box)
- Balanced IoU and classification (by multiplication, better than summation)
- 3D Max Filtering (3DMF)
- CenterNet uses 2D max filtering to replace NMS
- Duplicate predictions majorly come from the nearby spatial regions of the most conf prediction, and comes from neighboring scales. As objects with sizes on the border of a stage may be automatically assigned to neighboring stage of the FPN.
- 3DMF is a module to perform 3D max pooling to provide sharper response. It is used as a differentiable post-processing step inside the network.
- Auxiliary loss to speed up convergence.
Technical details
- By using POTO and 3D MF, the scores of duplicate samples are significantly suppressed.
- On CrowdHuman, the recall is even higher than the theoretical upper limit with GT (applying NMS on GT).
- MultiBox is the first paper to propose bipartite matching between pred and GT, way earlier than DETR.
Notes
- Review on Zhihu by 1st author
- About spatial prior in matching cost
在α合理的情况下,空间先验不是必须的,但空间先验能够在匹配过程中帮助排除不好的区域,提升绝对性能;研究者在 COCO 实验中采用 center sampling radius=1.5,在 CrowdHuman 实验中采用 inside gt box. 理由很简单,CrowdHuman 的遮挡问题太严重,center 区域经常完全被遮挡。