Learning-AI

Object Detection based on Region Decomposition and Assembly

@YuShen1116

May 2019

tl;dr

This paper propose a region decomposition and assembly(R-DAD) for more accurate object detection. the motivation of decomposition object is to tackle occlusion: if your target object is blocked by something, decompose the object would help the network to learn the features from unblocked(valid) parts.

Overall impression

This paper is also working on the feature extraction part of the network to improve detection accuracy. In the beginning they are still using the CNN network as backbone, and they applied a multi-scale region proposal and Roi pooling to crate a base feature map. After that, they decompose the input object features into four parts(up, bottom, left, right) and use a region assembly block(RAB) to extract and fuse features for those sub-images, the connect the block to the detection layer.

Key ideas

Re-scale the feature from backbone to increase diversity of region proposals and filter output proposals wih low-confidence and low-overlap ratios. (this step is more like a engineering trick)
split the proposed object into four different parts, and re-scale them back with bi-linear interpolation.

Technical details

Multi-scale region proposal network
- Input to this layer is the output of backbone
- first use RPN to generate proposal boxes, then re-scales the boxes to increase the diversity or proposal regions(filter out the regions with low confident and low overlap to speed up)
Region decomposition and assembly network
- decompose the proposed region into different parts(top, bottom, left, right)
- apply RoI pooling to the whole object region
- up-sample each parts by a factor of 2 with bi-linear interpolation before feature extraction.

Notes

Questions and notes on how to improve/revise the current work
a csdn note: https://blog.csdn.net/qq_30708445/article/details/88182603
This paper used lots of engineering tricks to achieve its performance, it’s hard to say if his R-DAD network is really working. However, I think decomposing the feature map for object detection is a very interesting approach. In the paper, the author only split the proposed region into four parts without considering their relationship. Maybe we can decompose the object with different methods? But this may requires the study of object relationship. Professor Tu had journal about image parsing (http://pages.ucsd.edu/~ztu/publication/IJCV_parsing.pdf). Maybe we can figure something out from them