April 2020
tl;dr: Aligned channel2spatial boosts the performance of instance segmentation than direct channel2spatial.
The paper proposes a relatively rigorous formulation for 4D tensor that unifies DeepMask and InstanceFCN into one framework. The paper seems to be overly complicated to convey two simple ideas: We need to align channel2spatial, and we need large masks for large objects.
The key question to dense instance segmentation: why cannot we naively adopt CenterNet architecture for instance segmentation?
The answer is that training a neural network with $480^2$ channels is intractable. Thus a tradeoff has to be made for $H \times W \times C$. Either predicts a coarse mask and rely on bilinear upsampling and feature alignment to gain better masks, as in TensorMask, or predicts full resolution masks at coarse location grids such as SOLO.