January 2020
tl;dr: CenterNet for video object detection.
This extends CenterNet as Recurrent SSD extends SSD.
However it is still using box-based method to generate bbox and then link them to action tublets. This is more of a bottom up approach as compared to recurrent ssd.
Drawbacks and limitations: The main drawback is that it takes in K frames (K=7) frames at the same time. It is not suitable for fast online inference. It does support multiple object detection at the same time, same as CenterNet.