Jan 2019
tl;dr: Create a long-term feature bank (list of short term features) as attention signal to augment short term features for classification.
Overall impression
The paper builds on top of existing work of video understanding (Non-local net) and addresses the issue of how to exploit the long term signal, given the limited memory. Thus the same features represent both the present and the long-term context. This work tries to decouple the two, allowing long term feature bank to store flexible features that compliment the short term features.
Previous works
- Previous work extract features with pretrained network from individual frames and then feed the features into a RNN or pooling net.
- Use agressive subsampling to fit the entire video into GPU memory.
- 3D CNN takes in a short clip of 2-5 seconds and computes a feature map, which is then RoIAligned with region proposals (pre-computed from a single frame?) to compute the RoI features for each actor. It captures short term informaiton.
Key ideas
\(S_t^{(1)} = NL'_{\theta_1} (S_t, \tilde{L_t}) \\
S_t^{(2)} = NL'_{\theta_2} (S_t^{(1)}, \tilde{L_t}) \\
\vdots\)
-
The output of the non-local blocks are 512 and then are concatenated with the short term features 2048.
-
Another implementation of FBO is through avg/max pooling through time dimension, yielding 2048 long term dimensions to be concatenated with short term features 2048.
-
Backbone: pretrained 2D ResNet inflated into 3D (I3D). The same architecture is used to calculated long and short term features, but parameters are not shared (separately trained models).
-
RoIpooling: video backbone features (16x14x14) is first average pooled in time, then RoIAligned into 7x7 regional features, then spatial max pooled into 1x1x1x2048 dimension features. This should be the d the paper referes to. The region proposals are computed with pretrained model on the center frame of the center clip (to be verified).
Technical details
- Both short and long term features are extracted using 32 input frames with temporal stride of 2 (~2 seconds in 30 FPS videos). Long term features are computed 1 clip per second. If the video has S seconds, then the long term feature bank is a list containing S long term features. Note that the overlap is half of the input window.
- Ablation results
- Increasing temporal support hurts the performance of 3D CNN and 3D CNN with self-attention. Temporal convolution might not be suitable for long-term patterns.
- Even with fixed pretrained 3D CNN, the performance is pretty high for final results (24.3 vs optimal 25.5 vs baseline 22.3 mAP). I think the benefit from feature decoupling is overrated by the autors.
- Using 2x2 grid features also acheves very good results comared to RoIAligned features (25.1 vs 25.5 mAP).
- Non-local design of FBO is much better than feature pooling (23.2 vs 25.5 mAP).
- TTA significantly improves performance (28.0 vs 25.5 mAP)
- Improving LFB and backbone are complementary and they improve different types of tasks (interactive actions, talk to, throw, hit, etc vs standalone ones, climb, take a photo, etc).
Notes
- This work on long term feature bank seems to have been developed in parallel with SlowFast Network based on the publication date on ArXiv. Both addresses the issue of how to exploit long term signal.
- The work may be applicable to CT data analysis by building a long term feature bank across the entire body length. This should be helpful in understanding the anatomy.
Things to follow up
- Layer normalization: why is it useful in video understanding?
- I3D/C3D