February 2020
tl;dr: An efficient training technique by scaling spatial and temporal dimension of videos.
Overall impression
The paper is from FAIR and well written, as usual. Lots of experiments, and lots of GPUs (128)! Although they also validated the methods on 1 GPU as well with 3x speed up.
Recent Video training SOTA: I3D, SlowFast, Non-Local
It draws inspiration from FixRes that it requires a finetuning stage at the end to match train/test descrepancy.
Key ideas
Technical details
- Linear scaling rule
- Cosine learning schedule. This seems to yield similar performance to stagewise training schedule.
- Temporal subsampling: non-uniform stride
- May become I/O bound
- Training beyond 1 to 2 epoches hurt performance.
Notes
- Can we apply this to images?
- Temporal subsampling in the long cycle seem to hurt performance. Can we just downsample the spatial resolution? Short cycle do not downsample time and leads to better performance. Maybe the time dimension augmentation/subsampling altered the meaning of video.