Learning-AI

SimMIM: A Simple Framework for Masked Image Modeling

November 2021

tl;dr: Large scale pretraining based on Masked Image Modeling. Similar to MAE.

Overall impression

This paper is published a week after MAE, obviously rushed by the publication of the latter. The ideas are very similar, but execution (hyperparameter tuning, paper writing) is considerably inferior to MAE.

Difference between MAE and SimMIM:

MAE uses asymmetric design of encoder and decoder, where encoder does not see masked patches. SimMIM uses symmetric design.
SimMIM stressed the difference between prediction (of only masked patches) and reconstruction (of all patches), and mentioned that the former yields better performance. MAE also observes the trend (in footnote). However MAE also demonstrates the mid-ground: training without losses on visible patches but prediction on all the patches.
SimMIM was not validated on more fine-grained downstream tasks such as object detection and segmentation.

Similarities between MAE and SimMIM:

directly regress the pixels
light decoder design

Key ideas

Summaries of the key ideas

Technical details

Summary of technical details

Notes

Questions and notes on how to improve/revise the current work