May 2020
tl;dr: Channel pruning by learning with L1 sparse constraint on batch norm.
Overall impression
This paper proposes a simple idea of gamma (channel scaling factor) decay. It adds a L1 sparse constraint on BN scale parameter gamma. During inference time, when gamma is smaller than a global threshold, set the entire channel to zero.
The work is concurrent with batchnorm pruning which has a very similar idea.
Key ideas
- Advantages of network slimming:
- No need to change architecture. Recycles the parameters in BN.
- No need for special lib for inference.
- The pruning can happen multiple pass.
- Sparsity constraints can actually help performance, even before pruning.
- During training, the channel scaling factor can actually go up or down.
Technical details
- Un-structured pruning can only save model size by storing it with a sparse format.
- In Batch Norm, there are 4*C parameters (2 trainable and 2 un-trainable parameters per channel).
- The additional L1 regularization term rarely hurt model performance
- ResNets are harder to prune. Only 10% or so can be pruned away without hurting performance after a model is trained and without further finetuning. (per pruning filters). The paper also reported that for ResNet-164, the FLOPs saving ratios is around 50%.
Notes
- code on github
- How is the channel selection layer implemented for architectures with cross layer connection such as ResNets?