August 2019
tl;dr: Predicting coordinate transformation (predicting x and y directly from image and vice versa) with Conv Nets are hard. Adding a mesh grid to input image helps this task significantly.
The paper results are very convincing, and the technique is super efficient. Essentially it only concats two channel meshgrid to the original input.
RoI10D cited this paper. This work also inspired cam conv.