April 2020
tl;dr: Multimodal behavioral prediction from Uber ATG with 6 seconds horizon.
Very similar to the idea of Waymo’s MultiPath. Uber’s approach uses multiple trajectory prediction (MTP) loss. Waymo’s approach uses fixed number of anchor trajectories. These two approaches are largely equivalent–predicting the mode first, and masking out the loss for all other modes.
It uses a raster image to encode map information (BEV semantic map), very close to MultiPath and the previous researches such as RoR, ChauffeurNet and IntentNet.
It is quite interesting to see that a single modal model will just predicting the average of the two modes. In general, if it is hard for humans to label deterministically, the underlying distribution is multimodal.
IRL is used for behavioral prediction. However usually it is not fast enough for real-time inference.
One approach would be to create a reward function that captures the desired behavior of a driver, like stopping at red lights, avoiding pedestrians, etc. However, this would require an exhaustive list of every behavior we’d want to consider, as well as a list of weights describing how important each behavior is.
However, through IRL, the task is to take a set of human-generated driving data and extract an approximation of that human’s reward function for the task. Still, much of the information necessary for solving a problem is captured within the approximation of the true reward function. Once we have the right reward function, the problem is reduced to finding the right policy, and can be solved with standard reinforcement learning methods.
MDN (MultiPath from Waymo used this formulation)
Mixture Density Networks (MDNs) are conventional neural networks which solve multimodal regression tasks by learning parameters of a Gaussian mixture model. However, MDNs are often difficult to train in practice due to numerical instabilities when operating in high-dimensional spaces.