Learning-AI

😼 Gato: A Generalist Agent

June 2023

tl;dr: A massive multitask multimodal LLM that can perform 604 multi-embodiment tasks as language modeling.

Overall impression

Gato is a generalist model with a single neural network with the same set of weights to perform a massive number of tasks. It can output text, joint torques, button presses and other tokens. –> This is maybe what is used by Tesla’s TeslaBot which maps images to joint angles.

Gato significantly scales up behavior cloning (BC) with data, compute and model parameters. BC can also be framed as a sequence modeling problem with natural language as a common grounding across otherwise incompatible embodiments.. It is NOT trained with offline or online RL. Gato reduces the need for handcrafting policy models with appropriate inductive biases for each domain.

LLM training is always open-loop, and inference is always close-loop.

The tricks in tokenization and details in Appendix B and Appendix C are super useful and is perhaps the greatest contribution of this paper.

Key ideas

Technical details

Notes

“Single-brain”-style models have interesting connections to neuroscience. Mountcastle (1978) famously stated that “the processing function of neocortical modules is qualitatively similar in all neocortical regions. Put shortly, there is nothing intrinsically motor about the motor cortex, nor sensory about the sensory cortex”. Mountcastle found that columns of neurons in the cortex behave similarly whether associated with vision, hearing or motor control. This has motivated arguments that we may only need one algorithm or model to build intelligence (Hawkins & Blakeslee, 2004).

Sensory substitution provides another argument for a single model (Bach-y Rita & Kercel, 2003). For example, it is possible to build tactile visual aids for blind people as follows. The signal captured by a camera can be sent via an electrode array on the tongue to the brain. The visual cortex learns to process and interpret these tactile signals, endowing the person with some form of “vision”. Suggesting that, no matter the type of input signal, the same network can process it to useful effect.