Vision Language Models as Reward Models

Vision language models (VLMs) that have been trained on internet-scale text-image data sets have been used to formulate reward functions from language [cite:@kwonRewardDesignLanguage2022;@rocamondeVisionLanguageModelsAre2023;@mahmoudiehZeroShotRewardSpecification2022]. For example, [cite/t:@mahmoudiehZeroShotRewardSpecification2022];[cite/t:@rocamondeVisionLanguageModelsAre2023] use CLIP embeddings [cite:@radfordLearningTransferableVisual2021] to calculate the similarity between an image observation embedding and the goal text embedding. Given a language goal $l$, for example, $l=\text{"Put the fork on the kitchen table"}$ and an image observation $o$, we can formulate a reward function as, \begin{align} R_{\text{VLM}}(o,l) = \frac{\text{CLIP}_{\text{v}}(o) \cdot \text{CLIP}_{l}(l)}{\| \text{CLIP}_{\text{v}}(o) \| \cdot \| \text{CLIP}_{\text{l}}(l) \|}, \end{align} where $\text{CLIP}_{\text{l}}(\cdot)$ is the language encoder and $\text{CLIP}_{\text{v}}(\cdot)$ is the vision encoder. Intuitively, a higher similarity corresponds to higher reward, i.e. the observation is closer to the text goal.

It is worth noting that there are other ways to get reward functions from VLMS. For example, [cite/t:@kwonRewardDesignLanguage2022] take an alternative approach and use LLMs to get reward signals.

Aidan Scannell
Aidan Scannell
Postdoctoral Researcher

My research interests include model-based reinforcement learning, probabilistic machine learning (gaussian processes, Bayesian neural networks, approximate Bayesian inference, etc), learning-based control and optimal control.