Vision Language Models as Reward Models

Vision language models (VLMs) that have been trained on internet-scale text-image data sets have been used to formulate reward functions from language (Kwon et al. 2022; Rocamonde et al. 2023; Mahmoudieh, Pathak, and Darrell 2022). For example, Mahmoudieh, Pathak, and Darrell (2022);Rocamonde et al. (2023) use CLIP embeddings (Radford et al. 2021) to calculate the similarity between an image observation embedding and the goal text embedding. Given a language goal \(l\), for example, \(l=\text{“Put the fork on the kitchen table”}\) and an image observation \(o\), we can formulate a reward function as,

\begin{align} R_{\text{VLM}}(o,l) = \frac{\text{CLIP}_{\text{v}}(o) \cdot \text{CLIP}_{l}(l)}{\| \text{CLIP}_{\text{v}}(o) \| \cdot \| \text{CLIP}_{\text{l}}(l) \|}, \end{align}

where \(\text{CLIP}_{\text{l}}(\cdot)\) is the language encoder and \(\text{CLIP}_{\text{v}}(\cdot)\) is the vision encoder. Intuitively, a higher similarity corresponds to higher reward, i.e. the observation is closer to the text goal.

It is worth noting that there are other ways to get reward functions from VLMS. For example, Kwon et al. (2022) take an alternative approach and use LLMs to get reward signals.

References

Kwon, Minae, Sang Michael Xie, Kalesha Bullard, and Dorsa Sadigh. 2022. “Reward Design with Language Models.” In The Eleventh International Conference on Learning Representations.
Mahmoudieh, Parsa, Deepak Pathak, and Trevor Darrell. 2022. “Zero-Shot Reward Specification via Grounded Natural Language.” In Proceedings of the 39th International Conference on Machine Learning, 14743–52. PMLR.
Radford, Alec, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, et al. 2021. “Learning Transferable Visual Models From Natural Language Supervision.” arXiv. https://doi.org/10.48550/arXiv.2103.00020.
Rocamonde, Juan, Victoriano Montesinos, Elvis Nava, Ethan Perez, and David Lindner. 2023. “Vision-Language Models Are Zero-Shot Reward Models for Reinforcement Learning.” In The Twelfth International Conference on Learning Representations.
Previous