Vision Language Models as Reward Models
Vision language models (VLMs) that have been trained on internet-scale text-image data sets have been used to formulate reward functions from language (Kwon et al. 2022; Rocamonde et al. 2023; Mahmoudieh, Pathak, and Darrell 2022). For example, Mahmoudieh, Pathak, and Darrell (2022);Rocamonde et al. (2023) use CLIP embeddings (Radford et al. 2021) to calculate the similarity between an image observation embedding and the goal text embedding. Given a language goal \(l\), for example, \(l=\text{“Put the fork on the kitchen table”}\) and an image observation \(o\), we can formulate a reward function as,
\begin{align} R_{\text{VLM}}(o,l) = \frac{\text{CLIP}_{\text{v}}(o) \cdot \text{CLIP}_{l}(l)}{\| \text{CLIP}_{\text{v}}(o) \| \cdot \| \text{CLIP}_{\text{l}}(l) \|}, \end{align}
where \(\text{CLIP}_{\text{l}}(\cdot)\) is the language encoder and \(\text{CLIP}_{\text{v}}(\cdot)\) is the vision encoder. Intuitively, a higher similarity corresponds to higher reward, i.e. the observation is closer to the text goal.
It is worth noting that there are other ways to get reward functions from VLMS. For example, Kwon et al. (2022) take an alternative approach and use LLMs to get reward signals.