Vision language models (VLMs) that have been trained on internet-scale text-image data sets have been used to formulate reward functions from language [cite:@kwonRewardDesignLanguage2022;@rocamondeVisionLanguageModelsAre2023;@mahmoudiehZeroShotRewardSpecification2022]. For example, [cite/t:@mahmoudiehZeroShotRewardSpecification2022];[cite/t:@rocamondeVisionLanguageModelsAre2023] use CLIP embeddings [cite:@radfordLearningTransferableVisual2021] to calculate the similarity between an image observation embedding and the goal text embedding.