Loading paper
Can a Unimodal Language Agent Provide Preferences to Tune a Multimodal Vision-Language Model? | Tomesphere