Multimodal Shannon Game with Images
Vil\'em Zouhar, Sunit Bhattacharya, Ond\v{r}ej Bojar

TL;DR
This paper explores how adding images as an extra modality in the Shannon game enhances human and language model performance, confidence, and priming effects, especially with larger contexts.
Contribution
It introduces a multimodal extension to the Shannon game, demonstrating that image information improves language understanding and modeling in both humans and GPT-2.
Findings
Image info boosts accuracy and confidence.
Certain word classes benefit more from images.
Priming effects increase with larger context.
Abstract
The Shannon game has long been used as a thought experiment in linguistics and NLP, asking participants to guess the next letter in a sentence based on its preceding context. We extend the game by introducing an optional extra modality in the form of image information. To investigate the impact of multimodal information in this game, we use human participants and a language model (LM, GPT-2). We show that the addition of image information improves both self-reported confidence and accuracy for both humans and LM. Certain word classes, such as nouns and determiners, benefit more from the additional modality information. The priming effect in both humans and the LM becomes more apparent as the context size (extra modality information + sentence context) increases. These findings highlight the potential of multimodal information in improving language understanding and modeling.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage, Metaphor, and Cognition · Speech and dialogue systems · Natural Language Processing Techniques
