Metaphors are a Source of Cross-Domain Misalignment of Large Reasoning Models
Zhibo Hu, Chen Wang, Yanfeng Shu, Hye-young Paik, Liming Zhu

TL;DR
This paper investigates how metaphors in training data influence large language models' reasoning, revealing a causal link to cross-domain misalignment and proposing a detection method based on latent feature monitoring.
Contribution
It uncovers the causal impact of metaphors on LLMs' reasoning misalignment and introduces interventions and a detector to mitigate this issue.
Findings
Metaphors causally increase cross-domain misalignment in LLMs.
Interventions on metaphors significantly alter misalignment levels.
A high-accuracy detector for misaligned content based on latent features was developed.
Abstract
Earlier research has shown that metaphors influence human's decision making, which raises the question of whether metaphors also influence large language models (LLMs)' reasoning pathways, considering their training data contain a large number of metaphors. In this work, we investigate the problem in the scope of the emergent misalignment problem where LLMs can generalize patterns learned from misaligned content in one domain to another domain. We discover a strong causal relationship between metaphors in training data and the misalignment degree of LLMs' reasoning contents. With interventions using metaphors in pre-training, fine-tuning and re-alignment phases, models' cross-domain misalignment degrees change significantly. As we delve deeper into the causes behind this phenomenon, we observe that there is a connection between metaphors and the activation of global and local latent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage, Metaphor, and Cognition · Topic Modeling · Multimodal Machine Learning Applications
