TL;DR
This paper introduces a new embedding approach focused on capturing preferences rather than semantics, improving preference prediction in collective decision-making contexts.
Contribution
It formalizes the invariance problem in text embeddings and proposes synthetic training data to enhance preference-focused similarity measurement.
Findings
Synthetic training data shifts the optimal scorer away from nuisance signals.
The proposed method improves preference prediction across 11 datasets.
Standard embeddings often conflate semantics with preferences, leading to inaccuracies.
Abstract
Modern AI is opening the door to collective decision-making in which participants express their views as free-form text rather than voting on a fixed set of candidates. A natural idea is to embed these opinions in a vector space so that the substantial literature on facility location problems and fair clustering can be brought to bear. But standard text embeddings measure semantic similarity, whereas distances in facility location problems and fair clustering require what we call \textit{preferential similarity}: a participant's agreement with a piece of text should be inversely related to their distance from it. Off-the-shelf embeddings inherit a coarse preference signal through a correlation between semantic and preferential similarity, but fail to capture preferences when the correlation breaks. We formalize this as an invariance problem: text embedding models encode both a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
