Towards Improved Sentence Representations using Token Graphs
Krishna Sri Ipsit Mantri, Carola-Bibiane Sch\"onlieb, Zorah L\"ahner, Moshe Eliasof

TL;DR
GLOT introduces a structure-aware pooling method that constructs token similarity graphs and refines token representations with graph neural networks, significantly improving sentence embeddings' robustness and efficiency.
Contribution
The paper presents GLOT, a novel graph-based pooling technique that enhances sentence representations by leveraging token relations, outperforming standard methods with fewer parameters.
Findings
GLOT maintains over 97% accuracy with random distractors, outperforming baselines.
It is competitive on GLUE and MTEB benchmarks, with 20x fewer trainable parameters.
GLOT speeds up training over 100x compared to fine-tuning methods.
Abstract
Obtaining a single-vector representation from a Large Language Model's (LLM) token-level outputs is a critical step for nearly all sentence-level tasks. However, standard pooling methods like mean or max aggregation treat tokens as an independent set, discarding the rich relational structure captured by the model's self-attention layers and making them susceptible to signal dilution. To address this, we introduce GLOT, a lightweight, structure-aware pooling module that reframes pooling as relational learning followed by aggregation. Operating on the outputs of a frozen LLM, GLOT first constructs a latent token-similarity graph, then refines token representations with a graph neural network, and finally aggregates them using a readout layer. Experimentally, our approach is remarkably robust and efficient: on a diagnostic stress test where 90% of tokens are random distractors, GLOT…
Peer Reviews
Decision·ICLR 2026 Poster
1. Reframing sentence pooling as token-relation modeling. The paper introduces GLOT, a new framework that replaces traditional pooling (e.g., mean or [CLS]) with a graph-based aggregation process. GLOT unifies prior pooling schemes as special cases (when the graph is empty or uniform) and explicitly addresses the long-standing signal dilution problem in sentence embeddings. Demonstrated by strong improvements in the signal dilution test (≈97% accuracy under 90% distractors; Table 7, Sec. 5.4).
1. While GLOT introduces a graph-based pooling paradigm, the graph construction step is heuristic; edges are formed by thresholding pairwise cosine similarities between token embeddings. The paper does not explore learnable or adaptive graph formation, nor analyze how different thresholds quantitatively affect representation quality beyond a small ablation (τ = 0.4–0.6 works best). As a result, the approach lacks a deeper theoretical justification on how graph topology influences performance. (S
1) The paper has a clear motivation. 2) The method sounds technical. 3) GLOT has been evaluated through extensive experiments.
1) Some technical details of the method need to be presented. 2) The paper lacks essential theoretical explanations to ensure the effectiveness of the method. 3) The experimental analysis needs more powerful explanations. I have significant doubts about the rationale behind the design of the proposed GLOT method and about why using a GNN-based approach alone can substantially improve the problem. These issues should be clearly explained in the method description, theoretical analysis, and exper
1. Sentence embedding is a core topic in representation learning, and pooling is a crucial step in converting token-level embeddings into sequence-level embeddings. Therefore, the chosen track is highly relevant to the conference's scope and holds significant practical value. 2. The paper is well-written. The methodology is presented clearly and intuitively. The authors claim that GLOT is the first work to learn sentence representations via GNNs on top of a frozen Large Language Model (LLM). 3.
1. The MTEB is a massive benchmark for embedding evaluation. The authors show that their method surpasses baselines on only a selected subset of tasks. This seems to be insufficient to support the claim of "state-of-the-art performance" in the abstract. 2. The chosen baselines are rather conventional. I am curious about the comparison between this proposed method and the prompting-based methods that have emerged in the past two years. Prompting methods can have an even smaller memory footprint t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Sentiment Analysis and Opinion Mining · Advanced Graph Neural Networks
