FlowNIB: An Information Bottleneck Analysis of Bidirectional vs. Unidirectional Language Models

Md Kowsher; Nusrat Jahan Prottasha; Shiyun Xu; Shetu Mohanto; Ozlem Garibay; Niloofar Yousefi; Chen Chen

arXiv:2506.00859·cs.CL·October 10, 2025

FlowNIB: An Information Bottleneck Analysis of Bidirectional vs. Unidirectional Language Models

Md Kowsher, Nusrat Jahan Prottasha, Shiyun Xu, Shetu Mohanto, Ozlem Garibay, Niloofar Yousefi, Chen Chen

PDF

Open Access 3 Reviews

TL;DR

This paper uses the Information Bottleneck principle and a new method, FlowNIB, to explain why bidirectional language models outperform unidirectional ones in understanding tasks, showing they retain more task-relevant information.

Contribution

The paper introduces FlowNIB, a scalable method for estimating mutual information during training, and provides a theoretical framework demonstrating bidirectional models retain more information and are more expressive.

Findings

01

Bidirectional models retain more mutual information.

02

Bidirectional models have higher effective dimensionality.

03

FlowNIB effectively analyzes information flow during training.

Abstract

Bidirectional language models have better context understanding and perform better than unidirectional models on natural language understanding tasks, yet the theoretical reasons behind this advantage remain unclear. In this work, we investigate this disparity through the lens of the Information Bottleneck (IB) principle, which formalizes a trade-off between compressing input information and preserving task-relevant content. We propose FlowNIB, a dynamic and scalable method for estimating mutual information during training that addresses key limitations of classical IB approaches, including computational intractability and fixed trade-off schedules. Theoretically, we show that bidirectional models retain more mutual information and exhibit higher effective dimensionality than unidirectional models. To support this, we present a generalized framework for measuring representational…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

The paper adds a formal lens to a known empirical observation that bidirectional attention yields stronger representations than causal attention. The work includes broad comparisons across many datasets and model families.

Weaknesses

The authors claim that higher OIC is correlated with better accuracy of the model, but no quantitiave correlations or graphical comparisons are provided. There are small issues with the presentation of the paper. Namely, the tables are hardly readable -- a lot of content with tiny font size. I would suggest moving these full tables to appendix, and leaving only the most important aggregated/selected values in the main text, or represented in a graphical form as a plot.

Reviewer 02Rating 2Confidence 3

Strengths

- I found the paper clearly written and overall easy to follow (with a few but critical exceptions mentioned below). - Given the recent dominance of unidirectional language models, it can be interesting to revisit bidirectional language models.

Weaknesses

- I didn't understand a key point: line 099: "in finding both information *simultaneously*". What does this mean? Why not just use MINE to estimate I(X; Z_l) and I(Z_l; Y)? Why use a schedule emphasizing first one of these and then the other? I see that this yields a nice 2D plane, but what is the theoretical interpretation? This seems also key to understanding why the authors need to introduce FlowNIB, which otherwise seems unclear. - Figure 1: which ones are the lower vs upper layers? It seems

Reviewer 03Rating 2Confidence 3

Strengths

The paper uses and emphasizes the importance of mutual information, which is a metric well-supported by theory and could be very useful for analyzing neural models, but is somewhat overlooked by the research community.

Weaknesses

- There’s an important question that is not clear to me, is the input variable X a sequence of tokens or just one token? From the architecture of MINE critics (2 layer MLP) mentioned in the paper, it seems it is similar to the original use case of MINE, where inputs are just pairs of vectors. But it means X is a single token, and Z is representation correspond to this token. Then this I(X, Z) is measuring how much information about the current token is encoded by its corresponding representation

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Text Readability and Simplification