TL;DR
This paper identifies rank bottlenecks in knowledge graph embedding models caused by simple output layers, and proposes KGE-MoS, a mixture-based layer, to improve ranking accuracy on large datasets.
Contribution
The paper introduces KGE-MoS, a novel mixture-based output layer that alleviates rank bottlenecks in knowledge graph embeddings, enhancing performance on large-scale datasets.
Findings
KGE-MoS improves ranking performance on large datasets.
Rank bottlenecks limit model expressivity and accuracy.
KGE-MoS achieves these improvements with low parameter cost.
Abstract
Many knowledge graph embedding (KGE) models for link prediction use powerful encoders. However, they often rely on a simple hidden vector-matrix multiplication to score subject-relation queries against candidate object entities. When the number of entities is larger than the model's embedding dimension, which is often the case in practice by several orders of magnitude, we have a linear output layer with a rank bottleneck. Such bottlenecked layers limit model expressivity. We investigate both theoretically and empirically how rank bottlenecks affect KGEs. We find that, by limiting the set of feasible predictions, rank bottlenecks hurt the ranking accuracy and distribution fidelity of scores. Inspired by the language modelling literature, we propose KGE-MoS, a mixture-based output layer to break rank bottlenecks in many KGEs. Our experiments show that KGE-MoS improves ranking performance…
Peer Reviews
Decision·Submitted to ICLR 2026
I think the paper’s main strength is its theoretical analysis of the bottleneck problem. The authors do a solid job of identifying and characterizing this limitation and even relate it to graph connectivity, which is interesting. The experiments are also convincing overall that KGE-MOS seems to improve results in the right settings (i.e., large graphs) without blowing up the parameter count. It’s a clear and well-motivated piece of work.
I do have a few concerns about the experimental part. W1. The baselines are reasonable (DISTMULT, ConvE, etc.), but they’re a bit outdated. There are more recent KGE architectures — some Transformer- or GNN-based — that also use similar scoring layers. It would strengthen the argument a lot if the authors could show that their method helps even those stronger baselines, not just the classic ones. W2. The ablation on the number of mixtures, $K$, is only done on DISTMULT and on a single dataset
- Problem novelty: rank bottleneck problem not studied yet in KGE literature, to the best of my knowledge. - Paper is well written, and it includes comprehensive material. - Contribution is an original adoption of methods from language modelling literature. - Evaluation: good mixture of benchmark datasets.
- The rank bottleneck problem could use a more in-depth introduction, to broaden up the audience. - Contribution limited to adopting a MoS layer to existing KGE architectures. - KGE-MOS does not support translation-based KGE methods (e.g. RotatE). - Evaluation: limited impact of \*-MoS on predictive power. results at par with baselines. - Experimental results presented in the paper does not justify the adoption of KGE-MOS in practice due to computational overhead (e.g. 2.75 slower to train)
Hard to say, as key related work is not represented.
W1. Not novel and key related work missing. The rank bottleneck has been studied in more detail, using tighter bounds, and applied to more models in [A]. It also proposes an ensemble approach. This paper is neither cited nor discussed. W2. Does not use the right problems. The paper asks whether a KGE model can express every ranking. That's not relevant, however, if a KGE model can express every ranking, but only whether it can rank positives higher than negatives: the relative ranking of, say,
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
