Interpretable Company Similarity with Sparse Autoencoders

Marco Molinari; Victor Shao; Luca Imeneo; Mateusz Mikolajczak; Vladimir Tregubiak; Abhimanyu Pandey; Sebastian Kuznetsov Ryder Torres Pereira

arXiv:2412.02605·cs.CL·May 27, 2025

Interpretable Company Similarity with Sparse Autoencoders

Marco Molinari, Victor Shao, Luca Imeneo, Mateusz Mikolajczak, Vladimir Tregubiak, Abhimanyu Pandey, Sebastian Kuznetsov Ryder Torres Pereira

PDF

Open Access 1 Video

TL;DR

This paper introduces a method using Sparse Autoencoders to create interpretable, meaningful clusters of companies based on descriptions, outperforming traditional sector codes and embeddings in capturing fundamental similarities and improving trading strategies.

Contribution

The paper demonstrates that Sparse Autoencoders can produce interpretable company clusters that better reflect fundamental characteristics than existing classification methods.

Findings

01

SAE features outperform SIC and GICS codes in correlation with returns

02

SAE-based clusters yield higher Sharpe ratios in trading strategies

03

Clusters are simple and interpretable, aiding high-stakes decision-making

Abstract

Determining company similarity is a vital task in finance, underpinning risk management, hedging, and portfolio diversification. Practitioners often rely on sector and industry classifications such as SIC and GICS codes to gauge similarity, the former being used by the U.S. Securities and Exchange Commission (SEC), and the latter widely used by the investment community. Since these classifications lack granularity and need regular updating, using clusters of embeddings of company descriptions has been proposed as a potential alternative, but the lack of interpretability in token embeddings poses a significant barrier to adoption in high-stakes contexts. Sparse Autoencoders (SAEs) have shown promise in enhancing the interpretability of Large Language Models (LLMs) by decomposing Large Language Model (LLM) activations into interpretable features. Moreover, SAEs capture an LLM's internal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Interpretable Company Similarity with Sparse Autoencoders· underline

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Stock Market Forecasting Methods