# FinTextSim: a domain-specific sentence-transformer for extracting predictive latent topics from financial disclosures

**Authors:** Simon Jehnen, Javier Villalba-Díez, Joaquín Ordieres-Meré

PMC · DOI: 10.3389/frai.2026.1752103 · Frontiers in Artificial Intelligence · 2026-03-02

## TL;DR

FinTextSim is a specialized AI model that improves the analysis of financial reports by extracting meaningful topics from textual data, leading to better corporate performance predictions.

## Contribution

FinTextSim introduces a domain-specific sentence-transformer that significantly enhances topic modeling in financial text analysis.

## Key findings

- FinTextSim improves intratopic similarity by up to 71% and reduces intertopic similarity by over 108%.
- Using FinTextSim boosts predictive performance in corporate performance forecasting by two percentage points in key metrics.
- FinTextSim outperforms both standard embedding models and classical topic models in linear and non-linear settings.

## Abstract

Recent advancements in information availability and computational capabilities have transformed the analysis of annual reports, integrating traditional financial metrics with insights from textual data. To extract actionable insights from this wealth of textual data, automated review processes, such as topic modeling, are essential. This study benchmarks classical approaches against contemporary neural techniques and introduces FinTextSim, a sentence-transformer finetuned for financial text. Using Item 7 and Item 7A of 10-K filings from S&P 500 companies (2016–2023), we systematically evaluate these models qualitatively and quantitatively. BERTopic in combination with FinTextSim consistently outperforms all alternatives, producing notably clearer, more coherent and financially relevant topic clusters. Compared to the most widely used standard embedding models and financial baselines, FinTextSim improves intratopic similarity by up to 71% and reduces intertopic similarity by more than 108%, highlighting the importance of domain-specific embeddings. Crucially, these qualitative gains translate into quantitative predictive benefits: incorporating FinTextSim-derived topic features into a logistic regression framework for corporate performance prediction leads to a statistically significant two-percentage-point increase in both ROC-AUC and F1-score over a purely financial baseline. In contrast, off-the-shelf sentence-transformers and classical topic models introduce noise that degrades predictive performance. For non-linear classifiers, several textual representations yield modest gains, reflecting their greater capacity to absorb noisier features. However, FinTextSim remains the most stable and consistently strong performer across both linear and non-linear settings. Overall, FinTextSim acts as a domain-adapted information filter, translating unstructured financial text into structured, semantically rich representations that human analysts and generic models often overlook. By bridging interpretability and predictive utility, it enables the extraction of economically relevant information from corporate narratives and supports more effective decision-making, resource allocation, and corporate performance forecasting.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12989565/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12989565/full.md

## References

133 references — full list in the complete paper: https://tomesphere.com/paper/PMC12989565/full.md

---
Source: https://tomesphere.com/paper/PMC12989565