Evaluating Financial Intelligence in Large Language Models: Benchmarking SuperInvesting AI with LLM Engines

Akshay Gulati; Kanha Singhania; Tushar Banga; Parth Arora; Anshul Verma; Vaibhav Kumar Singh; Agyapal Digra; Jayant Singh Bisht; Danish Sharma; Varun Singla; Shubh Garg

arXiv:2603.08704·cs.AI·March 10, 2026

Evaluating Financial Intelligence in Large Language Models: Benchmarking SuperInvesting AI with LLM Engines

Akshay Gulati, Kanha Singhania, Tushar Banga, Parth Arora, Anshul Verma, Vaibhav Kumar Singh, Agyapal Digra, Jayant Singh Bisht, Danish Sharma, Varun Singla, Shubh Garg

PDF

Open Access

TL;DR

This paper introduces AFIB, a comprehensive benchmark for evaluating financial reasoning in large language models, revealing performance differences and emphasizing the importance of multi-dimensional financial intelligence.

Contribution

It presents the AFIB framework for systematic assessment of LLMs' financial reasoning, benchmarking five models and highlighting the strengths of retrieval-based and analytical capabilities.

Findings

01

SuperInvesting achieves highest overall performance.

02

Retrieval systems excel in data recency but lag in reasoning.

03

Financial intelligence in LLMs is multi-dimensional.

Abstract

Large language models are increasingly used for financial analysis and investment research, yet systematic evaluation of their financial reasoning capabilities remains limited. In this work, we introduce the AI Financial Intelligence Benchmark (AFIB), a multi-dimensional evaluation framework designed to assess financial analysis capabilities across five dimensions: factual accuracy, analytical completeness, data recency, model consistency, and failure patterns. We evaluate five AI systems: GPT, Gemini, Perplexity, Claude, and SuperInvesting, using a dataset of 95+ structured financial analysis questions derived from real-world equity research tasks. The results reveal substantial differences in performance across models. Within this benchmark setting, SuperInvesting achieves the highest aggregate performance, with an average factual accuracy score of 8.96/10 and the highest completeness…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStock Market Forecasting Methods · Explainable Artificial Intelligence (XAI) · Big Data and Digital Economy