# Evaluating multimodal commercial and open-source large language models for dynamical astronomy: a benchmark study of resonant behavior classification

**Authors:** Evgeny Smirnov, Valerio Carruba

PMC · DOI: 10.1038/s41598-026-45926-y · Scientific Reports · 2026-03-28

## TL;DR

This paper evaluates how well large language models can classify types of astronomical resonances from images, comparing commercial and open-source models.

## Contribution

The study introduces new benchmark datasets and shows that open-source models can achieve high accuracy in resonance classification without training.

## Key findings

- Commercial LLMs achieved 100% accuracy on simple resonance cases and up to 94% on complex three-class datasets.
- Open-source models reached 100% on clear cases and 76% on complex ones, approaching commercial performance on full binary benchmarks.
- LLMs can perform resonance classification at levels comparable to traditional methods without training or fine-tuning.

## Abstract

We present a systematic evaluation of modern multimodal large language models (LLMs) for the classification of mean-motion and secular resonances from images of resonant arguments. Four benchmark datasets (RB-TEST, RB-PILOT, RB-SMALL, RB-FULL) were constructed to cover clear, ambiguous, and transient cases, with both binary and three-class outputs. Using standardized prompts (a full prompt for large models and a simplified variant for small models that cannot process complex instructions), we tested flagship commercial models, large open-source models, and small locally runnable models. Commercial LLMs reach \documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$F_1=100\%$$\end{document} on simple cases and up to \documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$94\%$$\end{document} on the three-class RB-SMALL dataset, while the best open-source models also reach \documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$100\%$$\end{document} on unambiguous cases and \documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$76\%$$\end{document} on the complex ones. On the full binary benchmark, open-source models approach commercial performance (\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$F_1\approx 90$$\end{document}–\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$96\%$$\end{document}). Most errors occur in transient and resonance-sticking regimes. The results show that LLMs can perform resonance classification at levels comparable to those of classical or machine-learning methods without training or fine-tuning, and that even small open-source models achieve practically useful accuracy. The released benchmarks establish a reproducible standard for evaluating LLMs on dynamical astronomy tasks.

## Full-text entities

- **Diseases:** LLMs (MESH:D007806), ML (MESH:C537366)
- **Chemicals:** gemini (-), RB (MESH:D012413), F = (MESH:D005461)
- **Species:** Homo sapiens (human, species) [taxon 9606]
- **Cell lines:** gemma-3-12b — Mus musculus (Mouse), Hybridoma (CVCL_B0V0)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC13039361/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13039361/full.md

## References

7 references — full list in the complete paper: https://tomesphere.com/paper/PMC13039361/full.md

---
Source: https://tomesphere.com/paper/PMC13039361