# BLUEX Revisited: Enhancing Benchmark Coverage with Automatic Captioning

**Authors:** Jo\~ao Guilherme Alves Santos, Giovana Kerche Bon\'as, Thales Sales Almeida

arXiv: 2508.21294 · 2025-09-01

## TL;DR

This paper updates the BLUEX benchmark with new exams and automatically generated image captions, significantly improving its coverage and utility for evaluating multilingual LLMs and their use of visual context.

## Contribution

It introduces an enhanced BLUEX dataset with 2024-2025 exams and automated captions, expanding benchmark coverage and accessibility for LLM evaluation.

## Key findings

- Captioning increases accessibility to text-only models by over 40%
- The updated dataset includes 1,422 questions, more than doubling the original size
- Evaluations show how LLMs leverage visual context through captions

## Abstract

With the growing capabilities of Large Language Models (LLMs), there is an increasing need for robust evaluation methods, especially in multilingual and non-English contexts. We present an updated version of the BLUEX dataset, now including 2024-2025 exams and automatically generated image captions using state-of-the-art models, enhancing its relevance for data contamination studies in LLM pretraining. Captioning strategies increase accessibility to text-only models by more than 40%, producing 1,422 usable questions, more than doubling the number in the original BLUEX. We evaluated commercial and open-source LLMs and their ability to leverage visual context through captions.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.21294/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/2508.21294/full.md

## References

39 references — full list in the complete paper: https://tomesphere.com/paper/2508.21294/full.md

---
Source: https://tomesphere.com/paper/2508.21294