Mil-SCORE: Benchmarking Long-Context Geospatial Reasoning and Planning in Large Language Models

Aadi Palnitkar; Mingyang Mao; Nicholas Waytowich; Vinicius G. Goecks; Xiaomin Lin

arXiv:2601.21826·cs.CL·February 6, 2026

Mil-SCORE: Benchmarking Long-Context Geospatial Reasoning and Planning in Large Language Models

Aadi Palnitkar, Mingyang Mao, Nicholas Waytowich, Vinicius G. Goecks, Xiaomin Lin

PDF

Open Access

TL;DR

MilSCORE is a new benchmark dataset designed to evaluate large language models' ability to perform complex, long-horizon geospatial reasoning and planning in military scenarios, highlighting current system limitations.

Contribution

This paper introduces MilSCORE, the first scenario-level dataset for expert-authored, multi-hop questions in complex military planning, to evaluate LLMs' reasoning over long, multi-modal contexts.

Findings

01

Current models show significant room for improvement on MilSCORE.

02

MilSCORE effectively challenges LLMs in long-context, multi-source geospatial reasoning.

03

Baseline results reveal difficulties in scenario-level military planning tasks.

Abstract

As large language models (LLMs) are applied to increasingly longer and more complex tasks, there is a growing need for realistic long-context benchmarks that require selective reading and integration of heterogeneous, multi-modal information sources. This need is especially acute for geospatial planning problems, such as those found in planning for large-scale military operations, which demand fast and accurate reasoning over maps, orders, intelligence reports, and other distributed data. To address this gap, we present MilSCORE (Military Scenario Contextual Reasoning), to our knowledge the first scenario-level dataset of expert-authored, multi-hop questions grounded in a complex, simulated military planning scenario used for training. MilSCORE is designed to evaluate high-stakes decision-making and planning, probing LLMs' ability to combine tactical and spatial reasoning across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Constraint Satisfaction and Optimization