Yanyun-3: Enabling Cross-Platform Strategy Game Operation with Vision-Language Models

Guoyan Wang; Yanyan Huang; Chunlin Chen; Lifeng Wang; Yuxiang Sun

arXiv:2511.12937·cs.AI·November 26, 2025

Yanyun-3: Enabling Cross-Platform Strategy Game Operation with Vision-Language Models

Guoyan Wang, Yanyan Huang, Chunlin Chen, Lifeng Wang, Yuxiang Sun

PDF

Open Access

TL;DR

Yanyun-3 is a vision-language model-based agent that enables cross-platform strategy game automation by integrating visual reasoning and interface execution, achieving significant performance improvements and generalization without platform-specific tuning.

Contribution

We introduce Yanyun-3, a novel framework combining structured multimodal data organization and fine-tuning techniques to enhance VLM performance in cross-platform game automation.

Findings

01

12.98x BLEU-4 score improvement

02

63% reduction in inference time

03

Effective cross-platform task execution

Abstract

Cross-platform strategy game automation remains a challenge due to diverse user interfaces and dynamic battlefield environments. Existing Vision--Language Models (VLMs) struggle with generalization across heterogeneous platforms and lack precision in interface understanding and action execution. We introduce Yanyun-3, a VLM-based agent that integrates Qwen2.5-VL for visual reasoning and UI-TARS for interface execution. We propose a novel data organization principle -- combination granularity -- to distinguish intra-sample fusion and inter-sample mixing of multimodal data (static images, multi-image sequences, and videos). The model is fine-tuned using QLoRA on a curated dataset across three strategy game platforms. The optimal strategy (M*V+S) achieves a 12.98x improvement in BLEU-4 score and a 63% reduction in inference time compared to full fusion. Yanyun-3 successfully executes core…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques