TL;DR
Xuanwu VL-2B is an industrial-grade multimodal foundation model optimized for content moderation, balancing visual perception, language alignment, and deployment costs within a 2B-parameter budget.
Contribution
The paper introduces a new multimodal model with a specialized training pipeline and data curation mechanism for industrial content ecosystems.
Findings
Xuanwu VL-2B outperforms existing models on multimodal benchmarks.
Achieves high recall in business moderation tasks.
Balances general capabilities with deployment efficiency.
Abstract
In recent years, multimodal large models have continued to improve on general benchmarks. However, in real-world content moderation and adversarial settings, mainstream models still suffer from degraded generalization and catastrophic forgetting because of limited fine-grained visual perception and insufficient modeling of long-tail noise. In this paper, we present Xuanwu VL-2B as a case study of how general multimodal models can be developed into an industrial-grade foundation model for content ecosystems. The model adopts a compact InternViT-300M + MLP + Qwen3 1.7B architecture, balancing fine-grained visual perception, language-semantic alignment, and deployment cost within an approximately 2B-parameter budget. To balance business specialization with the retention of general capabilities, we developed a data iteration and curation mechanism and trained the model through a progressive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
