DART: A Vision-Language Foundation Model for Comprehensive Rope Condition Monitoring
Anju Rani, Daniel Ortiz-Arroyo, Petar Durdevic

TL;DR
DART is a comprehensive vision-language model for rope condition monitoring that provides damage detection, severity assessment, and automated reporting from a single image using a unified multi-task architecture.
Contribution
The paper introduces DART, a novel multi-task foundation model that integrates vision and language for full-spectrum rope condition inspection without task-specific fine-tuning.
Findings
DART achieves 93.22% accuracy in damage classification.
It attains a Spearman rho of 0.94 in severity regression.
DART performs well in few-shot damage recognition with 89.2% macro-F1 at 20 shots.
Abstract
The condition monitoring (CM) of synthetic fibre ropes (SFRs) used in offshore, maritime, and industrial settings demands more than a classifier: inspectors need continuous severity estimates, maintenance recommendations, anomaly flags, deterioration timelines, and automated reports, all from a single inspection image. We present DART (Damage Assessment via Rope Transformer), a vision-language foundation model that addresses the full rope inspection workflow through a unified multi-task architecture. DART extends the Joint-Embedding Predictive Architecture (JEPA) to the cross-modal domain by coupling a Vision Transformer (ViT-H/14) with Llama-3.2-3B-Instruct via a Severity-Conditioned Cross-Modal Fusion (SC-CMF) module. Three architectural innovations drive the model's versatility: (1) HD-MASK, a saliency-guided masking strategy that focuses self-supervised reconstruction on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
