X-modaler: A Versatile and High-performance Codebase for Cross-modal   Analytics

Yehao Li; Yingwei Pan; Jingwen Chen; Ting Yao; Tao Mei

arXiv:2108.08217·cs.CV·August 19, 2021·1 cites

X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics

Yehao Li, Yingwei Pan, Jingwen Chen, Ting Yao, Tao Mei

PDF

Open Access 2 Repos

TL;DR

X-modaler is an open-source, modular, high-performance codebase designed to unify and accelerate research and development in cross-modal analytics tasks like image captioning, video captioning, and vision-language pre-training.

Contribution

It introduces a versatile, modular framework that encapsulates state-of-the-art cross-modal algorithms, enabling flexible implementation and extension across various vision-language tasks.

Findings

01

Supports multiple cross-modal tasks with a unified framework

02

Enables seamless switching between modules for different algorithms

03

Facilitates rapid development and deployment of cross-modal models

Abstract

With the rise and development of deep learning over the past decade, there has been a steady momentum of innovation and breakthroughs that convincingly push the state-of-the-art of cross-modal analytics between vision and language in multimedia field. Nevertheless, there has not been an open-source codebase in support of training and deploying numerous neural network models for cross-modal analytics in a unified and modular fashion. In this work, we propose X-modaler -- a versatile and high-performance codebase that encapsulates the state-of-the-art cross-modal analytics into several general-purpose stages (e.g., pre-processing, encoder, cross-modal interaction, decoder, and decode strategy). Each stage is empowered with the functionality that covers a series of modules widely adopted in state-of-the-arts and allows seamless switching in between. This way naturally enables a flexible…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization