COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

Sihan Chen; Xingjian He; Handong Li; Xiaojie Jin; Jiashi Feng; Jing; Liu

arXiv:2306.09085·cs.CV·June 16, 2023·2 cites

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

Sihan Chen, Xingjian He, Handong Li, Xiaojie Jin, Jiashi Feng, Jing, Liu

PDF

Open Access 1 Repo 3 Reviews

TL;DR

COSA introduces a novel pretraining approach for vision-language models by concatenating image-text pairs to simulate long-form video data, enhancing temporal understanding and improving performance across multiple downstream tasks.

Contribution

It presents a new method of pretraining that leverages concatenated image-text pairs to model temporal cues without requiring actual video data.

Findings

01

Improves performance on video-text and image-text tasks.

02

Achieves state-of-the-art results on several benchmarks.

03

Enhances temporal modeling in vision-language understanding.

Abstract

Due to the limited scale and quality of video-text training corpus, most vision-language foundation models employ image-text datasets for pretraining and primarily focus on modeling visually semantic representations while disregarding temporal semantic representations and correlations. To address this issue, we propose COSA, a COncatenated SAmple pretrained vision-language foundation model. COSA jointly models visual contents and event-level temporal cues using only image-text corpora. We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining. This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus, enabling richer scene transformations and explicit event-description correspondence. Extensive experiments demonstrate that COSA consistently improves performance across a broad range of…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

1. The idea is simple and easy to reproduce. Meanwhile, the performance gain is impressive. 2. The experiments are conducted on many benchmarks across image-text and video-text tasks, as well as different data scales. Also the ablation is comprehensive and covers most of the aspects of this method.

Weaknesses

1. It makes sense that pseudo video-paragraph data in pre-training can mitigate the gap between pre-training and fine-tuning in image-text pertaining. However, intuitively, the discontinuity of semantics in pseudo video-paragraph data should hurt compared with relevant video-paragraph data because in downstream videos, image and text are indeed relevant. But in Tab9, it seems random sampling is better than relevant sampling, which is kind of counter-intuitive. Can the authors explain more about

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 5

Strengths

1. The paper proposes the effective method for video-text and image-text tasks. 2. The experiment is very adequate. The model consistently improves performance across a broad range of semantic vision-language downstream tasks.

Weaknesses

1. The reasons for the improvement brought by Concatenation lack detailed analysis. Why is there also improvement for image-text tasks? Why is it necessary to include the video dataset (web2vid)? Why wasn't the 1.2B model included in the video dataset? 2. The data shown in Table 1 is confusing. The data for COSA-L is 417M, while the data volume for COSA is 415M. 3. The results in Table 7 and Table 7 are also confusing. The best performance is based on 6 pretraining task? Which pre-training tasks

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

- The paper is well written and easy to follow. In addition, the proposed method was supported by comprehensive experiments together with ablation studies, which made the paper a complete work. - The method COSA itself was simple yet effective to improve the learned representations for downstream tasks, and at the same time, it did not introduce extra computational costs.

Weaknesses

- The method was more like a trick of data augmentation instead of a significant technical contribution, as it just simply concatenated images and their corresponding captions and it was not very surprising to observe performance improvements. - As it was mentioned in the paper that apart from modified objectives, COSA also included original objectives for pre-training on image-text pairs. It was a complicated design to have so many training objectives and it was unclear how they were weighted (

Code & Models

Repositories

txh-mercury/cosa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques

MethodsFocus