COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
Sihan Chen, Xingjian He, Handong Li, Xiaojie Jin, Jiashi Feng, Jing, Liu

TL;DR
COSA introduces a novel pretraining approach for vision-language models by concatenating image-text pairs to simulate long-form video data, enhancing temporal understanding and improving performance across multiple downstream tasks.
Contribution
It presents a new method of pretraining that leverages concatenated image-text pairs to model temporal cues without requiring actual video data.
Findings
Improves performance on video-text and image-text tasks.
Achieves state-of-the-art results on several benchmarks.
Enhances temporal modeling in vision-language understanding.
Abstract
Due to the limited scale and quality of video-text training corpus, most vision-language foundation models employ image-text datasets for pretraining and primarily focus on modeling visually semantic representations while disregarding temporal semantic representations and correlations. To address this issue, we propose COSA, a COncatenated SAmple pretrained vision-language foundation model. COSA jointly models visual contents and event-level temporal cues using only image-text corpora. We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining. This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus, enabling richer scene transformations and explicit event-description correspondence. Extensive experiments demonstrate that COSA consistently improves performance across a broad range of…
Peer Reviews
Decision·ICLR 2024 poster
1. The idea is simple and easy to reproduce. Meanwhile, the performance gain is impressive. 2. The experiments are conducted on many benchmarks across image-text and video-text tasks, as well as different data scales. Also the ablation is comprehensive and covers most of the aspects of this method.
1. It makes sense that pseudo video-paragraph data in pre-training can mitigate the gap between pre-training and fine-tuning in image-text pertaining. However, intuitively, the discontinuity of semantics in pseudo video-paragraph data should hurt compared with relevant video-paragraph data because in downstream videos, image and text are indeed relevant. But in Tab9, it seems random sampling is better than relevant sampling, which is kind of counter-intuitive. Can the authors explain more about
1. The paper proposes the effective method for video-text and image-text tasks. 2. The experiment is very adequate. The model consistently improves performance across a broad range of semantic vision-language downstream tasks.
1. The reasons for the improvement brought by Concatenation lack detailed analysis. Why is there also improvement for image-text tasks? Why is it necessary to include the video dataset (web2vid)? Why wasn't the 1.2B model included in the video dataset? 2. The data shown in Table 1 is confusing. The data for COSA-L is 417M, while the data volume for COSA is 415M. 3. The results in Table 7 and Table 7 are also confusing. The best performance is based on 6 pretraining task? Which pre-training tasks
- The paper is well written and easy to follow. In addition, the proposed method was supported by comprehensive experiments together with ablation studies, which made the paper a complete work. - The method COSA itself was simple yet effective to improve the learned representations for downstream tasks, and at the same time, it did not introduce extra computational costs.
- The method was more like a trick of data augmentation instead of a significant technical contribution, as it just simply concatenated images and their corresponding captions and it was not very surprising to observe performance improvements. - As it was mentioned in the paper that apart from modified objectives, COSA also included original objectives for pre-training on image-text pairs. It was a complicated design to have so many training objectives and it was unclear how they were weighted (
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
MethodsFocus
