TL;DR
This paper introduces a novel semantic segmentation method using language modeling and run length encoding to produce segmentation masks as token sequences, applicable to images and videos.
Contribution
It adapts the Pix2Seq framework with new tokenization strategies for efficient mask representation and extends the approach to panoptic segmentation.
Findings
Models achieve competitive results on domain-specific datasets.
The approach demonstrates potential for extension to videos and panoptic segmentation.
Code and models are publicly available for future research.
Abstract
This paper presents a new unified approach to semantic segmentation in both images and videos by using language modeling to output the masks as sequences of discrete tokens. We use run length encoding (RLE) to discretize the segmentation masks, and adapt the Pix2Seq framework to learn autoregressive models to output these tokens. We propose novel tokenization strategies to compress the lengths of the token sequences to make it practicable to extend this approach to videos. We also show how instance information can be incorporated into the tokenization process to perform panoptic segmentation. We evaluate our models on two domain-specific datasets to demonstrate their competitiveness with the state of the art in certain scenarios, in spite of being severely bottlenecked by our limited computational resources. We supplement these analyses by proposing several promising approaches to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
