TL;DR
This paper presents a method for efficient joint inference in video semantic segmentation that combines semantic co-labeling with expressive models, significantly improving accuracy without extra computational cost.
Contribution
It introduces a novel inference approach that enables rapid, accurate video semantic segmentation by integrating co-labeling and expressive models, outperforming previous image segmentation methods.
Findings
Achieves up to 8% accuracy improvement on CamVid dataset
Performs inference over 10,000 images within seconds
No additional time overhead for improved accuracy
Abstract
We explore the efficiency of the CRF inference beyond image level semantic segmentation and perform joint inference in video frames. The key idea is to combine best of two worlds: semantic co-labeling and more expressive models. Our formulation enables us to perform inference over ten thousand images within seconds and makes the system amenable to perform video semantic segmentation most effectively. On CamVid dataset, with TextonBoost unaries, our proposed method achieves up to 8% improvement in accuracy over individual semantic image segmentation without additional time overhead. The source code is available at https://github.com/subtri/video_inference
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsConditional Random Field
