TL;DR
Poet is a novel product-oriented video captioning framework that uses graph representations and knowledge-enhanced inference to generate detailed, product-focused descriptions for e-commerce videos, improving caption quality and relevance.
Contribution
The paper introduces Poet, a new framework that models videos as product-oriented graphs and employs knowledge-enhanced inference, advancing product-specific video captioning techniques.
Findings
Poet outperforms previous methods in caption quality and product aspect capturing.
The framework improves lexical diversity in generated captions.
Experiments on BFVD and FFVD datasets validate the effectiveness of Poet.
Abstract
In e-commerce, a growing number of user-generated videos are used for product promotion. How to generate video descriptions that narrate the user-preferred product characteristics depicted in the video is vital for successful promoting. Traditional video captioning methods, which focus on routinely describing what exists and happens in a video, are not amenable for product-oriented video captioning. To address this problem, we propose a product-oriented video captioner framework, abbreviated as Poet. Poet firstly represents the videos as product-oriented spatial-temporal graphs. Then, based on the aspects of the video-associated product, we perform knowledge-enhanced spatial-temporal inference on those graphs for capturing the dynamic change of fine-grained product-part characteristics. The knowledge leveraging module in Poet differs from the traditional design by performing knowledge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
