HouseTour: A Virtual Real Estate A(I)gent
Ata \c{C}elen, Marc Pollefeys, Daniel Barath, Iro Armeni

TL;DR
HouseTour is a novel system that generates smooth, 3D-grounded real estate videos with natural language summaries from images, leveraging geometric reasoning and diffusion-based trajectory synthesis.
Contribution
We introduce HouseTour, a method combining 3D camera trajectory generation with vision-language models, and provide a new dataset for real estate video synthesis.
Findings
Incorporating 3D trajectories improves description quality.
Our approach outperforms independent task methods.
The system enables professional-quality real estate videos.
Abstract
We introduce HouseTour, a method for spatially-aware 3D camera trajectory and natural language summary generation from a collection of images depicting an existing 3D space. Unlike existing vision-language models (VLMs), which struggle with geometric reasoning, our approach generates smooth video trajectories via a diffusion process constrained by known camera poses and integrates this information into the VLM for 3D-grounded descriptions. We synthesize the final video using 3D Gaussian splatting to render novel views along the trajectory. To support this task, we present the HouseTour dataset, which includes over 1,200 house-tour videos with camera poses, 3D reconstructions, and real estate descriptions. Experiments demonstrate that incorporating 3D camera trajectories into the text generation process improves performance over methods handling each task independently. We evaluate both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization
