# Scene-level movie data from Amazon X-Ray in the US market combined with IMDb

**Authors:** Safal Shrestha, Yeonie Heo, Alexander T. J. Barron, Minsu Park

PMC · DOI: 10.1038/s41597-026-06602-y · Scientific Data · 2026-01-20

## TL;DR

This paper introduces a large, structured dataset of movie scenes from Amazon X-Ray, linked to IMDb, to improve the accuracy and scale of film analysis.

## Contribution

The paper provides a novel, standardized scene-level movie dataset with character and subtitle metadata for 3,265 US Amazon Prime Video titles.

## Key findings

- The dataset includes scene breakdowns with character appearances and IMDb ID links for 3,265 movies.
- Subtitles are provided for 3,110 movies, enabling dialogue-level analysis.
- The dataset supports large-scale analyses of on-screen representation and narrative structure.

## Abstract

This paper presents a structured, scene-level dataset of movie content that addresses the limitations of previous research relying on small or non-standardized screenplay collections. Such collections often lack consistent scene representations and actor metadata and use draft versions that differ from their final cinematic products, limiting both the scale and accuracy for content-level analysis. To overcome these limitations, we compile scene breakdowns for 3,265 movies from Amazon X-Ray in the US Amazon Prime Video market, detailing the characters appearing in each scene and linking them to their corresponding IMDb IDs. Subtitles are included for the subset of 3,110 movies, providing complementary dialogue-level data, and each title is linked to its corresponding IMDb ID to enable augmentation with additional metadata for extended analyses. Integration of these resources can allow accurate, large-scale analyses of on-screen representation, character interactions, and narrative structure that were not feasible with earlier screenplay-based datasets. This dataset enhances the consistency and accessibility of movie data, providing a reliable stepping stone for quantitative research on film as cultural artifacts.

## Full-text entities

- **Diseases:** IMDb IDs (MESH:C535742)
- **Chemicals:** selenium (MESH:D012643)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12920774/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12920774/full.md

## References

21 references — full list in the complete paper: https://tomesphere.com/paper/PMC12920774/full.md

---
Source: https://tomesphere.com/paper/PMC12920774