Can AI predict how audiences will rate a TV episode — just from watching it?
A vision model generates scene-by-scene screenplays from raw video. A frontier reasoning model scores each episode across dozens of viewer-experience dimensions. No audience signals, no ratings — just the content itself. How close can this get to what millions of viewers collectively decide?
Every episode across 4 shows — toggle source and model type
Actual IMDB rating vs model prediction for every episode — select a show and config
Rating trajectory across episodes
Predicted vs actual, colored by season
Cross-section comparisons: SP vs Subs, Opus vs Sonnet, Sequential vs Independent
Sequential expanding-window cross-validation
| Show | Episodes | Baseline ρ | Best R² | Best Recipe | ExcL R² |
|---|
Same method, same modifier — only the input source differs
Where does prediction work, and where does it break down?
Sequential Expanding Window: Train on episodes 1..t-1, predict episode t. This respects temporal ordering — the model never sees future episodes. For single-season shows (Fargo), Leave-One-Out CV is used instead.
Four curated episodes where the SP/Subs divergence tells a story about what visual analysis contributes
Side-by-side reasoning: SP (video-derived screenplay) vs Subs (dialogue only)
How Gemini 2.5 Flash converts frame grids into structured screenplays, vs raw subtitles