The Critic in the (Anthropic) Machine

Can AI predict how audiences will rate a TV episode — just from watching it?

A vision model generates scene-by-scene screenplays from raw video. A frontier reasoning model scores each episode across dozens of viewer-experience dimensions. No audience signals, no ratings — just the content itself. How close can this get to what millions of viewers collectively decide?

Predicted vs Actual

Every episode across 4 shows — toggle source and model type

Source:
Type:

Episode Explorer

Actual IMDB rating vs model prediction for every episode — select a show and config

Time Series

Rating trajectory across episodes

Scatter by Season

Predicted vs actual, colored by season

Model Comparison

Cross-section comparisons: SP vs Subs, Opus vs Sonnet, Sequential vs Independent

Best R² Per Show

Sequential expanding-window cross-validation

Show Episodes Baseline ρ Best R² Best Recipe ExcL R²

SP vs Subs: Matched Pairs

Same method, same modifier — only the input source differs

Per-Season Prediction Quality

Where does prediction work, and where does it break down?

Methodology
Step 1
Video + Subtitles
Step 2
Gemini 2.5 Flash Lite
Frame → Screenplay
Step 3
Claude Opus 4.6
Score 32–51 dimensions
Step 4
8 Trained Recipes
Ridge / PCR-1
Step 5
Sequential CV
Predict IMDB rating

Base Features (All Sources)

  • 17 craft dimensions (dialogue, pacing, coherence, character logic, emotional resonance)
  • 9 engagement dimensions (binge pull, phone-check risk, stakes, WTF moments)
  • 3 viewer archetypes (analyst, emotional, spectacle)
  • 3 anticipation dimensions (cliffhanger, resolution, surprise)

Visual + Derived (SP Only / Zero-Cost)

  • 6 validated visual dimensions (R² > 0.3): beat spacing, rhythm, believability, foreshadowing, body language, amplification
  • 3 derived metrics (structured surprise, Goodhart signal, earned surprise)
  • 12 deterministic metrics from screenplay structure (n_scenes, location_entropy, etc.)

Recipe Sweep Configuration

Sequential Expanding Window: Train on episodes 1..t-1, predict episode t. This respects temporal ordering — the model never sees future episodes. For single-season shows (Fargo), Leave-One-Out CV is used instead.

How This Works

Step 1
Video + Subtitles
Step 2
Gemini 2.5 Flash Lite
Frame → Screenplay
Step 3
Claude Opus 4.6
Score 32–51 dimensions
Step 4
8 Trained Recipes
Ridge / PCR-1
Step 5
Sequential CV
Predict IMDB rating

SP vs Subs: What Vision Adds

Four curated episodes where the SP/Subs divergence tells a story about what visual analysis contributes

What the Model "Sees"

Side-by-side reasoning: SP (video-derived screenplay) vs Subs (dialogue only)

Video → Screenplay: Frame Grid Examples

How Gemini 2.5 Flash converts frame grids into structured screenplays, vs raw subtitles