Modeling the Oscar for Best Picture (and Some Insights About XGBoost)

The Academy Awards are a week away, and I’m sharing my machine-learning-based predictions for Best Picture as well as some insights I took away from the process (particularly XGBoost’s sparsity-aware split finding). Oppenheimer is a heavy favorite at 97% likely to win—but major surprises are not uncommon, as we’ll see.

I pulled data from three sources. First, industry awards. Most unions and guilds for filmmakers—producers, directors, actors, cinematographers, editors, production designers—have their own awards. Second, critical awards. I collected as wide as possible, from the Golden Globes to the Georgia Film Critics Association. More or less: If an organization had a Wikipedia page showing a historical list of nominees and/or winners, I scraped it. Third, miscellaneous information like Metacritic score and keywords taken from synopses to learn if it was adapted from a book, what genre it is, the topics it covers, and so on. Combining all of these was a pain, especially for films that have bonkers names like BİRDMAN or (The Unexpected Virtue of Ignorance).

The source data generally aligns with what FiveThirtyEight used to do, except I casted a far wider net in collecting awards. Other differences include FiveThirtyEight choosing a closed-form solution for weighting the importance of awards and then rating films in terms of “points” they accrued (out of the potential pool of points) throughout the season. I chose to build a machine learning model, which was tricky.

To make the merging of data feasible (e.g., different tables had different spellings of the film or different years associated with the film), I only looked at the movies who received a nomination for Best Picture, making for a tiny dataset of 591 rows for the first 95 ceremonies. The wildly small N presents a challenge for building a machine learning model, as does sparsity and missing data.

Sparsity and Missing Data

There are a ton of zeroes in the data, creating sparsity. Every variable (save for the Metacritic score) is binary. Nomination variables (i.e., was the film nominated for the award?) may have multiple films for a given year with a 1, but winning variables (i.e., did the film win the award?) only have a single 1 each year.

There is also the challenge of missing data. Not every award in the model goes back to the late 1920s, meaning that each film has an NA if it was released in a year before a given award. For example, I only included Metacritic scores for contemporaneous releases, and the site launched in 2001, while the Screen Actors Guild started their awards in 1995.

My first thought was an ensemble model. Segment each group of awards, based on their start date, into different models. Get predicted probabilities from these, and combine them weighted on the inverse of out-of-sample error. After experimenting a bit, I came to the conclusion so many of us do when building models: Use XGBoost. With so little data to use for tuning, I simply stuck with model defaults for hyper-parameters.

Outside of its reputation for being accurate out of the box, it handles missing data. The docs simply state: “XGBoost supports missing values by default. In tree algorithms, branch directions for missing values are learned during training.” This is discussed in deeper detail in the “sparsity-aware split finding” section of the paper introducing XGBoost. The full algorithm is shown in that paper, but the general idea is that an optimal default direction at each split in a tree is learned from the data, and missing values follow that default.

Backtesting

To assess performance, I backtested on the last thirty years of Academy Awards. I believe scikit-learn would call this group k-fold cross-validation. I removed a given year from the dataset, fit the model, and then made predictions on the held-out year. The last hiccup is that the model does not know that if Movie A from Year X wins Best Picture, it means Movies B - E from Year X cannot. It also does not know that one of the films from Year X must win. My cheat around this is I re-scale all the predicted probabilities to sum to one.

The predictions for the last thirty years:

Year Predicted Winner Modeled Win Probability Won Best Picture? Actual Winner
1993 schindler’s list 0.996 1 schindler’s list
1994 forrest gump 0.990 1 forrest gump
1995 apollo 13 0.987 0 braveheart
1996 the english patient 0.923 1 the english patient
1997 titanic 0.980 1 titanic
1998 saving private ryan 0.938 0 shakespeare in love
1999 american beauty 0.995 1 american beauty
2000 gladiator 0.586 1 gladiator
2001 a beautiful mind 0.554 1 a beautiful mind
2002 chicago 0.963 1 chicago
2003 the lord of the rings: the return of the king 0.986 1 the lord of the rings: the return of the king
2004 the aviator 0.713 0 million dollar baby
2005 brokeback mountain 0.681 0 crash
2006 the departed 0.680 1 the departed
2007 no country for old men 0.997 1 no country for old men
2008 slumdog millionaire 0.886 1 slumdog millionaire
2009 the hurt locker 0.988 1 the hurt locker
2010 the king’s speech 0.730 1 the king’s speech
2011 the artist 0.909 1 the artist
2012 argo 0.984 1 argo
2013 12 years a slave 0.551 1 12 years a slave
2014 birdman 0.929 1 birdman
2015 spotlight 0.502 1 spotlight
2016 la la land 0.984 0 moonlight
2017 the shape of water 0.783 1 the shape of water
2018 roma 0.928 0 green book
2019 parasite 0.576 1 parasite
2020 nomadland 0.878 1 nomadland
2021 the power of the dog 0.981 0 coda
2022 everything everywhere all at once 0.959 1 everything everywhere all at once

Of the last 30 years, 23 predicted winners actually won, while 7 lost—making for an accuracy of about 77%. Not terrible. (And, paradoxically, many of the misses are predictable ones to those familiar with Best Picture history.) However, the mean predicted probability of winning from these 30 cases is about 85%, which means the model is maybe 8 points over-confident. We do see recent years being more prone to upsets—is that due to a larger pool of nominees? Or something else, like a change in the Academy’s makeup or voting procedures? At any rate, some ideas I am going to play with before next year are weighting more proximate years higher (as rules, voting body, voting trends, etc., change over time), finding additional awards, and pulling in other metadata on films. It might just be, though, that the Academy likes to swerve away from everyone else sometimes in a way that is not readily predictable from outside data sources. (Hence the fun of watching and speculating and modeling in the first place.)

This Year

I wanted to include a chart showing probabilities over time, but the story has largely remained the same. The major inflection point was the Directors Guild of America (DGA) Awards.

Of the data we had on the day the nominees were announced (January 23rd), the predictions were:

Film Predicted Probability
Killers of the Flower Moon 0.549
The Zone of Interest 0.160
Oppenheimer 0.147
American Fiction 0.061
Barbie 0.039
Poor Things 0.023
The Holdovers 0.012
Past Lives 0.005
Anatomy of a Fall 0.005
Maestro 0.001

I was shocked to see Oppenheimer lagging in third and to see The Zone of Interest so high. The reason here is that, while backtesting, I saw that the variable importance for winning the DGA award for Outstanding Directing - Feature Film was the highest by about a factor of ten. Since XGBoost handles missing values nicely, we can rely on the sparsity-aware split testing to get a little more information from these data. If we know the nominees of an award but not the winner yet, we can still infer: Anyone who was nominated is left NA, while anyone who was not nominated is set to zero. That allows us to partially use this DGA variable (and the other awards where we knew the nominees on January 23rd, but not the winners). When we do that, the predicted probabilities as of the announcing of the Best Picture nominees were:

Film Predicted Probability
Killers of the Flower Moon 0.380
Poor Things 0.313
Oppenheimer 0.160
The Zone of Interest 0.116
American Fiction 0.012
Barbie 0.007
Past Lives 0.007
Maestro 0.003
Anatomy of a Fall 0.002
The Holdovers 0.001

The Zone of Interest falls in favor of Poor Things, since the former was not nominated for the DGA award while the latter was. I was still puzzled, but I knew that the model wouldn’t start being certain until we knew the DGA award. Those top three films were nominated for many of the same awards. Then Christopher Nolan won the DGA award for Oppenheimer, and the film hasn’t been below a 95% chance for winning Best Picture since.

Final Predictions

The probabilities as they stand today, a week before the ceremony, have Oppenheimer as the presumptive winner at a 97% chance of winning.

Film Predicted Probability
Oppenheimer 0.973
Poor Things 0.010
Killers of the Flower Moon 0.005
The Zone of Interest 0.004
Anatomy of a Fall 0.003
American Fiction 0.002
Past Lives 0.001
Barbie 0.001
The Holdovers 0.001
Maestro 0.000

There are a few awards being announced tonight (Satellite Awards, the awards for the cinematographers guild and the edtiors guild), but they should not impact the model much. So, we are in for a year of a predictable winner—or another shocking year where a CODA or a Moonlight takes home film’s biggest award. (If you’ve read this far and enjoyed Cillian Murphy in Oppenheimer… go check out his leading performance in Sunshine, directed by Danny Boyle and written by Alex Garland.)