Netflix is well-known for its data-driven approach to content creation, as seen with by the success of the $100m production “House of Cards” and Oscar-winning documentary “The Square”. This video-on-demand service provider, which disrupted the market in recent years, can rely on the data its 104 million subscribers generate. But who says you need 104 million subscribers to predict the success of a movie?
Using open data available to the public and machine learning, we worked with MSc Business Analytics students at Imperial College Business School in London, to try and predict the commercial performance and critical reception of a movie.
How did we do it?
And, more interestingly, could content creators easily use open data to adopt a Netflix-like data-driven approach?
Step 1: We collected, structured and processed masses of data in a time-efficient manner
First of all, let’s define what we mean by “open data.” Open data is data that can be accessed freely by anybody and that generally does not contain personally identifiable information ( PII ). The open data sources we used for this project are IMDB, The Numbers, Box Office Mojo, and FXTOP for currency conversion. From these sources, we collected data points from 11,000 movies and classified them based on more than 300 criteria, such as:
- popularity of actors and directors, based on number of movies and awards, number of likes and retweets, or career momentum
- movie genre and size of the target audience; thrillers and dramas have wider mass appeal than documentaries and film noir
- actor face recognition on promotional posters
- production studios past successes
- number of trailers
- country of origin
- age restriction
- film duration
- keywords extracted from text mining the movie description
- contextual trends linked to release date and concurrent exchange rates
Step 2: We used machine learning to build and iterate on our predictive model
Where human expertise and manpower would have been limited, these millions of data points were processed in just a few seconds using machine learning, and the algorithm drew correlations between parameters that could not have been seen manually.
Step 3: We applied the model to the US adaptation of Stephen King’s novel “The Dark Tower”
A few weeks later, we came up with a model based on a hundred variables that was supposedly twice as good as a simple rule-based model (e.g. based solely on past success of actors, directors and genres of the movie) in predicting box office success. In a fraction of a second, the model could predict the US box office performance of any given movie.
We decided to give it a go on a movie that had not been released yet: “The Dark Tower” by Nikolaj Arcel, starring Idris Elba and Matthew McConaughey, and with a budget of $60 million.
We predicted a total gross revenue of $70 million in the US. However, three months later the film had earned only $50 million.
We’re willing to admit that ours might not be as sophisticated as Netflix’s model, which can dig into huge amounts of behavioural data… But does this mean that our model is faulty? Not necessarily, but to be sure we would have to test it out on hundreds of movies to assess its actual performance. Perhaps we should have taken more data history into account, or even added more data sources.
In short: building a predictive model is not easy, and there’s no magic recipe! It is a continuous process of iteration, testing, and learning, and it takes time. But every little helps, so if you’re reading this from the States would you please go and buy your ticket for “The Dark Tower”? 🙂