thumbnail

MovieLists Dataset

Abstract: User content curation is becoming an important source of preference data, as well as providing information regarding the items being curated. One popular approach involves the creation of lists. On Twitter, these lists might contain user accounts relevant to a particular topic, whereas on a community site such as the Internet Movie Database (IMDb), this might take the form of lists of sharing common characteristics. While list curation implicitly involves substantial combined effort on the part of users, researchers have rarely looked at mining the outputs of this kind of crowdsourcing activity. Here we study a large collection of movie lists from IMDb. We apply network analysis methods to a graph that reflects the degree to which pairs of movies are "co-listed", that is, assigned to the same lists. This allows us to uncover a more nuanced categorisation of movies that goes beyond simple metadata, such as genre or era.

In collection Insight Centre for Data Analytics

# Label Extent MIME type Data
MovieLists Dataset : movielists_20130821.zip 1 zip-compressed file archive
14415652 bytes
Date of publication
Date issued:
Type of Resource
software, multimedia
Physical description
1 zip-compressed file archive
14415652 bytes — Digital origin: born digital (application/zip)
MD5 checksum: Be1617e429d05e7760e94790f93d2b49
Zip compressed file archive.
Movielists_20130821.zip
Downloaded from the Insight Centre web site at http://mlg.ucd.ie/movielists/index.html, 2014-10-06
Data
To examine the information provided by user-curated movie lists, we constructed a new dataset from IMDb during July 2013. Collection was restricted to lists covering items such as feature films, documentaries, and TV shows/episodes. From the initial set of 121k lists and 249k movies, we constructed a co-listed graph (i.e. a graph of movies co-assigned to the same lists). We subsequently normalise and threshold this graph to produce a normalised co-listed graph. Details of the normalisation process are described in our paper above.
Download
We make the pre-processed versions of our graph data available here. The data is for further non-commercial and research purposes only.
Note
In both graphs, movie nodes are identified by their unique IMDb IDs ttXXXXXXX (e.g. tt1375666 = "Inception"). Each node also has a "title" attribute, indicating the movie's title.
Table of Contents
imdb-colisted.graphml: Complete co-listed graph, with no normalisation. Each node corresponds to a movie. An edge exists between two movies if they are assigned to one or more lists together. The weight on each edge indicates the number of lists that they share. -- imdb-normalised.graphml: Normalised co-listed graph, thresholded at 0.1. This graph was used in the analysis described in our paper.
Genre
Dataset   linked data (dct)
Funding
Science Foundation Ireland, grant no. SFI-12-RC-2289

Referenced by
Greene, Derek, Cunningham, Pádraig, 1962-. arXiv:1308.5125 [physics] Discovering Latent Patterns from the Analysis of User-Curated Movie Lists 13 –
Record source
Prepared by staff of UCD Library, University College Dublin

Rights & Usage Conditions

Zebra_Session: The table 'session_data' is full