Assignment 1: IMDB movie scores

Python is a very convenient language for text and data processing and this assignment will hopefully highlight this.

Preparations

IMDB provides bulk data for a number of different things. In this assignment, you will only need the ratings.list.gz file.

You are free (and encouraged) to use anything useful you may find in the standard library. Any other third party module (including numpy, scipy, etc.) are forbidden. I.e. your code should run on a clean install of Python 3.4 or later.

A: Loading movies

The assignment is to build your own local "movie database" from the data in ratings.list.gz. Each entry in this "database" must contain at the minimum

Title
Year
IMDB rank/rating

Hints and constraints:

You are allowed to first uncompress the gzipped file. Can you avoid it?
Look at the data file. Where is the data you want? (Hint: It starts around line 296)
TV-series can be considered movies for this assignment
Any entry that is listed more than once because it has multiple episodes can only be loaded once. See e.g. $#*! My Dad Says on lines 305-323.
Your "database" could be something as simple as a list of tuples, although more interesting, and arguably better, options exist. You are free to choose any representation that you feel is convenient.

B: Yearly movie statistics

Given the movie database you created in A you should

find the best movie for a given year.
find the mean and standard deviation of ratings for a given year
for each year in the range 1980-2015, print the best movie and the mean and standard deviation for that year.

Hints and constraints:

The printed list should be easy to read with pretty columns
Your list will probably be dominated by movies no one has heard of. Can you apply some trivial logic to the data to get a more representable set of top movies?

C: Open assignment

The last exercise can be summarized as do something interesting within the context of the code you just wrote. The list of ideas below should give a rough idea on the intended amout of work required.

For this assignment, you are allowed to use external/third party modules. You will however need to give detailed installation instructions. It must also be possible to run part A and B without using this third party module.

Some ideas:

Add more IMDB movie data to your movie database. You could parse the actors.list.gz or directors.list.gz files to find the actors and/or directors of a movie, to take one example
Fetch movie data from an online source like the Open Movie Database. You will probably want to take a look at the urllib package.
Turn your code into a command line application which can be called as e.g. $ ./bestmovie.py 1984 Your application must behave as any command line application and support e.g. -h and --h flags, etc.
Visualize the data (or parts of it) in some way
Calculate some other interesting statistic

Submission

Submissions are accepted as either python modules, scripts or as IPython notebooks.

Single scripts or notebooks can be attached and sent by email
If you have multiple files, please put them in a zip or tar.gz archive that I can easily access (/nobackup/, in your www-tree, ...) and send me a link or instructions.

Informationsansvarig: Hannes Ovrén
Senast uppdaterad: 2015-10-21

Institutionen för systemteknik (ISY)

Information