Pure Numbers: Scope creep in a learning exercise.

In the interest of science, mainly to have a model project to design a framework around, I decided to look into a suitable API that would be equivalent to the MovieLister project discussed on this blog post. Unfortunately, no such api supporting that specific query appears to exist. imdbapi.com exports a REST service that yields some metadata about movies, but short of requesting every movie, by ID, there’s no way that service can be used to query the director->movies association.

IMDB does provide a more general solution. They offer a free as in beer ftp service where you can download their entire database.

As flat files.

Parsing the IMDB flat text database.

Lot’s of files, some of them are pretty big. They are not in a standard, recognizeable format like CSV or xml or sqldump, so I’ll have to create a parser, from scratch, to transform the files into something a python application can use.

The basic approach I’m using is building a generator, one for each filetype, to iterate python objects representing the data.

The first hickup is that it’s not in plain ascii and it’s not in unicode. I’m not too good at knowing one encoding from the next, so... googling led me to chardet, which does exactly what it says on the tin, offers a suggested encoding, with a confidence level.
But if only it were that simple.

Most lines in the file are plain ‘ol ascii. it doesn’t take too long before non-ascii shows up. using chardet on the earliest lines, most of which are of the form “‘El Nickname’ Gonzalez, Juan etc”. The only encoding (other than ascii) chardet reports for these first few hundred lines is ISO-8859-2, which is mostly used for Slavic languages, but seems to do a reasonable job at encoding spanish titles. Could the whole file be in this encoding?

Well, No.

Applying chardet to the first megabyte of the file returned windows-1255, which is used to encode hebrew. I’ll have to decode each line independently. In fact, there are a wide range of encodings used throughout the file:

dbor1234@ubuntu-virtbox:~/src/imdb$ time zcat actors.list.gz | python linedecode.py 
'    For further info visit http://www.imdb.com/licensing/contact\n'
573785031
{'Big5': 3074,
'EUC-JP': 49,
'EUC-KR': 33,
'EUC-TW': 17,
'GB2312': 5,
'IBM855': 759,
'IBM866': 2,
'ISO-8859-2': 1297461,
'ISO-8859-5': 12,
'ISO-8859-7': 4585,
'ISO-8859-8': 145,
'MacCyrillic': 1673,
'SHIFT_JIS': 2,
'TIS-620': 5,
'ascii': 9171862,
'utf-8': 2,
'windows-1251': 20867,
'windows-1252': 11833,
'windows-1255': 52164}
real 79m20.586s
user 78m31.530s
sys 0m40.423s

Alas, unicode isn’t very popular...

And so now I’m wacking away at getting a suitable parser to process this data. I’ve settled on using pyparsing, after an aborted attempt at a handbuilt parser. For the most part, the data is pretty well behaved. but there are some corner cases that make parsing frustrating. The first is that the year field, which seems to be required, may be malformed... most look like (1994/I) I don’t know what that means, but it’s added to the parser.

The next regression is dealing with titles that contain parenthesis. This is challenging because the only way to recognize the end of a film title is by the start of the year field, which begins with an open parenthesis. TV titles are easier since they are always double quoted. The solution? Use a regex. I recognize film titles by searching for the first position in the input that could be the year field, in a lookahead assertion.

Of course, the list goes on. To find entries that the parse cannot parse, i’ve written a function that tries to parse every line in the input, and when one raises an exception, it catches the parse exception and returns it with the offending input. My next discovery, year fields that are all questionmarks.

Another thing I’m noticing is that the form used for Film is infact overloaded, by following the year with an additional attrubute, such as (V) or (VG), to indicate (Presumably) video or game titles. There is no obvious legend for them and they are not trivially distinguishable from other attributes, since either are parenthesized and in the same position between year and role.

Pure Numbers

Thursday, January 20, 2011

Scope creep in a learning exercise.

Parsing the IMDB flat text database.

No comments:

Post a Comment