Pure Numbers: 2011

Monday, November 21, 2011

Mosquitoes!

At this very moment it's November, but it's still hot in New York; hot by my Chicago standards, anyway. I am ... well I think I've got a touch of a fever; it's been breweing as a cough for about a week, so I'm sweating, very warm. But I'm also suffering, at the torment of an evolutionary bully.

I'm temporarily staying with a coworker in his run-down Brooklyn apartment. The most important detail of my particular, admittedly unimpressive plight, is the windows in his three room apartment have no screens. Instead, many of these old buildings have these telescoping screen frames that are just propped up inside the window.

They do keep some of the bugs out, but the mosquitoes; practically the whole purpose of screens; they just pour right in; so long as the weather is above about 40 degrees.

I can tell, with just a glance in the mirror, especially in just the last couple of days, that I sleep on the right side of my body. It's illustrated in grotesque detail on my left arm; which is covered in scabs and scars, all of which have been acquired in just the past few weeks. I sleep soundly on my right side, and throughout the night they feed on the one part of me they can access.

A few hours later, I notice a slight itch, and I'll find a huge welt on my arm, or shoulder, or leg, or wherever. Within 12 hours, the large welt will sort of shrink into a tiny blister, maybe an eighth of an inch wide. The welt itched a bit, but blister itches like crazy. I'm complete unable to resist scratching at it.... I don't mean I'm fine one minute, and then completely and convulsively preoccupied with scratching an itch that wont' go away for an hour (though sometimes it feels that way), rather, I'll be minding my business, and instinct just drives me to reach over, and just scratch it, just a little. Half a beat later I realize and stop. Then a few minutes later it'll happen again.

This only has to happen like three times, just a little scratch, just here and there, and the blister breaks wide open, and blood starts oozing out of an even itchier wound, now a quarter inch wide. And they scab over and turn into these ugly scars. The worst part is, even if I do resist; It still scabs over, I still get the scar; it's just a little smaller.

----

Okay, enough complaining. People are dying all over the world from diseases transmitted by mosquito bites and this guy is complaining about the itching? I can almost hear you say. I don't really want to talk about my itching. I'm more interested in the itching; as in, "Why does it even itch?", "What's the point?"

See, mosquitoes are really old, a lot older than humans; somewhere between 75 to 100 million years old, depending on when you want to call this thing a mosquito, but that thing is just a more primitive fly (did you mosquitoes are a type of fly?). They've been around so long you could sort of say that mosquitoes are really good at "being mosquitoes".

With their short life cycles, they get extra chances at getting it right; A typical mosquito completes its life-cycle in warm months in around 40 days. That's like a year of its whole life for every day of a human's life cycle, or maybe a week of a smaller mammal. 75 million years of mammal evolution is like half a billion in mosquito years.

So... For almost all purposes, mosquitoes are close to perfect. If a trait doesn't really do them any good, they'll probably just, ditch it.

At any rate, 75 million years is plenty of time for mammals to deal with an accidental allergic reaction to mosquito saliva.

And so I wonder, "The mosquito drifts in, practically unnoticed, steals its meal, and it's gone from my life forever. Only later does the itching start. How can it possibly serve the mosquito after it leaves?"

I try flipping the idea on its head "Ok, maybe it doesn't serve the mosquito, it serves me. I react to the saliva and notice the mosquito just before it gets a chance to transmit some horrible parasite... but it doesn't itch at all until after the mosquito is long gone"

And the whole time as I'm reasoning this out, I catch occasional glances of a mosquito in the corner of my eye. If I feel a breeze or a twitch, in the back of my mind "oh, is that the mosquito landing on me?" and I reach to smack it ... and ... no, it's an old bite. And then I feel something else on the other side of my body: "Oh is that the mosquito" and I reach to slap that, but it's another old bite.

And now I'm actually thinking "Damn it if a mosquito really does land on me, I'll never notice it over all the itching... "

"Oh..." And the obviousness of it seems kind of ridiculous. It's never one mosquito, it's usually a whole swarm of them, one or two at a time, over the course of the whole summer (or in this case, autumn; I had arrived in October). It doesn't really have to serve the first mosquito, because it will serve the next. Once a host has been bitten a few times, and they itch in several places, the mosquito can slip in between the noise.

And come to think of it, It doesn't even have to be one mosquito after the next. I arrive in new york, exhausted from travel. I collapse on the air mattress and fall almost immediately into a sound slumber. When i wake up, there's not just one or two welts, but a dozen, all clustered around my forearm. And I think this was one mosquito, because I caught it a day or two later, and was undisturbed my mosquitoes for several more days.

Almost like the mosquito was laying a trap. I picture it scanning around my bed, looking for the best place to plant its nefarious seed, and it homes in on my forearm; exposed, easy to get to or away from.

And it lands, and then hops right back into the air. It doesn't even try to bite. Right now its just... testing. To see if i'm really asleep. after circling around once, she lands again, this time in almost the same place, but this time she bites. But she still doesn't stick around, she's back in the air immediately. Just checking to see if I'm sound asleep or just half asleep. She does this a few more times, each time landing in a slightly different part of my forearm, and injecting just a little bit more saliva.

And then, after she's done this a half a dozen times, even more; she still hasn't tasted a single drop of my precious humor; she flies away. Not too far, just... out of site. She waits. Just long enough for the allergy response to start.

And then she strikes. Boom, a completely different part of my body. Even if I do wake up; she's safe. I'll be so preoccupied with the itch on my forearm I won't even notice where she's really working. I'll shift around to reach my forearm, and she can just escape.

----

One more thing though... I mentioned that I'm new to New York. We have mosquitoes in Chicago, too. And they never seem to bother me. I smack them off my arms all the time, and never (or hardly ever) mistakenly slap an existing bite. Like the allergic reaction is just missing.

I think what's going on here is that the different populations of mosquitoes are interacting with my immune system in very different ways. In that environment, I was winning; the bites didn't itch, and so if I felt an little bit of an itch, it was not a bite, but an actual mosquito, and I could catch 'em.

Monday, May 30, 2011

Graph patterns and SQLAlchemy: Part Two

In the first part of this series, I introduced a basic technique for handling graph structures that supports efficient reading for path reach. That solution was sufficient for primarily read operations, but isn't much help when there are frequent writes, or if database persistence is undesirable for some uses of the model. In this part, I'll show a solution to those shortcomings.

So, from this point forward, we're only interested in solutions that can be used *without* resorting to a database, and which leave the models in an always consistent state. There are no sessions here!. SQLAlchemy will give us one leg-up, the mapper provides just a bit of magic, such that when you modify one side of an association, it takes care of the book-keeping on the other side. This works whether you have a database or not.

We can approach this from the point of view of breaking the problem into two parts. When you add an edge, you only have to worry about adding new paths, and when you remove an edge, you only need to worry about breaking existing paths; Adding an edge won't ever break any paths.

In fact, That first one is really easy to deal with. A new edge creates a pretty clear set of paths. the successor, and all of its successors gain the predecessor, and all of it's predecessors as predecessors themselves.


class Node(Base):
    # ...
    def add_successor(self, other):
        if other in self.successors:
            return
        self.successors.append(other)
        self.before.append(other)
        for descendent in other.before:
            if descendent not in self.before:
                self.before.append(descendent)
        for ancestor in self.after:
            if ancestor not in other.after:
                other.after.append(ancestor)

In the case of adding an edge, we are only really looking at the ancestors of the top node, and the descendants of the bottom node. If they aren't already, we connect 'em up.

What about removing an edge. Before we get too involved in what that does to the graph, lets look again at an example. here's a slightly modified Diamond graph


  a
 / \
b   \
|    c
e   /
 \ /
  d

suppose we wanted to remove the edge between a and b, turning it into:


  a
 / \
b   \
     c
e   /
 \ /
  d

The important thing to think about here is that the relation a < b still holds. c was not among the ancestors of b, nor among the descendants of e, but it was among the descendents of the ancestors of b and also among the ancestors of the descendents of e.

If you read that sentence carefully, you might already have a good guess about how to resolve the situation. C is an intermediate node between an ancestral node, a, and a descendent node d. That sure sounds awfully similar to the Floyd-Warshall algorithm. The only difference is that instead of looking at [all nodes × all nodes × all nodes], we are only interested in [ancestor nodes × descendent nodes × (ancestors→descendants ∩ descendents→ancestors)].

Not only does this narrow things down quite a bit, but it even simplifies the issue of expressing the algorithm in terms of the model itself.

There is one, last gotcha. It's entirely too easy to remove edges. Once you've found all of the edges that should remain, and go about trimming out the chaff, you need to make sure that the paths you're culling actually could have been among the paths that were broken.

This amounts to making and breaking connections based on the original set of nodes we're trying to resolve, rather than merging the whole output of the Floyd-Warshall algorithm right into the paths table directly.


class Node(Base):
    # ...
    def del_successor(self, other):
        if not self < other:
            # nodes are not connected, do nothing!
            return
        if not other in self.successors:
            # nodes aren't adjacent, but this *could*
            # be a warning...
            return

        self.successors.remove(other)

        # we build up a set of nodes that will be affected by the removal
        # we just did.  
        ancestors = set(other.after)
        descendents = set(self.before)

        # we also need to build up a list of nodes that will determine
        # where the paths may be.  basically, we're looking for every 
        # node that is both before some node in the descendents and
        # ALSO after the ancestors.  Such nodes might not be comparable
        # to self or other, but may still be part of a path between
        # the nodes in ancestors and the nodes in descendents.
        ancestors_descendents = set()
        for ancestor in ancestors:
            ancestors_descendents.add(ancestor)
            for descendent in ancestor.before:
                ancestors_descendents.add(descendent)

        descendents_ancestors = set()
        for descendent in descendents:
            descendents_ancestors.add(descendent)
            for ancestor in descendent.after:
                descendents_ancestors.add(ancestor)
        search_set = ancestors_descendents & descendents_ancestors

        known_good = set() # This is the 'paths' from the 
                           # original algorithm.  

        # as before, we need to initialize it with the paths we 
        # know are good.  this is just the successor edges in
        # the search set.
        for predecessor in search_set:
            for successor in search_set:
                if successor in predecessor.successors:
                    known_good.add((predecessor, successor))

        # We now can work our way through floyd_warshall to resolve
        # all adjacencies:
        for ancestor in ancestors:
            for descendent in descendents:
                if (ancestor, descendent) in known_good:
                    # already got this one, so we don't need to look for an
                    # intermediate.  
                    continue
                for intermediate in search_set:
                    if (ancestor, intermediate) in known_good \
                            and (intermediate, descendent) in known_good:
                        known_good.add((ancestor, descendent))
                        break # don't need to look any further for an
                              # intermediate, we can move on to the next
                              # descendent.  


        # sift through the bad nodes and update the links
        for ancestor in ancestors:
            for descendent in descendents:
                if descendent in ancestor.before \
                        and (ancestor, descendent) not in known_good:
                    ancestor.before.remove(descendent)

As you can see, that's a great deal more involved than anything we've done so far. It's also not any more efficient in the long term than doing a full rebuild, in fact it's a lot worse if you're adding and removing many nodes. But for changing a single node, it wins easily.

This still has some shortcomings, though. If you make changes via myNode.successors property instead of through the add/del methods, the graph goes out of sync. Since that's really more a matter of isolating a public interface from the private implementation, I'll leave that as an exercise for the reader.

A more interesting topic might be finding ways to make the path construction more efficient. One approach is to keep track of the path length. Another method to pursue might be to group changes together, so that you only have to pay the rebuild cost once for the whole change set, instead of over and over for each edge added or removed.

In some ways, it's probably best to solve all of these issues by moving the logic for keeping the paths list in the database, in the form of triggers or views or both.

If you'd like to see the code presented in this post all together, you can, at gitorious.org / ineedmorepower-examples

Sunday, May 29, 2011

Graph patterns and SQLAlchemy: Part One

In this part, I introduce the topic of generalized graph structures as used in relational databases, some of the basic approaches used, and the shortcomings of the simplest solutions. If you already know most of that stuff, you can skip straight to Part Two.

Relational Databases are near and dear to my heart. If the same is true for you, then you already know that the relational model, or at least the fraction of the model exposed in the least common denominator of SQL, has some key shortcomings. Just recently I found myself having to solve one of those shortcomings, the one about graphs.

This is a well studied problem, and there are lots of respectable solutions to lots of common problems. Alas, I not only am faced with the daunting challenge of supporting MySQL, but further more, I must face an abstraction. We use SQLAlchemy.

The important bit is that a well designed model layer, implemented in SQLAlchemy operates on at least two separate layers of abstraction, and the difference between the two lies in the keystone SQLAlchemy object, the session. Layers closer to the controller generally have access to the session instance, but layers far away aren't able to infer the session, and even if they could, wouldn't be able to know how to use it in a valid way.

To get at the point, consider a mapped model which has a special method that does something more interesting than set columns on itself, IE it operates on other instances.


def controller(Session):
    # create a session
    session = Session()
    instance = session.query(Model).get()
    instance.frob()

class Model():
    def frob(self):
        # can't use the session here!

This matters for graphs represented in relational databases, mapped by SQLAlchemy (and probably most other orm's), because changes to a single edge in the graph can have far reaching consequences on the total graph. Many nodes can become reachable from each other by adding one edge, and conversely, many nodes may become inaccessible for the loss of just one edge.

In my particular case, The graph I need to model is a partial order, which are conveniently represented as a Hass diagram. All of the ascii art should be understood like such: if two nodes, represented as letters, are connected by an edge (/, | or \), then the top node can reach the bottom node, but not the other way around, successors are visually lower than their predecessors. Since this sort of graph is transitive, then a node can reach another node so long as there is an unbroken sequence of edges connecting the first, through a chain of successors to the last.

On to the first example, the "diamond" graph, which is a useful test for all sorts of things:


  a
 / \
b   c
 \ /
  d

In this graph, a < d, by virtue of the edges connecting through b (and c) and ultimately to d. However, ¬(b < c), ¬(b > c), and (b ≠ c), since there is no path of successors from one to the other. b and c are said to be incomparable.

With that introduction out of the way, the idiomatic way of representing directed graphs is via the Associaton Table pattern. a table containing a column for the predecessor and successor is used to store all of the edges of interest in the graph. If the nodes also have values of interest (in my case, they certainly do), then you need a table for them, too.


from sqlalchemy import *
from sqlalchemy.orm import *
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()

association_table = Table('edges', Base.metadata,
    Column('predecessor', Integer, 
           ForeignKey('nodes.id'), primary_key=True),
    Column('successor', Integer, 
           ForeignKey('nodes.id'), primary_key=True))
           ForeignKey('nodes.id'), primary_key=True))

class Node(Base):
    __tablename__ = 'nodes'
    id = Column(Integer, primary_key=True)
    # extra columns

    def __repr__(self):
        return '' % (self.id,)
    
    successors = relationship('Node', backref='predecessors',
        secondary=association_table,
        primaryjoin=id == association_table.c.predecessor,
        secondaryjoin=id == association_table.c.successor)

But, when we're using nodes, we don't really care about the immediate successors and predecessors, we want to know about the full chain... basically, we want to know:



class Node(Base):
    # ...
    def __lt__(self, other):
        # what goes here?

The core solution here is to compute (by hook or by crook), the transitive closure of the association. There are probably plenty of good algorithms for accomplishing this, but the algorithm I'll consider here is the one developed by Robert Floyd, Bernard Roy and Stephen Warshall. The birds eye view of this strategy is to consider every possible pair of nodes, with every possible intermediate node as a path between them, and if you can get to the second from the first via the third, keep a note of it, to make use for the remaining checks.

We want another table to keep the output of the Floyd Warshall algorithm in. Including the extra relationship on the Node model, we'll do this


path_table = Table('paths', Base.metadata,
    Column('predecessor', Integer, 
           ForeignKey('nodes.id'), primary_key=True),
    Column('successor', Integer, 
           ForeignKey('nodes.id'), primary_key=True))

class Node(Base):
    # ...
    before = relationship('Node', backref='after',
        secondary=path_table,
        primaryjoin=id == path_table.c.predecessor,
        secondaryjoin=id == path_table.c.successor)

    def __lt__(self, other):
        return other in self.before

The simplest way of populating the before/after relationship is to just execute the described algorithm as above. Since this operates by recomputing the whole path table, we don't have much option, except to work with a session.


def recompute_paths(session):
    # we recompute the whole path table, so we 
    # start by deleting all of the edges in it.
    session.execute(path_table.delete())

    # we keep a list of all of the paths we know about
    # in this implementation, I'm using a set of pairs,
    # since I don't happen to care about the length of the path,
    # if you need to compare path lengths, or if you expect the
    # resulting graph to be fairly dense, you should use a different
    # data structure
    paths = set()

    # we have to initialize the path matrix with
    # all of the paths we already know about, which is
    # the contents of the 'edges' table.
    for edge in session.execute(association_table.select()).fetchall():
        predecessor, successor = edge['predecessor'], edge['successor']
        paths.add((predecessor, successor))

    # finally, the meat of the algorithm
    # we're quering raw ID's because that's what the edges look like.
    nodes = [row.id for row in session.execute(select([Node.id]))]
    for predecessor in nodes:
        for successor in nodes:
            for intermediate in nodes:
                if ((predecessor, successor) in paths or
                         ((predecessor, intermediate) in paths and
                          (intermediate, successor) in paths)):
                    paths.add((predecessor, successor))
                    break; # move on to the next successor

    # last but not least, we have to get the data back into the database.
    session.execute(path_table.insert(), 
                    [dict(predecessor=path[0], successor=path[1])
                     for path in paths])
    session.flush()
    session.expire_all()

This certainly does the job. But there are a bunch of shortcomings with this approach. The most obvious is the fact the cataclysmic nature of implementation. Even if you only need to change one edge, you end up recreating a whole table. In realistic situations, its probably only a small percentage of nodes that would be affected by such a change.

But believe it or not, that's not the part that bothers me, personally the most. The issue in my mind is that this breaks the abstraction. Changes at the model level, something like: myNode.successors.append(yourNode) don't actually work, since the change doesn't get reflected in comparisons until you issue a session.add(myNode) and a recompute_paths(session).

To drive home the point, models are just that, object oriented models of something. They are used for the calculations, and are only secondarily meaningful in terms of data persistence. Requiring them to be in a particular state to perform correctly is what I would consider a code smell

In part two, I demonstrate the solution to both of those problems.

Wednesday, January 26, 2011

Sci-Fi Pet peeves

SciFi Pet peeve #2: "A life form based on _ instead of carbon".

The basic flaw of this sort of discussion is that life on earth is not 'based on carbon' for any explanation I've heard. A common way this trope is used is to move to an adjacent element in the periodic table, most frequently silicon. This makes sense to a junior high level of chemistry knowledge, after all, silicon has the same number of available electron orbitals, and therefore aught to be a reasonable substitute.

Interestingly, there's plenty of silicon on earth, making up more than 1/4 of the earth's crust (by weight). Carbon, by comparison, represents a mere 0.03%, nearly 1000 times more rare, and yet silicon is essentially not a part of life on earth. The reason is actually pretty strait forward, the chemical reactions involving silicon are not useful to metabolism. Just looking at the most obvious compound involving each element, you can see obvious differences, Carbon dioxide is a water soluble gas that can be easily made to react with other chemicals, Silicon dioxide is a hard ionic crystal (quartz) which requires a tremendous amount of energy to decompose.

Other substitutions are even more ridiculous, like 'nitrogen based life'. Life on earth is made predominantly of four elements, hydrogen, oxygen, carbon and nitrogen, with other elements in smaller percentages according to their usefulness in chemical reactions.

(posted here because it was too long to post on facebook. so what was pet peeve #1?)

SciFi Pet peeve #1: "Show, don't tell"

It's interesting how you can encounter something, some uncertain distaste, and then, some time later, discover that this unspecific feeling actually has a particular name. Just the act of naming it changes the character from uneasy distaste to sharp hatred. It's not the object that has changed, but your ability to identify it. Like knowing that weird, licoricy flavor you tolerated in the herbal tea given to you as a holiday gift is actually anise, and whenever you taste it now you immediately spit it out in revulsion.

In my case that spice is known as Expospeak, and it's certainly not exclusive to science fiction, but is seen rather commonly there. It's characterized by long spans of exposition, detailing the history, science or politics of a setting in ways that just simply do not move the plot.

I'm sure the perpetrators of this crime feel that by explaining themselves as the do, they help the reader with suspension of disbelief. I will tell all you writers now that nothing could be further from the truth. What you are doing instead is wasting the readers' time, because either they have already suspended our disbelief, and instead of giving us plot, you're making us skim over unimportant details, or if we doubt the setting you've composed, it's because we know more about it than you do. Imagination is a funny thing, the less information it has to go on, the better it works. A physicist who works on linear accelerators might be reading your work, and I assure you, he can invent reasons he believes to be plausible far better than you can.

Science fiction, or any fiction for that matter, is engaging because the consequences are interesting. If you really want to write about the causes, try popular mechanics.

It would be a *shame* for me to continue in any semblance of order, so here's Sci Fi Pet Peeve #3

Sci Fi Pet Peeve #3! The mischaracterization of researchers.

Very often a Sci-Fi Villian will be some mad scientist type who loses track of right and wrong in pursuit of science. Cackling away in the night, the deranged researcher commits some terrible crime, alone in his tower.

Most researchers are about a million miles away from this. For one thing, a good percentage of research just doesn't have any impact on the day to day lives of the people not in the lab. In the cases that it does, like medical science or high energy physics, there are numerous checks and balances to ensure proper safety, and on top of that, the researchers involved are usually acutely interested in the common good their research may bring.

There have been some notable historical counterexamples for when this breaks down, but there are a number of features that must happen for any kind of harm to follow.

A non research agenda is more important to the researcher than the outcome of the research. This is pretty important here, because objectivity is the root of good science.
Supporters. The normal checks and balances that prevent unjust actions must be willfully disabled, and that means everyone involved has to basically agree with the non-research agenda.

The most common way fictional scientists go wrong is by means of perverting the scientific method, attributing progress with action. "Lets go chop up some dudes and make medical progress". This kind of cavalier research method is sure to make very little progress indeed, and an experienced researcher will know that.

The most horrifying historical experiments were cases where the scientific method was applied with callous integrity and brutal efficiency. There was no one scientist performing unlikely tests to the horror of his peers, rather, it was a collaborative effort of many, with the intent of harming a whole group of people, with the mask of scientific research. Fringe science would have been too unreliable, instead well known toxins or pathogens were employed, and only the details of their already expected success could be extracted as new experimental evidence.

Thursday, January 20, 2011

Scope creep in a learning exercise.

In the interest of science, mainly to have a model project to design a framework around, I decided to look into a suitable API that would be equivalent to the MovieLister project discussed on this blog post. Unfortunately, no such api supporting that specific query appears to exist. imdbapi.com exports a REST service that yields some metadata about movies, but short of requesting every movie, by ID, there’s no way that service can be used to query the director->movies association.

IMDB does provide a more general solution. They offer a free as in beer ftp service where you can download their entire database.

As flat files.

Parsing the IMDB flat text database.

Lot’s of files, some of them are pretty big. They are not in a standard, recognizeable format like CSV or xml or sqldump, so I’ll have to create a parser, from scratch, to transform the files into something a python application can use.

The basic approach I’m using is building a generator, one for each filetype, to iterate python objects representing the data.

The first hickup is that it’s not in plain ascii and it’s not in unicode. I’m not too good at knowing one encoding from the next, so... googling led me to chardet, which does exactly what it says on the tin, offers a suggested encoding, with a confidence level.
But if only it were that simple.

Most lines in the file are plain ‘ol ascii. it doesn’t take too long before non-ascii shows up. using chardet on the earliest lines, most of which are of the form “‘El Nickname’ Gonzalez, Juan etc”. The only encoding (other than ascii) chardet reports for these first few hundred lines is ISO-8859-2, which is mostly used for Slavic languages, but seems to do a reasonable job at encoding spanish titles. Could the whole file be in this encoding?

Well, No.

Applying chardet to the first megabyte of the file returned windows-1255, which is used to encode hebrew. I’ll have to decode each line independently. In fact, there are a wide range of encodings used throughout the file:

dbor1234@ubuntu-virtbox:~/src/imdb$ time zcat actors.list.gz | python linedecode.py 
'    For further info visit http://www.imdb.com/licensing/contact\n'
573785031
{'Big5': 3074,
'EUC-JP': 49,
'EUC-KR': 33,
'EUC-TW': 17,
'GB2312': 5,
'IBM855': 759,
'IBM866': 2,
'ISO-8859-2': 1297461,
'ISO-8859-5': 12,
'ISO-8859-7': 4585,
'ISO-8859-8': 145,
'MacCyrillic': 1673,
'SHIFT_JIS': 2,
'TIS-620': 5,
'ascii': 9171862,
'utf-8': 2,
'windows-1251': 20867,
'windows-1252': 11833,
'windows-1255': 52164}
real 79m20.586s
user 78m31.530s
sys 0m40.423s

Alas, unicode isn’t very popular...

And so now I’m wacking away at getting a suitable parser to process this data. I’ve settled on using pyparsing, after an aborted attempt at a handbuilt parser. For the most part, the data is pretty well behaved. but there are some corner cases that make parsing frustrating. The first is that the year field, which seems to be required, may be malformed... most look like (1994/I) I don’t know what that means, but it’s added to the parser.

The next regression is dealing with titles that contain parenthesis. This is challenging because the only way to recognize the end of a film title is by the start of the year field, which begins with an open parenthesis. TV titles are easier since they are always double quoted. The solution? Use a regex. I recognize film titles by searching for the first position in the input that could be the year field, in a lookahead assertion.

Of course, the list goes on. To find entries that the parse cannot parse, i’ve written a function that tries to parse every line in the input, and when one raises an exception, it catches the parse exception and returns it with the offending input. My next discovery, year fields that are all questionmarks.

Another thing I’m noticing is that the form used for Film is infact overloaded, by following the year with an additional attrubute, such as (V) or (VG), to indicate (Presumably) video or game titles. There is no obvious legend for them and they are not trivially distinguishable from other attributes, since either are parenthesized and in the same position between year and role.