Sunday, May 29, 2011

Graph patterns and SQLAlchemy: Part One

In this part, I introduce the topic of generalized graph structures as used in relational databases, some of the basic approaches used, and the shortcomings of the simplest solutions. If you already know most of that stuff, you can skip straight to Part Two.



Relational Databases are near and dear to my heart. If the same is true for you, then you already know that the relational model, or at least the fraction of the model exposed in the least common denominator of SQL, has some key shortcomings. Just recently I found myself having to solve one of those shortcomings, the one about graphs.

This is a well studied problem, and there are lots of respectable solutions to lots of common problems. Alas, I not only am faced with the daunting challenge of supporting MySQL, but further more, I must face an abstraction. We use SQLAlchemy.

The important bit is that a well designed model layer, implemented in SQLAlchemy operates on at least two separate layers of abstraction, and the difference between the two lies in the keystone SQLAlchemy object, the session. Layers closer to the controller generally have access to the session instance, but layers far away aren't able to infer the session, and even if they could, wouldn't be able to know how to use it in a valid way.

To get at the point, consider a mapped model which has a special method that does something more interesting than set columns on itself, IE it operates on other instances.


def controller(Session):
    # create a session
    session = Session()
    instance = session.query(Model).get()
    instance.frob()

class Model():
    def frob(self):
        # can't use the session here!

This matters for graphs represented in relational databases, mapped by SQLAlchemy (and probably most other orm's), because changes to a single edge in the graph can have far reaching consequences on the total graph. Many nodes can become reachable from each other by adding one edge, and conversely, many nodes may become inaccessible for the loss of just one edge.

In my particular case, The graph I need to model is a partial order, which are conveniently represented as a Hass diagram. All of the ascii art should be understood like such: if two nodes, represented as letters, are connected by an edge (/, | or \), then the top node can reach the bottom node, but not the other way around, successors are visually lower than their predecessors. Since this sort of graph is transitive, then a node can reach another node so long as there is an unbroken sequence of edges connecting the first, through a chain of successors to the last.

On to the first example, the "diamond" graph, which is a useful test for all sorts of things:


  a
 / \
b   c
 \ /
  d

In this graph, a < d, by virtue of the edges connecting through b (and c) and ultimately to d. However, ¬(b < c), ¬(b > c), and (b ≠ c), since there is no path of successors from one to the other. b and c are said to be incomparable.

With that introduction out of the way, the idiomatic way of representing directed graphs is via the Associaton Table pattern. a table containing a column for the predecessor and successor is used to store all of the edges of interest in the graph. If the nodes also have values of interest (in my case, they certainly do), then you need a table for them, too.


from sqlalchemy import *
from sqlalchemy.orm import *
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()

association_table = Table('edges', Base.metadata,
    Column('predecessor', Integer, 
           ForeignKey('nodes.id'), primary_key=True),
    Column('successor', Integer, 
           ForeignKey('nodes.id'), primary_key=True))
           ForeignKey('nodes.id'), primary_key=True))

class Node(Base):
    __tablename__ = 'nodes'
    id = Column(Integer, primary_key=True)
    # extra columns

    def __repr__(self):
        return '' % (self.id,)
    
    successors = relationship('Node', backref='predecessors',
        secondary=association_table,
        primaryjoin=id == association_table.c.predecessor,
        secondaryjoin=id == association_table.c.successor)

But, when we're using nodes, we don't really care about the immediate successors and predecessors, we want to know about the full chain... basically, we want to know:



class Node(Base):
    # ...
    def __lt__(self, other):
        # what goes here?


The core solution here is to compute (by hook or by crook), the transitive closure of the association. There are probably plenty of good algorithms for accomplishing this, but the algorithm I'll consider here is the one developed by Robert Floyd, Bernard Roy and Stephen Warshall. The birds eye view of this strategy is to consider every possible pair of nodes, with every possible intermediate node as a path between them, and if you can get to the second from the first via the third, keep a note of it, to make use for the remaining checks.

We want another table to keep the output of the Floyd Warshall algorithm in. Including the extra relationship on the Node model, we'll do this


path_table = Table('paths', Base.metadata,
    Column('predecessor', Integer, 
           ForeignKey('nodes.id'), primary_key=True),
    Column('successor', Integer, 
           ForeignKey('nodes.id'), primary_key=True))

class Node(Base):
    # ...
    before = relationship('Node', backref='after',
        secondary=path_table,
        primaryjoin=id == path_table.c.predecessor,
        secondaryjoin=id == path_table.c.successor)

    def __lt__(self, other):
        return other in self.before

The simplest way of populating the before/after relationship is to just execute the described algorithm as above. Since this operates by recomputing the whole path table, we don't have much option, except to work with a session.


def recompute_paths(session):
    # we recompute the whole path table, so we 
    # start by deleting all of the edges in it.
    session.execute(path_table.delete())

    # we keep a list of all of the paths we know about
    # in this implementation, I'm using a set of pairs,
    # since I don't happen to care about the length of the path,
    # if you need to compare path lengths, or if you expect the
    # resulting graph to be fairly dense, you should use a different
    # data structure
    paths = set()

    # we have to initialize the path matrix with
    # all of the paths we already know about, which is
    # the contents of the 'edges' table.
    for edge in session.execute(association_table.select()).fetchall():
        predecessor, successor = edge['predecessor'], edge['successor']
        paths.add((predecessor, successor))

    # finally, the meat of the algorithm
    # we're quering raw ID's because that's what the edges look like.
    nodes = [row.id for row in session.execute(select([Node.id]))]
    for predecessor in nodes:
        for successor in nodes:
            for intermediate in nodes:
                if ((predecessor, successor) in paths or
                         ((predecessor, intermediate) in paths and
                          (intermediate, successor) in paths)):
                    paths.add((predecessor, successor))
                    break; # move on to the next successor

    # last but not least, we have to get the data back into the database.
    session.execute(path_table.insert(), 
                    [dict(predecessor=path[0], successor=path[1])
                     for path in paths])
    session.flush()
    session.expire_all()

This certainly does the job. But there are a bunch of shortcomings with this approach. The most obvious is the fact the cataclysmic nature of implementation. Even if you only need to change one edge, you end up recreating a whole table. In realistic situations, its probably only a small percentage of nodes that would be affected by such a change.

But believe it or not, that's not the part that bothers me, personally the most. The issue in my mind is that this breaks the abstraction. Changes at the model level, something like: myNode.successors.append(yourNode) don't actually work, since the change doesn't get reflected in comparisons until you issue a session.add(myNode) and a recompute_paths(session).

To drive home the point, models are just that, object oriented models of something. They are used for the calculations, and are only secondarily meaningful in terms of data persistence. Requiring them to be in a particular state to perform correctly is what I would consider a code smell



In part two, I demonstrate the solution to both of those problems.

No comments:

Post a Comment