Monday, May 30, 2011

Graph patterns and SQLAlchemy: Part Two

In the first part of this series, I introduced a basic technique for handling graph structures that supports efficient reading for path reach. That solution was sufficient for primarily read operations, but isn't much help when there are frequent writes, or if database persistence is undesirable for some uses of the model. In this part, I'll show a solution to those shortcomings.



So, from this point forward, we're only interested in solutions that can be used *without* resorting to a database, and which leave the models in an always consistent state. There are no sessions here!. SQLAlchemy will give us one leg-up, the mapper provides just a bit of magic, such that when you modify one side of an association, it takes care of the book-keeping on the other side. This works whether you have a database or not.

We can approach this from the point of view of breaking the problem into two parts. When you add an edge, you only have to worry about adding new paths, and when you remove an edge, you only need to worry about breaking existing paths; Adding an edge won't ever break any paths.

In fact, That first one is really easy to deal with. A new edge creates a pretty clear set of paths. the successor, and all of its successors gain the predecessor, and all of it's predecessors as predecessors themselves.


class Node(Base):
    # ...
    def add_successor(self, other):
        if other in self.successors:
            return
        self.successors.append(other)
        self.before.append(other)
        for descendent in other.before:
            if descendent not in self.before:
                self.before.append(descendent)
        for ancestor in self.after:
            if ancestor not in other.after:
                other.after.append(ancestor)


In the case of adding an edge, we are only really looking at the ancestors of the top node, and the descendants of the bottom node. If they aren't already, we connect 'em up.

What about removing an edge. Before we get too involved in what that does to the graph, lets look again at an example. here's a slightly modified Diamond graph


  a
 / \
b   \
|    c
e   /
 \ /
  d

suppose we wanted to remove the edge between a and b, turning it into:


  a
 / \
b   \
     c
e   /
 \ /
  d

The important thing to think about here is that the relation a < b still holds. c was not among the ancestors of b, nor among the descendants of e, but it was among the descendents of the ancestors of b and also among the ancestors of the descendents of e.


If you read that sentence carefully, you might already have a good guess about how to resolve the situation. C is an intermediate node between an ancestral node, a, and a descendent node d. That sure sounds awfully similar to the Floyd-Warshall algorithm. The only difference is that instead of looking at [all nodes × all nodes × all nodes], we are only interested in [ancestor nodes × descendent nodes × (ancestors→descendants ∩ descendents→ancestors)].


Not only does this narrow things down quite a bit, but it even simplifies the issue of expressing the algorithm in terms of the model itself.


There is one, last gotcha. It's entirely too easy to remove edges. Once you've found all of the edges that should remain, and go about trimming out the chaff, you need to make sure that the paths you're culling actually could have been among the paths that were broken.

This amounts to making and breaking connections based on the original set of nodes we're trying to resolve, rather than merging the whole output of the Floyd-Warshall algorithm right into the paths table directly.


class Node(Base):
    # ...
    def del_successor(self, other):
        if not self < other:
            # nodes are not connected, do nothing!
            return
        if not other in self.successors:
            # nodes aren't adjacent, but this *could*
            # be a warning...
            return

        self.successors.remove(other)

        # we build up a set of nodes that will be affected by the removal
        # we just did.  
        ancestors = set(other.after)
        descendents = set(self.before)

        # we also need to build up a list of nodes that will determine
        # where the paths may be.  basically, we're looking for every 
        # node that is both before some node in the descendents and
        # ALSO after the ancestors.  Such nodes might not be comparable
        # to self or other, but may still be part of a path between
        # the nodes in ancestors and the nodes in descendents.
        ancestors_descendents = set()
        for ancestor in ancestors:
            ancestors_descendents.add(ancestor)
            for descendent in ancestor.before:
                ancestors_descendents.add(descendent)

        descendents_ancestors = set()
        for descendent in descendents:
            descendents_ancestors.add(descendent)
            for ancestor in descendent.after:
                descendents_ancestors.add(ancestor)
        search_set = ancestors_descendents & descendents_ancestors

        known_good = set() # This is the 'paths' from the 
                           # original algorithm.  

        # as before, we need to initialize it with the paths we 
        # know are good.  this is just the successor edges in
        # the search set.
        for predecessor in search_set:
            for successor in search_set:
                if successor in predecessor.successors:
                    known_good.add((predecessor, successor))

        # We now can work our way through floyd_warshall to resolve
        # all adjacencies:
        for ancestor in ancestors:
            for descendent in descendents:
                if (ancestor, descendent) in known_good:
                    # already got this one, so we don't need to look for an
                    # intermediate.  
                    continue
                for intermediate in search_set:
                    if (ancestor, intermediate) in known_good \
                            and (intermediate, descendent) in known_good:
                        known_good.add((ancestor, descendent))
                        break # don't need to look any further for an
                              # intermediate, we can move on to the next
                              # descendent.  


        # sift through the bad nodes and update the links
        for ancestor in ancestors:
            for descendent in descendents:
                if descendent in ancestor.before \
                        and (ancestor, descendent) not in known_good:
                    ancestor.before.remove(descendent)

As you can see, that's a great deal more involved than anything we've done so far. It's also not any more efficient in the long term than doing a full rebuild, in fact it's a lot worse if you're adding and removing many nodes. But for changing a single node, it wins easily.

This still has some shortcomings, though. If you make changes via myNode.successors property instead of through the add/del methods, the graph goes out of sync. Since that's really more a matter of isolating a public interface from the private implementation, I'll leave that as an exercise for the reader.

A more interesting topic might be finding ways to make the path construction more efficient. One approach is to keep track of the path length. Another method to pursue might be to group changes together, so that you only have to pay the rebuild cost once for the whole change set, instead of over and over for each edge added or removed.

In some ways, it's probably best to solve all of these issues by moving the logic for keeping the paths list in the database, in the form of triggers or views or both.

If you'd like to see the code presented in this post all together, you can, at gitorious.org / ineedmorepower-examples

Sunday, May 29, 2011

Graph patterns and SQLAlchemy: Part One

In this part, I introduce the topic of generalized graph structures as used in relational databases, some of the basic approaches used, and the shortcomings of the simplest solutions. If you already know most of that stuff, you can skip straight to Part Two.



Relational Databases are near and dear to my heart. If the same is true for you, then you already know that the relational model, or at least the fraction of the model exposed in the least common denominator of SQL, has some key shortcomings. Just recently I found myself having to solve one of those shortcomings, the one about graphs.

This is a well studied problem, and there are lots of respectable solutions to lots of common problems. Alas, I not only am faced with the daunting challenge of supporting MySQL, but further more, I must face an abstraction. We use SQLAlchemy.

The important bit is that a well designed model layer, implemented in SQLAlchemy operates on at least two separate layers of abstraction, and the difference between the two lies in the keystone SQLAlchemy object, the session. Layers closer to the controller generally have access to the session instance, but layers far away aren't able to infer the session, and even if they could, wouldn't be able to know how to use it in a valid way.

To get at the point, consider a mapped model which has a special method that does something more interesting than set columns on itself, IE it operates on other instances.


def controller(Session):
    # create a session
    session = Session()
    instance = session.query(Model).get()
    instance.frob()

class Model():
    def frob(self):
        # can't use the session here!

This matters for graphs represented in relational databases, mapped by SQLAlchemy (and probably most other orm's), because changes to a single edge in the graph can have far reaching consequences on the total graph. Many nodes can become reachable from each other by adding one edge, and conversely, many nodes may become inaccessible for the loss of just one edge.

In my particular case, The graph I need to model is a partial order, which are conveniently represented as a Hass diagram. All of the ascii art should be understood like such: if two nodes, represented as letters, are connected by an edge (/, | or \), then the top node can reach the bottom node, but not the other way around, successors are visually lower than their predecessors. Since this sort of graph is transitive, then a node can reach another node so long as there is an unbroken sequence of edges connecting the first, through a chain of successors to the last.

On to the first example, the "diamond" graph, which is a useful test for all sorts of things:


  a
 / \
b   c
 \ /
  d

In this graph, a < d, by virtue of the edges connecting through b (and c) and ultimately to d. However, ¬(b < c), ¬(b > c), and (b ≠ c), since there is no path of successors from one to the other. b and c are said to be incomparable.

With that introduction out of the way, the idiomatic way of representing directed graphs is via the Associaton Table pattern. a table containing a column for the predecessor and successor is used to store all of the edges of interest in the graph. If the nodes also have values of interest (in my case, they certainly do), then you need a table for them, too.


from sqlalchemy import *
from sqlalchemy.orm import *
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()

association_table = Table('edges', Base.metadata,
    Column('predecessor', Integer, 
           ForeignKey('nodes.id'), primary_key=True),
    Column('successor', Integer, 
           ForeignKey('nodes.id'), primary_key=True))
           ForeignKey('nodes.id'), primary_key=True))

class Node(Base):
    __tablename__ = 'nodes'
    id = Column(Integer, primary_key=True)
    # extra columns

    def __repr__(self):
        return '' % (self.id,)
    
    successors = relationship('Node', backref='predecessors',
        secondary=association_table,
        primaryjoin=id == association_table.c.predecessor,
        secondaryjoin=id == association_table.c.successor)

But, when we're using nodes, we don't really care about the immediate successors and predecessors, we want to know about the full chain... basically, we want to know:



class Node(Base):
    # ...
    def __lt__(self, other):
        # what goes here?


The core solution here is to compute (by hook or by crook), the transitive closure of the association. There are probably plenty of good algorithms for accomplishing this, but the algorithm I'll consider here is the one developed by Robert Floyd, Bernard Roy and Stephen Warshall. The birds eye view of this strategy is to consider every possible pair of nodes, with every possible intermediate node as a path between them, and if you can get to the second from the first via the third, keep a note of it, to make use for the remaining checks.

We want another table to keep the output of the Floyd Warshall algorithm in. Including the extra relationship on the Node model, we'll do this


path_table = Table('paths', Base.metadata,
    Column('predecessor', Integer, 
           ForeignKey('nodes.id'), primary_key=True),
    Column('successor', Integer, 
           ForeignKey('nodes.id'), primary_key=True))

class Node(Base):
    # ...
    before = relationship('Node', backref='after',
        secondary=path_table,
        primaryjoin=id == path_table.c.predecessor,
        secondaryjoin=id == path_table.c.successor)

    def __lt__(self, other):
        return other in self.before

The simplest way of populating the before/after relationship is to just execute the described algorithm as above. Since this operates by recomputing the whole path table, we don't have much option, except to work with a session.


def recompute_paths(session):
    # we recompute the whole path table, so we 
    # start by deleting all of the edges in it.
    session.execute(path_table.delete())

    # we keep a list of all of the paths we know about
    # in this implementation, I'm using a set of pairs,
    # since I don't happen to care about the length of the path,
    # if you need to compare path lengths, or if you expect the
    # resulting graph to be fairly dense, you should use a different
    # data structure
    paths = set()

    # we have to initialize the path matrix with
    # all of the paths we already know about, which is
    # the contents of the 'edges' table.
    for edge in session.execute(association_table.select()).fetchall():
        predecessor, successor = edge['predecessor'], edge['successor']
        paths.add((predecessor, successor))

    # finally, the meat of the algorithm
    # we're quering raw ID's because that's what the edges look like.
    nodes = [row.id for row in session.execute(select([Node.id]))]
    for predecessor in nodes:
        for successor in nodes:
            for intermediate in nodes:
                if ((predecessor, successor) in paths or
                         ((predecessor, intermediate) in paths and
                          (intermediate, successor) in paths)):
                    paths.add((predecessor, successor))
                    break; # move on to the next successor

    # last but not least, we have to get the data back into the database.
    session.execute(path_table.insert(), 
                    [dict(predecessor=path[0], successor=path[1])
                     for path in paths])
    session.flush()
    session.expire_all()

This certainly does the job. But there are a bunch of shortcomings with this approach. The most obvious is the fact the cataclysmic nature of implementation. Even if you only need to change one edge, you end up recreating a whole table. In realistic situations, its probably only a small percentage of nodes that would be affected by such a change.

But believe it or not, that's not the part that bothers me, personally the most. The issue in my mind is that this breaks the abstraction. Changes at the model level, something like: myNode.successors.append(yourNode) don't actually work, since the change doesn't get reflected in comparisons until you issue a session.add(myNode) and a recompute_paths(session).

To drive home the point, models are just that, object oriented models of something. They are used for the calculations, and are only secondarily meaningful in terms of data persistence. Requiring them to be in a particular state to perform correctly is what I would consider a code smell



In part two, I demonstrate the solution to both of those problems.