Monday, May 30, 2011

Graph patterns and SQLAlchemy: Part Two

In the first part of this series, I introduced a basic technique for handling graph structures that supports efficient reading for path reach. That solution was sufficient for primarily read operations, but isn't much help when there are frequent writes, or if database persistence is undesirable for some uses of the model. In this part, I'll show a solution to those shortcomings.



So, from this point forward, we're only interested in solutions that can be used *without* resorting to a database, and which leave the models in an always consistent state. There are no sessions here!. SQLAlchemy will give us one leg-up, the mapper provides just a bit of magic, such that when you modify one side of an association, it takes care of the book-keeping on the other side. This works whether you have a database or not.

We can approach this from the point of view of breaking the problem into two parts. When you add an edge, you only have to worry about adding new paths, and when you remove an edge, you only need to worry about breaking existing paths; Adding an edge won't ever break any paths.

In fact, That first one is really easy to deal with. A new edge creates a pretty clear set of paths. the successor, and all of its successors gain the predecessor, and all of it's predecessors as predecessors themselves.


class Node(Base):
    # ...
    def add_successor(self, other):
        if other in self.successors:
            return
        self.successors.append(other)
        self.before.append(other)
        for descendent in other.before:
            if descendent not in self.before:
                self.before.append(descendent)
        for ancestor in self.after:
            if ancestor not in other.after:
                other.after.append(ancestor)


In the case of adding an edge, we are only really looking at the ancestors of the top node, and the descendants of the bottom node. If they aren't already, we connect 'em up.

What about removing an edge. Before we get too involved in what that does to the graph, lets look again at an example. here's a slightly modified Diamond graph


  a
 / \
b   \
|    c
e   /
 \ /
  d

suppose we wanted to remove the edge between a and b, turning it into:


  a
 / \
b   \
     c
e   /
 \ /
  d

The important thing to think about here is that the relation a < b still holds. c was not among the ancestors of b, nor among the descendants of e, but it was among the descendents of the ancestors of b and also among the ancestors of the descendents of e.


If you read that sentence carefully, you might already have a good guess about how to resolve the situation. C is an intermediate node between an ancestral node, a, and a descendent node d. That sure sounds awfully similar to the Floyd-Warshall algorithm. The only difference is that instead of looking at [all nodes × all nodes × all nodes], we are only interested in [ancestor nodes × descendent nodes × (ancestors→descendants ∩ descendents→ancestors)].


Not only does this narrow things down quite a bit, but it even simplifies the issue of expressing the algorithm in terms of the model itself.


There is one, last gotcha. It's entirely too easy to remove edges. Once you've found all of the edges that should remain, and go about trimming out the chaff, you need to make sure that the paths you're culling actually could have been among the paths that were broken.

This amounts to making and breaking connections based on the original set of nodes we're trying to resolve, rather than merging the whole output of the Floyd-Warshall algorithm right into the paths table directly.


class Node(Base):
    # ...
    def del_successor(self, other):
        if not self < other:
            # nodes are not connected, do nothing!
            return
        if not other in self.successors:
            # nodes aren't adjacent, but this *could*
            # be a warning...
            return

        self.successors.remove(other)

        # we build up a set of nodes that will be affected by the removal
        # we just did.  
        ancestors = set(other.after)
        descendents = set(self.before)

        # we also need to build up a list of nodes that will determine
        # where the paths may be.  basically, we're looking for every 
        # node that is both before some node in the descendents and
        # ALSO after the ancestors.  Such nodes might not be comparable
        # to self or other, but may still be part of a path between
        # the nodes in ancestors and the nodes in descendents.
        ancestors_descendents = set()
        for ancestor in ancestors:
            ancestors_descendents.add(ancestor)
            for descendent in ancestor.before:
                ancestors_descendents.add(descendent)

        descendents_ancestors = set()
        for descendent in descendents:
            descendents_ancestors.add(descendent)
            for ancestor in descendent.after:
                descendents_ancestors.add(ancestor)
        search_set = ancestors_descendents & descendents_ancestors

        known_good = set() # This is the 'paths' from the 
                           # original algorithm.  

        # as before, we need to initialize it with the paths we 
        # know are good.  this is just the successor edges in
        # the search set.
        for predecessor in search_set:
            for successor in search_set:
                if successor in predecessor.successors:
                    known_good.add((predecessor, successor))

        # We now can work our way through floyd_warshall to resolve
        # all adjacencies:
        for ancestor in ancestors:
            for descendent in descendents:
                if (ancestor, descendent) in known_good:
                    # already got this one, so we don't need to look for an
                    # intermediate.  
                    continue
                for intermediate in search_set:
                    if (ancestor, intermediate) in known_good \
                            and (intermediate, descendent) in known_good:
                        known_good.add((ancestor, descendent))
                        break # don't need to look any further for an
                              # intermediate, we can move on to the next
                              # descendent.  


        # sift through the bad nodes and update the links
        for ancestor in ancestors:
            for descendent in descendents:
                if descendent in ancestor.before \
                        and (ancestor, descendent) not in known_good:
                    ancestor.before.remove(descendent)

As you can see, that's a great deal more involved than anything we've done so far. It's also not any more efficient in the long term than doing a full rebuild, in fact it's a lot worse if you're adding and removing many nodes. But for changing a single node, it wins easily.

This still has some shortcomings, though. If you make changes via myNode.successors property instead of through the add/del methods, the graph goes out of sync. Since that's really more a matter of isolating a public interface from the private implementation, I'll leave that as an exercise for the reader.

A more interesting topic might be finding ways to make the path construction more efficient. One approach is to keep track of the path length. Another method to pursue might be to group changes together, so that you only have to pay the rebuild cost once for the whole change set, instead of over and over for each edge added or removed.

In some ways, it's probably best to solve all of these issues by moving the logic for keeping the paths list in the database, in the form of triggers or views or both.

If you'd like to see the code presented in this post all together, you can, at gitorious.org / ineedmorepower-examples

No comments:

Post a Comment