Connected components algorithm

Sean Barrett  —  1 year, 3 months ago [Edited 2 days, 15 hours later]
stb_connected_components.h

stb_connected_components.h (stbcc) finds the connected components on a large 2D grid, and tracks those connected components as the grid is updated. This is trickier than you might think.

stbcc has many possible uses, but it was inspired in particular by a comment by the programmer of Dwarf Fortress that they kept a separate data structure that could efficiently answer the question of whether any given point Q on the map was reachable from some other point P, and they used this to prevent degenerate slowness in pathfinding from P to Q (which, if Q is unreachable, typically will explore the entirety of the map that is reachable from P).

Finding and updating connected components

The traditional algorithm for finding connected components in a graph is to just use depth-first-search; the equivalent on grids is a "flood-fill" algorithm. These techniques work fine, and are O(N) with N nodes (e.g. in a 1024x1024 grid, they take ~1M units of time). However, they offer no good approach for updating.

The classic update-friendly approach for this kind of problem is to use Disjoint Set Forests. The algorithm will work on arbitrary graphs, not just grids. It allows you to build groups of 'connected' objects (sets) efficiently; the initial build is essentially O(N) time (see the Wikipedia page for more details about its speed). This technique also allows you to dynamically add connections to the existing data structure in essentially O(1) (amortized) time. The implementation of disjoint set forests is straightforward, using only a dozen or so lines of code, as in this reference solution I coded at the beginning of the stbcc stream (see previous post):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
point leader[1024][1024];

point dsf_find(int x, int y)
{
   point p,q;
   p = leader[y][x];
   if (p.x == x && p.y == y)
      return p;
   q = dsf_find(p.x, p.y);
   leader[y][x] = q; // path compression
   return q;
}

void dsf_union(int x1, int y1, int x2, int y2)
{
   point p = dsf_find(x1,y1);
   point q = dsf_find(x2,y2);

   if (p.x == q.x && p.y == q.y)
      return;

   leader[p.y][p.x] = q;
}

Unfortunately, there is no simple update-friendly data structure to allow deleting a connection and still track the connected components.

Understanding the Deletion Problem

Think of a connected-components data structure as storing a label for every node on the graph, indicating which component that node belongs to. (To check if point P can reach point Q, we check if they have the same label.)

If you imagine a map split into two large components, when you connect them, you naively might have to update half of the nodes on the map, which will obviously not be O(1) time. However, if the nodes don't directly store a label, but store a pointer to a label, then we might be able to set up a situation where all the nodes in one component point to one label, and all the nodes in the other component point to another label, and when we join the two components, we only have to update one pointed-to label and by doing so we've updated the whole map. (This is essentially what the Disjoint Set Forest algorithm above does, taken to its ultimate limit.)

But when we disconnect a large component on the map into two components, we might make that disconnection in many places; there's no way to guarantee that the existing pointers to labels point to two different locations in a way that exactly matches the disconnection that we have. Thus there is no straightforward solution to an update that creates a disconnection.

AFAIK most people simply rebuild the entire connected component data structure (i.e. periodically recompute the flood fill).

So the idea of stb_connected_components.h is to compromise; we introduce a moderate number of intermediate, indirected labels, so that we can rewrite as few labels as possible and minimize the number of nodes which we need to point to different labels entirely. We do this by relying on the graph being a grid and using the geometry of the grid to guide us.

Two-level Hierarchies

There's a classic trick for making certain kinds of O(N^1.5) algorithms when you have an O(N^2) algorithm that it relies on: partition the input data into sqrt(N) pieces each of size sqrt(N), run the O(N^2) algorihm on the pieces; each one will take O(sqrt(N)^2), or O(N) time, so running on all the pieces will take sqrt(N)*O(N) time, i.e. O(N^1.5) time.

We do something similar for stbcc. We're going to chunk up our M*M grid into smaller pieces. Suppose that size is K. Then we have K*K pieces, each one (M/K)*(M/K) nodes. Suppose we run an O(N) algorithm within one (M/K)^2 piece and another O(N) algorithm on the K^2 coarse pieces; then we have O(K*K) + O((M/K)*(M/K)), or O(K*K+M*M/(K*K). If K is 1, this becomes O(M^2), if K is M, this becomes O(M^2), but if K is sqrt(M), this becomes O(M). It's minimized when K is sqrt(M).

Why a two-level hierarchy? Decades of experience has shown that you almost never want quadtree-esque fully-recursive hierarchies for speeding things up; you almost always want two-level hierarchies. (Maybe, maybe, if it's really big, you want three-level hierarchies. If you wanted to do a 65536x65536 map, maybe that would be better.)

Subgrids in stb_connected_components

So, as described in the previous section, stbcc splits an M*M grid into subgrids that are sqrt(M)*sqrt(M). For each subgrid, we cache the locally-connected components; each within-subgrid connected component has a single slot storing a label and all the nodes in that component point to the same slot; thus when connectivity changes we only have to update that one slot to update all of the nodes.

There are sqrt(M)*sqrt(M) total subgrids; which means we have (on typical maps) a small multiple of that many locally-connected components; in other words we usually have O(M) pieces--but that's from an M*M map, so if N=M*M, we're talking about O(N^0.5) components. If we also keep track of how locally-connected components connect across subgrid boundaries, then we can run a straight Disjoint Set Forest algorithm on those O(N^0.5) components in essentially O(N^0.5) time.

On a single update, we do three things:

  1. Rebuild the locally-connected components for the subgrid containing the updated node
  2. Rebuild the adjacency information from that subgrid to its neighbors, and vice versa
  3. Compute from scratch the globally-connected components from the locally-connected pieces

For an M*M map, step 1 takes essentially O(M), step 2 takes O(sqrt(M)) time [pretending array deletion is O(1)], and step 3 takes O(M) time for game-like maps but has a much worse case discussed below. (All these 'essentiallies' are based on pretending that DSF takes O(N) time for N operations, ignoring that (a) the actual cost is technically higher, and (b) the nearly-linear behavior is only if you implement both path compression and union by rank, and we only implement the former to reduce memory/cache pressure.)

Adjacency lists can be maintained as doubly-linked lists so that deletion time for them is actually O(1), but this requires a lot of extra memory, and in the common case where the arrays are small, searching the list to delete an item isn't a big deal.

The actual implementation is:

  1. Run a distjoint-set-forest on a subgrid
  2. Delete old adjacency information for the old subgrid (and use that to delete adjancey info pointing to this subgrid from its neighbors); then check every node on the subgrid edges to see if it connects across the edge to a neighbor component, and if so add a connection to the adjacency lists for both
  3. Run a disjoint-set-forest over all the locally-collected-components in the entire map using the adjacency info to define the connections (this is the slowest step, and has a bad worst case)

Plausible Improvements

Just writing down the above suggested a few possible improvements:

  • Might a flood-fill for step 1 above be faster?
  • The adjacency list additions are actually computed twice, to add the entries to each side independently. Could maybe be twice as fast; but I don't think this was ever a significant chunk of the time in tests.
  • The old reference disjoint-set-forest added connections in both directions [i.e. called both dsf_union(x,y,x+1,y) and dsf_union(x+1,y,x,y)], and it was found to nearly double the speed to remove this redundancy. The final global one might want to do something similar, although it is not as straightforward to do it geometrically now, as they're arbitrary graph connections instead of simple grid connections. However, they ARE still always across a subgrid edge, and we could determine that; but is that faster than just computing dsf_union()?
  • (This comes up below:) Don't store anything at all for degenerate components, since their index is all we need for reachability tests.

  • Reducing Memory Usage

    The above describes the initial 0.90 release of stb_connected_components. The first update reduced memory usage.

    stbcc allocates the worst-case memory usage needed so it never does any dynamic allocations. Because adjacency lists are variable sized, the initial version allocated a max-sized adjacency list for every locally-connected-component. This required foolish amounts of memory (~80MB for a 1K*1K map).

    The max-sized adjacency list occurs when every connection out of a subgrid come from the same locally-connected component. If this happens, no other locally-connected component in that subgrid can have any adjacency list entries. In other words, the total number of adjacency list entries is bounded by the same maximum that the initial version used for each list. (This is, by the way, half the "circumference" of the subgrid.)

    The problem is that with a fixed-allocation scheme, we need to allocate variably-sized lists, and they might need to be resized. This could require writing a little "memory allocator" out of each subgrid's adjacency "pool", and then after multiple updates, these could become fragmented and need defragmentation. While this is possible, we use a much simpler strategy: when we create the initial lists for a subgrid, we divide up the list, and apportion any leftover space equally between all the components. If we try to add to a component's adjacency list and there is no room, we discard all the information for that subgrid and rebuild. (We're already always rebuilding 1 subgrid on every update; now we update at most 5, and this is rare in practice.)

    This reduces worst-case memory to ~8MB for a 1K*1K map (equivalent to about 8 bytes per node, which I consider tolerable).

    Reducing Global Components

    Although I don't believe the cases considered in this section are interesting for game maps, stb_connected_components doesn't claim to be only for game maps, so when I tested and found some degeneracies I decided it was worth putting the effort into speeding them up. This didn't noticeably speed up the game map case, but it also didn't hurt.

    A worst case for the algorithm as described above is a checkerboard; each white square is a degenerate single-node connected component. In this case, an M*M grid has M*M/2 locally-connected components, and the disjoint-set-forest algorithm run on those components will take O(M*M) time (even though they're all disconnected). This is not theoretical; making such a map did make this step very slow.

    To improve performance, we could simply eliminate degenerate 1-node connected-components from the global step, but we can do better, handling larger components as well. We could classify each locally-connected component by whether it has an empty adjacency list or not. However, that could change when an update happens immediately next to the upgrid, creating a new adjacent connection even if the subgrid itself doesn't change.

    Instead we classify each locally-connected component with whether it touches an edge of the subgrid or not. Those that touch it need adjacency lists and need to participate in the globally-connected components step. Those that do not do not need lists and do not participate. (This also helps the previous section; splitting up the adjacency-list storage, only components that are touch edges are allocated extra adjacency-list storage.) Remember, if the subgrid ever changes, it will be discarded and fully rebuilt, so if a locally-connected component doesn't touch the edge now, it never will for the lifetime of this subgrid.

    By separating the list of locally-connected components in each subgrid into two lists (those that are on-edge and those that aren't), we avoid traversing the latter entirely, bringing performance down from O(N) to O(N^0.75) (i.e. from O(M^2) to O(M^1.5). This is still a bit slow; in practice with M=1024 I saw times from 0.25ms for the O(M)-ish normal-map case to 2.5ms for an O(M^1.5)-ish worst case, all compared to say 50ms for the O(M^2) full-map disjoint set forest.

    The checkerboard case is actually not the worst case for performance, because it actually has no global adjacencies; we had to traverse every locally-connected component, but didn't do any dsf_union() operations. Some practical cases that were more interesting: invert every other subgrid in a checkerboard fashion (so there are maximal numbers of adjacencies across the subgrid borders); stripes on alternating rows (so there are large global connected components), and both of those but with some fraction of the nodes randomized (creating more connected components).

    Implausible Improvements

    To do significantly better we'd have to formalize the problem differently; as it stands, an M*M grid has M subgrids with worst case O(sqrt(M)) adjacencies, and we have to call dsf_union() for every adjacency, so the O(M^1.5) worst case is unavoidable with this way of factoring the problem. Of course, speeding up this degenerate worst case is also not necessarily the priority; however, measurements have shown that the global disjoint-set-forest step is also the slowest step in the update operation, even for 'normal' grids.

    One way to speed up the current global disjoint-set-forest step would be to apply a similar 'incremental' algorithm to it, i.e. introduce a third hierarchical level. The easiest way is to imagine that this is another level of grids above the current level; the globally-connected components from the current grid could be treated as locally-connected components on an even further-out supergrid. We might split a 1024*1024 supergrid into 8x8 grids that are 128x128, and subdivide each grid into 8x8 sets of 16x16 subgrids. Then an update operation would rebuild one 16x16 subgrid, compute the "globally"-connected components from ~200 locally-connected components (worst case 2048), and then compute the truely-global-connected components from a set of ~100 "globally"-connected components (worst case 16384).

    Note that that worst case hasn't improved a lot (the existing worst case for a 1024*1024 grid is 65536), so this might cause an even bigger spread between worst case and average case (which is perhaps surprising if you thought it would help the worst case more).

    But I don't plan to do this, since two-level hierarchies tend to be the sweet spot of complexity & performance.
    Log in to comment