Program Analysis with Datalog: Dominance Tree

Datalog

I will start a series of posts about using Datalog to solve standard program analysis problems. For people that have not heard about Datalog, an excellent entry point is the paper What you Always Wanted to Know About Datalog (And Never Dared to Ask).

Datalog programs contain rules (or caluses); a classic example of Datalog program is:

GrandParent(x, z) :- Parent(x, y), Parent(y, z).

The head of the rule is the conclusion, which holds true if the body of the rule is true. So one interpretation of this rule is: x is the grandparent of z if x is the parent of y and y is the parent of z. Together with inputs (or facts) like Parent("Alice", "Bob"), Parent("Bob", "Claire") will return us GrandParent("Alice", "Claire").

One of the possible execution models for Datalog program is to reduce it to a set of Relational Algebra, just like you reduce imperative language to a set of low-level instructions and functional language to lambda calculus. However, unlike Database language, which is also reduced to Relational Algebra, Datalog has the ability of recursion. Of course, this comes with extra overhead and some unsafe properties.

Datalog has been shown to be particularly useful for static analysis, and many real-world applications ¹ have been built upon it. Specifically, many program analysis problems are modelled with lattice and fixpoint iteration. This coincides perfectly with the execution model of Datalog. Throughout the series, I’ll be using Souffle, one of the modern implantations of Datalog. I won’t detail the Souffle’s syntax or semantics but will explain it as we see more examples.

Souffle implements stratified Datalog, a model that extends basic Datalog with the extra ability to use negated relation. Souffle also has additional features like functors. This makes Souffle truing complete and loses the property of being termination safe.

Dominance Tree

Dominance tree, along with dominator and dominance frontier, are important properties for helping the compiler insert Phi nodes when constructing the SSA form.

We won’t go into the details of SSA, which could be our next post’s topic. Let’s directly dive into the definition of dominator ².

In a control flow graph with entry node $b_0$, node $b_i$ dominates node $b_j$, if and only if $b_i$ lies on every path from $b_0$ to $b_j$. By definition, all nodes dominate themselves.

Informally, if $b_i$ lies on every paths that goes from $b_0$ to $b_j$, $b_i$ is said to dominate $b_j$.

A small example is shown here, where we can easily spot $B_1$ dominate every node but $B_0$.

The dominance set can therefore be computed as: \[ \tag*{$n \neq root$} DOM(n) = \lbrace n \rbrace \cup \left( \bigcap_{m \in pred(n)} {DOM(m)} \right) \]

With base case: \[ DOM(root) = \lbrace root \rbrace \]

The formula is intuitive; for a node $n$, if $p$ dominates all predecessor of $n$, them $p$ must lie on every path from root to $n$, hence dominate $n$.

The formula is a data-flow problem and can be shown to be solved by a chaotic fixpoint algorithm. That is, visit node in the graph in any order and update their dominance set as defined until a fixpoint is reached. We mentioned that this kind of algorithm is straightforward to implement in Datalog, and we will soon show how, but before that, there is a catch.

Compute Intersection in Datalog

The formula shown above involves calculating the intersection of multiple sets. Datalog is built upon a union semantic, you can think of as it does not allow removing elements from relations, which makes intersection not that straightforward.

We can easily compute the intersection of two sets with:

A(1).   // <- this is a input (fact)
A(2).
A(3).
B(1).

intersect(x) :- A(x), B(x).

Where x is in the intersect if it appears both in A and B.

However, writing this rule requires us to know the relations involved beforehand (in this case, A and B). This is not the case for our formula, where the sets to intersect depends on the structure of the graph (the predecessors) — something we don’t know until runtime. This is not entirely impossible though, assuming we have a relation called set(n, e), where n is the set id and e is its element (hence we can have arbitrary number of sets based on input), we can still compute their intersection:

set(1, 1).  // element 1 is in set1
set(1, 2).  // element 2 is in set1
set(1, 3).
set(2, 2).
set(2, 3).  
set(3, 3).

not_intersect(x) :- set(n1, x), set(n2, _), n1 != n2, !set(n2, x).
intersect(x) :- set(_, x), !not_intersect(x).

We translate the problem to its dual problem — we compute elements that do not belong to the final intersection. If an element x appears in set n1 and another set n2 does not contain x, then x cannot be in the resulting set. Finally, we go through all elements and filter out those in the not_intersect.

So this is it, right? Can we use this to compute our dominance relation? Unfortunately, no. Souffle implements stratified negation, which means there cannot be cyclic negations in the relation dependency. In another word, this is not allowed:

A(x) :- !B(x), C(x).
B(x) :- A(x)

In order to compute A we need !B, which in turn require B, and to compute B we need A, which basically falls into a paradox.

Indeed this is not the case with the set example, where the dependency is shown below. We can see that a cycle with negation does not exist.

So what is the problem stopping us from using it in our dominance calculation? The truth is, there is no cycle in our set example because the relation set is given as user input and does not require further computation. This is different from our dominance formula. If we try to migrate our example solution to the dominance problem, you can imagine the set relation in our example would correspond to the DOM relation. However, to calculate DOM relation, we need to know the intersect, creating a cycle with negation in our dependency graph. This is also implied in the formula, which is defined recursively.

In brief, to use a negated relation in stratified Datalog (like Souffle), it is required that the relation is fully computed before its negation can be used. This is what the word ‘stratified’ means. In a loose term, Datalog is built upon fixpoint semantic but with a union favor, which makes things with intersection a bit trickier.

Back to Dominance

“You just said Datalog is easy when implementing program analysis problems!”

Yes… Although I’ve used quite some space to show how the question cannot be done in a certain way, I stand by my claim. Just like when you first start learning the functional language, you need some time to get used to the fact that there are no variables and loops are replaced with operations like fold. This applies to logic programming as well, you just need to think differently.

Remember when doing the set example, instead of directly calculating the intersect, we consider the dual problem and calculate the elements that do not belong to the intersect? The idea is the same here. Instead of calculating the set of elements that dominate $n$, we consider the dual and calculate the set of elements that does not dominance $n$.

So what does that mean? Let $DOM’(n)$ be the set of elements that do not dominate $n$, we can come up with a similar formula:

\[ \tag*{$n \neq root$ } DOM’(n) = \left( \bigcup_{ \begin{subarray}{l} m \in pred(n) \end{subarray}} {DOM’(m)} \right) \setminus \lbrace n \rbrace \] The formula is intuitive ³, for a node $n$, if $p$ does not dominate one of the predecessors ($m$) of $n$, then at least one path from the root to $n$ does not visit $p$. Hence $p$ does not dominate $n$. The exclusion of $n$ is necessary, consider if a node $n$ does not dominate one of its predecessor $m$, we will get $n$ does not dominate itself based on the formula, which is not what we want.

Now we have union! Which is what Souffle likes! The only missing part is to define the base case, which is for $root$. This is straightforward also; all other nodes except $root$ itself do not dominate $root$. \[ DOM’(root) = \lbrace n \vert n \in G \land n \neq root \rbrace \]

Let’s know define the Souffle program to compute the dominance set!

.decl cfg(src: symbol, dest: symbol)    // our input control flow graph
.input cfg(delimiter=",")

.decl root(x: symbol)
root(x) :- cfg(x, _), !cfg(_, x).

.decl node(x: symbol)
node(x) :- cfg(x, _).
node(x) :- cfg(_, x).

// read as: src is not dominated by dom
.decl not_dom(src: symbol, non_dom:symbol)
not_dom(n, m) :- node(m), root(n), n != m.   // base case for root
not_dom(n, m) :- cfg(pred, n), not_dom(pred, m), n != m.

// read as: src is dominated by dom
.decl dom(src:symbol, dom:symbol)
dom(n, m) :- node(n), node(m), !not_dom(n, m).

.output dom(IO=stdout)

This is the first time we encounter Souffle syntax, so I’ll explain them a bit.

.decl cfg(src: symbol, dest: symbol)    // our input control flow graph
.input cfg(delimiter=",")

Souffle is statically typed, so we must declare their arity and the type of each element for each relation. Here, cfg is a relation with two elements, both of symbol type. symbol corresponds to the string type you find in other common languages. The elements’ names (src and dest) have no computational meaning here, and it is only for readability. The following line uses the word input, saying that the relation cfg should be provided with some user input. Souffle read the input from a file (by default, cfg.facts), and we specify the delimiter here. For our example, we provide a cfg.facts with the input corresponding to our cfg example:

B0,B1
B1,B2
B1,B5
B2,B3
B5,B6
B5,B8
B6,B7
B8,B7
B7,B3
B3,B4
B3,B1

(B0,B1) means there is an edge from B0 to B1. This reads tuple cfg(B0, B1) cfg(B1, B2) ... into the relation cfg.

.decl root(x: symbol)
root(x) :- cfg(x, _), !cfg(_, x).

.decl node(x: symbol)
node(x) :- cfg(x, _).
node(x) :- cfg(_, x).

Those two relations are auxiliary. The root rule extracts the root node from the graph: a node x is root if it is a starting node of an edge in cfg and there is no edge ends with x in cfg. The underscore has a similar meaning as in functional language like Haskell, saying that we don’t care about their value. The node relation further extracts each node in the graph. x is a node if it is a starting node of an edge or a end node of an edge in cfg.

// read as: src is not dominated by dom
.decl not_dom(src: symbol, non_dom:symbol)
not_dom(n, m) :- node(m), root(n), n != m.   // base case for root
not_dom(n, m) :- cfg(pred, n), not_dom(pred, m), n != m.

Here we compute the $DOM’$. The first rule corresponds to our base case. It says: n is not dominated by m if: 1) m is a node 2) and n is root 3) and n is not equal to m. The second rule computes everything for other nodes and directly corresponds to our formula. It says: n is not dominated by m if:

There is an edge from pred to n (so pred is a predecessor of n).
m does not dominate pred.
n and m are not the same node (corresponds to the exclusion of $x$).

// read as: src is dominated by dom
.decl dom(src:symbol, dom:symbol)
dom(n, m) :- node(n), node(m), !not_dom(n, m).

.output dom(IO=stdout)

Finally, we calculate the $DOM$ set, which is precisely the complement of $DOM’$. We then output the relation $DOM$ to the standard output.

Dominance Frontier

Let’s now calculate the $DF$ set.

Informally, $DF(n)$ contains the first node reachable from $n$ that $n$ does not strictly dominate.

The algorithm here is quite complicated if you look at the book like Engineer a Compiler. You need some book keeping and also the notion of immediate dominator (which we have not introduced yet). However, in Datalog, you just need to translate the English description to:

.decl df(src:symbol, frontier:symbol)
df(n, m) :- cfg(pred, m), dom(pred, n), !strict_dom(m, n).

and we are done!

Immediate Dominator

Given a node $n$ and its strictly dominate set $SDOM(n)$, the node $m \in SDOM(n)$ that are closest to $n$ is the immediate dominator of $n$.

The definition is simple and looks intuitive to translate to Datalog. But we have a problem; we need to choose the closest node (in turns of the number of edges), but how do we express that in Datalog? Recall that Datalog is union-oriented, which makes it hard to select an optimal solution, because you probably need some notion of deletions or replacement in the language (think about how you would find a largest value in a list in language like C).

In order to do this easilty, Souffle has the extension feature of subsumptive rule. A subsumption allows user to specify a choosing criteria via a binary relation $\sqsubseteq$ between two elements $A$ and $B$ in a relation. If $A \sqsubseteq B$ is true, then the element $A$ will be replaced by element $B$ ⁴.

For example, use subsumption to find a maximum number, we can do:

.decl num(x:number)
// x is replaced by y if x < y.
num(x) <= num(y) :- x < y.

Now back to our immediate dominator, how do we choose the closest one? The critical observation here is that, all the dominators $DOM(n)$ of a node $n$ must lie on every path from root to $n$. Therefore, for two dominators $a, b \in DOM(n)$, either $a$ dominates $b$ or vice visa. If $a$ dominates $b$, then $b$ must be closer to $n$ then $a$ ($b$ is ’lower’ in the path from root to $n$). Hence comes the following subsumptive rule:

[decl](.decl) ID(src:symbol, id:symbol)
ID(n, m) :- strict_dom(n, m).
ID(n, m1) <= ID(n, m2) :- dom(m2, m1).

Let’s put it all together. Initially, every node m is an immediate dominator of n if m strictly dominates n. Then, we remove those that are not the closest. If n has two immediate dominators, m1 and m2, and m2 is dominated by m1, then m2 is closer and replaces m1.

Summary

I (very informally) introduced some basic semantics and syntax of Datalog. I showed how to compute dominance and its related notions using a modern Datalog engine Souffle. I’ll explore more program analysis problems using Datalog in the future posts.

See Souffle applications: https://souffle-lang.github.io/applications ↩︎
Most of the definitions and examples are taken from Engineer a Compiler, 2nd ed. ↩︎
Do you see the symmetric between $DOM’$ and $DOM$? ↩︎
The relation must form a partial order otherwise termination might not be guaranteed. I actually have not figured out why yet (looks like the requirement is stricter than it needs to be), might do an update on this later. ↩︎