Datalog
I will start a series of posts about using Datalog to solve standard program analysis problems. For people that have not heard about Datalog, an excellent entry point is the paper What you Always Wanted to Know About Datalog (And Never Dared to Ask).
Datalog programs contain rules (or caluses); a classic example of Datalog program is:
GrandParent(x, z) :- Parent(x, y), Parent(y, z).
The head of the rule is the conclusion, which holds true if the body of the
rule is true.
So one interpretation of this rule is: x
is the grandparent of z
if x
is the
parent of y
and y
is the parent of z
. Together with inputs (or facts) like
Parent("Alice", "Bob"), Parent("Bob", "Claire")
will return us
GrandParent("Alice", "Claire")
.
One of the possible execution models for Datalog program is to reduce it to a set of Relational Algebra, just like you reduce imperative language to a set of low-level instructions and functional language to lambda calculus. However, unlike Database language, which is also reduced to Relational Algebra, Datalog has the ability of recursion. Of course, this comes with extra overhead and some unsafe properties.
Datalog has been shown to be particularly useful for static analysis, and many real-world applications 1 have been built upon it. Specifically, many program analysis problems are modelled with lattice and fixpoint iteration. This coincides perfectly with the execution model of Datalog. Throughout the series, I’ll be using Souffle, one of the modern implantations of Datalog. I won’t detail the Souffle’s syntax or semantics but will explain it as we see more examples.
Souffle implements stratified Datalog, a model that extends basic Datalog with the extra ability to use negated relation. Souffle also has additional features like functors. This makes Souffle truing complete and loses the property of being termination safe.
Dominance Tree
Dominance tree, along with dominator and dominance frontier, are important properties for helping the compiler insert Phi nodes when constructing the SSA form.
We won’t go into the details of SSA, which could be our next post’s topic. Let’s directly dive into the definition of dominator 2.
In a control flow graph with entry node $b_0$, node $b_i$ dominates node $b_j$, if and only if $b_i$ lies on every path from $b_0$ to $b_j$. By definition, all nodes dominate themselves.
Informally, if $b_i$ lies on every paths that goes from $b_0$ to $b_j$, $b_i$ is said to dominate $b_j$. A small example is shown here, where we can easily spot $B_1$ dominate every node but $B_0$.
The dominance set can therefore be computed as: \[ \tag*{$n \neq root$} DOM(n) = \lbrace n \rbrace \cup \left( \bigcap_{m \in pred(n)} {DOM(m)} \right) \]
With base case: \[ DOM(root) = \lbrace root \rbrace \]
The formula is intuitive; for a node $n$, if $p$ dominates all predecessor of $n$, them $p$ must lie on every path from root to $n$, hence dominate $n$.
The formula is a data-flow problem and can be shown to be solved by a chaotic fixpoint algorithm. That is, visit node in the graph in any order and update their dominance set as defined until a fixpoint is reached. We mentioned that this kind of algorithm is straightforward to implement in Datalog, and we will soon show how, but before that, there is a catch.
Compute Intersection in Datalog
The formula shown above involves calculating the intersection of multiple sets. Datalog is built upon a union semantic, you can think of as it does not allow removing elements from relations, which makes intersection not that straightforward.
We can easily compute the intersection of two sets with:
A(1). // <- this is a input (fact)
A(2).
A(3).
B(1).
intersect(x) :- A(x), B(x).
Where x
is in the intersect
if it appears both in A
and B
.
However, writing this rule requires us to know the relations involved
beforehand (in this case, A
and B
). This is not the case for our formula,
where the sets to intersect depends on the structure of the graph
(the predecessors) — something we don’t know until runtime. This is not entirely
impossible though, assuming we have a relation called set(n, e)
, where n
is
the set id and e
is its element (hence we can have arbitrary number of sets
based on input), we can still compute their intersection:
set(1, 1). // element 1 is in set1
set(1, 2). // element 2 is in set1
set(1, 3).
set(2, 2).
set(2, 3).
set(3, 3).
not_intersect(x) :- set(n1, x), set(n2, _), n1 != n2, !set(n2, x).
intersect(x) :- set(_, x), !not_intersect(x).
We translate the problem to its dual problem — we compute elements that do
not belong to the final intersection. If an element x
appears in set n1
and
another set n2
does not contain x
, then x
cannot be in the resulting set.
Finally, we go through all elements and filter out those in the
not_intersect
.
So this is it, right? Can we use this to compute our dominance relation? Unfortunately, no. Souffle implements stratified negation, which means there cannot be cyclic negations in the relation dependency. In another word, this is not allowed:
A(x) :- !B(x), C(x).
B(x) :- A(x)
In order to compute A
we need !B
, which in turn require B
, and to compute
B
we need A
, which basically falls into a paradox.
Indeed this is not the case with the set
example, where the dependency is
shown below. We can see that a cycle with negation does not exist.
So what is the problem stopping us from using it in our dominance calculation?
The truth is, there is no cycle in our set
example because the relation set
is given as user input and does not require further computation. This is
different from our dominance formula. If we try to migrate our example solution
to the dominance problem, you can imagine the set
relation in our example
would correspond to the DOM
relation. However, to calculate DOM
relation,
we need to know the intersect
, creating a cycle with negation in our
dependency graph. This is also implied in the formula, which is defined
recursively.
In brief, to use a negated relation in stratified Datalog (like Souffle), it is required that the relation is fully computed before its negation can be used. This is what the word ‘stratified’ means. In a loose term, Datalog is built upon fixpoint semantic but with a union favor, which makes things with intersection a bit trickier.
Back to Dominance
“You just said Datalog is easy when implementing program analysis problems!”
Yes… Although I’ve used quite some space to show how the question cannot
be done in a certain way, I stand by my claim. Just like when you first start
learning the functional language, you need some time to get used to the fact
that there are no variables and loops are replaced with operations like fold
.
This applies to logic programming as well, you just need to think
differently.
Remember when doing the set
example, instead of directly calculating the
intersect, we consider the dual problem and calculate the elements that do not
belong to the intersect? The idea is the same here. Instead of calculating the
set of elements that dominate $n$, we consider the dual and calculate the set
of elements that does not dominance $n$.
So what does that mean? Let $DOM’(n)$ be the set of elements that do not dominate $n$, we can come up with a similar formula:
\[ \tag*{$n \neq root$ } DOM’(n) = \left( \bigcup_{ \begin{subarray}{l} m \in pred(n) \end{subarray}} {DOM’(m)} \right) \setminus \lbrace n \rbrace \] The formula is intuitive 3, for a node $n$, if $p$ does not dominate one of the predecessors ($m$) of $n$, then at least one path from the root to $n$ does not visit $p$. Hence $p$ does not dominate $n$. The exclusion of $n$ is necessary, consider if a node $n$ does not dominate one of its predecessor $m$, we will get $n$ does not dominate itself based on the formula, which is not what we want.
Now we have union! Which is what Souffle likes! The only missing part is to define the base case, which is for $root$. This is straightforward also; all other nodes except $root$ itself do not dominate $root$. \[ DOM’(root) = \lbrace n \vert n \in G \land n \neq root \rbrace \]
Let’s know define the Souffle program to compute the dominance set!
.decl cfg(src: symbol, dest: symbol) // our input control flow graph
.input cfg(delimiter=",")
.decl root(x: symbol)
root(x) :- cfg(x, _), !cfg(_, x).
.decl node(x: symbol)
node(x) :- cfg(x, _).
node(x) :- cfg(_, x).
// read as: src is not dominated by dom
.decl not_dom(src: symbol, non_dom:symbol)
not_dom(n, m) :- node(m), root(n), n != m. // base case for root
not_dom(n, m) :- cfg(pred, n), not_dom(pred, m), n != m.
// read as: src is dominated by dom
.decl dom(src:symbol, dom:symbol)
dom(n, m) :- node(n), node(m), !not_dom(n, m).
.output dom(IO=stdout)
This is the first time we encounter Souffle syntax, so I’ll explain them a bit.
.decl cfg(src: symbol, dest: symbol) // our input control flow graph
.input cfg(delimiter=",")
Souffle is statically typed, so we must declare their arity and the type of
each element for each relation. Here, cfg
is a relation with two elements,
both of symbol
type. symbol
corresponds to the string
type you find in
other common languages. The elements’ names (src
and dest
) have no
computational meaning here, and it is only for readability. The following line
uses the word input
, saying that the relation cfg
should be provided with
some user input. Souffle read the input from a file (by default, cfg.facts
),
and we specify the delimiter here. For our example, we provide a cfg.facts
with the input corresponding to our cfg example:
B0,B1
B1,B2
B1,B5
B2,B3
B5,B6
B5,B8
B6,B7
B8,B7
B7,B3
B3,B4
B3,B1
(B0,B1)
means there is an edge from B0
to B1
. This reads tuple cfg(B0, B1) cfg(B1, B2) ...
into the relation cfg
.
.decl root(x: symbol)
root(x) :- cfg(x, _), !cfg(_, x).
.decl node(x: symbol)
node(x) :- cfg(x, _).
node(x) :- cfg(_, x).
Those two relations are auxiliary. The root
rule extracts the root node from the
graph: a node x
is root if it is a starting node of an edge in cfg
and
there is no edge ends with x
in cfg
. The underscore has a similar meaning
as in functional language like Haskell, saying that we don’t care about their value.
The node
relation further extracts each node in the graph. x
is a node if
it is a starting node of an edge or a end node of an edge in cfg
.
// read as: src is not dominated by dom
.decl not_dom(src: symbol, non_dom:symbol)
not_dom(n, m) :- node(m), root(n), n != m. // base case for root
not_dom(n, m) :- cfg(pred, n), not_dom(pred, m), n != m.
Here we compute the $DOM’$. The first rule corresponds to our base case. It
says: n
is not dominated by m
if: 1) m
is a node 2) and n
is root 3)
and n
is not equal to m
. The second rule computes everything for other
nodes and directly corresponds to our formula. It says: n
is not dominated by
m
if:
- There is an edge from
pred
ton
(sopred
is a predecessor ofn
). m
does not dominatepred
.n
andm
are not the same node (corresponds to the exclusion of $x$).
// read as: src is dominated by dom
.decl dom(src:symbol, dom:symbol)
dom(n, m) :- node(n), node(m), !not_dom(n, m).
.output dom(IO=stdout)
Finally, we calculate the $DOM$ set, which is precisely the complement of $DOM’$. We then output the relation $DOM$ to the standard output.
Dominance Frontier
Let’s now calculate the $DF$ set.
Informally, $DF(n)$ contains the first node reachable from $n$ that $n$ does not strictly dominate.
The algorithm here is quite complicated if you look at the book like Engineer a Compiler. You need some book keeping and also the notion of immediate dominator (which we have not introduced yet). However, in Datalog, you just need to translate the English description to:
.decl df(src:symbol, frontier:symbol)
df(n, m) :- cfg(pred, m), dom(pred, n), !strict_dom(m, n).
and we are done!
Immediate Dominator
Given a node $n$ and its strictly dominate set $SDOM(n)$, the node $m \in SDOM(n)$ that are closest to $n$ is the immediate dominator of $n$.
The definition is simple and looks intuitive to translate to Datalog. But we have a problem; we need to choose the closest node (in turns of the number of edges), but how do we express that in Datalog? Recall that Datalog is union-oriented, which makes it hard to select an optimal solution, because you probably need some notion of deletions or replacement in the language (think about how you would find a largest value in a list in language like C).
In order to do this easilty, Souffle has the extension feature of subsumptive rule. A subsumption allows user to specify a choosing criteria via a binary relation $\sqsubseteq$ between two elements $A$ and $B$ in a relation. If $A \sqsubseteq B$ is true, then the element $A$ will be replaced by element $B$ 4.
For example, use subsumption to find a maximum number, we can do:
.decl num(x:number)
// x is replaced by y if x < y.
num(x) <= num(y) :- x < y.
Now back to our immediate dominator, how do we choose the closest one? The critical observation here is that, all the dominators $DOM(n)$ of a node $n$ must lie on every path from root to $n$. Therefore, for two dominators $a, b \in DOM(n)$, either $a$ dominates $b$ or vice visa. If $a$ dominates $b$, then $b$ must be closer to $n$ then $a$ ($b$ is ’lower’ in the path from root to $n$). Hence comes the following subsumptive rule:
[decl](.decl) ID(src:symbol, id:symbol)
ID(n, m) :- strict_dom(n, m).
ID(n, m1) <= ID(n, m2) :- dom(m2, m1).
Let’s put it all together. Initially, every node m
is an immediate dominator
of n
if m
strictly dominates n
. Then, we remove those that are not the
closest. If n
has two immediate dominators, m1
and m2
, and m2
is
dominated by m1
, then m2
is closer and replaces m1
.
Summary
I (very informally) introduced some basic semantics and syntax of Datalog. I showed how to compute dominance and its related notions using a modern Datalog engine Souffle. I’ll explore more program analysis problems using Datalog in the future posts.
-
See Souffle applications: https://souffle-lang.github.io/applications ↩︎
-
Most of the definitions and examples are taken from Engineer a Compiler, 2nd ed. ↩︎
-
Do you see the symmetric between $DOM’$ and $DOM$? ↩︎
-
The relation must form a partial order otherwise termination might not be guaranteed. I actually have not figured out why yet (looks like the requirement is stricter than it needs to be), might do an update on this later. ↩︎