Program Analysis with Datalog: Liveness and Reaching Definition

2022/02/25

Tags: Liveness Analysis Reaching Definition Souffle Datalog Compiler

Continuing our last post, we will cover two more data-flow analyses, liveness and reaching definition. The current plan is to temporarily finish this series in the next post by writing a (nice) SSA transformation with everything we learn.

$\gdef\KILL{\mathrm{Kill}}$ $\gdef\GEN{\mathrm{Gen}}$ $\gdef\LIVEOUT{\mathrm{LiveOut}}$ $\gdef\REACH{\mathrm{Reach}}$

Liveness Analysis

In liveness analysis, given a variable $v$ and another location $p$ in the program, we want to know whether the variable $x$ will be used (and before redefined by other statements) in any program path starting from $p$.

Liveness analysis has many useful applications. In register allocation, it helps us determine whether it is necessary to keep preserving a variable’s value or discard it if it is not live anymore (live range). In SSA construction, it can help us eliminate some of the dead Phi instructions.

Throughout this post, we will use the following program as an example. 1

Example Program

Example Program

Dependency graph with negation in red

Dependency graph with negation in red

Let’s first clarify some semantics. The code is already in three-address form, $\leftarrow$ means assignment, and if the right-hand side is $\dots$, the value is assigned from some variables outside of our intersect (e.g., global variables, input from users).

The liveness analysis focuses on the exit point of each block, and we are trying to answer the following question: upon the exit of block $b$, which variables are alive (may be used in the remaining execution).

For example, upon exit of $B_7$, $b$ is alive because the subsequent $B_3$ reads it before $b$ is redefined. However, upon exit of $B_1$, $c$ is no longer alive because the subsequent $B_2$ and $B_5$ overwrite it immediately.

I’ll use a similar notation from PPA and Engineer a Compiler. However, in PPA each statement (label) is treated as its own block. We will not do it here. Instead, we still treat standard control flow block as our minimal unit. Their method essentially is more fine-grained, and the obtained result is a superset of ours. Nevertheless, getting more detailed information is not very hard once the block-level liveness result is achieved. Also, using blocks is clearer to demonstrate the idea.

Let $\LIVEOUT (n)$ be the set of variables that are alive on the exit of block $n$. Let $\GEN(n)$ be the upward-exposed variables in $n$. That is, variables that have been referenced in $n$ before statements in $n$ redefine it. Essentially, if a variable is referenced in $n$, it must be alive on the exit of the predecessor of $n$. Finally, let $\KILL(n)$ be the variables that are killed by $n$ through redefinition.

Therefore, our goal $\LIVEOUT(n)$ can be defined as a backward analysis: \[ \LIVEOUT(n) = \bigcup_{m \in succ(n)} \left( \GEN(m) \cup (\LIVEOUT(m) \cap \overline{\KILL(m)} )\right) \] The definition can be intuitively understood as, the live set of $n$ is:

  1. Those variables that are referenced by $n$’s successors.
  2. Those variables are alive when exiting $n$’s successors and not killed.

Now we can easily translate it into a Souffle program. But before that, I want to address two issues.

The first is that we notice an intersect operation in the formula. Last time we mentioned that intersection is tricky, but this is not the case: the $\KILL$ set can be fully computed before computing the liveness, whereas the last time $DOM$ is recursively defined. Therefore, the intersection here is straightforward.

The second issue is that, unlike last time, we encoded only a simple graph information. Here we have statements with operations, and we need to translate them into a Souffle input file. This can be done (relatively) easy with the Algebra Data Type (ADT) support in Souffle. I’ll not explain more on this. The syntax and semantics should be obvious to those with ADT experience.

Let’s first define the representation.

.decl cfg(src: symbol, dest : symbol)

.type Value =   NIL {}
            |   Variable{ v: symbol }
            |   Constant {}

.type Stmt = [
    assignee: Value,
    operandA: Value,
    operandB: Value
]

.decl prog(block: symbol, line: number, stmt: Stmt)

cfg is the graph representation of the control flow. Value is an ADT which is either a NIL or Constant or a Variable with a name. Stmt is a record type, contains three Value. prog is a set of stmt, grouped by block number and line number.

Therefore, a statement like $a \leftarrow b + 1$ will be written as [$Variable(a), $Variable(b), $Constant], statement like $a \leftarrow \dots$ will be written as [$Variable(a), NIL, NIL], etc.

The input file for prog looks like this:

B0,0,[$Variable(i),$Constant,$NIL]
B1,0,[$Variable(a), $NIL, $NIL]
B1,1,[$Variable(c), $NIL, $NIL]
B1,2,[$NIL, $Variable(a), $Variable(c)]
B2,0,[$Variable(b), $NIL, $NIL]
B2,1,[$Variable(c), $NIL, $NIL]
B2,2,[$Variable(d), $NIL, $NIL]
B3,0,[$Variable(y), $Variable(a), $Variable(b)]
B3,1,[$Variable(z), $Variable(c), $Variable(d)]
B3,2,[$Variable(i), $Variable(i), $Constant]
B3,3,[$NIL, $Variable(i), $Constant]
B5,0,[$Variable(a), $NIL, $NIL]
B5,1,[$Variable(d), $NIL, $NIL]
B5,2,[$NIL, $Variable(a), $Variable(d)]
B6,0,[$Variable(d), $NIL, $NIL]
B7,0,[$Variable(b), $NIL, $NIL]
B8,0,[$Variable(c), $NIL, $NIL]

The $\KILL$ set is straightforwardly defined:

.decl live_kill(block: symbol, var: Value)
live_kill(b, $Variable(v)) :- prog(b, _, [$Variable(v), _, _]).

A variable is killed by block b if v appears on the left-hand side of any statement in b. 2

$\GEN$ set requires more work. We need to find those variables that are referenced before being redefined in $b$. We think of the dual problem to calculate the inverse.

.decl live_redefined(block: symbol, line:number, var: Value)
live_redefined(b, i, $Variable(v)) :- prog(b, i, [_, $Variable(v), _]), 
            prog(b, j, [$Variable(v), _, _]), j < i.
live_redefined(b, i, $Variable(v)) :- prog(b, i, [_, _, $Variable(v)]), 
            prog(b, j, [$Variable(v), _, _]), j < i.

A variable v is redefined before used: if v is used in line i and there is a definition overwrite it in line j where j is before i.

Then, to calculate the $\GEN$ set, we simply find all used variables and ensure they are not in redefined.

.decl live_gen(block: symbol, var: Value)
live_gen(b, $Variable(v)) :- prog(b, i, [_, $Variable(v), _]),
    !live_redefined(b, i, $Variable(v)).
live_gen(b, $Variable(v)) :- prog(b, i, [_, _, $Variable(v)]), 
    !live_redefined(b, i, $Variable(v)).

Put them all together to calculate the $\LIVEOUT$ set:

.decl live(block: symbol, var: Value)
live(b, $Variable(v)) :- cfg(b, succ), live_gen(succ, $Variable(v)).
live(b, $Variable(v)) :- cfg(b, succ), live(succ, $Variable(v)), 
    !live_kill(succ, $Variable(v)).

Reaching Definition

In reaching definition, we are trying to answer the following question: for each block $b$, we want to know the set of definitions that can reach $b$ without being redefined.

Note, on the first glance, reaching definition is quite similar to liveness analysis, but there are some crucial differences:

  1. Reaching definition targets on a set of definition points while liveness analysis targets on a set of variable names.
  2. A definition of $v$ reaches a block $b$ does not mean $v$ is alive on the exit of $b$ (or even alive when entering $b$).
  3. A variable $v$ is alive when entering $b$ does not mean a definition of $v$ will reach $b$.

Take our example again, the definition of $a$ in $B_1$ never reach $B_6$ because the redefinition of $a$ in $B_5$ overwrites it.

Let $\REACH(n)$ be the set of definitions that reaches block $n$. Let $\GEN(n)$ be the set of definitions generated from $n$. Finally, let $\KILL(n)$ be the set of all definitions in the program and exclude those from $n$.

A definition is uniquely defined by a pair of block and line numbers. The only thing I want to clarify is that the $\KILL$ set is defined in such a prolonged and prolix way is because we want to properly formulate the set operation 3. Intuitively, $\KILL(n)$ set just kill all reaching definitions with the same variable names.

Finally,$\REACH(n)$ can be defined as a forward analysis \[ \REACH(n) = \bigcup_{m \in pred(n)} \left( \GEN(m) \cup (\REACH(m) \cap \overline{\KILL(m)} )\right) \] A definition reaches $n$ if it reaches some predecessors of $n$ and is not killed by that predecessor.

Let’s translate it into Souffle. A $\GEN$ set includes all the definitions that occur in the block.

.decl def_gen(block: symbol, line:number, var:Value)
def_gen(b, line, $Variable(v)) :- prog(b, line, [$Variable(v), _, _]).
def_gen(b, l1, $Variable(v)) <= def_gen(b, l2, $Variable(v)) :-
    l2 >= l1.

However, a variable can be defined more than once in a block, and only the last definition escape the block and reaches others. Therefore, we use subsumptive rule 4 to select the definition with the largest line number: they are the last definitions that escape the block.

Next, we define the $\KILL$ set. Unlike the formal definition we gave previously, it is acceptable to let kill just be a set of redefined variable names in actual implementation.

.decl def_kill(block: symbol, var:Value)
def_kill(b, $Variable(v)) :- prog(b, [$Variable(v), _, _]).

Finally, we put them together to form the $\REACH$ set.

.decl reach(block: symbol, var: Value, pred:symbol, line: number)
reach(m, $Variable(v), pred, line) :- 
    cfg(pred, m), def_gen(pred, line, $Variable(v)).
reach(m, $Variable(v), b, line) :- cfg(pred, m), 
    reach(pred, $Variable(v), b, line), 
    !def_kill(pred, $Variable(v)).

Summary

I’ve shown the definition of liveness analysis and reaching definition and demonstrated their computation in a Datalog engine like Souffle. The process is still relatively easy to compare to other programming paradigms. Liveness will be helpful for us when constructing SSA and allocating registers. Reaching definition will be used during the renaming process of the SSA.

In the next post, I want to construct an SSA form. However, as you will see, things are starting to become tedious 5 in the pure Datalog world, and we might need to use Souffle as a library instead of a standalone tool.


  1. Taken from Engineer a Compiler, 2nd ed. ↩︎

  2. I prefixed all relations with live to differentiate with the reaching definition we are about to do. ↩︎

  3. Every set must have elements of same shapes. ↩︎

  4. To follow the requirement of subsumptive rules, we must declare a partial order. Hence the less than and equal to operator. ↩︎

  5. In fact, renaming variables and constructing SSA is not purely a program analysis, it is more of a transformation. ↩︎