Advanced Reverse-Engineering Techniques

Symbolic Execution
In computer science, symbolic execution is a means of analyzing a program to determine what inputs cause each part of a program to execute.
Source: Wikipedia
In other words, one looks to follow the execution flow of a program by assuming symbolic values (i.e., x
, y
, z
) for the inputs, instead of feeding the program with concrete values (i.e., 0
, 19287
, ‘a’
).
The result of symbolic execution is a set of symbolic (algebraic) expressions depending on values and variables in the program, as well as constraints expressed in terms of these symbols for the possible outcome of each conditional branch (also known as path formula). Then, one can then attempt to solve these constraints and expressions using a constraint solver.
Here is a toy example illustrating how symbolic execution works, which starts with the following C code:
int checkPassword() {
// ...
hash = getHash();
decryptedHash = (3*hash+38) % 257;
if (decryptedHash == 167) {
printf("Password OK";
return 0;
} else {
printf("Password NOK");
return -1;
}
}
During the symbolic interpretation of the program, the variable hash will be assigned the symbolic value ⍺. Then, the variable decryptedHash
will be assigned the expression 3⍺ + 38 (mod 257). After that, two path constraints will be assigned to the if construct: 3⍺ + 38 (mod 257) == 167 and 3⍺ + 38 (mod 257) != 167.
The contraints solver can work on the equation 3⍺ + 38 (mod 257) for the then
branch and get ⍺ = (167-38) * 1/3 = 129*86 = 43 (mod 257).
The limitations of symbolic execution include:
- Number of paths: the number of paths can quickly increase, depending on how complex the software is. This can lead to high computational and memory costs for the symbolic execution engine and the constraint solver.
- System interactions, such as system calls and signals, can also lead to interactions with the program that are not under the control of the symbolic execution engine, unless the program is working at the kernel level.
Concolic Execution
The term concolic execution refers to a process that combines the symbolic execution of a program with its actual execution. The basic idea is to have the concrete program execution drive the symbolic execution.
Here is a small example, starting with the following C code:
1: void f(int x, int y) {
2: int z = 2*y;
3: if (x == 100) {
4: if (x < z) {
5: printf("Success"); /* state we would like to reach */
6: }
7: }
8: }
We start the program with x == 1
and y == 1
, which is an arbitrary choice. In reality, the if statement isn’t taken because 1
is not equal to 100
. In the same run, the symbolic execution engine follows the same path, but treats x
and y
as symbolic variables. It sets z = 2x and remembers that x
is not equal to 100.
In the next run, the constraint solver uses the last path condition that was encountered (x != 100
) and negates it (x == 100
). The constraint solver is then used to find values for x
and y
, given the complete set of symbolic values and path conditions, for example x == 100
and y == 0
.
In the next step, line 4 is reached, but not line 5. At this point, the path conditions are x == 100
and x < z
. Looking at the last one, the constraint solver can look for x
and y
satisfying x == 100
, x < z
and z == 2y
, e.g., x == 100
and y == 51
, and eventually reach line 5.
Depending on the type of reverse engineering that has to be performed, concolic execution can be much faster than a manual analysis. Documented applications include:
- detecting opaque predicates;
- simplifying complicated sequences of instructions;
- reconstructing flattened control-flow;
- deobfuscating VM-based protections;
- etc.
Tools
MIASM
MIASM is a reverse engineering framework that allows you to analyse, modify and generate binary programs. Its functionalities include:
- opening, modifying, generating PE / ELF32 / ELF64 files;
- (dis-)assembling x86, ARM, MIPS, etc.
- representing assembly semantics using an intermediate language
- simplifying expression for automatic deobfuscation;
- etc.
KLEE
KLEE is a symbolic virtual machine built on top of the LLVM compilation framework. It works on LLVM IR bitcode.
angr
angr is a suite of Python libraries that allows you to load a binary and analyze it. Its main capabilities include:
- program instrumentation;
- symbolic execution;
- control-flow analysis;
- data-dependency analysis;
- etc.
Triton
Triton is a dynamic binary analysis (DBA) framework. Its main capabilities include:
- dynamic symbolic execution;
- a taint engine;
- AST representations of the x86 and the x86-64 instructions set semantics;
- symbolic simplification passes;
- etc.
S2E
S2E is a whole platform for writing tools that analyze the properties and behavior of software systems. S2E comes as a modular library that provides program analysis capabilities. Its main feature is that it allows virtual machines symbolic execution (i.e., it works at both kernel and user level).
In the next episode, I’ll cover white-box cryptography and other advanced software protection topics. Stay tuned!
Thanks for reading Crumbs of Cybersecurity! Subscribe for free to receive new posts and support my work.