Reverse Engineering Interpreted Languages

Interpreted Languages
In computer science, an interpreter is a computer program that directly executes instructions written in a programming or scripting language, without requiring them previously to have been compiled into a machine language program.
Source: Wikipedia
The following languages are interpreted: Java, Python, Ruby, Perl, PHP, Postscript, etc.
Reverse engineering the bytecode translated from interpreted languages is much easier than for native code, because a lot of information about the source code needs to be kept for the interpreter (often called a virtual machine, too).
Java and Derivates
Java bytecode forms the instruction set of the Java virtual machine. Each opcode is coded on one or two bytes plus zero or more bytes for transmitting parameters.
There are the following categories of opcodes:
- Loading / storing
- Arithmetic / boolean operations
- Type conversion
- Object creation and manipulation
- Operand stack management
- Control transfer
- Method invocation and return
Here is a Wikipedia example illustrating how some Java code is translated into Java bytecode:
outer:
for (int i = 2; i < 1000; i++) {
for (int j = 2; j < i; j++) {
if (i % j == 0)
continue outer;
}
System.out.println (i);
}
A Java compiler might produce the following bytecode:
0: iconst_2
1: istore_1
2: iload_1
3: sipush 1000
6: if_icmpge 44
9: iconst_2
10: istore_2
11: iload_2
12: iload_1
13: if_icmpge 31
16: iload_1
17: iload_2
18: irem
19: ifne 25
22: goto 38
25: iinc 2, 1
28: goto 11
31: getstatic #84; // Field java/lang/System.out:Ljava/io/PrintStream;
34: iload_1
35: invokevirtual #85; // Method java/io/PrintStream.println:(I)V
38: iinc 1, 1
41: goto 2
44: return
You can find all the details about Java bytecode in a document called The Java® Virtual Machine Specification.
In the Android ecosystem, the Java bytecode is replaced by Dalvik/ART bytecode. The Dalvik virtual machine uses a register-based architecture and fewer, typically more complex, virtual machine instructions.
A tool called dx
is used to convert Java .class
files into the .dex
format: the Java bytecode is converted into an alternative instruction set, formerly used by the Dalvik VM, but now translated into native code by the Android Runtime (ART) for better performances.
The standard Java bytecode executes 8-bit stack instructions. Local variables must be copied to and from the operand stack using separate instructions. Instead, Dalvik uses its own 16-bit instruction set that works directly on local variables.
Here is a list of tools useful for reverse engineering Java and Dalvik bytecode:
javap
: Java class file disassemblerSome Java decompilers (see here, too):
jadx: Dex to Java decompiler
Python
CPython is the most widely used implementation of Python. It is a source code interpreter.
Its salient features are the following:
- Automatic memory management and reference counting
- Python bytecode interpretation is done by a stack-based virtual machine (VM).
- The virtual machine creates and manages several data structures (maps, lists, tuples).
- CPython is multi-threaded, but only a single active thread on the interpreter (Global Interpreter Lock) can run at the same time.
- It uses late binding: it searches for a method in the class dictionary by name only when needed for the first time.
Functions are objects, like a list, a tuple or an instance of a class. Since functions are objects, you can talk about them without calling them. For example, you can pass a function as a parameter, or assign a function to another name:
def foo (a, b):
return a+b
\end{minted}
\begin{verbatim}
>>> foo
<function foo at 0x10eba0140>
>>> bar = foo
>>> bar
<function foo at 0x10eba0140>
>>> foo (1, 2)
3
>>> bar (1, 2)
3
There are several interesting attributes in a function object, including code objects:
>>> foo.func_code
<code object foo at 0x10eb90d30, file "<stdin>", line 1>
>>> foo.func_code.co_varnames
('a', 'b')
>>> foo.func_code.co_consts
(None,)
>>> foo.func_code.co_argcount
2
>>> foo.func_code.co_code
'|\x00\x00|\x01\x00\x17S'
A code object contains Python bytecode, which is a set of instructions for how to run a function. A list of the instructions that the Python compiler currently supports is available here: Python Bytecode Instructions.
Python bytecode can be disassembled using the dis
module:
>>> dis.dis (foo.func_code.co_code)
0 LOAD_FAST 0 (0)
3 LOAD_FAST 1 (1)
6 BINARY_ADD
7 RETURN_VALUE
Here is another, more complex example:
>>> def foo (a, b):
... if a > b:
... while a > b:
... a -= 1
... elif a == b:
... a += 2
... else:
... while b < a:
... b -= 1
...
... return (a, b)
...
>>> import dis
>>> dis.dis(foo)
1 RESUME 0
2 LOAD_FAST_LOAD_FAST 1 (a, b)
COMPARE_OP 148 (bool(>))
POP_JUMP_IF_FALSE 20 (to L3)
3 LOAD_FAST_LOAD_FAST 1 (a, b)
COMPARE_OP 148 (bool(>))
POP_JUMP_IF_FALSE 12 (to L2)
4 L1: LOAD_FAST 0 (a)
LOAD_CONST 1 (1)
BINARY_OP 23 (-=)
STORE_FAST 0 (a)
3 LOAD_FAST_LOAD_FAST 1 (a, b)
COMPARE_OP 148 (bool(>))
POP_JUMP_IF_FALSE 2 (to L2)
JUMP_BACKWARD 12 (to L1)
11 L2: LOAD_FAST_LOAD_FAST 1 (a, b)
BUILD_TUPLE 2
RETURN_VALUE
5 L3: LOAD_FAST_LOAD_FAST 1 (a, b)
COMPARE_OP 88 (bool(==))
POP_JUMP_IF_FALSE 8 (to L4)
6 LOAD_FAST 0 (a)
LOAD_CONST 2 (2)
BINARY_OP 13 (+=)
STORE_FAST 0 (a)
11 LOAD_FAST_LOAD_FAST 1 (a, b)
BUILD_TUPLE 2
RETURN_VALUE
8 L4: LOAD_FAST_LOAD_FAST 16 (b, a)
COMPARE_OP 18 (bool(<))
POP_JUMP_IF_FALSE 12 (to L6)
[...]
Python bytecode decompilers include uncompyle2, that is a Python 2.7 decompiler written in Python, and Decompyle++, which is a Python disassembler/decompiler written in C++ supporting all versions of Python bytecode.
Microsoft .NET
The .NET Framework is a software framework developed by Microsoft that runs primarily on Microsoft Windows. Programs written for the .NET Framework execute in a software environment, known as the Common Language Runtime (CLR), an application virtual machine that provides services such as security, memory management and exception handling.
Decompilers for .NET include JetBrains dotPeek, ILSpy, and Telerik JustDecompile.
In the next episode, I’ll discuss how debuggers work, with a special focus on Linux. Stay tuned!
Thanks for reading Crumbs of Cybersecurity! Subscribe for free to receive new posts and support my work.