4 min read

Reverse Engineering Interpreted Languages

Photo by Brandon Stoll / Unsplash
Photo by Brandon Stoll / Unsplash

Interpreted Languages

In computer science, an interpreter is a computer program that directly executes instructions written in a programming or scripting language, without requiring them previously to have been compiled into a machine language program.

Source: Wikipedia

The following languages are interpreted: Java, Python, Ruby, Perl, PHP, Postscript, etc.

Reverse engineering the bytecode translated from interpreted languages is much easier than for native code, because a lot of information about the source code needs to be kept for the interpreter (often called a virtual machine, too).

Java and Derivates

Java bytecode forms the instruction set of the Java virtual machine. Each opcode is coded on one or two bytes plus zero or more bytes for transmitting parameters.

There are the following categories of opcodes:

  • Loading / storing
  • Arithmetic / boolean operations
  • Type conversion
  • Object creation and manipulation
  • Operand stack management
  • Control transfer
  • Method invocation and return

Here is a Wikipedia example illustrating how some Java code is translated into Java bytecode:

outer:
for (int i = 2; i < 1000; i++) {
    for (int j = 2; j < i; j++) {
        if (i % j == 0)
            continue outer;
    }
    System.out.println (i);
}

A Java compiler might produce the following bytecode:

0:   iconst_2
1:   istore_1
2:   iload_1
3:   sipush  1000
6:   if_icmpge       44
9:   iconst_2
10:  istore_2
11:  iload_2
12:  iload_1
13:  if_icmpge       31
16:  iload_1
17:  iload_2
18:  irem
19:  ifne    25
22:  goto    38
25:  iinc    2, 1
28:  goto    11
31:  getstatic       #84; // Field java/lang/System.out:Ljava/io/PrintStream;
34:  iload_1
35:  invokevirtual   #85; // Method java/io/PrintStream.println:(I)V
38:  iinc    1, 1
41:  goto    2
44:  return

You can find all the details about Java bytecode in a document called The Java® Virtual Machine Specification.

In the Android ecosystem, the Java bytecode is replaced by Dalvik/ART bytecode. The Dalvik virtual machine uses a register-based architecture and fewer, typically more complex, virtual machine instructions.

A tool called dx is used to convert Java .class files into the .dex format: the Java bytecode is converted into an alternative instruction set, formerly used by the Dalvik VM, but now translated into native code by the Android Runtime (ART) for better performances.

The standard Java bytecode executes 8-bit stack instructions. Local variables must be copied to and from the operand stack using separate instructions. Instead, Dalvik uses its own 16-bit instruction set that works directly on local variables.

Here is a list of tools useful for reverse engineering Java and Dalvik bytecode:

Python

CPython is the most widely used implementation of Python. It is a source code interpreter.

Its salient features are the following:

  • Automatic memory management and reference counting
  • Python bytecode interpretation is done by a stack-based virtual machine (VM).
  • The virtual machine creates and manages several data structures (maps, lists, tuples).
  • CPython is multi-threaded, but only a single active thread on the interpreter (Global Interpreter Lock) can run at the same time.
  • It uses late binding: it searches for a method in the class dictionary by name only when needed for the first time.

Functions are objects, like a list, a tuple or an instance of a class. Since functions are objects, you can talk about them without calling them. For example, you can pass a function as a parameter, or assign a function to another name:

def foo (a, b):
       return a+b
  \end{minted}
  \begin{verbatim}
>>> foo
<function foo at 0x10eba0140>
>>> bar = foo
>>> bar
<function foo at 0x10eba0140>
>>> foo (1, 2)
3
>>> bar (1, 2)
3

There are several interesting attributes in a function object, including code objects:

>>> foo.func_code
<code object foo at 0x10eb90d30, file "<stdin>", line 1>
>>> foo.func_code.co_varnames
('a', 'b')
>>> foo.func_code.co_consts
(None,)
>>> foo.func_code.co_argcount
2
>>> foo.func_code.co_code
'|\x00\x00|\x01\x00\x17S'

A code object contains Python bytecode, which is a set of instructions for how to run a function. A list of the instructions that the Python compiler currently supports is available here: Python Bytecode Instructions.

Python bytecode can be disassembled using the dis module:

>>> dis.dis (foo.func_code.co_code)
          0 LOAD_FAST           0 (0)
          3 LOAD_FAST           1 (1)
          6 BINARY_ADD
          7 RETURN_VALUE

Here is another, more complex example:

>>> def foo (a, b):
...     if a > b:
...         while a > b:
...             a -= 1
...     elif a == b:
...         a += 2
...     else:
...         while b < a:
...             b -= 1
...
...     return (a, b)
...
>>> import dis
>>> dis.dis(foo)
  1           RESUME                   0

  2           LOAD_FAST_LOAD_FAST      1 (a, b)
              COMPARE_OP             148 (bool(>))
              POP_JUMP_IF_FALSE       20 (to L3)

  3           LOAD_FAST_LOAD_FAST      1 (a, b)
              COMPARE_OP             148 (bool(>))
              POP_JUMP_IF_FALSE       12 (to L2)

  4   L1:     LOAD_FAST                0 (a)
              LOAD_CONST               1 (1)
              BINARY_OP               23 (-=)
              STORE_FAST               0 (a)

  3           LOAD_FAST_LOAD_FAST      1 (a, b)
              COMPARE_OP             148 (bool(>))
              POP_JUMP_IF_FALSE        2 (to L2)
              JUMP_BACKWARD           12 (to L1)

 11   L2:     LOAD_FAST_LOAD_FAST      1 (a, b)
              BUILD_TUPLE              2
              RETURN_VALUE

  5   L3:     LOAD_FAST_LOAD_FAST      1 (a, b)
              COMPARE_OP              88 (bool(==))
              POP_JUMP_IF_FALSE        8 (to L4)

  6           LOAD_FAST                0 (a)
              LOAD_CONST               2 (2)
              BINARY_OP               13 (+=)
              STORE_FAST               0 (a)

 11           LOAD_FAST_LOAD_FAST      1 (a, b)
              BUILD_TUPLE              2
              RETURN_VALUE

  8   L4:     LOAD_FAST_LOAD_FAST     16 (b, a)
              COMPARE_OP              18 (bool(<))
              POP_JUMP_IF_FALSE       12 (to L6)
[...]

Python bytecode decompilers include uncompyle2, that is a Python 2.7 decompiler written in Python, and Decompyle++, which is a Python disassembler/decompiler written in C++ supporting all versions of Python bytecode.

Microsoft .NET

The .NET Framework is a software framework developed by Microsoft that runs primarily on Microsoft Windows. Programs written for the .NET Framework execute in a software environment, known as the Common Language Runtime (CLR), an application virtual machine that provides services such as security, memory management and exception handling.

Decompilers for .NET include JetBrains dotPeek, ILSpy, and Telerik JustDecompile.

In the next episode, I’ll discuss how debuggers work, with a special focus on Linux. Stay tuned!


Thanks for reading Crumbs of Cybersecurity! Subscribe for free to receive new posts and support my work.