One of the popular topics on many engineering blogs is program startup and ELF file structure.
Given that fact, a reader may have some objections about whether another post about ELF makes sense or not. Besides that, I decided to attack this topic from the different angle.
There are two reasons beyond that decision: first of all is a really interesting topic, but more importantly, this material will be useful in the future (without spoiling why).
Running environment
Examples below can be run on most of the UNIX based systems, however, I will stick with Linux as out of the box most of the distributions come with the newest versions of the tools used to analyze binaries.
Useful Tools
In this article which covers elf format and linker behavior we will use the following tools:
objdump
Object dump (objdump) is a simple and clean solution for a quick disassembly of code. It is great for disassembling simple and untampered binaries but will show its limitations quickly when attempting to use it for any real challenging reverse engineering tasks, especially against hostile software. Its primary weakness is that it relies on the ELF section headers and doesn’t perform control flow analysis, which are both limitations that greatly reduce its robustness. This results in not being able to correctly disassemble the code within a binary, or even open the binary at all if there are no section headers. For many conventional tasks, however, it should suffice, such as when disassembling common binaries that are not fortified, stripped, or obfuscated in any way. It can read all common ELF types.
Here are some common examples of how to use objdump you can check objdump man pages:
readelf
The readelf command is one of the most useful tools around for dissecting ELF
binaries. It provides every bit of the data specific to ELF necessary for gathering
information about an object before reverse engineering it. This tool will be used
often throughout the book to gather information about symbols, segments, sections,
relocation entries, dynamic linking of data, and more. The readelf command is the
Swiss Army knife of ELF, but usually not really helpful for other binary formats
(In contrast to objdump). Below are a few examples of its most commonly used cases:
readelf man pages
nm
One of the members of the binutils package to work with the executable format is nm.
Is the simplest tool from this list nm lists file symbols.
Few examples of the usage below:
nm man pages
gdb
GNU Debugger (GDB) is not only good to debug buggy applications, but is an essential tool for hackers and can be used in various situations like learning about a program’s control flow,
change a program’s control flow, and modify the code, registers, and data structures.
These tasks are common for hackers, reverse engineers or system engineers.
GDB works on ELF binaries, Linux processes or even application/kernel core dumps.
gdb man pages
Man pages are quite long as these tools are coming with a lot of useful options, and there is no good reason, in my opinion, to try to read all of them But instead is really good to coming back to them and search for interesting options or clarifying existing commands.
Explore binary files
First of all, we will write simple C program and try to use some of the commonly known tools to understand it structure after compilation to object and linked executable file.
Sample program made from two source code files and header in C:
Now compile the source codes. First we can just create binary object by compiling with -c flag
First tool that we can use to browse symbols inside a binary object in nm.
We see few symbols: main function, show_smaller, printf, scanf and _stack_chk_fail.
The nm does not know anything about other symbols than main. We as an author of the source code know that we want to use custom show_smaller function and will be using printf, scanf from the standard library, also we got as a bonus from GCC security feature: stack protector which bring additional symbol __stack_chk_fail. We can get rid of it by using compilation flag no-stack-protector (gcc -c ./main.c -fno-stack-protector )
nm print results in columns: the first column is an address of the symbol, the second show symbol type, and the last column show the name of the symbol.
In the example we have just two types of the symbols: T and U. From nm man pages we can find a description of these letters:
“U” - The symbol is undefined.
“T” - The symbol is in the text (code) section.
Which should be logical, as at this point the only function that was written by us in the main object is function main
We also used other functions some of which are from external libraries and show_smaller which we defined inside source.c.
Running nm on the source.o object should not be surprising in any way.
Now we will take a look how binary object refers to external functions. To extract more information from a binary file we can use objdump.
As a first step, print code section with assembly instruction:
A few callq instructions with target address: e8 00 00 00 00. We know that e8 is optcode of callq instruction (we can trust objdump tool or check it by ourselves in Intel optcodes online arch) and rest of the space (4 bytes) is a placeholder to be filled later.
A curious reader may ask: why as an output of 64-bytecode we got only 4 bytes for the address?
That is caused by gcc default behavior, it tries to fit whole code in 32 bits addressing model or in other words using relative addressing mode which is considered as a faster because of the internals of relative jump.
The reader interested in code models can recompile main.c with -mcmodel=large and -mcmodel=medium and then compare output of objdump -S.
Informations about external symbols aren’t in code section. Next step will be to see headers, we can do this with -x option:
Objdump showed a symbol table with relocations and names of the functions. I already introduced here term relocations so, before we will move forward I need to explain what these relocations are.
Relocations
blockquote From the ELF(5) man pages
Relocation is the process of connecting symbolic references with symbolic
definitions. Relocatable files must have information that describes how to modify
their section contents, thus allowing executable and shared object files to hold the
right information for a process’s program image. Relocation entries are these data.
To understand better what relocations are, we can print them from main.o compiled with default gcc options and also with option mcmodel=large:
What we can see: Relocations are in machine-dependent types. In the first example, we have 64-bit addresses but relocations are 32-bit relative offsets, the second example show us 64-bit relocation that is symbol address placed to the registers.
Now we will move from standalone object files and take a look at the linked ELF file. First, need to compile and then again we will take a look at the main function, and how it changed.
How relocations based on relative address are done in practice:
Going back to the assembly instruction where show_smaller function is called:
e8 1b 00 00 00 callq 400657 <show_smaller> callq = e8
Objdump show us show_smaller on the address: 0x400657, callq show_smaller is 0x400657 and 0x1b is offset created by linker, if we add everything together we discovered that is 5 bytes left.
0x400657 - (0x400637 + 0x1b) = 0x5
So 0x400637 is address of callq <show_smaller> but in this line we do have 5 bytes e8 1b 00 00 00 so to get address of relocation show_smaller we need to add these 5 and we got the answer 0x400657 <show_smaller>
Relocations aren’t only code functions, as we saw before also printf argument which is a string constant is resolved as relocation 0060d: bf 14 07 40 00 mov $0x400714,%edi
We can take a look at content inside this address in section rodata:
Now we just confirmed that at address 0x400714 is letter “E”, first from the "Enter a value" string.
Elf file structure
ELF file types
An ELF file may be marked as one of the following types:
• ET_NONE: This is an unknown type, indicates that the file type is unknown, or has not yet been defined.
• ET_REL: This is a relocatable file. ELF type relocatable means that the file is marked as a relocatable piece of code or sometimes called an object file. Relocatable object files are generally pieces of Position-independent code (PIC) that have not yet been linked into an executable. You will often see *.o files in a compiled code base. These are the files that hold code and data suitable for creating an executable file.
• ET_EXEC: This is an executable file. ELF type executable means that the file is marked as an executable file. These types of files are also called programs and are the entry point of how a process begins running.
• ET_DYN: This is a shared object. ELF type dynamic means that the file is marked as a dynamically linkable object file, also known as shared libraries. These shared libraries are loaded and linked into a program’s process image at runtime.
• ET_CORE: This is an ELF type core that marks a core file. A core file is a dump of a full process image during the time of a program crash or when the process has delivered a SIGSEGV signal (segmentation violation). By using GDB we can read these files and understand why the crash happened.
Headers and identification of architecture
The most basic information that is needed at the very beginning of the loading program is target architecture. ELF file contains headers that describe segments within a binary and are necessary for program loading. OS kernel understands segments during load time and describes the memory layout of an executable on disk and how it should translate to memory. The program header table can be accessed by referencing the offset found in the initial ELF header member called e_phoff (program header table offset).
Using tool file we can read ELF headers:
Segments
Executable code or program data are stored inside parts of the ELF file called segments. Borders of these regions and their size are defined inside the program file header. Each segment is described by structure Elf32_Phdr or Elf64_Phdr and they are arranged in a continuous manner. A number of segments are defined in e_phnum field of ElfN_Ehdr structure.
The most important segments
PT_INTERP: Describe the full path to the dynamic linker ld.so. The segments point to the region with the path by field p_offset
PT_LOAD: Describe the region that will be placed into the program memory. Data from p_offset will be copied into the p_vaddr
PT_DYNAMIC: Contain information for loader required to load the ELF. An executable will always have at least one PT_LOAD type segment.
Segment PT_LOAD
Usually executable files contains two PT_LOAD segments, first one describe data with machine code second one data used by the code. That is why we see two sections one RE and second one RW.
Segment PT_DYNAMIC
Each executable which is not build staticaly (*.so) contain section called .dynamic.
We can display dynamic linking arrayfrom previously compiled test program.
ELF section headers
After we have looked at what program headers, it is time to take a look at section headers. I really want to point out here the distinction between the two. I often hear people calling sections, segments, and vice versa.
Section is not a segment, segments are necessary for program execution, and within each segment, there is either code or data divided up into sections. We can recall here Wikipedia picture with a dual view of the ELF file runtime and binary file.
A section header table exists to reference the location and size of these sections and is primarily for linking and debugging purposes. Section headers are not necessary for program execution, and a program will execute just fine without having a section header table. This is because the section header table doesn’t describe the program memory layout. That is the responsibility of the program header table. The section headers are really just complementary to the program headers.
Program in the memory
Linux comes with a virtual filesystem called procfs which usually is mounted at /proc. This filesystem provides a lot of useful options for process tracing and debugging.
Inside /proc folder we can find a lot of folders called with numbers. Each of these folders corresponds to the running process with unique pid.
By running our test process we can create a new entry.
So our running proces has pid = 25824, now we can see entries inside /proc/<pid>. We don’t have enought time to cover all entries
Process Memory
File maps is a element of sysfs which contains informations about memory map of the given process. This information is available in nice readable format:
Code injection
Thanks to the sysfs localization of the function address in memory is easy.
Once we will find it we can try to modify running code in the memory.
We will try to change substitution inside the loop to addition. We can do this in two different ways:
Find proper optcode in the database, and then write a script which will write this value to the memory at given offset
Recompile example and get instruction code from objdump, then save to file and inject it using dd
The first approach looks a little bit more subtlety but it requires a little bit more knowledge, once the second way looks much more intuitive so let’s try it.
First of all, we need to run our test program in a separate terminal and leave it for the moment. The program will create entry inside /proc folder which we can easily find using ps -e | grep test.
Next step is to change show_smaller function to add. Compile it and find instruction that we want to use.
After compilation we can just dump show_smaller using gdb.
We replaced 836DFC01 sub dword [rbp-0x4],byte +0x1 with 8345FC01 add dword [rbp-0x4],byte +0x1 so we just changed operand from - to the +.
Now just type 5 and press Enter, if function show_smaller was correctly patched we should be overwhelmed by the numbers in the terminal…