Reverse Engineering by Tracking Function Calls

As McAfee Labs researchers examine malware, we often have to reverse-engineer those binaries when we don’t have the source code. Because reverse engineering depends heavily on the state of the binaries, most of the time it is a tedious manual task. Any tool or technique that speeds our work is a big help.

A lot of great tools have been developed to accelerate reverse engineering, but there is still room for innovation. Most of the time when we analyze malware or normal applications we try to find the key functions, which may reside deep inside the binary. Finding them manually with debuggers is very time consuming.

One effective way to accelerate reverse-engineering tasks is to plot every call from the binary (function to function, function to API) on a directed graph using runtime event captures. The graph displays the relationships between the calls inside the binary; these relationships can show us which code chunks are most important.

As a proof of concept, we have developed a plug-in that uses the IDA database to enumerate the functions inside the binary. We used Python modules pefile and pydbg to hook on the APIs and to capture calls. Then we used pydot and the Graphviz libraries to plot calls on the graph. We generated a relationship graph using the following steps.

Assign node1 to caller and node2 to callee (node1 –> node 2)
Inside node 2, node 2 becomes node1 and new callee becomes node2 (node2(node1) –>node2(new))
Similarly, we can created a relationship call graph for the entire binary

class idafunction:
”’
class to enumerate the functions from the current IDA database
”’
def __init__(self):
self.func_dict = {}
def functions(self):
for seg_ea in Segments():
for func_ea in Functions(seg_ea,SegEnd(seg_ea)):
self.func_dict[func_ea] = GetFunctionName(func_ea)
return self.func_dict

def idafun_generic(self,dummy=None):
”’
generic call back function for breakpoint handling.
”’
EIP = self.dbg.context.Eip
try:
esp = self.dbg.context.Esp
paddr = self.dbg.read_process_memory(esp,4)
addr = struct.unpack(“L”,paddr)[0]
addr = int(addr)
CallFrom = GetFunctionName(addr)
if not CallFrom:
CallFrom = hex(addr)
log(“FUNCTION ADDRESS: %x\tCALL FROM: %s \tFUNCTION NAME: %s” % (EIP,CallFrom,self.func_dict[EIP]))
logn(“FUNCTION ADDRESS: %x\tCALL FROM: %s \tFUNCTION NAME: %s” % (EIP,CallFrom,self.func_dict[EIP]))
SetColor(EIP,CIC_FUNC,FUNC_COLOR)

The relationship graph will also contains calls to APIs from the respective functions. These will help us identify the most sensitive areas inside the binary, and will also allow us to track the entire path from the main function.

If the binary is unpacked, then our technique will greatly minimize analysis time. If the binary is packed, then in most cases our technique can provide a fair amount of details about the binary that might be useful in unpacking, or in identifying new allocations, calls from new allocations, and type of calls from new allocations (to unknown or known addresses, etc.).

Real-Life Examples

Here is a look at the functions call relationship graph of the Herpsnet bot:

By looking at the graph we can say that the function sub_4070E0 is the most valuable function in the binary because it calls the following important functions:

sub_4044b0
sub_406Fc0

Inside the function sub_4044b0 we can see calls to other interesting functions:

sub_404880. By looking into the type of API calls from this function, it appears to be used for process enumeration (running processes). This function calls the function sub_404550, which seems to check user privileges.
sub_404350. This function is used to inject code into other processes. It also calls the function sub_4041E0, which adjusts the privileges of the processes, probably to “se_debug” because we saw a function above that calls some injection-related APIs.

Inside the function sub_406fc0 we can also see one interesting call to the function sub_403034. In our manual analysis we found that this function is used for strings decryption.

If we look back into sub_4070E0, we can see that it also calls APIs such as CreateThread and CreateMutexA. So overall the graph gives us some pretty good information about this malware, and saves us a significant amount of analysis time.

Now let’s look at the functions call relationship graph of the Alureon bot:

In this graph we can see that sub_402450, sub_4026E0, sub_402665, sub_40507A, and sub_402E52 are the most interesting functions.

Conclusion

Our technique for speeding up reverse engineering and malware analysis has the same problems as those of any similar tool, but we have found it useful in our analysis. Nonetheless, we would be happy with a more robust program. Currently in the proof-of-concept state, pydbg hooking is not that stable. Moreover, it can be easily found by debugger detection techniques.

If you have any other ideas, please share them with us.

Acknowledgements:

I would like to thank my colleagues Neeraj Thakar and Vikas Taneja for their valuable input.

References:

http://www.labri.fr/perso/fleury/courses/PdP/SoftwareVisualization/bohnet_Softvis06.pdf

http://www.futureinternet.fi/publications/seminar_2011/Kostakis_Lundstrom.pdf