Companies in our field of business have long wished for a better way of discovering and describing malware capabilities than the current system. Such a system would be of great benefit to everyone who has to deal with malware and the damage they can cause. While there is currently a whole spectrum of techniques used to discover the functionality of malware, ranging from the most basic to the more advanced, most fall short because they don’t describe the malware in a very complete way.
Many either rely on manual decomposition and analysis or may run samples in physical or virtual machine (VM) environments, then record changes made to the system and report them as side effects of the malware. Each method has its own benefits and drawbacks. Manual analysis is a slow and cumbersome task and prone to human error. Automated side effects collation is faster and requires little or no human input but is often sadly lacking in useful information and completeness. In short, automated malware analysis is a difficult problem to crack and some of the reasons why automated analysis does not deliver as much as we would like it to boils down to a number of factors:
- Conditional code: Many malware have code built in that only activates when certain conditions are met. E.g. a user clicks on a button or visits a particular website. It could be waiting for commands from its command-and-control server. If these conditions are never met, the malware could lie idle.
- Virtual machine detection: Malware writers are aware that security vendors often use virtual machines to perform runtime analysis of malware. To prevent this, they have implemented checks for the presence of virtual machines and if detected, the malware would just terminate itself and give nothing away of its behavior. One malware that goes against the grain is the W32.Crisis worm (Security Response). Not only does this malware run fine in a virtual machine environment, it actively searches for the presence of VM files and attempts to infect them too.
- Time: Some malware need time to pass before they start performing their mischief. When running tens of thousands of samples through automated analysis systems daily, time is something that is not on our side so each sample only gets a small slice of time under automated analysis before the data is collected and the slate wiped clean for the next sample. This means that any functionality that does not happen right away will be missed by automated analysis systems. For example, if a Trojan’s functionality is to delete all .doc files on all drives, but it has only had time to delete some files on the C drive, then that’s all the automated system would report.
- Context: This is often missing from reports generated by automated analysis systems. For example, if a sample starts modifying files on a drive, many automated systems may report them as just modified files but give no indication as to what kind of changes were made. The changes could have been a viral code infection, it could have wiped/encrypted the contents of the file, it may have injected some textual content to the file but, with most automated systems, there are no indications of what the changes are. Another example of missing contextual information is with network connections. When a malware is reported to visit a particular URL, there is no indication of what’s at the end of the URL. Is it to download more malware, to retrieve instructions, to upload stolen information, or just to report an infection? These things are important to know but for automated systems, extremely difficult to establish.
- Modularization: Most modern malware do not exist as a single discrete piece of code anymore. They are often modularized with separate files containing different bits of functionality. These modules are often not packaged with the installer piece. So what you get is the installer being spread initially then, once the installer is invoked, it tries to download the other bits and pieces. Sometimes the downloading of additional components may be directly controlled by a remote actor who takes a look around the compromised computer before deciding what else to download. In of itself, the installer does not do very much and hence under automated analysis, little is reported. Similarly, the individual modules themselves are code libraries which need to be invoked by other code so cannot be readily executed to find out its functionality.
All these factors combine to limit the capability of automated malware analysis.
However, researcher Joshua Saxe is to present at BlackHat today, an open-sourced and crowd-trained machine learning tool that can be used to identify the capabilities of malware files. It claims to be able to generate lists of malware capabilities such as the ability to use particular network protocols and the ability to steal data and so forth. Interestingly, it will give probability scores for the detected capabilities when appropriate, which could mean that for anything that it is uncertain of or looks like it might do, a score will be given so that the reader can know how likely or unlikely the capability is. The project to create this tool is funded by the DARPA Cyber Fast track initiative and algorithms used will be detailed in today’s presentation. It should make for very interesting viewing indeed.
Incidentally, Symantec Security Response have a number of automated systems that analyze and collate malware sample information and capabilities. These systems can perform static and runtime analysis of malware samples and record their side effects. This information is combined with other Symantec data and telemetry sources and then supplied to our customers through our exclusive malware reporting services, providing valuable information to help our customers prevent or recover from malware attacks.