Vulnerability Discovery in Open Source Libraries Part 1: Tools of the Trade

Executive Summary

Open source has become the foundation for modern software development. Vendors use open source software to stay competitive and improve the speed, quality, and cost of the development process. At the same time, it is critical to maintain and audit open source libraries used in products as they can expose a significant volume of risk.

The responsibility of auditing code for potential security risks lies with the organization using it. We have seen one of the highest impact vulnerabilities originate with open source software in the past. The famous Equifax data breach was due to a vulnerability in open source component Apache Struts, widely used in mainstream web frameworks. Furthermore, the 2020 Open Source Security Risk and Analysis report states that out of the applications audited in 2019, 99% of codebases contained open source components and 75% of codebases contained vulnerabilities, with 49% of codebases containing high risk vulnerabilities.

Graphics libraries have a rich history of vulnerabilities and the volume of exploitable issues are especially magnified when the code base is relatively older and has not been recompiled recently. It turns out that graphics libraries on Linux are widely used in many applications but are not sufficiently audited and tested for security issues. This eventually became a driving force for us to test multiple vector graphics and GDI libraries on Linux, one of which was libEMF, a Linux C++ library written for a similar purpose and used in multiple graphics tools that support graphics conversion into other vector formats. We tested this library for several days and found multiple vulnerabilities, ranging from multiple denial-of-service issues, integer overflow, out-of-bounds memory access, use-after-free conditions, and uninitialized memory use.

All the vulnerabilities were locally exploitable. We reported them to the code’s maintainer, leading to two new versions of the library being released in a matter of weeks. This reflects McAfee’s commitment to protecting its customers from upcoming security threats, including defending them against those found in open source software. Through collaboration with McAfee researchers, all issues in this library were fixed in a timely manner.

In this blog we will emphasize why it is critical to audit the third-party code we often use in our products and outline general practices for security researchers to test it for security issues.

Introduction

Fuzzing is an extremely popular technique used by security researchers to discover potential zero-day vulnerabilities in open, as well as closed source software. Accepted as a fundamental process in product testing, it is employed by many organizations to discover vulnerabilities earlier in the product development lifecycle. At the same time, it is substantially overlooked. A well designed fuzzer typically comprises of a set of tools to make the fuzzing process relatively more efficient and fast enough to discover exploitable bugs in a short period, helping developers patch them early.

Several of the fuzzers available today help researchers guide the fuzzing process by measuring code coverage, by using static or dynamic code instrumentation techniques. This eventually results in more efficient and relevant inputs to the target software, exercising more code paths, leading to more vulnerabilities discovered in the target. Modern fuzzing frameworks also come with feedback-driven channels for maximizing the code coverage of the target software, by learning the input format along the way and comparing the code coverage of the input via feedback mechanisms, resulting in more efficient mutated inputs. Some of the state-of-the-art fuzzing frameworks available are American Fuzzy Lop (AFL), LibFuzzer and HongFuzz.

Fuzzers like AFL on Linux come with compiler wrappers (afl-gcc, afl-clang, etc.). With the assembly parsing module afl-as, AFL parses the generated assembly code to add compile-time instrumentation, helping in visualizing the code coverage. Additionally, modern compilers come with sanitizer modules like Address Sanitizers (ASAN), Memory Sanitizers (MSAN), Leak Sanitizers (LSAN), Thread Sanitizers (TSAN), etc., which can further increase the fuzzer’s bug finding abilities. Below highlights the variety of memory corruption bugs that can be discovered by sanitizers when used with fuzzers.

ASAN MSAN UBSAN TSAN LSAN
Use After Free Vulnerabilities Uninitialized Memory Reads Null Pointer Dereferences Race Conditions Run Time Memory Leak Detection
Heap Buffer Overflow   Signed Integer Overflows  
Stack Buffer Overflow Typecast Overflows
Initialization Order Bugs Divide by Zero Errors
Memory Leaks
Out of Bounds Access

 

One of the McAfee Vulnerability Research Team goals is to fuzz multiple open and closed source libraries and report vulnerabilities to the vendors before they are exploited. Over the next few sections of this blog, we aim to highlight the vulnerabilities we discovered and reported while researching one open source library, LibEMF (ECMA-234 Metafile Library).

Using American Fuzzy Lop (AFL)

Much of the technical detail and working of this state-of-the-art feedback-driven fuzzer is available in its documentation. While AFL has many use cases, its most common is to fuzz programs written in C / C++ since they are susceptible to widely exploited memory corruption bugs, and that is where AFL and its mutation strategies are extremely useful. AFL gave rise to several forks like AFLSmart , AFLFast and Python AFL, differing in their mutation strategies and extensions to increase performance. Eventually, AFL was also imported to the Windows platform, WinAFL, using a dynamic instrumentation approach predominantly for closed source binary fuzzing.

The fuzzing process primarily comprises the following tasks:

Fuzzing libEMF (ECMA-234 Metafile Library) with AFL

LibEMF (Enhanced Metafile Library) is an EMF parsing library written in C/C++ and provides a drawing toolkit based on ECMA-234. The purpose of this library is to create vector graphic files. Documentation of this library is available here and is maintained by the developer.

We chose to fuzz this LibEMF with AFL fuzzer because of its compile time instrumentation capabilities and good mutation strategies as mentioned earlier. We have the source code compiled in hardened mode, which will add code hardening options while invoking the downstream compiler, which helps with discovering memory corruption bugs.

Compiling the Source

To use the code instrumentation capabilities of AFL, we must compile the source code with the AFL compiler wrapper afl-gcc/afl-g++ and, with an additional address sanitizer flag enabled, use the following command:

./configure CXX=afl-g++ CFLAGS=”-fsanitize=address -ggdb” CXXFLAGS=”-fsanitize=address -ggdb” LDFLAGS=”-fsanitize=address”

Below is a snapshot of the compilation process showing how the instrumentation is added to the code:

Pwntools python package comes with a good utility script, checksec, that can examine the binary security properties. Executing checksec over the library confirms the code is now ASAN instrumented. This will allow us to discover non-crashing memory access bugs as well:

Test harness is a program that will use the APIs from the library to parse the file given to the program as the command line argument. AFL will use this harness to pass its mutated files as an argument to this program, resulting in several executions per second. While writing the harness, it is extremely important to release the resources before returning to avoid excessive usage which can eventually crash the system. Our harness for parsing EMF files using APIs from the libEMF library is shown here:

AFL will also track the code coverage with every input that it passes to the program and, if the mutations result in new program states, it will add the test case to the queue. We compiled our test harness using the following command:

afl-g++ -o playemffile playemffile.c -g -O2 -D_FORTIFY_SOURCE=0 -fsanitize=address -I /usr/local/include/libEMF/ -L/usr/local/lib/libEMF -lEMF

Collecting Test Cases

While a fuzzer can learn and generate the input format even from an empty seed file, gathering the intial corpus of input files is a significant step in an effective fuzzing process and can save huge amounts of CPU cycles. Depending upon the popularity of the file format, crawling the web and downloading the initial set of input files is one of the most intuitive approaches. In this case, it is not a bad idea to manually construct the input files with a variety of EMF record structures, using vector graphic file generation libraries or Windows GDI APIs. Pyemf is one such available library with Python bindings which can be used to generate EMF files. Below shows example code of generating an EMF file with an EMR_EXTEXTOUTW record using Windows APIs. Constructing these files with the different EMR records will ensure functionally different input files, exercising different record handlers in the code.

Running the Fuzzer

Running the fuzzer is just running the afl-fuzz command with the parameters as shown below. We would need to provide the input corpus of EMF files ( -i EMFs/ ) , output directory ( -o output/ ) and the path to the harness binary with @@, meaning the fuzzer will pass the file as an argument to the binary. We also need to use -m none since the ASAN instrumented binary needs a huge amount of memory.

afl-fuzz -m none -i EMFs/ -o output/ — /home/targets/libemf-1.0.11/tests/playemffile @@

However, we can make multiple tweaks to the running AFL instance to increase the number of executions per second. AFL provides a persistent mode which is in-memory fuzzing. This avoids forking a new process on every run, resulting in increased speed. We can also run multiple AFL instances, one on every core, to increase the speed. Beyond this, AFL also provides a file size minimization tool that can be used to minimize the test case size. We applied some of these optimization tricks and, as we can see below, there is a dramatic increase in the execution speed reaching ~500 executions per second.

After about 3 days of fuzzing this library, we had more than 200 unique crashes, and when we triaged them we noticed 5 unique crashes. We reported these crashes to the developer of the library along with MITRE, and after being acknowledged, CVE-2020-11863, CVE-2020-11864, CVE-2020-11865, CVE-2020-11866 and CVE-2020-13999 were assigned to these vulnerabilities. Below we discuss our findings for some of these vulnerabilities.

CVE-2020-11865 – Out of Bounds Memory Access While Enumerating the EMF Stock Objects

While triaging one of the crashes produced by the fuzzer, we saw SIGSEGV (memory access violation) for one of the EMF files given as an input. When the binary is compiled with the debugging symbols enabled, ASAN uses LLVM Symbolizer to produce the symbolized stack traces. As shown below, ASAN outputs the stack trace which helps in digging into this crash further.

Looking at the crash point in the disassembly clearly indicates the out of bounds memory access in GLOBALOBJECTS::find function.

Further analyzing this crash, it turned out that the vulnerability was in accessing of the global object vector which had pointers to stock objects. Stock objects are primarily logical graphics objects that can be used in graphics operations. Each of the stock objects used to perform graphical operations have their higher order bit set, as shown below from the MS documentation. During the metafile processing, the index of the relevant stock object can be determined by masking the higher order bit and then using that index to access the pointer to the stock object. Metafile processing code tries to retrieve the pointer from the global object vector by attempting to access the index after masking the higher order bit, as seen just above the crash point instruction, but does not check the size of the global object vector before accessing the index, leading to out of bounds vector access while processing a crafted EMF file.

Shown below is the vulnerable and fixed code where the vector size check was added:

CVE-2020-13999 – Signed Integer Overflow While Processing EMR_SCALEVIEWPORTEX Record

Another crash in the code that we triaged turned out to be a signed integer overflow condition while processing an EMR_SCALEVIEWPORTEXT record in the metafile. This record specifies the viewport in the current device context and is calculated by computing ratios. An EMR_SCALEVIEWPORTEXTEX record looks like this, as per the record specification. A new viewport is calculated as shown below:

As part of AFL’s binary mutation strategy, it applies a deterministic approach where certain hardcoded sets of integers replace the existing data. Some of these are MAX_INT, MAX_INT-1, MIN_INT, etc., which increases the likelihood of triggering edge conditions while the application processes binary data. One such mutation done by AFL in the EMF record structure is shown below:

This resulted in the following crash while performing the division operation.

Below we see how this condition, eventually leading to a denial-of-service, was fixed by adding division overflow checks in the code:

CVE-2020-11864 – Memory Leaks While Processing Multiple Metafile Records

Leak Sanitizer (LSAN) is yet another important tool which is integrated with the ASAN and can be used to detect runtime memory leaks. LSAN can also be used in standalone mode without ASAN. While triaging generated crashes, we noticed several memory leaks while processing multiple EMF record structures. One of them is as shown below while processing the EXTTEXTOUTA metafile record, which was later fixed in the code by releasing the memory buffer when there are exceptions reading the corrupted metafiles.

Apparently, memory leaks can lead to excessive resource usage in the system when the memory is not freed after it is no longer needed. This eventually leads to the denial-of-service. We found memory leak issues while libEMF processed several such metafile records. The same nature of fix, releasing the memory buffer, was applied to all the vulnerable processing code:

Additionally, we also reported multiple use-after-free conditions and denial-of-service issues which were eventually fixed in the newer version of the library released here.

Conclusion

Fuzzing is an important process and fundamental to testing the quality of a software product. The process becomes critical, especially when using third-party libraries in a product which may come with exploitable vulnerabilities. Auditing them for security issues is crucial. We believe the vulnerabilities that we reported are just the tip of the iceberg. There are several legacy libraries which likely require a thorough audit. Our research continues with several other similar Windows and Linux libraries and we will continue to report vulnerabilities through our coordinated disclosure process. We believe this also highlights that it is critical to maintain a good level of collaboration between vulnerability researchers and the open source community to have these issues reported and fixed in a timely fashion. Additionally, modern compilers come with multiple code instrumentation tools which can help detect a variety of memory corruption bugs when used early in the development cycle. Using these tools is recommended when auditing code for security vulnerabilities.

The post Vulnerability Discovery in Open Source Libraries Part 1: Tools of the Trade appeared first on McAfee Blogs.