Co-authored with Jesse Chick, OSU Senior and Former McAfee Intern, Primary Researcher.
Special thanks to Dr. Catherine Huang, McAfee Advanced Analytics Team
Special thanks to Kyle Baldes, Former McAfee Intern
“Face” the Facts
There are 7.6 Billion people in the world. That’s a huge number! In fact, if we all stood shoulder to shoulder on the equator, the number of people in the world would wrap around the earth over 86 times! That’s just the number of living people today; even adding in the history of all people for all time, you will never find two identical human faces. In fact, in even some of the most similar faces recorded (not including twins) it is quite easy to discern multiple differences. This seems almost impossible; it’s just a face, right? Two eyes, a nose, a mouth, ears, eyebrows and potentially other facial hair. Surely, we would have run into identical unrelated humans by this point. Turns out, there’s SO much more to the human face than this which can be more subtle than we often consider; forehead size, shape of the jaw, position of the ears, structure of the nose, and thousands more extremely minute details.
You may be questioning the significance of this detail as it relates to McAfee, or vulnerability research. Today, we’ll explore some work undertaken by McAfee Advanced Threat Research (ATR) in the context of data science and security; specifically, we looked at facial recognition systems and whether they were more, or less susceptible to error than we as human beings.
Look carefully at the four images below; can you spot which of these is fake and which are real?
The answer may surprise you; all four images are completely fake – they are 100% computer-generated, and not just parts of different people creativally superimposed. An expert system known as StyleGAN generated each of these, and millions more, with varying degrees of photorealism, from scratch.
This impressive technology is equal parts revolutions in data science and emerging technology that can compute faster and cheaper at a scale we’ve never seen before. It is enabling impressive innovations in data science and image generation or recognition, and can be done in real time or near real time. Some of the most practical applications for this are in the field of facial recognition; simply put, the ability for a computer system to determine whether two images or other media represent the same person or not. The earliest computer facial recognition technology dates back to the 1960s, but until recently, has either been cost-ineffective, false positive or false negative prone, or too slow and inefficient for the intended purpose.
The advancements in technology and breakthroughs in Artificial Intelligence and Machine Learning have enabled several novel applications for facial recognition. First and foremost, it can be used as a highly reliable authentication mechanism; an outstanding example of this is the iPhone. Beginning with the iPhone X in 2017, facial recognition was the new de facto standard for authenticating a user to their mobile device. While Apple uses advanced features such as depth to map the target face, many other mobile devices have implemented more standard methods based on the features of the target face itself; things we as humans see as well, including placement of eyes, width of the nose, and other features that in combination can accurately identify a single user. More simplistic and standard methods such as these may inherently suffer from security limitations relative to more advanced capabilities, such as the 3D camera capture. In a way, this is the whole point; the added complexity of depth information is what makes pixel-manipulation attacks impossible.
Another emerging use case for facial recognition systems is for law enforcement. In 2019, the Metropolitan Police in London announced the rollout of a network of cameras designed to aid police in automating the identification of criminals or missing persons. While widely controversial, the UK is not alone in this initiative; other major cities have piloted or implemented variants of facial recognition with or without the general population’s consent. In China, many of the trains and bus systems leverage facial recognition to identify and authenticate passengers as they board or unboard. Shopping centers and schools across the country are increasingly deploying similar technology.
More recently, in light of racial profiling and racial bias demonstrated repeatedly in facial recognition AI, IBM announced that it would eliminate its facial recognition programs given the way it could be used in law enforcement. Since then, many other major players in the facial recognition business have suspended or eliminated their facial recognition programs. This may be at least partially based on a high profile “false positive” case in which authorities errantly based an arrest of an individual on an incorrect facial recognition match of a black man named Robert Williams. The case is known as the country’s first wrongful arrest directly resulting from facial recognition technology.
Facial recognition has some obvious benefits of course, and this recent article details the use of facial recognition technology in China to track down and reunite a family many years after an abduction. Despite this, it remains a highly polarizing issue with significant privacy concerns, and may require significant further development to reduce some of the inherent flaws.
Live Facial Recognition for Passport Validation
Our next use case for facial recognition may hit closer to home than you realize. Multiple airports, including many in the United States, have deployed facial recognition systems to aid or replace human interaction for passport and identity verification. In fact, I was able to experience one of these myself in the Atlanta airport in 2019. It was far from ready, but travellers can expect to see continued rollouts of this across the country. In fact, based on the global impact COVID-19 has had on travel and sanitization, we are observing an unprecedented rush to implement touchless solutions such as biometrics. This is of course being done from a responsibility standpoint, but also from an airlines and airport profitability perspective. If these two entities can’t convince travelers that their travel experience is low-risk, many voluntary travelers will opt to wait until this assurance is more solid. This article expands on the impact Coronavirus is having on the fledgling market use of passport facial recognition, providing specific insight into Delta and United Airlines’ rapid expansion of the tech into new airports immediately, and further testing and integration in many countries around the world. While this push may result in less physical contact and fewer infections, it may also have the side-effect of exponentially increasing the attack surface of a new target.
The concept of passport control via facial recognition is quite simple. A camera takes a live video and/or photos of your face, and a verification service compares it to an already-existing photo of you, collected earlier. This could be from a passport or a number of other sources such as the Department of Homeland Security database. The “live” photo is most likely processed into a similar format (image size, type of image) as the target photo, and compared. If it matches, the passport holder is authenticated. If not, an alternate source will be checked by a human operator, including boarding passes and forms of ID.
As vulnerability researchers, we need to be able to look at how things work; both the intended method of operation as well as any oversights. As we reflected on this growing technology and the extremely critical decisions it enabled, we considered whether flaws in the underlying system could be leveraged to bypass the target facial recognition systems. More specifically, we wanted to know if we could create “adversarial images” in a passport-style format, that would be incorrectly classified as a targeted individual. (As an aside, we performed related attacks in both digital and physical mediums against image recognition systems, including research we released on the MobilEye camera deployed in certain Tesla vehicles.)
The conceptual attack scenario here is simple. We’ll refer to our attacker as Subject A, and he is on the “no-fly” list – if a live photo or video of him matches a stored passport image, he’ll immediately be refused boarding and flagged, likely for arrest. We’ll assume he’s never submitted a passport photo. Subject A (AKA Jesse), is working together with Subject B (AKA Steve), the accomplice, who is helping him to bypass this system. Jesse is an expert in model hacking and generates a fake image of Steve through a system he builds (much more on this to come). The image has to look like Steve when it’s submitted to the government, but needs to verify Jesse as the same person as the adversarial fake “Steve” in the passport photo. As long as a passport photo system classifies a live photo of Jesse as the target fake image, he’ll be able to bypass the facial recognition.
If this sounds far-fetched to you, it doesn’t to the German government. Recent policy in Germany included verbiage to explicitly disallow morphed or computer-generated combined photos. While the techniques discussed in this link are closely related to this, the approach, techniques and artifacts created in our work vary widely. For example, the concepts of face morphing in general are not novel ideas anymore; yet in our research, we use a more advanced, deep learning-based morphing approach, which is categorically different from the more primitive “weighted averaging” face morphing approach.
Over the course of 6 months, McAfee ATR researcher and intern Jesse Chick studied state-of-the-art machine learning algorithms, read and adopted industry papers, and worked closely with McAfee’s Advanced Analytics team to develop a novel approach to defeating facial recognition systems. To date, the research has progressed through white box and gray box attacks with high levels of success – we hope to inspire or collaborate with other researchers on black box attacks and demonstrate these findings against real world targets such as passport verification systems with the hopes of improving them.
The Method to the Madness
The term GAN is an increasingly-recognized acronym in the data science field. It stands for Generative Adversarial Network and represents a novel concept using one or more “generators” working in tandem with one or more “discriminators.” While this isn’t a data science paper and I won’t go into great detail on GAN, it will be beneficial to understand the concept at a high level. You can think of GAN as a combination of an art critic and an art forger. An art critic must be capable of determining whether a piece of art is real or forged, and of what quality the art is. The forger of course, is simply trying to create fake art that looks as much like the original as possible, to fool the critic. Over time, the forger may outwit the critic, and at other times the opposite may hold true, yet ultimately, over the long run, they will force each other to improve and adapt their methods. In this scenario, the forger is the “generator” and the art critic is the “discriminator.” This concept is analogous to GAN in that the generator and discriminator are both working together and also opposing each other – as the generator creates an image of a face, for example, the discriminator determines whether the image generated actually looks like a face, or if it looks like something else. It rejects the output if it is not satisfied, and the process starts over. This is repeated in the training phase for as long of a time as it takes for the discriminator to be convinced that the generator’s product is high enough quality to “meet the bar.”
One such implementation we saw earlier, StyleGAN, uses these exact properties to generate the photorealistic faces shown above. In fact, the research team tested StyleGAN, but determined it was not aligned with the task we set out to achieve: generating photorealistic faces, but also being able to easily implement an additional step in face verification. More specifically, its sophisticated and niche architecture would have been highly difficult to harness successfully for our purpose of clever face-morphing. For this reason, we opted to go with a relatively new but powerful GAN framework known as CycleGAN.
CycleGAN is a GAN framework that was released in a paper in 2017. It represents a GAN methodology that uses two generators and two discriminators, and in its most basic sense, is responsible for translating one image to another through the use of GAN.
Image of zebras translated to horses via CycleGAN
There are some subtle but powerful details related to the CycleGAN infrastructure. We won’t go into depth on these, but one important concept is that CycleGAN uses higher level features to translate between images. Instead of taking random “noise” or “pixels” in the way StyleGAN translates into images, this model uses more significant features of the image for translation (shape of head, eye placement, body size, etc…). This works very well for human faces, despite the paper not specifically calling out human facial translation as a strength.
Face Net and InceptionResnetV1
While CycleGAN is an novel use of the GAN model, in and of itself it has been used for image to image translation numerous times. Our facial recognition application facilitated the need for an extension of this single model, with an image verification system. This is where FaceNet came into play. The team realized that not only would our model need to accurately create adversarial images that were photorealistic, it would also need to be verified as the original subject. More on this shortly. FaceNet is a face recognition architecture that was developed by Google in 2015, and was and perhaps still is considered state of the art in its ability to accurately classify faces. It uses a concept called facial embeddings to determine mathematical distances between two faces in a dimension. For the programmers or math experts, 512 dimensional space is used, to be precise, and each embedding is a 512 dimensional list or vector. To the lay person, the less similar the high level facial features are, the further apart the facial embeddings are. Conversely, the more similar the facial features, the closer together these faces are plotted. This concept is ideal for our use of facial recognition, given FaceNet operates against high level features of the face versus individual pixels, for example. This is a central concept and a key differentiator between our research and “shallow”adversarial image creation a la more traditionally used FGSM, JSMA, etc. Creating an attack that operates at the level of human-understandable features is where this research breaks new ground.
One of the top reasons for FaceNet’s popularity is that is uses a pre-trained model with a data set trained on hundreds of millions of facial images. This training was performed using a well-known academic/industry-standard dataset, and these results are readily available for comparison. Furthermore, it achieved very high published accuracy (99.63%) when used on a set of 13,000 random face images from a benchmark set of data known as LFW (Labeled Faces in the Wild). In our own in-house evaluation testing, our accuracy results were closer to 95%.
Ultimately, given our need to start with a white box to understand the architecture, the solution we chose was a combination of CycleGAN and an open source FaceNet variant architecture known as InceptionResnet version 1. The ResNet family of deep neural networks uses learned filters, known as convolutions, to extract high-level information from visual data. In other words, the role of deep learning in face recognition is to transform an abstract feature from the image domain, i.e. a subject’s identity, into a domain of vectors (AKA embeddings) such that they can be reasoned about mathematically. The “distance” between the outputs of two images depicting the same subject should be mapped to a similar region in the output space, and two very different regions for input depicting different subjects. It should be noted that the success or failure of our attack is contingent on its ability to manipulate the distance between these face embeddings. To be clear, FaceNet is the pipeline consisting of data pre-processing, Inception ResNet V1, and data separation via a learned distance threshold.
Whoever has the most data wins. This truism is especially relevant in the context of machine learning. We knew we would need a large enough data set to accurately train the attack generation model, but we guessed that it would be smaller than many other use cases. This is because given our goal was simply to take two people, subject A (Jesse) and subject B (Steve) below and minimize the “distance” between the two face embeddings produced when inputted into FaceNet, while preserving a misclassification in either direction. In other words, Jesse needed to look like Jesse in his passport photo, and yet be classified as Steve, and vice versa. We’ll describe facial embeddings and visualizations in detail shortly.
The training was done on a set of 1500 images of each of us, captured from live video as stills. We provided multiple expressions and facial gestures that would enrich the training data and accurately represent someone attempting to take a valid passport photo.
The research team then integrated the CycleGAN + FaceNet architecture and began to train the model.
As you can see from the images below, the initial output from the generator is very rough – these certainly look like human beings (sort of), but they’re not easily identifiable and of course have extremely obvious perturbations, otherwise known as “artifacts.”
However, as we progress through training over dozens of cycles, or epochs, a few things are becoming more visually apparent. The faces begin to clean up some of the abnormalities while simultaneously blending features of both subject A and subject B. The (somewhat frightening) results look something like this:
Progressing even further in the training epochs, and the discriminator is starting to become more satisfied with the generator’s output. Yes, we’ve got some detail to clean up, but the image is starting to look much more like subject B.
A couple hundred training epochs in, and we are producing candidates that would meet the bar for this application; they would pass as valid passport photos.
Fake image of Subject B
Remember, that with each iteration through this training process, the results are systematically fed into the facial recognition neural network and classified as Subject A or Subject B. This is essential as any photo that doesn’t “properly misclassify” as the other, doesn’t meet one of the primary objectives and must be rejected. It is also a novel approach as there are very few research projects which combine a GAN and an additional neural network in a cohesive and iterative approach like this.
We can see visually above that the faces being generated at this point are becoming real enough to convince human beings that they are not computer-generated. At the same time, let’s look behind the curtain and see some facial embedding visualizations which may help clarify how this is actually working.
To further understand facial embeddings, we can use the following images to visualize the concept. First, we have the images used for both training and generation of images. In other words, it contains real images from our data set and fake (adversarial) generated images as shown below:
Model Images (Training – Real_A & Real_B) – Generated (Fake_B & Fake_A)
This set of images is just one epoch of the model in action – given the highly realistic fake images generated here, it is not surprisingly a later epoch in the model evaluation.
To view these images as mathematical embeddings, we can use a visualization representing them on a multidimensional plane, which can be rotated to show the distance between them. It’s much easier to see that this model represents a cluster of “Real A” and “Fake B” on one side, and a separate cluster of “Real B” and “Fake A” on the other. This is the ideal attack scenario as it clearly shows how the model will confuse the fake image of the accomplice with the real image of the attacker, our ultimate test.
White Box and Gray Box Application
With much of machine learning, the model must be both effectively trained as well as able to reproduce and replicate results in future applications. For example, consider a food image classifier; its job being to correctly identify and label the type of food it sees in an image. It must have a massive training set so that it recognizes that a French Fry is different than a crab leg, but it also must be able to reproduce that classification on images of food it’s never seen before with a very high accuracy. Our model is somewhat different in that it is trained specifically on two people only (the adversary and the accomplice), and its job is done ahead of time during training. In other words, once we’ve generated a photorealistic image of the attacker that is classified as the accomplice, the model’s job is done. One important caveat is that it must work reliably to both correctly identify people and differentiate people, much like facial recognition would operate in the real world.
The theory behind this is based on the concept of transferability; if the models and features chosen in the development phase (called white box, with full access to the code and knowledge of the internal state of the model and pre-trained parameters) are similar enough to the real-world model and features (black box, no access to code or classifier) an attack will reliably transfer – even if the underlying model architecture is vastly different. This is truly an incredible concept for many people, as it seems like an attacker would need to understand every feature, every line of code, every input and output, to predict how a model will classify “adversarial input.” After all, that’s how classical software security works for the most part. By either directly reading or reverse engineering a piece of code, an attacker can figure out the precise input to trigger a bug. With model hacking (often called adversarial machine learning), we can develop attacks in a lab and transfer them to black box systems. This work, however, will take us through white box and gray box attacks, with possible future work focusing on black box attacks against facial recognition.
As mentioned earlier, a white box attack is one that is developed with full access to the underlying model – either because the researcher developed the model, or they are using an open source architecture. In our case, we did both to identify the ideal combination discussed above, integrating CycleGAN with various open source facial recognition models. The real Google FaceNet is proprietary, but it has been effectively reproduced by researchers as open source frameworks that achieve very similar results, hence our use of Inception Resnet v1. We call these versions of the model “gray box” because they are somewhere in the middle of white box and black box.
To take the concepts above from theory to the real world, we need to implement a physical system that emulates a passport scanner. Without access to the actual target system, we’ll simply use an RGB camera, such as the external one you might see on desktops in a home or office. The underlying camera is likely quite similar to the technology used by a passport photo camera. There’s some guesswork needed to determine what the passport camera is doing, so we take some educated liberties. The first thing to do is programmatically capture every individual frame from the live video and save them in memory for the duration of their use. After that, we apply some image transformations, scaling them to a smaller size and appropriate resolution of a passport-style photo. Finally, we pass each frame to the underlying pretrained model we built and ask it to determine whether the face it is analyzing is Subject A (the attacker), or Subject B (the accomplice). The model has been trained on enough images and variations of both that even changes in posture, position, hair style and more will still cause a misclassification. It’s worth noting that in this attack method, the attacker and accomplice are working together and would likely attempt to look as similar as possible to the original images in the data set the model is trained, as it would increase the overall misclassification confidence.
The following demo videos demonstrate this attack using our gray box model. Let’s introduce the 3 players in these videos. In all three, Steve is the attacker now, Sam is our random test person, and Jesse is our accomplice. The first will show the positive test.
This uses a real, non-generated image on the right side of the screen of Steve (now acting as our attacker). Our random test person (Sam), first stands in front of the live “passport verification camera” and is compared against the real image of Steve. They should of course be classified as different. Now Steve stands in front of the camera and the model correctly identifies him against his picture, taken from the original and unaltered data set. This proves the system can correctly identify Steve as himself.
Next is the negative test, where the system tests Sam against a real photo of Jesse. He is correctly classified as different, as expected. Then Steve stands in front of the system and confirms the negative test as well, showing that the model correctly differentiates people in non-adversarial conditions.
Finally, in the third video, Sam is evaluated against an adversarial, or fake image of Jesse, generated by our model. Since Sam was not part of the CycleGAN training set designed to cause misclassification, he is correctly shown as different again. Lastly, our attacker Steve stands in front of the live camera and is correctly misclassified as Jesse (now the accomplice). Because the model was trained for either Jesse or Steve to be the adversarial image, in this case we chose Jesse as the fake/adversarial image.
If a passport-scanner were to replace a human being completely in this scenario, it would believe it had just correctly validated that the attacker was the same person stored in the passport database as the accomplice. Given the accomplice is not on a no-fly list and does not have any other restrictions, the attacker can bypass this essential verification step and board the plane. It’s worth noting that a human being would likely spot the difference between the accomplice and attacker, but this research is based off of the inherent risks associated with reliance on AI and ML alone, without providing defense-in-depth or external validation, such as a human being to validate.
Positive Test Video – Confirming Ability to Recognize a Person as Himself
Negative Test Video – Confirming Ability to Tell People Apart
Adversarial Test Video – Confirming Ability to Misclassify with Adversarial Image
What Have we Learned?
Biometrics are an increasingly relied-upon technology to authenticate or verify individuals and are effectively replacing password and other potentially unreliable authentication methods in many cases. However, the reliance on automated systems and machine learning without considering the inherent security flaws present in the mysterious internal mechanics of face-recognition models could provide cyber criminals unique capabilities to bypass critical systems such as automated passport enforcement. To our knowledge, our approach to this research represents the first-of-its-kind application of model hacking and facial recognition. By leveraging the power of data science and security research, we look to work closely with vendors and implementors of these critical systems to design security from the ground up, closing the gaps that weaken these systems. As a call to action, we look to the community for a standard by which can reason formally about the reliability of machine learning systems in the presence of adversarial samples. Such standards exist in many verticals of computer security, including cryptography, protocols, wireless radio frequency and many more. If we are going to continue to hand off critical tasks like authentication to a black box, we had better have a framework for determining acceptable bounds for its resiliency and performance under adverse conditions.