Arvind Narayanan Isn’t Anonymous, and Neither Are You

Arvind Narayanan’s business card is an exercise in brevity. It contains no data except his name and the words “Google me,” a fitting calling card for an academic who specializes in privacy and anonymity research. When you do Google him, his online footprint is robust, but highly selective and pruned. There’s a website for his post-doctoral research at Stanford University, where he’s currently based, an online journal of semi-personal musings (like the time he fell asleep jet-lagged and awoke with complete amnesia about, not just who he was, but what he was – animal, vegetable, mineral?), a Google scholar page indicating his work has been cited 849 times, and news articles about high-profile projects he’s worked on. There are also various social networking accounts (Facebook, Google+) that paint a picture of a precise and scientifically calculating, but whimsical, personality – one whose music tastes run the gamut from Queen to Qawwali (Sufi devotional music), and who prefers mind-bending films like Memento and Inception to mind-numbing superhero flicks.

What you won’t find online about Narayanan are party snapshots of him caught in a drunken stupor or inadvisable tweets later deleted on second thought. There’s little about him on the web that he doesn’t specifically want there, and he’s careful to use browser tools to control the digital trail his online activities leave behind. But as a data scientist, Narayanan knows there’s a lot he can’t control — his own work shows that often the steps he and others take to protect themselves online can be easily undone.

Narayanan isn’t much known outside the insular world of data privacy, but he’s likely to be a name that you’ll be seeing more and more, particularly as he’ll be heading to Princeton University next year to join the well-regarded Center for Information Technology Policy, led by computer scientist Ed Felten. In the age of Big Data, where bulk supplies of information about your browsing and other online activities are bought and sold instantaneously in marketplaces each day, and where Target can know your teenage daughter is pregnant before you do, Narayanan is one of the leading hands-on thinkers in exploring how traditional notions of privacy are radically fractured by the collision of big data and cheap analytics. Take, for example, his now-famous Netflix study.

In 2006, Narayanan and a colleague dug into “anonymized” Netflix customer information and showed how little data collection it took to unmask an anonymized person’s identity. Netflix, as part of a public contest to devise a better movie-recommendation algorithm, released a data set of 100 million movie ratings made by 480,000 of its customers. The online DVD provider anonymized the data before releasing it to contestants, by replacing names with random unique identifying numbers to protect the privacy of its customers. But Narayanan and Vitaly Shmatikov were able to unmask some Netflix users simply by taking the anonymized movie ratings – along with timestamps showing when customers submitted them – and comparing them against non-anonymized movie ratings posted at the Internet Movie Database web site. “Even before we looked at the data, we knew right away that this issue was going to exist,” Narayanan says. The research led to a privacy lawsuit against Netflix and a 2010 settlement that scuttled the company’s plans for a second contest that would have involved using even more customer data. Since that study and research paper, he and colleagues have produced four other major ones proving similar points in different contexts.

“In almost every one of the data anonymization projects that I’ve done, there were at least some people who looked at that before we did it and said, ‘Huh, I don’t think that’s possible.’ So that’s really what gets me going.”

Earlier this year, he and colleagues at Stanford and the University of California at Berkeley published a study about an algorithm designed to unmask “anonymous” internet authors simply by analyzing their word choice and writing styles and comparing these against online content written by writers who published under named bylines. Prior research had looked at making the same connections among a few hundred people, but Narayanan’s study scaled that out to make matches among some 100,000 authors. Now he’s working on a potentially new landmark study around DNA and de-anonymization. But he’s reluctant to discuss it before it’s peer-reviewed.

“This project is less about what’s happening here and now but kind of about what the world is going to look like probably ten years from now,” he notes. Pressed for more details, he dances carefully around the question. “[With DNA] it’s just a completely new domain of data with new characteristics,” he says. “The data here is very unique, and the connections between people are unique. And the particular de-anonymization threat that I’m considering . . . is very different from any of my past projects. . . . In terms of named verses anonymous samples, think about just pieces of hair at a train station. Is that named, or anonymous? That’s kind of what I mean by considering a different threat model.” In general, he says by way of elaborating, his work has not been about looking at how to distinguish between 1 out of 5 people with very high accuracy but, rather, about looking at distinguishing between 100,000 people — possibly with much less accuracy. “At least it could serve as a first step for an adversary, or some party, to further narrow down the list of possibilities, and then use some other technique to identify the individual,” he says.

For a guy so focused on privacy and anonymity, Narayanan has a strange hobby that at first glance might appear to focus on violating privacy. It involves photographing other people’s license plates. He says he has a collection of about 500 of them and snaps the pics in parking lots only when no one is around or in the car. “There are so many interesting vanity plates, especially in Palo Alto,” he says, mentioning the wealthy town where Stanford University resides. His interest in plates began when he was a shy child and “sort of socially maladjusted.” The hustle and bustle of the world gave him cognitive overload, he says, and the letters and numbers on license plates helped him focus by looking for patterns in them. His interest in plates continues to this day, though it sometimes makes for awkward conversation with women he’s dating when he has to explain that he knows their plates out of a habit of memorizing them, not because he’s interested in stalking them. He realized how bizarre his habit appeared to others when he was driving with a friend one day and noted the plate on the car in front of them. “Dude!,” he told his friend. “That car has every letter and number either equal to the ones on your plate, or one letter off.” His friend turned to him and they looked silently at one another for several seconds as the words were absorbed. “And the expression on his face – there’s only one way to describe it, which was WTF?” Narayanan recalls. Then they both burst out laughing. “In Silicon Valley . . . you might know a person really well, but everyone has this hidden weirdness; I guess he was thinking, ‘So that’s yours,’” Narayanan says. That focus on data and patterns might seem weird to others, but it has served him well in his research.

Raw Materials

Abstract
We apply our de-anonymization methodology to the Netﬂix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netﬂix, the world’s largest online movie rental service. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset. Using the Internet Movie Database as the source of background knowledge, we successfully identified the Netﬂix records of known users, uncovering their apparent political preferences and other potentially sensitive information.
View the Study

Narayanan chooses his data projects based on what he feels he can bring to the research, such as his technical knowledge, and whether it will produce a public good.

arayanan chooses his projects based on what he feels he can bring to the research, and whether or not it will produce something valuable to the public or to policy makers. “But the more important criteria for me is that it’s technologically novel, and it’s something that people could not have realized before, without actually doing the work,” he says. “In almost every one of the data anonymization projects that I’ve done, there were at least some people who looked at that before we did it and said, ‘Huh, I don’t think that’s possible.’ So that’s really what gets me going.”

The Netflix study began with a simple question, asking what would happen to customer data when companies anonymized it in good faith, but then passed it to third parties who might combine it with additional data? Could the mere marriage of datasets undo the anonymity of customers? It’s a problem that isn’t limited to Netflix or even the online world. Many companies involved in the collection of customer data or online behavioral tracking insist that it’s okay to collect and share data about customers as long as the data is anonymous at the time it’s collected or post-collection. But Narayanan thinks that’s naive at best and disingenuous at worst.

Narayanan walking near his office on the Stanford University campus in Northern California.

7 Favorite Movies

1

Adaptation – The deeply self-referential elements in this movie left me in awe.
2

Memento – No film makes you viscerally appreciate the fact that we are our memories better than this one.
3

Eternal Sunshine of the Spotless Mind – I think it’s possible that in a few decades, technology will force us to think about the questions this movie raises.
4

Inception – I take the possibility reasonably seriously that we have cognitive abilities in our dreams that we don’t when awake, so this movie appealed to me at … ahem … more than one level.
5

Usual Suspects – In terms of twist endings and unreliable narration, this one is unsurpassed.
6

Fight Club – Another unreliable-narrator film. What’s not to like? A twist ending, artistic violence, deep themes and phenomenal acting.
7

The Prestige – This movie shook me. The idea of being devoted to one’s art more than one’s life is fascinating and terrifying at the same time.

He recently worked as a consultant on the Heritage Health Prize, a contest awarding $3 million to anyone who can devise the best way to predict future health outcomes from past hospitalization visits. In this case, the data passed to contestants was much more highly sensitive than the data used in the Netflix contest because it included information about who was admitted to hospitals and their diagnoses. Narayanan worked on the project in collaboration with Kaggle, a startup that matches data crunchers with companies seeking insights, to identify the potential “re-identification threat” that might exist if the data were released and combined with other data that’s already publicly available. His work involved looking for other data sets that might be combined with the Heritage one to unmask the identity of patients.

Doing threat analysis is a tricky business, Narayanan acknowledges, because even if you think you’ve identified all existing datasets that might cause a problem for anonymity now, new ones could be released in the future that create additional problems later on. This is why he says that releasing datasets publicly is so risky. “Once you make data available publicly, it’s not going to go away. It’s going to be there forever,” he says. “So if you want to be absolutely sure that nothing bad will ever happen, I don’t really think there is hope for the model where you make it available to the public.”

He’s working on alternatives that would help protect anonymity. He proposes an approach that would resist handing datasets off to third parties altogether, and instead bring third-party analysts to the dataset. The company that collects the data would retain it and host a query-based computing model, so that analysts would bring their code to the database, instead of the data going to them. It’s not an option that would suit everyone, because it would likely require providing support services to work out tech issues for people encountering problems using their code with the dataset. But Narayanan says it wouldn’t result in much more work than already exists with data contests.

“This type of competition is already a huge mount of effort internally, both from the engineering side, and from the legal side, in making sure everything’s okay with respect to privacy and confidentiality,” he says. “So if you look at what companies already have to do, I think it’s certainly going to be additional effort, but … it’s not an order of magnitude more effort…. It has the benefit of being a technically very clean solution. It allows you to apply a lot of privacy protection technologies.”

But that’s not the only solution. For those who are opposed to hosting data, there are still ways to make the distribution of datasets safer. Narayanan advocates a two-stage process for running data-mining competitions. This would involve making only a limited, subset of the data available for the first round of contestants, or distributing a synthetic set of data – data that has the same characteristics of the real data but is actually fabricated. This would give contestants relevant data around which to develop their code and algorithms. Then once the first-round of the competition was narrowed down to a set of finalists, this smaller group could sign an NDA to obtain the real data or the larger dataset. This way, the data wouldn’t be released publicly to whoever wanted to grab it.

More recently, he’s been looking at online tracking issues and ways to limit what companies collect on consumers in the first place. He collaborated with colleagues at Stanford and New York University to devise a solution that would provide the benefit of personalized ads for users, without allowing a third party to collect data about them. The solution, called Adnostic, involves a browser tool that builds a detailed profile of the user without ever letting that profile leave the browser. The browser itself would decide which personalized ads to serve up to the user.

Not everyone, of course, is as concerned about privacy and anonymity as Narayanan is. There are mixed studies on how often people opt for privacy choices and tools when they’re available. But Narayanan isn’t bothered by the fact that some people are happy to give up their data to companies and governments. If anonymity research and tools benefit only the people who need them the most — whistleblowers and activists in oppressive regimes, for example — that’s all that matters.

“If the average person doesn’t care that much, but we’re developing tools that are very useful and potentially could make the difference between success and failure — or life and death — for this much smaller demographic,” he says, “then I think we’ve still succeeded as privacy researchers, and we’re having an impact.”

Photography by Michelle Le

World’s Most Wired

Wired is putting a spotlight on the brightest geniuses you’ve never heard of — the entrepreneurs, scientists, artists and designers who are quietly shaping the future behind the scenes. They’re the World’s Most Wired, and we’ll be profiling one of them bi-weekly through the end of the year. Check it out here.

Arvind Narayanan Isn’t Anonymous, and Neither Are You

World’s Most Wired