How to Authenticate Large Datasets

Hacked and leaked datasets are more common than ever. Here are some ways to verify they’re real.

Photo Illustration: The Intercept/AP Images

Unlike any other point in history, hackers, whistleblowers, and archivists now routinely make off with terabytes of data from governments, corporations, and extremist groups. These datasets often contain gold mines of revelations in the public interest and in many cases are freely available for anyone to download. 

Revelations based on leaked datasets can change the course of history. In 1971, Daniel Ellsberg’s leak of military documents known as the Pentagon Papers led to the end of the Vietnam War. The same year, an underground activist group called the Citizens’ Commission to Investigate the FBI broke into a Federal Bureau of Investigation field office, stole secret documents, and leaked them to the media. This dataset mentioned COINTELPRO. NBC reporter Carl Stern used Freedom of Information Act requests to publicly reveal that COINTELPRO was a secret FBI operation devoted to surveilling, infiltrating, and discrediting left-wing political groups. This stolen FBI dataset also led to the creation of the Church Committee, a Senate committee that investigated these abuses and reined them in. 

Huge data leaks like these used to be rare, but today they’re increasingly common. More recently, Chelsea Manning’s 2010 leaks of Iraq and Afghanistan documents helped spark the Arab Spring, documents and emails stolen by Russian military hackers helped elect Donald Trump as U.S. president in 2016, and the Panama Papers and Paradise Papers exposed how the rich and powerful use offshore shell companies for tax evasion.

Yet these digital tomes can prove extremely difficult to analyze or interpret, and few people today have the skills to do so. I spent the last two years writing the book “Hacks, Leaks, and Revelations: The Art of Analyzing Hacked and Leaked Data” to teach journalists, researchers, and activists the technologies and coding skills required to do just this. While these topics are technical, my book doesn’t assume any prior knowledge: all you need is a computer, an internet connection, and the will to learn. Throughout the book, you’ll download and analyze real datasets — including those from police departments, fascist groups, militias, a Russian ransomware gang, and social networks — as practice. Throughout, you’ll engage head-on with the dumpster fire that is 21st-century current events: the rise of neofascism and the rejection of objective reality, the extreme partisan divide, and an internet overflowing with misinformation.

My book officially comes out January 9, but it’s shipping today if you order it from the publisher here. Add the code INTERCEPT25 for a special 25 percent discount.

The following is a lightly edited excerpt from the first chapter of “Hacks, Leaks, and Revelations” about a crucial and often underappreciated part of working with leaked data: how to verify that it’s authentic.

Photo: Micah Lee

You can’t believe everything you read on the internet, and juicy documents or datasets that anonymous people send you are no exception. Disinformation is prevalent.

How you go about verifying that a dataset is authentic completely depends on what the data is. You have to approach the problem on a case-by-case basis. The best way to verify a dataset is to use open source intelligence (OSINT), or publicly available information that anyone with enough skill can find. 

This might mean scouring social media accounts, consulting the Internet Archive’s Wayback Machine, inspecting metadata of public images or documents, paying services for historical domain name registration data, or viewing other types of public records. If your dataset includes a database taken from a website, for instance, you might be able to compare information in that database with publicly available information on the website itself to confirm that they match. (Michael Bazzell also has great resources on the tools and techniques of OSINT.)

Below, I share two examples of authenticating data from my own experience: one about a dataset from the anti-vaccine group America’s Frontline Doctors, and another about leaked chat logs from a WikiLeaks Twitter group. 

In my work at The Intercept, I encounter datasets so frequently I feel like I’m drowning in data, and I simply ignore most of them because it’s impossible for me to investigate them all. Unfortunately, this often means that no one will report on them, and their secrets will remain hidden forever. I hope “Hacks, Leaks, and Revelations” helps to change that. 

The America’s Frontline Doctors Dataset

In late 2021, in the midst of the Covid-19 pandemic, an anonymous hacker sent me hundreds of thousands of patient and prescription records from telehealth companies working with America’s Frontline Doctors (AFLDS). AFLDS is a far-right anti-vaccine group that misleads people about Covid-19 vaccine safety and tricks patients into paying millions of dollars for drugs like ivermectin and hydroxychloroquine, which are ineffective at preventing or treating the virus. The group was initially formed to help Donald Trump’s 2020 reelection campaign, and the group’s leader, Simone Gold, was arrested for storming the U.S. Capitol on January 6, 2021. In 2022, she served two months in prison for her role in the attack.

My source told me that they got the data by writing a program that made thousands of web requests to a website run by one of the telehealth companies, Cadence Health. Each request returned data about a different patient. To see whether that was true, I made an account on the Cadence Health website myself. Everything looked legitimate to me. The information I had about each of the 255,000 patients was the exact information I was asked to provide when I created my account on the service, and various category names and IDs in the dataset matched what I could see on the website. But how could I be confident that the patient data itself was real, that these people weren’t just made up?

I wrote a simple Python script to loop through the 72,000 patients (those who had paid for fake health care) and put each of their email addresses in a text file. I then cross-referenced these email addresses with a totally separate dataset containing personal identifying information from members of Gab, a social network popular among fascists, anti-democracy activists, and anti-vaxxers. In early 2021, a hacktivist who went by the name “JaXpArO and My Little Anonymous Revival Project” had hacked Gab and made off with 65GB of data, including about 38,000 Gab users’ email addresses. Thinking there might be overlap between AFLDS and Gab users, I wrote another simple Python program that compared the email addresses from each group and showed me all of the addresses that were in both lists. There were several.

Armed with this information, I started scouring the public Gab timelines of users whose email addresses had appeared in both datasets, looking for posts about AFLDS. Using this technique, I found multiple AFLDS patients who posted about their experience on Gab, leading me to believe that the data was authentic. For example, according to consultation notes from the hacked dataset, one patient created an account on the telehealth site and four days later had a telehealth consultation. About a month after that, they posted to Gab saying, “Front line doctors finally came through with HCQ/Zinc delivery” (HCQ is an abbreviation for hydroxychloroquine).

Having a number of examples like this gave us confidence that the dataset of patient records was, in fact, legitimate. You can read our AFLDS reporting at The Intercept — which led to a congressional investigation into the group — here.

The WikiLeaks Twitter Group Chat

In late 2017, journalist Julia Ioffe published a revelation in The Atlantic: WikiLeaks had slid into Donald Trump Jr.’s Twitter DMs. Among other things, before the 2016 election, WikiLeaks suggested to Trump Jr. that even if his father lost the election, he shouldn’t concede. “Hi Don,” the verified @wikileaks Twitter account wrote, “if your father ‘loses’ we think it is much more interesting if he DOES NOT conceed [sic] and spends time CHALLENGING the media and other types of rigging that occurred—as he has implied that he might do.”

A long-term WikiLeaks volunteer who went by the pseudonym Hazelpress started a private Twitter group with WikiLeaks and its biggest supporters in mid-2015. After watching the group become more right-wing, conspiratorial, and unethical, and specifically after learning about WikiLeaks’ secret DMs with Trump Jr., Hazelpress decided to blow the whistle on the whistleblowing group itself. She has since publicly come forward as Mary-Emma Holly, an artist who spent years as a volunteer legal researcher for WikiLeaks.

Related

In Leaked Chats, WikiLeaks Discusses Preference for GOP Over Clinton, Russia, Trolling, and Feminists They Don’t Like

To carry out the WikiLeaks leak, Holly logged in to her Twitter account, made it private, unfollowed everyone, and deleted all of her tweets. She also deleted all of her DMs except for the private WikiLeaks Twitter group and changed her Twitter username. Using the Firefox web browser, she then went to the DM conversation — which contained 11,000 messages and had been going on for two-and-a-half years — and saw the latest messages in the group. She scrolled up, waited for Twitter to load more messages, scrolled up again, and kept doing this for four hours until she reached the very first message in the group. She then used Firefox’s Save Page As function to save an HTML version of the webpage, as well as a folder full of resources like images that were posted in the group.

Now that she had a local, offline copy of all the messages in the DM group, Holly leaked it to the media. In early 2018, she sent a Signal message to the phone number listed on The Intercept’s tips page. At that time, I happened to be the one checking Signal for incoming tips. Using OnionShare — software that I developed for this purpose — she sent me an encrypted and compressed file, along with the password to decrypt it. After extracting it, I found a 37MB HTML file — so big that it made my web browser unresponsive when I tried opening it and which I later split into separate files to make it easier to work with — and a folder with 82MB of resources.

How could I verify the authenticity of such a huge HTML file? If I could somehow access the same data directly from Twitter’s servers, that would do it; only an insider at Twitter would be in a position to create fake DMs that show up on Twitter’s website, and even that would be extremely challenging. When I explained this to Holly (who, at the time, I still knew only as Hazelpress), she gave me her Twitter username and password. She had already deleted all the other information from that account. With her consent, I logged in to Twitter with her credentials, went to her DMs, and found the Twitter group in question. It immediately looked like it contained the same messages as the HTML file, and I confirmed that the verified account @wikileaks frequently posted to the group.

Following these steps made me extremely confident in the authenticity of the dataset, but I decided to take verification one step further. Could I download a separate copy of the Twitter group myself in order to compare it with the version Holly had sent me? I searched around and found DMArchiver, a Python program that could do just that. Using this program, along with Holly’s username and password, I downloaded a text version of all of the DMs in the Twitter group. It took only a few minutes to run this tool, rather than four hours of scrolling up in a web browser.

Note: After this investigation, the DMArchiver program stopped working due to changes on Twitter’s end, and today the project is abandoned. However, if you’re faced with a similar challenge in a future investigation, search for a tool that might work for you. 

The output from DMArchiver, a 1.7MB text file, was much easier to work with compared to the enormous HTML file, and it also included exact time stamps. Here’s a snippet of the text version:

[2015-11-19 13:46:39] <WikiLeaks> We believe it would be much better for GOP to win.

[2015-11-19 13:47:28] <WikiLeaks> Dems+Media+liberals woudl then form a block to reign in their worst qualities.

[2015-11-19 13:48:22] <WikiLeaks> With Hillary in charge, GOP will be pushing for her worst qualities., dems+media+neoliberals will be mute.

[2015-11-19 13:50:18] <WikiLeaks> She’s a bright, well connected, sadistic sociopath.

I could view the HTML version in a web browser to see it exactly as it had originally looked on Twitter, which was also useful for taking screenshots to include in our final report.

A screenshot of the leaked HTML file.

Along with the talented reporter Cora Currier, I started the long process of reading all 11,000 chat messages, paying closest attention to the 10 percent of them from the @wikileaks account — which was presumably controlled by Julian Assange, WikiLeaks’s editor — and picking out everything in the public interest. We discovered the following details:

  • Assange expressed a desire for Republicans to win the 2016 presidential election.
  • Assange and his supporters were intensely focused on discrediting two Swedish women who had accused him of rape and molestation, as well as discrediting their lawyers. Assange and his defenders spent weeks discussing ways to sabotage articles about his rape case that feminist journalists were writing.
  • After Associated Press journalist Raphael Satter wrote a story about harm caused when WikiLeaks publishes personal identifiable information, Assange called him a “rat” and said that “he’s Jewish and engaged in the ((())) issue,” referring to an antisemitic neo-Nazi meme. He then told his supporters to “bog him down. Get him to show his bias.”

You can read our reporting on this dataset at The Intercept. After The Intercept published this article, Assange and his supporters also targeted me personally with antisemitic abuse, and Russia Today, the state-run TV station, ran a segment about me. 

The techniques you can use to authenticate datasets vary greatly depending on the situation. Sometimes you can rely on OSINT, sometimes you can rely on help from your source, and sometimes you’ll need to come up with an entirely different method.

Regardless, it’s important to explain in your published report, at least briefly, what makes you confident in the data. If you can’t authenticate it but still want to publish your report in case it’s real — or in case others can authenticate it — make that clear. When in doubt, err on the side of transparency.

My book, “Hacks, Leaks, and Revelations,” officially comes out on January 9, but it’s shipping today if you order it from the publisher here. Add the code INTERCEPT25 for a special 25 percent discount.

Join The Conversation