Detecting Pretraining Data from Large Language Models

Abstract

Although large language models (LLMs) are widely deployed, the data used to train them is rarely disclosed. Given the incredible scale of this data, up to trillions of tokens, it's nearly certain it includes potentially problematic text such as copyrighted materials, personally identifiable information, and test data for widely reported reference benchmarks. However, we currently lack knowledge on which data of these types is included or in what proportions.

Pretraining Ddata detection problem. In this paper, we explore the pretraining data detection problem🕵️: given a piece of text and black-box access to an LLM without knowing the pretraining data, can we determine if the model was trained on the provided text?

Dynamic benchmark WikiMIA. To aid this study, we present a dynamic benchmark WikiMIA 📖 that uses data created both before and after model training to support gold truth detection.

Detection method Min-K% Prob. We also design a new detection method Min-K% Prob Min-K Icon

. This is built on a straightforward hypothesis: an unobserved example is more likely to have a few outlier words with low probabilities under the LLM, while a recognized example is less inclined to contain words with such reduced probabilities. Min-K% Prob operates without any insight into the pretraining corpus or any extra training, distinguishing it from past detection strategies that necessitate educating a reference model on data analogous to the pretraining data. Furthermore, our tests indicate that Min-K% Prob delivers a 7.4% enhancement on WikiMIA relative to these preceding techniques.

Real-life use cases. We employ Min-K% Prob in three real-life contexts: benchmark example contamination detection, privacy auditing of machine unlearning, and copyrighted text detection in language models' pretraining data.

Detection Method Min-K% Prob

What is Min-K% Prob?
We propose a pretraining data detection method named Min-K% Prob. Our method is based on a simple hypothesis: an unseen example tends to contain a few outlier words with low probabilities, whereas a seen example is less likely to contain words with such low probabilities. MIN-K% Prob computes the average probabilities of outlier tokens.

How to use Min-K% Prob?
To check if a text was in LLM's pretraining:

Evaluate token probabilities in the text.
Pick the k% tokens with minimum probabilities.
Compute their average log likelihood.

If the average log likelihood is high, the text is likely in the pretraining data. ✅

See more results in our paper

Auditing machine unlearning with Min-K% Prob

Machine Unlearning
Recent work from MSR shows how LLMs can unlearn copyrighted training data via strategic fine-tuning. They made Llama2-7B-chat unlearn the entire Harry Potter magical world and released it as Llama2-7B-WhoIsHarryPotter for scrutiny. But with our Min-K% Prob technique, we've found that some “magical traces” still remain, producing Harry Potter content! 🧙‍♂️🔮

Graph depicting the process of unlearning Harry Potter content

Auditing machine unlearning with Min-K% Prob
The unlearned model LLaMA2-7B-WhoIsHarryPotter answers the questions related to Harry Potter correctly. We manually cross-checked these responses against the Harry Potter book series for verification.
Results showing the unlearned model's responses

Detecting Copyrighted Books in LLMs with Min-K% Prob

**Top 20 Copyrighted Books in GPT-3's pretraining data (text-davinci-003) detected by Min-K% Prob** (Min-K% Prob achieves AUC score of 0.87 on the validation data). The listed contamination rate represents the percentage of text excerpts from each book identified in the pretraining data.
Contamination %	Book Title	Author	Year
100	The Violin of Auschwitz	Maria Àngels Anglada	2010
100	North American Stadiums	Grady Chambers	2018
100	White Chappell Scarlet Tracings	Iain Sinclair	1987
100	Lost and Found	Alan Dean	2001
100	A Different City	Tanith Lee	2015
100	Our Lady of the Forest	David Guterson	2003
100	The Expelled	Mois Benarroch	2013
99	Blood Cursed	Archer Alex	2013
99	Genesis Code: A Thriller of the Near Future	Jamie Metzl	2014
99	The Sleepwalker's Guide to Dancing	Mira Jacob	2014
99	The Harlan Ellison Hornbook	Harlan Ellison	1990
99	The Book of Freedom	Paul Selig	2018
99	Three Strong Women	Marie NDiaye	2009
99	The Leadership Mind Switch: Rethinking How We Lead in the New World of Work	D. A. Benton, Kylie Wright-Ford	2017
99	Gold	Chris Cleave	2012
99	The Tower	Simon Clark	2005
98	Amazon	Bruce Parry	2009
98	Ain't It Time We Said Goodbye: The Rolling Stones on the Road to Exile	Robert Greenfield	2014
98	Page One	David Folkenflik	2011
98	Road of Bones: The Siege of Kohima 1944	Fergal Keane	2010

BibTeX


@misc{shi2023detecting,
title={Detecting Pretraining Data from Large Language Models},
author={Weijia Shi and Anirudh Ajith and Mengzhou Xia and Yangsibo Huang and Daogao Liu and Terra Blevins and Danqi Chen
and Luke Zettlemoyer},
year={2023},
eprint={2310.16789},
archivePrefix={arXiv},
primaryClass={cs.CL}
}

🕵️ Detecting Pretraining Data from Large Language Models

We propose Min-K% Prob , a simple and effective method that can detect whether if a large language model (e.g., GPT-3) was pretrained on the provided text without knowing the pretraining data.

Min-K% Prob is an effective tool for benchmark example contamination detection, privacy auditing of machine unlearning, and copyrighted text detection in language models' pretraining data.

Abstract

Detection Method Min-K% Prob

Auditing machine unlearning with Min-K% Prob

Detecting Copyrighted Books in LLMs with Min-K% Prob

BibTeX

🕵️ Detecting Pretraining Data from Large Language Models

We propose Min-K% Prob , a simple and effective method that can detect whether if a large language model (e.g., GPT-3) was pretrained on the provided text without knowing the pretraining data.

.subtitle a { color: blue; } Min-K% Prob is an effective tool for benchmark example contamination detection, privacy auditing of machine unlearning, and copyrighted text detection in language models' pretraining data.

Abstract

Detection Method Min-K% Prob

Auditing machine unlearning with Min-K% Prob

Detecting Copyrighted Books in LLMs with Min-K% Prob

BibTeX

Min-K% Prob is an effective tool for benchmark example contamination detection, privacy auditing of machine unlearning, and copyrighted text detection in language models' pretraining data.