🕵️ Detecting Pretraining Data from Large Language Models


University of Washington Princeton University   *Equal Contribution

We propose Min-K% Prob Min-K icon, a simple and effective method that can detect whether if a large language model (e.g., GPT-3) was pretrained on the provided text without knowing the pretraining data. teaser

Min-K% Prob is an effective tool for benchmark example contamination detection, privacy auditing of machine unlearning, and copyrighted text detection in language models' pretraining data.

Abstract

Although large language models (LLMs) are widely deployed, the data used to train them is rarely disclosed. Given the incredible scale of this data, up to trillions of tokens, it's nearly certain it includes potentially problematic text such as copyrighted materials, personally identifiable information, and test data for widely reported reference benchmarks. However, we currently lack knowledge on which data of these types is included or in what proportions.

  • Pretraining Ddata detection problem. In this paper, we explore the pretraining data detection problem🕵️: given a piece of text and black-box access to an LLM without knowing the pretraining data, can we determine if the model was trained on the provided text?
  • Dynamic benchmark WikiMIA. To aid this study, we present a dynamic benchmark WikiMIA 📖 that uses data created both before and after model training to support gold truth detection.
  • Detection method Min-K% Prob. We also design a new detection method Min-K% Prob Min-K Icon. This is built on a straightforward hypothesis: an unobserved example is more likely to have a few outlier words with low probabilities under the LLM, while a recognized example is less inclined to contain words with such reduced probabilities. Min-K% Prob operates without any insight into the pretraining corpus or any extra training, distinguishing it from past detection strategies that necessitate educating a reference model on data analogous to the pretraining data. Furthermore, our tests indicate that Min-K% Prob delivers a 7.4% enhancement on WikiMIA relative to these preceding techniques.
  • Real-life use cases. We employ Min-K% Prob in three real-life contexts: benchmark example contamination detection, privacy auditing of machine unlearning, and copyrighted text detection in language models' pretraining data.
  • Detection Method Min-K% Prob

    What is Min-K% Prob?
    We propose a pretraining data detection method named Min-K% Prob. Our method is based on a simple hypothesis: an unseen example tends to contain a few outlier words with low probabilities, whereas a seen example is less likely to contain words with such low probabilities. MIN-K% Prob computes the average probabilities of outlier tokens.

    How to use Min-K% Prob?
    To check if a text was in LLM's pretraining:

    1. Evaluate token probabilities in the text.
    2. Pick the k% tokens with minimum probabilities.
    3. Compute their average log likelihood.
    If the average log likelihood is high, the text is likely in the pretraining data. ✅

    See more results in our paper

    Auditing machine unlearning with Min-K% Prob

    Machine Unlearning
    Recent work from MSR shows how LLMs can unlearn copyrighted training data via strategic fine-tuning. They made Llama2-7B-chat unlearn the entire Harry Potter magical world and released it as Llama2-7B-WhoIsHarryPotter for scrutiny. But with our Min-K% Prob technique, we've found that some “magical traces” still remain, producing Harry Potter content! 🧙‍♂️🔮

    Graph depicting the process of unlearning Harry Potter content

    Auditing machine unlearning with Min-K% Prob
    The unlearned model LLaMA2-7B-WhoIsHarryPotter answers the questions related to Harry Potter correctly. We manually cross-checked these responses against the Harry Potter book series for verification.
    Results showing the unlearned model's responses

    Detecting Copyrighted Books in LLMs with Min-K% Prob

    Top 20 Copyrighted Books in GPT-3's pretraining data (text-davinci-003) detected by Min-K% Prob (Min-K% Prob achieves AUC score of 0.87 on the validation data). The listed contamination rate represents the percentage of text excerpts from each book identified in the pretraining data.
    Contamination % Book Title Author Year
    100 The Violin of Auschwitz Maria Àngels Anglada 2010
    100 North American Stadiums Grady Chambers 2018
    100 White Chappell Scarlet Tracings Iain Sinclair 1987
    100 Lost and Found Alan Dean 2001
    100 A Different City Tanith Lee 2015
    100 Our Lady of the Forest David Guterson 2003
    100 The Expelled Mois Benarroch 2013
    99 Blood Cursed Archer Alex 2013
    99 Genesis Code: A Thriller of the Near Future Jamie Metzl 2014
    99 The Sleepwalker's Guide to Dancing Mira Jacob 2014
    99 The Harlan Ellison Hornbook Harlan Ellison 1990
    99 The Book of Freedom Paul Selig 2018
    99 Three Strong Women Marie NDiaye 2009
    99 The Leadership Mind Switch: Rethinking How We Lead in the New World of Work D. A. Benton, Kylie Wright-Ford 2017
    99 Gold Chris Cleave 2012
    99 The Tower Simon Clark 2005
    98 Amazon Bruce Parry 2009
    98 Ain't It Time We Said Goodbye: The Rolling Stones on the Road to Exile Robert Greenfield 2014
    98 Page One David Folkenflik 2011
    98 Road of Bones: The Siege of Kohima 1944 Fergal Keane 2010

    BibTeX

    
    @misc{shi2023detecting,
    title={Detecting Pretraining Data from Large Language Models},
    author={Weijia Shi and Anirudh Ajith and Mengzhou Xia and Yangsibo Huang and Daogao Liu and Terra Blevins and Danqi Chen
    and Luke Zettlemoyer},
    year={2023},
    eprint={2310.16789},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
    }