Making AI models “forget” unwanted data hurts their performance

So-called “unlearning” techniques are used to make a generative AI model forget specific and unwanted information it has gleaned from training data, such as sensitive personal data or copyrighted material.

But current delearning techniques are a double-edged sword: they could make a model like OpenAI’s GPT-4o or Meta’s Llama 3.1 405B much less capable of answering basic questions.

That’s according to a new study co-authored by researchers from the University of Washington (UW), Princeton, the University of Chicago, USC, and Google, which found that today’s most popular unlearning techniques tend to degrade models—often to the point where they’re unusable. are.

“Our assessment suggests that currently applicable unlearning methods are not yet ready for meaningful use or implementation in real-world scenarios,” Weijia Shi, a researcher on the study and a Ph.D. candidate in computer science at the UW, told TechCrunch. “Currently, there are no effective methods that allow the model to forget specific data without significant loss of utility.”

How models are learned

Generative AI models have no real intelligence. They are statistical systems that predict words, images, speech, music, video, and other data. Fed with a huge number of examples (eg movies, voice recordings, essays, etc.), AI models learn how likely data is to occur based on patterns, including the context of any surrounding data.

Given an email ending in the snippet “Awaiting…”, for example, a model trained to autocomplete messages might suggest “… pending response,” following the pattern of all emails it has ingested. There is no intentionality there; the model expects nothing. Just making an educated guess.

Most models, including flagships like GPT-4o, are trained on data obtained from public websites and datasets on the web. Most vendors developing such models claim that fair use protects their practice of depleting data and using it for training without informing, compensating, or even crediting the owners of the data.

But not every copyright holder agrees. And many—from authors to publishers to record labels—have filed lawsuits against suppliers to force change.

The copyright dilemma is one of the reasons unlearning techniques have been getting a lot of attention lately. Google, in partnership with several academic institutions, launched a competition last year aimed at encouraging the creation of new approaches to learning.

Unlearning can also provide a way to remove sensitive information from existing models, such as medical records or compromising photos, in response to a request or government order. (Thanks to the way they’re trained, the models tend to collect a lot of personal information, from phone numbers to more problematic examples.) In the past few years, some vendors have implemented tools that allow data owners to request that their data be removed from training sets. But these opt-out tools only apply to future models, not to models trained before their release; unlearning would be a much more thorough approach to data deletion.

Regardless, unlearning isn’t as easy as hitting Delete.

The art of forgetting

Delearning techniques today rely on algorithms designed to “steer” models away from the data to be unlearned. The idea is to influence the model’s predictions so that it never – or only very rarely – outputs certain data.

To see how effective these unlearning algorithms can be, Shi and her collaborators created a benchmark and selected eight different open-source algorithms for testing. Called MUSE (Machine Unlearning Six-way Evaluation), the benchmark aims to examine the algorithm’s ability to not only prevent the model from spitting out training data verbatim (a phenomenon known as regurgitation), but also to eliminate the model’s knowledge of that data along with all proof that it was originally trained on the data.

Doing well in MUSE requires the model to forget two things: books from the Harry Potter series and news articles.

For example, given an excerpt from Harry Potter and the Chamber of Secrets (“There’s more in the frying pan,” said Aunt Petunia…”), MUSE tests whether an unlearned model can recite the entire sentence (“There’s more in the frying pan,” said Aunt Petunia , turning his eyes to his massive son”), answers questions about the scene (e.g. “What does Aunt Petunia say to her son?”, “Still in the frying pan”) or otherwise shows that he has been trained on text from the book.

MUSE also tested whether the model retained related general knowledge—for example, that J. K. Rowling is the author of the Harry Potter series – having unlearned what researchers call the overall utility of the model. The lower the utility, the more related knowledge the model has lost, making the model less able to answer questions correctly.

In their study, the researchers found that the unlearning algorithms they tested I did make the models forget certain information. But they also hurt the models’ overall ability to answer questions, presenting a trade-off.

“Designing efficient unlearning methods for models is challenging because knowledge is intricately entangled in the model,” Shi explained. “For example, a model can be trained on copyrighted material – Harry Potter books, as well as freely available content from the Harry Potter Wiki. When existing delearning methods attempt to remove copyrighted Harry Potter books, they significantly affect the model’s knowledge of the Harry Potter Wiki as well.

Are there solutions to the problem? Not yet — and that underscores the need for more research, Shi said.

So far, vendors looking to unlearning as the solution to their training data problems seem to be falling short. Perhaps a technical breakthrough will make unlearning possible someday. But for now, sellers will have to find another way to prevent their models from saying things they shouldn’t.