You are currently viewing OpenAI says it’s building a tool to let content creators ‘opt out’ of AI training |  TechCrunch

OpenAI says it’s building a tool to let content creators ‘opt out’ of AI training | TechCrunch

OpenAI says it’s developing a tool to let creators better control how their content is used to train generative AI.

The tool, called Media Manager, will allow content creators and owners to identify their works to OpenAI and specify how they want those works to be included or excluded from AI research and training.

The goal is to have the tool in place by 2025, OpenAI says, as the company works with “creators, content owners and regulators” on a standard — perhaps through the industry governance committee it recently joined.

“This will require cutting-edge machine learning research to build a first-of-its-kind tool to help us identify copyrighted text, images, audio and video across multiple sources and reflect creator preferences,” OpenAI wrote in blog post. “We plan to introduce additional options and features over time.”

Media Manager, whatever form it eventually takes, appears to be OpenAI’s response to growing criticism of its approach to AI development, which relies heavily on mining publicly available data from the web. Most recently, eight prominent US newspapers, including the Chicago Tribune, sued OpenAI for IP infringement related to the company’s use of generative AI, accusing OpenAI of stealing articles on training generative AI models, which it then commercialized without compensating—or credits — the source publications.

Generative AI models, including OpenAI—the kinds of models that can analyze and generate text, images, videos, and more—are trained on vast numbers of examples, typically originating from public sites and datasets. OpenAI and other generative AI providers argue that fair use, the legal doctrine that allows the use of copyrighted works to create a secondary creation as long as it is transformative, protects their practice of deleting public data and using it to model training. But not everyone agrees.

OpenAI, in fact, recently argued that it would be impossible to create useful AI models without copyrighted material.

But in an effort to appease critics and protect itself against future lawsuits, OpenAI has taken steps to meet content creators in the middle.

Last year, OpenAI allowed artists to “opt out” and remove their work from the datasets the company uses to train its image-generating models. The company also allows website owners to specify through the robots.txt standard, which instructs websites to crawl bots, whether their site’s content can be scraped to train AI models. And OpenAI continues to sign licensing deals with major content owners, including news organizations, stock media libraries, and question-and-answer sites like Stack Overflow.

However, some content creators say OpenAI hasn’t gone far enough.

The artists described OpenAI’s opt-out workflow for images, which requires submitting an individual copy of each image to be removed along with a description, as burdensome. OpenAI reportedly pays relatively little for licensed content. And as OpenAI itself acknowledged in Tuesday’s blog post, the company’s current solutions don’t address scenarios where creators’ works are cited, remixed or republished on platforms they don’t control.

Besides OpenAI, a number of third parties are trying to build universal provenance and disclaimer tools for generative AI.

Startup Spawning AI, whose partners include Stability AI and Hugging Face, offers an app that identifies and tracks bots’ IP addresses to block deletion attempts, as well as a database where artists can register their works to ban learning from providers who choose to honor requests. Steg.AI and Imatag help creators establish ownership of their images by applying watermarks invisible to the human eye. And Nightshade, a University of Chicago project, “poisons” image data to make it useless or destructive for training AI models.

Leave a Reply