You are currently viewing Microsoft’s AI CEO: Your online content is ‘free’ food for training models

Microsoft’s AI CEO: Your online content is ‘free’ food for training models

Mustafa Suleiman, CEO of Microsoft AI, said this week that machine learning companies can harvest most content posted online and use it to train neural networks because it’s essentially “free.”

Shortly thereafter, the Center for Investigative Reporting sued OpenAI and its largest investor, Microsoft, “for using the non-governmental news organization’s content without permission or offering compensation.”

This follows in the footsteps of eight newspapers that sued OpenAI and Microsoft for allegedly misappropriating content in April, as did the New York Times four months earlier.

Then there are two authors who sued OpenAI and Microsoft in January, claiming they trained AI models on the authors’ works without permission. Additionally, in 2022, several unidentified developers sued OpenAI and GitHub based on allegations that the organizations used publicly released programming code to train generative models in violation of software license terms

Asked in an interview with CNBC’s Andrew Ross Sorkin at the Aspen Festival of Ideas whether AI companies have effectively stolen the world’s intellectual property, Suleiman acknowledged the controversy and tried to draw a distinction between content people put online and content backed by the owners of corporate copyright.

“I think in terms of content that’s already on the open web, the social contract for that content since the 1990s is that it’s fair use,” he believes. “Anyone can copy it, recreate with it, reproduce with it. It was freeware if you will. That was the understanding.’

Suleiman admits there is another category of content, the stuff published by companies with lawyers.

“There’s a separate category where a website or publisher or news organization has specifically said ‘don’t scrape or crawl me for any reason other than to index me,’ so that other people can find that content,” he explained. “But that’s the gray area. And I think it’s going to make its way through the courts.”

That’s putting it mildly. While Suleiman’s remarks seem certain to offend content creators, he’s not entirely wrong—it’s not clear where the legal lines lie when it comes to AI model training and model output.

Most people posting content online as individuals will have compromised their rights in some way by accepting the Terms of Service agreements offered by major social media platforms. Reddit’s decision to license its users’ posts to OpenAI wouldn’t have happened if the social media giant thought its users had a legitimate claim to their memes and manifestos.

The fact that OpenAI and other AI model builders are inking content deals with major publishers shows that a strong brand, deep pockets and legal team can bring large tech operations to the negotiating table.

In other words, those who create content and publish it online are making free software, unless they retain or can attract lawyers willing to challenge Microsoft and its ilk.

In an article distributed via SSRN last month, Frank Pasquale, professor of law at Cornell Tech and Cornell Law School in the US, and Haochen Sun, associate professor of law at the University of Hong Kong, explore the legal uncertainty surrounding the use of copyrighted data for education of AI and whether courts would find such use fair. They conclude that AI needs to be addressed at a policy level, as current laws are not adequate to address the issues that now need to be addressed.

“Given that there is considerable uncertainty about the legality of AI vendors using copyrighted works, lawmakers will need to articulate a bold new vision for balancing rights and responsibilities, just as they did after the development of the Internet (which led to the Digital Millennium Copyright Act of 1998), they argue.

The authors suggest that the continued uncompensated collection of creative works threatens not only writers, composers, journalists, actors and other creative professionals, but also generative AI itself, which will eventually be deprived of training data. They predict that people will stop providing work online if they simply get used to AI models that reduce the marginal cost of creating content to zero and deprive creators of the possibility of any remuneration.

This is the future Suleiman is looking forward to. “The information economy is about to change radically because we can reduce the cost of producing knowledge to zero marginal cost,” he said.

All that free software you may have helped create can be yours for a small monthly subscription fee. ®

Leave a Reply