You are currently viewing Apple says it has taken a “responsible” approach to training its Apple Intelligence | models  TechCrunch

Apple says it has taken a “responsible” approach to training its Apple Intelligence | models TechCrunch

Apple has published a white paper detailing the models it has developed to power Apple Intelligence, a set of generative AI features headed to iOS, macOS and iPadOS over the next few months.

In the document, Apple rejects accusations that it took an ethically questionable approach to training some of its models, reiterating that it did not use private user data and used a combination of publicly available and licensed data for Apple Intelligence.

“[The] The pre-training dataset consists of … data we have licensed from publishers, curated publicly available or open datasets, and publicly available information crawled by our web robot, Applebot,” Apple wrote in the article. “Given our focus on protecting user privacy, we note that Apple’s private user data is not included in the data mix.”

In July, Proof News reported that Apple used a dataset called The Pile, which contains subtitles from hundreds of thousands of YouTube videos, to train a family of models designed to handle the device. Many YouTube creators who had their subtitles swept up in The Pile didn’t know and didn’t agree; Apple later released a statement saying that it does not intend to use these models to power any AI features in its products.

The white paper that lifts the curtain on the models that Apple first unveiled at WWDC 2024 in June, called Apple Foundation Models (AFM), emphasizes that the training data for the AFM models is obtained “responsibly” — or responsibly, by Apple definition, at least.

The training data for the AFM models includes publicly available web data as well as licensed data from undisclosed publishers. According to The New York Times, Apple has approached several publishers by the end of 2023, including NBC, Condé Nast and IAC, about multi-year deals worth at least $50 million to train models on publishers’ news archives . Apple’s AFM models were also trained on open source code hosted on GitHub, specifically Swift, Python, C, Objective-C, C++, JavaScript, Java, and Go code.

Training models on permissionless code, even open source, is a point of contention among developers. Some open source codebases are not licensed or do not allow AI training in their terms of use, some developers claim. But Apple says it is “license-filtered” for code to try to include only repositories with minimal usage restrictions, such as those under the MIT, ISC or Apache license.

To improve the math skills of AFM models, Apple specifically included in the training set math questions and answers from web pages, math forums, blogs, tutorials and workshops, according to the article. The company also tapped “high-quality, publicly available” data sets (which the paper did not name) with “licenses that allow use to train … models,” filtered to remove sensitive information.

In total, the training data set for the AFM models weighs about 6.3 trillion tokens. (Tokens are small chunks of data that are generally easier to ingest than generative AI models.) By comparison, that’s less than half the number of tokens — 15 trillion — Meta used to train its flagship text generation model , Llama 3.1 405B .

Apple obtained additional data, including human feedback data and synthetic data, to fine-tune the AFM models and try to mitigate any unwanted behavior, such as shaving toxicity.

“Our models are designed to help users carry out everyday activities in their Apple products grounded
in Apple’s core values ​​and rooted in our principles for responsible AI at every stage,” the company says.

There is no smoker or shocking insight in the newspaper – and this is thanks to careful design. Rarely are papers like these very revealing, because of competitive pressures, but also because of disclosure too many can get companies into legal trouble.

Some companies training models by mining public web data argue that their practice is protected by the fair use doctrine. But it’s a hotly debated issue and the subject of a growing number of lawsuits.

Apple notes in the article that it allows webmasters to block their bot from deleting their data. But that leaves individual creators in a bind. What should an artist do if, for example, their portfolio is hosted on a site that refuses to block Apple’s data drain?

Courtroom battles will decide the fate of generative AI models and how they are trained. For now, however, Apple is trying to position itself as an ethical player while avoiding unwanted legal scrutiny.

Leave a Reply