Apple Might Have Trained Its AI Models on Thousands of YouTube Videos

Apple Might Have Trained Its AI Models on Thousands of YouTube Videos



Apple, Anthropic, and different main synthetic intelligence (AI) corporations have reportedly educated AI fashions on information from tons of of hundreds of YouTube movies. A brand new report claims that a number of AI firms used a publicly accessible dataset referred to as Pile which contained the plain textual content of movies’ subtitles with none video imagery. The info was collected from common YouTube creators comparable to MrBeast, Marques Brownlee, and PewDiePie in addition to Indian YouTube creators comparable to CarryMinati, BB ki Vines, and Ashish Chanchlani.

A number of AI Fashions Reportedly Skilled on YouTube Movies

Proof Information carried out an investigation to seek out that subtitles information from as many as 1,73,536 YouTube movies had been taken from greater than 48,000 channels. As per the report, EleutherAI, a non-profit AI analysis lab, curated this dataset. Later, it was utilized by firms comparable to Apple, Anthropic, Nvidia, Salesforce, and extra. Notably, the AI lab printed a analysis paper highlighting the small print of the dataset.

EleutherAI created a knowledge repository of 800GB dubbed Pile and made it publicly accessible for individuals who wished to coach AI fashions however couldn’t afford massive datasets. Nearly all of the dataset was taken from publicly accessible sources comparable to English Wikipedia, e-books, and extra. Nonetheless, it additionally contained the subtitles from all of the movies compiled in a dataset referred to as YouTube Subtitles.

The report claimed that the Pile was used to coach Apple’s OpenELM AI mannequin, on the premise of the analysis paper’s description. Salesforce, Nvidia, and Anthropic’s AI fashions’ analysis papers additionally reportedly point out the utilization of the dataset.

Anthropic spokesperson Jennifer Martinez instructed the publication in a press release, “The Pile features a very small subset of YouTube subtitles. YouTube’s phrases cowl direct use of its platform, which is distinct from use of the Pile dataset. On the purpose about potential violations of YouTube’s phrases of service, we might should refer you to the Pile authors.”

Notably, YouTube’s phrases of service prohibit anybody from accessing the movies on the platform utilizing automated means comparable to robots, botnets or scrapers. YouTube Subtitles will fall below the scraping class. A Google spokesperson instructed Proof Information in an e mail response that the tech large has taken “motion through the years to forestall abusive, unauthorised scraping.” Nonetheless, no feedback had been made about AI corporations’ utilization of the information.

In a submit on X (previously often known as Twitter), Marques Brownlee referred to as out Apple for sourcing information from firms that included his movies’ transcripts, however he additionally highlighted that it was not the iPhone maker’s fault since they didn’t acquire the information.

Whereas this dataset was collected and distributed publicly, there may very well be different cases of knowledge scraping on platforms comparable to YouTube. With AI corporations scrambling to seek out extra information to coach their massive language fashions (LLMs), information procurement would possibly proceed to enter related legally gray areas.







Source link