Investigation reveals unauthorized data scraping from YouTube for AI training

published on 11 September 2025

A recent investigation has unveiled widespread unauthorized data scraping by major technology companies, igniting concerns over data ownership and fairness in the rapidly advancing field of artificial intelligence (AI). According to the report, companies including Meta, Microsoft, and Nvidia have extracted over 15.8 million videos from YouTube without obtaining consent from content creators. These videos, collected from more than 2 million YouTube channels, are being used to train sophisticated generative AI video models, intensifying competition in the AI sector.

This large-scale data extraction, conducted in violation of YouTube’s terms of service, has left many creators feeling betrayed and uncertain about the future of their work. Jon Peters, a woodworker whose videos were among those scraped, voiced his frustration, saying, "Should I keep making things in the hope of connecting with people, or just stop altogether?" His dilemma reflects a broader anxiety within the creator community, as their content - created through significant effort - is being repurposed to develop AI tools that could ultimately compete with them.

The scope of data scraping

The investigation highlights the sheer scale of this unauthorized activity, identifying at least 13 datasets used by leading tech companies, including Amazon, ByteDance, Snap, and Tencent. These findings also corroborate earlier accusations of data scraping by other major players such as Apple and Anthropic. The revelations underscore a troubling pattern of unregulated data collection, even as YouTube attempts to mediate the tension between its creators and the AI ambitions of its parent company, Google.

In response to growing criticism, YouTube introduced new tools in December 2024 to give creators more control over how their work is used. Among these measures is an opt-in setting for allowing content to be used in AI training, which is turned off by default. Additionally, updates to the Content ID system aim to improve transparency by detecting AI-generated faces and voices. However, these efforts have done little to quell concerns, especially as Google continues to utilize YouTube content to develop its own AI models.

The growing outcry has prompted a series of lawsuits against prominent tech firms accused of unfairly profiting from creators’ work. Creators like David Millette have filed lawsuits against companies such as Nvidia and OpenAI, alleging unjust enrichment and unfair competition. These lawsuits form part of a larger legal reckoning for the AI industry, which is increasingly under scrutiny for its methods of acquiring training data.

High-profile cases are also shaping the legal landscape. Disney and Universal have initiated legal action against AI lab Midjourney, accusing it of using stolen intellectual property to train its models. Disney’s general counsel stressed that the infringement remains significant regardless of the company’s AI affiliations. Similarly, Anthropic’s attempt to settle a $1.5 billion copyright infringement lawsuit with book authors was challenged by U.S. District Judge William Alsup. The judge raised concerns about the legitimacy of the settlement, highlighting the distinction between training AI models and the problematic acquisition of data.

The race for AI dominance

These revelations come as tech companies pour billions into generative AI technologies, with the market projected to exceed $2.5 billion by 2032. Industry leaders are scrambling to secure high-quality datasets, which are essential for developing cutting-edge AI models. Google, for instance, is advancing its Veo 3 model, capable of generating videos with synchronized audio, while Microsoft offers OpenAI’s Sora model at no cost to users. Meanwhile, Meta has integrated technology from Midjourney to maintain its competitive edge.

However, the relentless drive for innovation has sparked a clash between the rights of content creators and the ambitions of AI developers. As the debate over ethical data use intensifies, creators and advocates are demanding accountability and consent in an industry that relies heavily on uncredited labor.

The ongoing conflict underscores a fundamental tension: the immense value of creator-generated content as a resource for AI development and the ethical questions surrounding its unauthorized use. As the legal and ethical landscape continues to evolve, this controversy raises urgent questions about the future of both AI and content creation.

Read the source

Read more

Built on Unicorn Platform