Anthropic Faces Massive Damages? The Countless Ways AI Models "Pirate" Content

Deep News
Aug 15

The secret recipe for AI large language models might be surprisingly simple: massive amounts of "pirated content." This has become an open secret within the industry.

In 2023, The New York Times filed a lawsuit against OpenAI and Microsoft, officially launching this legal battle. Soon, the conflict spread throughout Silicon Valley. Meta faced class-action lawsuits over its Llama model allegedly using pirated books, while Anthropic was similarly sued over training data for its Claude model. Almost all major players found themselves in the defendant's seat.

The core dispute between large language models and copyright holders centers on whether using vast amounts of copyrighted works as AI training data without authorization constitutes legal "transformative use" or "copyright infringement."

Among numerous pending cases, the Anthropic case has progressed fastest. In a milestone ruling in June 2025, the court provided an extremely important reference signal: the act of model training itself, due to its ability to create entirely different new things, has high "transformative" nature and may not constitute infringement. However, methods of obtaining training data involving pirated websites or unauthorized copying can hardly be exempted under "fair use" principles.

According to calculations, Anthropic may face astronomical lawsuit damages of $750 billion. This signal has made all AI companies nervous. The "pollute first, treat later" wild growth approach of large model manufacturers may be coming to an end.

**Multiple Data "Piracy" Paths for Large Models**

To satisfy endless data demands, major model manufacturers have each developed controversial and even "creative" approaches, all walking the edge of legality.

**1. From Public Scraping to Deliberate "Cleaning"**

This represents the most primitive and widespread method of AI data accumulation. AI companies use powerful web crawlers like casting a giant net across the global internet, indiscriminately capturing content from news websites, professional blogs, academic forums, and social media to build initial training datasets.

For example, when OpenAI built its famous WebText dataset, it scraped millions of external links shared by users on Reddit, indirectly incorporating massive copyrighted content, including articles from The New York Times.

Beyond scraping, more damaging is the cleaning behavior. In lawsuits by The New York Times and Daily News, plaintiffs pointed out that OpenAI actively and systematically removed copyright notices, author attributions, footers, and other key copyright management information (CMI) when scraping news content. This behavior fundamentally transformed the nature of data acquisition from potentially unintentional "casual theft" to deliberate "data laundering" with clear evasion intent.

**2. Format Conversion: Extracting Text from Videos and Physical Books**

As high-quality public text data becomes increasingly scarce, manufacturers have turned to content in other formats, using technical means to convert them into plain text suitable for model training - a more covert approach.

A typical method is OpenAI's "clever use" of its speech recognition tool Whisper. Reportedly, OpenAI used Whisper to transcribe over one million hours of YouTube video content. This means core "linguistic assets" from in-depth interviews, professional courses, and documentary narrations were quietly extracted and fed directly to GPT-4 without video creators' permission, bypassing audiovisual copyrights.

Anthropic adopted a more dramatic approach. Recognizing the enormous legal risks of directly using pirated book libraries, Anthropic hired Tom Turvey, former head of Google's book scanning project, to launch a costly and complex "physical world whitewashing plan":

First, bulk purchasing: spending huge sums to buy millions of physical books from distributors and retailers, including used books.

Second, physical conversion: transporting these books to service providers, where machines removed bindings, cut pages, then performed high-speed scanning to generate PDF digital files containing images and machine-readable text.

Third, destroying originals: after scanning, physical originals were directly discarded.

The core purpose was to legally argue this constituted "format conversion" rather than creating "additional copies," thus avoiding infringement charges.

Fourth, database creation: establishing detailed bibliographic databases for these digitized books and performing tokenization, cleaning, and other complex preprocessing to ultimately form a seemingly "legitimate" high-quality training dataset.

However, this behavior precisely proved two points: first, AI companies fully recognize the copyright value of high-quality data; second, the cost of obtaining compliant data is far more staggering than imagined.

**3. "Shadow Libraries"**

Under intense technological competition and enormous performance pressure, some companies chose the most efficient yet highest-risk shortcut - directly embracing clearly pirated resource libraries.

Meta faced direct accusations of using illegal book copies from "shadow libraries" (like Library Genesis, Books3) when training its open-source Llama model. Similarly, Anthropic's internal documents showed its co-founders downloaded the pirated Books3 library containing nearly 200,000 books during the company's early days, fully aware of these resources' pirated nature.

**4. Platforms Leveraging Privacy Agreements to Obtain Data**

Unlike the aforementioned "hardcore" piracy methods, tech giants demonstrated a more platform-characteristic "open strategy." This doesn't rely on external scraping or piracy but uses their vast user ecosystems to "legally" internalize user data as training resources through service terms.

Alphabet's privacy policy explicitly states it may use publicly shared user information to train AI models. This means when ordinary users collaborate on documents in Google Docs, write reviews on Google Maps, or publish articles on Blogger, this content may unknowingly be incorporated into Alphabet's AI training data pool, helping Alphabet build a data moat competitors find difficult to cross.

These varied, legally borderline data acquisition methods show that during AI development's "land grab" phase, large models prioritized obtaining maximum-scale data at lowest cost and fastest speed, with data source compliance risks secondary. However, copyright holders' lawsuits completely shattered this understanding, precisely targeting the most vulnerable link: original data acquisition paths.

**A More Expensive AI Era Has Arrived**

The real turning point in AI copyright wars was the shift in litigation focus: no longer entangling in how AI "uses" data, but directly targeting where it "obtains" data.

Initially, legal battles mainly centered on the nature of AI data "usage." AI companies argued their behavior wasn't traditional "copying" but "learning" - models internalize patterns, grammar, and knowledge from data like students reading vast books to form writing styles, aiming to create entirely new things, thus constituting highly "transformative" use.

Copyright holders countered that AI's commercial products directly compete with original works, replacing user demand for news subscriptions and book purchases, thereby harming core commercial interests.

However, copyright holders struggled tremendously on both battlefields. In this stalemate, their litigation strategy underwent a decisive shift, finding a more fundamental and lethal attack point - data source legitimacy.

The court's preliminary rulings provided an extremely subtle yet far-reaching signal: on one hand, initial decisions suggested AI output content and training behavior itself, due to its "transformative" nature, might not constitute direct infringement, somewhat preserving space for large model development and avoiding complete strangulation of technological innovation. On the other hand, courts drew clear red lines regarding "source legitimacy," severely cracking down on pirated resource usage.

Facing snowflake-like lawsuits, radical factions among large model manufacturers are shifting toward conservative approaches. Apple represents the conservative camp, prioritizing user privacy and rules from the start, preferring to start later in the AI race while avoiding legal risks through explicit licensing agreements (like partnerships with Shutterstock) and proprietary data.

Radical factions like Meta and early OpenAI believed in Silicon Valley's "move fast, break things" creed, viewing potential lawsuits as calculable, bearable business costs. However, after becoming litigation-entangled, OpenAI quickly transformed into an active data "buyer," spending vast sums on content licensing agreements with dozens of media outlets including Associated Press and Financial Times. Anthropic staged a transformation from using pirated book libraries to costly purchasing, scanning, and destroying physical books in "manual whitewashing."

These developments mean the golden age of "free data" has gone forever. Data will become a clear, expensive cost item on AI companies' financial statements.

From an industry perspective, content publishers and news organizations holding high-quality content will transform from passive victims to key upstream participants in the AI industry chain, wielding bargaining chips and strong negotiating power. This will dramatically raise industry competition barriers, giving tech giants with strong cash flows and top legal teams stronger competitive advantages over AI startups.

AI industry competition has expanded from pure algorithm and computing power races to comprehensive warfare involving data supply chain management, business negotiation, and legal compliance capabilities.

As those controversial pirated "wild approaches" are blocked one by one, a more expensive AI era has already arrived.

Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.

Most Discussed

  1. 1
     
     
     
     
  2. 2
     
     
     
     
  3. 3
     
     
     
     
  4. 4
     
     
     
     
  5. 5
     
     
     
     
  6. 6
     
     
     
     
  7. 7
     
     
     
     
  8. 8
     
     
     
     
  9. 9
     
     
     
     
  10. 10