America

Project Panama: Inside Anthropic’s secret race to scan millions of physical books

Published

on

In early 2024, the artificial intelligence startup Anthropic initiated a clandestine operation dubbed “Project Panama,” an effort to “destructively scan” nearly every book in existence.

According to court filings obtained by The Washington Post, the company spent tens of millions of dollars over the course of a year to purchase millions of books and physically dismantle them by cutting off their spines. The pages were then scanned to feed vast quantities of information into the AI models powering products like the popular chatbot Claude.

The previously undisclosed details of Project Panama were revealed in more than 4,000 pages of documents related to a copyright lawsuit brought by authors against Anthropic, which investors recently valued at $183 billion. While the company agreed to a $1.5 billion settlement in August to resolve the case, a district judge’s decision last week to unseal a series of related documents has provided a clearer picture of Anthropic’s aggressive pursuit of literary data.

These new filings, alongside documents submitted in other copyright cases against AI firms, illustrate the extraordinary lengths to which technology giants—including Anthropic, Meta, Google, and OpenAI—have gone to acquire massive troves of data for “training” their software. The Anthropic litigation is part of a broader wave of legal action by authors, artists, photographers, and news organizations who claim their creative works are being exploited.

Court records reveal that these companies view books as a premier prize. In a January 2023 document, one of Anthropic’s co-founders suggested that training AI models on books would teach them “to write well,” rather than merely mimicking “low-quality internet slang.” Similarly, a 2024 internal Meta email described access to digital book archives as “essential” for staying competitive with AI rivals.

However, the records also show that these companies found it “impractical” to obtain direct permission from publishers and authors. Instead, Anthropic, Meta, and others devised ways to acquire books in bulk without the authors’ knowledge. According to court records, these methods included downloading pirated copies.

When Anthropic launched Project Panama to purchase and scan physical books, it turned to a Silicon Valley veteran. The company hired Tom Turvey, a former Google executive who two decades ago helped spearhead the famous but legally controversial Google Books project.

Anthropic initially considered sourcing books from libraries or iconic second-hand bookstores like New York City’s Strand, famous for its “18 miles” of new and used titles. A March 2024 document detailing an Anthropic content acquisition meeting noted that the store “was interested in providing second-hand books.” Documents also show Anthropic employees discussed approaching US libraries, including the New York Public Library or even “a chronically underfunded new library.”

It remains unclear which of these proposals, if any, were executed. A spokesperson for Strand, reached via email, stated that the bookstore did not sell any books to Anthropic.

Ultimately, documents indicate that Anthropic purchased millions of books, often in batches of tens of thousands, relying on used-book retailers such as Better World Books and the UK-based World of Books. While the final number of scanned books and the total cost were redacted in the documents, a project proposal from a vendor working with Anthropic specified that the AI firm was seeking a “document scanning service provider experienced in converting 500,000 to two million books over a six-month period.”

The document explained that the scanning firm’s “hydraulic-driven cutting machine” would “neatly cut” the books, and the pages would then be scanned using “high-speed, high-quality, production-level scanners.” Finally, the vendor would arrange a schedule with a “recycling company to collect the completed books.”

Internal messages show that Meta employees repeatedly expressed concerns that downloading millions of books without permission would violate copyright law. In December 2023, an internal email submitted in the copyright case against Meta noted that the practice was approved after being “communicated to MZ,” an apparent reference to CEO Mark Zuckerberg.

In a recently released legal filing, Anthropic revealed that co-founder Ben Mann personally spent 11 days in June 2021 downloading fiction and non-fiction titles from “LibGen,” a well-known “shadow library” hosting pirated books and other copyrighted content. A screenshot of a web browser included in the files showed Mann using file-sharing software to download the data.

A year later, in July 2022, Mann welcomed the launch of a new website called Pirate Library Mirror, which claimed to host a massive database of books and stated, “we are intentionally violating copyright law in most countries.” Mann sent a link to the site to other Anthropic employees with the message: “just in time!!!”

In legal filings, Anthropic argued that it did not train a commercial AI model for profit using LibGen data and that it never used Pirate Library Mirror to train any completed AI model.

Ed Newton-Rex, a former AI executive and music composer who now leads a non-profit advocating for creators’ rights, said these revelations underscore that AI companies owe creators far more than they have paid to date. “We urgently need a reset in the AI industry so that creators start getting paid fairly for the vital contributions they make,” he said.

Google, Microsoft, and ChatGPT-maker OpenAI face similar copyright lawsuits from authors. While many of these cases remain pending, James Grimmelmann, a professor of digital and information law at Cornell Tech, noted that the legal questions they raise remain unresolved.

However, in two separate rulings, judges determined that tech companies’ use of books to train AI models without author or publisher permission might be legal under the “fair use” doctrine of copyright law. In June, District Judge William Alsup ruled that Anthropic had the right to use books for training because they processed the works in a “transformative” manner. The judge likened the AI training process to teachers “teaching school children how to write well.”

That same month, District Judge Vince Chhabria ruled in the Meta case that authors failed to prove the company’s AI models could harm the sales of their books.

Nevertheless, companies may still face legal jeopardy regarding how they acquired the books. In the Anthropic case, while the scanning project was accepted, the judge ruled that the company may have violated authors’ copyrights by downloading millions of pirated books for free before launching Project Panama. Alsup granted class-action status to authors whose works were included in two shadow libraries that Anthropic downloaded and stored for future use.

Rather than go to trial, Anthropic agreed to pay publishers and authors $1.5 billion without admitting wrongdoing. Authors whose books were downloaded can claim a share of the settlement, estimated at approximately $3,000 per book.

Aparna Sridhar, Anthropic’s deputy general counsel, stated in an email to The Washington Post: “This case has been resolved, but the court’s landmark June 2025 ruling remains valid. Judge Alsup argued that AI training is ‘fundamentally transformative’: Anthropic’s AI models were trained ‘not to copy or replace works, but to get over a difficult hump and create something different.’ What we settled on was how some materials were obtained, not whether we could use them to develop AI models.”

Documents released in the Meta lawsuit suggest that the social media giant’s employees were equally data-hungry and willing to take legal risks to obtain it. While Judge Chhabria sided with Meta on the use of books for training, he allowed authors to proceed with claims that Meta illegally distributed copies of pirated books. The plaintiffs are seeking class-action status for these claims in the Northern District of California.

In that case, authors alleged that Meta’s senior executives considered purchasing books for training but instead opted to download millions of books for free from “torrent” platforms that facilitate online piracy. Internal documents, some previously reported, show Meta employees expressing concerns that their actions were risky or wrong and discussing how to hide their tracks.

One engineer wrote in 2023, “Downloading torrents from a company laptop doesn’t feel right.” The same employee later voiced concern to the legal team that using torrent sites might require sharing pirated works with others, which “might not be legally appropriate.”

A December 2023 email clearly stated that the use of LibGen was approved after Zuckerberg was notified. “After prior notification to MZ, GenAI’s use of LibGen for Llama 3 was approved… with a series of agreed-upon mitigating measures,” the email read, before listing legal and political risks. It noted that media reports suggesting the use of a known pirate dataset like LibGen could “weaken our negotiating position with regulators on these issues.”

By April 2024, internal correspondence showed the company moving to download LibGen and other shadow libraries. Chat logs show one employee asking another why they were using servers rented from Amazon for torrenting instead of Facebook-owned servers. The answer: “to avoid the risk of the activity being traced back to the company.”

In a filing last month, Meta’s lawyers wrote that the company “denies distributing the plaintiffs’ works while downloading training data using torrents.”

In a separate 2023 case, authors accused OpenAI and Microsoft of violating copyright law by using books for AI training. OpenAI, where Mann and Anthropic CEO Dario Amodei worked before founding their own firm, admitted to downloading LibGen but told the court it deleted the files before the launch of ChatGPT.

Justin A. Nelson, an attorney at Susman Godfrey LLP representing authors in both the OpenAI and Anthropic cases, said: “OpenAI fired the opening shot that led to the widespread piracy by AI companies and the exploitation of all human expression.”

Earlier this month, two major publishers applied to join a group of authors and illustrators in a 2023 copyright lawsuit against Google.

Grimmelmann, the Cornell Tech law professor, observed that AI companies “led themselves into a delusion” regarding the use of copyrighted data. The breakthroughs behind tools like ChatGPT began in academic research, where the use of copyrighted material for training is widely accepted, but researchers continued the practice even as AI models became commercialized.

“By the time the tension became apparent, they had invested heavily in incorporating copyrighted data into their workflows and were in a fast-paced, high-stakes competition to launch newer and better models,” Grimmelmann said.

MOST READ

Exit mobile version