Anthropic physically scanned millions of print books to train its AI assistant, Claude, subsequently discarding the originals, as revealed in court documents, according to Ars Tecnica. This extensive operation, detailed in a legal decision, involved the acquisition and destructive digitization of these texts. The company’s approach to data acquisition reflects a broader industry demand for high-quality textual information.
Anthropic engaged Tom Turvey, formerly the head of partnerships for Google Books, in February 2024. His mandate was to procure “all the books in the world” for the company. This hiring decision aimed to replicate Google’s legally validated book digitization strategy, which had successfully navigated copyright challenges and established fair use precedents. While destructive scanning is common in smaller-scale operations, Anthropic implemented it on a massive scale. The destructive process offered faster speed and lower costs, outweighing the need to preserve the physical books.
Judge William Alsup ruled this destructive scanning operation constituted fair use. This determination was contingent on several factors: Anthropic legally purchased the books, destroyed each print copy post-scanning, and maintained the digital files internally without distribution. The judge analogized the process to “conserv[ing] space” through format conversion, deeming it transformative. Had this method been consistently applied from the outset, it might have established the first legally sanctioned instance of AI fair use. However, Anthropic’s earlier use of pirated material undermined its initial legal standing.
The AI industry exhibits a significant demand for high-quality text, which serves as a fundamental driver behind these data acquisition strategies. Large language models (LLMs), such as those powering Claude and ChatGPT, are trained by ingesting billions of words into neural networks. During this training, the AI system processes the text repeatedly, establishing statistical relationships between words and concepts. The quality of the training data directly influences the capabilities of the resulting AI model. Models trained on well-edited books and articles generally produce more coherent and accurate responses compared to those trained on lower-quality text sources.
Publishers retain legal control over content that AI companies seek for training purposes. Negotiating licenses for this content can be complex and time-consuming. The first-sale doctrine provided a legal workaround for Anthropic: once a physical book is purchased, the buyer can dispose of that specific copy, including destroying it. This principle allowed for the legal acquisition of physical books, circumventing direct licensing negotiations. Despite the legality, the procurement of physical books represented a substantial financial outlay.
Initially, Anthropic opted to use digitized versions of pirated books to acquire high-quality training data, a strategy chosen to avoid what CEO Dario Amodei termed the “legal/practice/business slog” of complex licensing negotiations. By 2024, however, Anthropic had become “not so gung ho about” utilizing pirated ebooks due to “legal reasons,” necessitating a more secure source of data. Purchasing used physical books offered a method to bypass licensing issues entirely while providing the professionally edited text essential for AI model training. Destructive scanning facilitated the rapid digitization of millions of volumes.
Anthropic invested “many millions of dollars” in this book buying and scanning operation. The company often acquired used books in bulk. The process involved stripping books from their bindings, cutting pages to workable dimensions, and scanning them as stacks of pages into PDFs. These PDFs included machine-readable text and covers. All paper originals were subsequently discarded. Court documents do not indicate that any rare books were destroyed, as Anthropic procured its books in bulk from major retailers. Other methods exist for extracting information from paper while preserving the physical documents; for example, The Internet Archive developed non-destructive book scanning techniques that maintain the integrity of physical volumes while creating digital copies.
In a related development, OpenAI and Microsoft announced a collaboration with Harvard’s libraries to train AI models using nearly 1 million public domain books, some dating back to the 15th century. These books are fully digitized but are preserved.