The New York Times prohibits AI vendors from devouring its content

An android man looking through a hole in a newspaper. — Benj Edwards / Getty Images

In early August, The New York Times updated its terms of service (TOS) to prohibit scraping its articles and images for AI training, reports Adweek. The move comes at a time when tech companies have continued to monetize AI language apps such as ChatGPT and Google Bard, which gained their capabilities through massive unauthorized scrapes of Internet data.

The new terms prohibit the use of Times content—which includes articles, videos, images, and metadata—for training any AI model without express written permission. In Section 2.1 of the TOS, the NYT says that its content is for the reader’s “personal, non-commercial use” and that non-commercial use does not include “the development of any software program, including, but not limited to, training a machine learning or artificial intelligence (AI) system.”

Further down, in section 4.1, the terms say that without NYT’s prior written consent, no one may “use the Content for the development of any software program, including, but not limited to, training a machine learning or artificial intelligence (AI) system.”

NYT also outlines the consequences for ignoring the restrictions: “Engaging in a prohibited use of the Services may result in civil, criminal, and/or administrative penalties, fines, or sanctions against the user and those assisting the user.”

As threatening as that sounds, restrictive terms of use have not previously stopped the wholesale gobble of the Internet into machine learning data sets. Every large language model available today—including OpenAI’s GPT-4, Anthropic’s Claude 2, Meta’s Llama 2, and Google’s PaLM 2—has been trained on large data sets of materials scraped from the Internet. Using a process called unsupervised learning, the web data was fed into neural networks, allowing AI models to gain a conceptual sense of language by analyzing the relationships between words.

The controversial nature of using scraped data to train AI models, which has not been fully resolved in US courts, has led to at least one lawsuit that accuses OpenAI of plagiarism due to the practice. Last week, the Associated Press and several other news organizations published an open letter saying that “a legal framework must be developed to protect the content that powers AI applications,” among other concerns.

OpenAI likely anticipates continued legal challenges ahead and has begun making moves that may be designed to get ahead of some of this criticism. For example, OpenAI recently detailed a method that websites could use to block its AI-training web crawler using robots.txt. This led to several sites and authors publicly stating they would block the crawler.

For now, what has already been scraped is baked into GPT-4, including New York Times content. We may have to wait until GPT-5 to see whether OpenAI or other AI vendors respect content owners’ wishes to be left out. If not, new AI lawsuits—or regulations—may be on the horizon.

https://arstechnica.com/?p=1960621