The New York Times Updates Terms of Service to Prevent AI Scraping Its Content


.article-native-ad { border-bottom: 1px solid #ddd; margin: 0 45px; padding-bottom: 20px; margin-bottom: 20px; } .article-native-ad svg { color: #ddd; font-size: 34px; margin-top: 10px; } .article-native-ad p { line-height:1.5; padding:0!important; padding-left: 10px!important; } .article-native-ad strong { font-weight:500; color:rgb(46,179,178); }

TV isn’t what it used to be. Join the Convergent TV Summit in LA this October 25 with media, technology and marketing leaders to prepare for new trends and make industry connections.

Among the early use cases of AI within newsrooms appears to be fighting AI itself.

The New York Times updated its terms of services Aug. 3 to forbid the scraping of its content to train a machine learning or AI system.

The content includes but is not limited to text, photographs, images, illustrations, designs, audio clips, video clips, “look and feel” and metadata, including the party credited as the provider of such content.

The updated TOS also prohibits website crawlers, which let pages get indexed for search results, from using content to train LLMs or AI systems.

Defying these rules could result in penalties, per the terms and services, although it’s unclear what the penalties would look like. When contacted for this piece, The New York Times said that it didn’t have any additional comment beyond the TOS.

“Most boilerplate terms of service include restrictions on data scraping, but the explicit reference to training AI is new,” said Katie Gardner, partner at Gunderson Dettmer.

AI models rely on content and data, including journalism pieces and copyrighted art, as a main source of information to output results. In some cases, this content is replicated verbatim. Publishers, especially those with paywalls and healthy subscription businesses, are concerned that AI models will undermine their revenue streams by publishing repurposed content without credit, and contribute to misinformation, degrading people’s trust in news.

The confusing case of creepy crawlers

LLMs like ChatGPT work similarly to website crawlers which scan content on publishers’ sites and feed their information to inform search results.

While publishers can see crawlers visiting their sites, they cannot know their exact purposes, whether for search engine optimization or training AI models. Some paywall tech companies are looking at ways to block crawlers, according to Digiday’s reporting.

Crawlers like CommonCrawl, with a data set of 3.15 billion web pages, have brokered deals with OpenAI, Meta, and Google for AI training, per The Decoder.

Earlier this week, OpenAI launched GPTBot, a web crawler to improve AI models. This will let publishers control GPTBot’s access to their website content. Still, significant players in the field, namely Microsoft’s Bing and Google’s Bard, have not added this functionality to their bots, leaving publishers struggling to control what the crawlers scrape.

While tech companies like OpenAI are reticent to disclose what they train their AI models on, The Washington Post analyzed Google’s C4 data set, a smaller version of the CommonCrawl dataset, to understand what was training the models. It found evidence that content from 15 million websites, including The New York Times, have been used to train LLMs such as Meta’s LLaMAa and Google’s T5—an open-source language model that helps developers build software for translation tasks.

All this has spurred other publishers to reevaluate their terms of services, according to Chris Pedigo, svp for government affairs at trade body Digital Content Next, whose members include The New York Times and The Washington Post.

More licensing deals to come

While it’s unclear how AI companies will respond to these updated terms of services, they have a vested interest in shielding themselves from legal repercussions.

As a result, discussions are underway between AI companies and major publishers to establish licensing agreements, according to Pedigo, such as the deal between OpenAI and The Associated Press.

These deals are primarily set for AI companies to compensate publishers for their content. However, there’s a desire from publishers to go beyond just financial matters.

Ongoing negotiations look at how to cite publishers for their content, including aspects like footnotes. Simultaneously, there is a focus on establishing mechanisms such as guardrails and fact-checking processes within AI companies to prevent the generation of factually inaccurate content by the LLMs.

“Publishers would not want to be associated with that, especially if they’re going to have a licensing deal,” said Pedigo. “Publishers want to make sure that information meets the brand level.”

.font-primary { } .font-secondary { } #meter-count { position: fixed; z-index: 9999999; bottom: 0; width:96%; margin: 2%; -webkit-border-radius: 4px; -moz-border-radius: 4px; border-radius: 4px; -webkit-box-shadow: 0 0px 15px 4px rgba(0,0,0,.2); box-shadow:0 0px 15px 4px rgba(0,0,0,.2); padding: 15px 0; color:#fff; background-color:#343a40; } #meter-count .icon { width: auto; opacity:.8; } #meter-count .icon svg { height: 36px; width: auto; } #meter-count .btn-subscribe { font-size:14px; font-weight:bold; padding:7px 18px; color: #fff; background-color: #2eb3b2; border:none; text-transform: capitalize; margin-right:10px; } #meter-count .btn-subscribe:hover { color: #fff; opacity:.8; } #meter-count .btn-signin { font-size:14px; font-weight:bold; padding:7px 14px; color: #fff; background-color: #121212; border:none; text-transform: capitalize; } #meter-count .btn-signin:hover { color: #fff; opacity:.8; } #meter-count h3 { color:#fff!important; letter-spacing:0px!important; margin:0; padding:0; font-size:16px; line-height:1.5; font-weight:700; margin: 0!important; padding: 0!important; } #meter-count h3 span { color:#E50000!important; font-weight:900; } #meter-count p { font-size:14px; font-weight:500; line-height:1.4; color:#eee!important; margin: 0!important; padding: 0!important; } #meter-count .close { color:#fff; display:block; position:absolute; top: 4px; right:4px; z-index: 999999; } #meter-count .close svg { display:block; color:#fff; height:16px; width:auto; cursor:pointer; } #meter-count .close:hover svg { color:#E50000; } #meter-count .fw-600 { font-weight:600; } @media (max-width: 1079px) { #meter-count .icon { margin:0; padding:0; display:none; } } @media (max-width: 768px) { #meter-count { margin: 0; -webkit-border-radius: 0px; -moz-border-radius: 0px; border-radius: 0px; width:100%; -webkit-box-shadow: 0 -8px 10px -4px rgba(0,0,0,0.3); box-shadow: 0 -8px 10px -4px rgba(0,0,0,0.3); } #meter-count .icon { margin:0; padding:0; display:none; } #meter-count h3 { color:#fff!important; font-size:14px; } #meter-count p { color:#fff!important; font-size: 12px; font-weight: 500; } #meter-count .btn-subscribe, #meter-count .btn-signin { font-size:12px; padding:7px 12px; } #meter-count .btn-signin { display:none; } #meter-count .close svg { height:14px; } }

Enjoying Adweek’s Content? Register for More Access!

https://www.adweek.com/media/the-new-york-times-updates-terms-of-service-to-prevent-ai-scraping-its-content/