OpenAI searches for an answer to its copyright problems

30 Agosto 2024 News, Rassegna Stampa

The huge leaps in OpenAI’s GPT model probably came from sucking down the entire written web. That includes entire archives of major publishers such as Axel Springer, Condé Nast, and The Associated Press — without their permission. But for some reason, OpenAI has announced deals with many of these conglomerates anyway.

At first glance, this doesn’t entirely make sense. Why would OpenAI pay for something it already had? And why would publishers, some of whom are lawsuit-style angry about their work being stolen, agree?

I suspect if we squint at these deals long enough, we can see one possible shape of the future of the web forming. Google has been referring less and less traffic outside itself — which threatens the existence of the entire rest of the web. That’s a power vacuum in search that OpenAI may be trying to fill.

Let’s start with what we know. The deals give OpenAI access to publications in order to, for instance, “enrich users’ experience with ChatGPT by adding recent and authoritative content on a wide variety of topics,” according to the press release announcing the Axel Springer deal. The “recent content” part is clutch. Scraping the web means there’s a date beyond which ChatGPT can’t retrieve information. The closer OpenAI is to real-time access, the closer its products are to real-time results.

On the one hand, this is peanuts, just embarrassingly small amounts of money

The terms around the deals have remained murky, I assume because everyone has been thoroughly NDA’d. Certainly I am in the dark about the specifics of the deal with Vox Media, the parent company of this publication. In the case of the publishers, keeping details private gives them a stronger hand when they pivot to, let’s say, Google and AI startup Anthropic — in the same way that not disclosing your previous salary lets you ask for more money from a new would-be employer.

OpenAI has been offering as little as $1 million to $5 million a year to publishers, according to The Information. There’s been some reporting on the deals with publishers such as Axel Springer, the Financial Times, NewsCorp, Condé Nast, and the AP. My back-of-the-envelope math based on publicly reported figures suggests that the ceiling on these deals is $10 million per publication per year.

On the one hand, this is peanuts, just embarrassingly small amounts of money. (The company’s former top researcher Ilya Sutskever made $1.9 million in 2016 alone.) On the other hand, OpenAI has already scraped all these publications’ data anyway. Unless and until it is prohibited by courts from doing so, it can just keep doing that. So what, exactly, is it paying for?

Maybe it’s API access, to make scraping easier and more current. As it stands, ChatGPT can’t answer up-to-the-moment queries; API access might change that.

But these payments can be thought of, also, as a way of ensuring publishers don’t sue OpenAI for the stuff it’s already scraped. One major publication has already filed suit, and the fallout could be much more expensive for OpenAI. The legal wrangling will take years.

If OpenAI ingested the entirety of the text-based internet, that means a couple things. First, that there’s no way to generate that volume of data again anytime soon, so that may limit any further leaps in usefulness from ChatGPT. (OpenAI notably has not yet released GPT-5.) Second, that a lot of people are pissed.

Many of those people have filed lawsuits, and the most important was filed by The New York Times. The Times’ lawsuit alleges that when OpenAI ingested its work to train its LLMs, it engaged in copyright infringement. Moreover, the product OpenAI created by doing this now competes with the Times and is meant to “steal audiences away from it.”

The Times’ lawsuit says that it tried to negotiate with OpenAI to permit the use of its work, but those negotiations failed. I’m going to take a wild guess based on the math I did above and say it’s because OpenAI offered insultingly low sums of money to the Times. Its excuse? Fair use — a provision that allows the unlicensed use of copyrighted material under certain circumstances.

Should the newspaper win its case, OpenAI is going to have to pay an absolute minimum of $7.5 billion in statutory damages alone

If the Times wins its lawsuit, it may be entitled to statutory damages, which start at $750 per work. (I know those figures because — as you may have guessed from my use of “statutory” — they are dictated by law. The paper is also asking for compensatory damages, restitution, and attorneys’ fees.) The Times says that OpenAI ingested 10 million total works — so that’s an absolute minimum of $7.5 billion in statutory damages alone. No wonder the Times wasn’t going to cut a deal in the single-digit millions.

So when OpenAI makes its deals with publishers, they are, functionally, settlements that guarantee the publishers won’t sue OpenAI as the Times is doing. They are also structured so that OpenAI can maintain its previous use of the publishers’ work is fair use — because OpenAI is going to have to argue that in multiple court cases, most notably the one with the Times.

“I do have every reason to believe that they would like to preserve their rights to use this under fair use,” says Danielle Coffey, the CEO of the News Media Alliance. “They wouldn’t be arguing that in a court if they didn’t.”

It seems like OpenAI is hoping to clean up its reputation a little. If you’re introducing a new product you want people to pay for, it simply can’t come with a ton of baggage and uncertainty. And OpenAI does have baggage: to make its fair use defense, it must admit to taking The New York Times’ copyrighted material without permission — which implicitly suggests it’s taken a lot of other copyrighted material without permission, too. Its argument is just that it is legally entitled to do that.

There’s also a question of accuracy. At this point, we all know generative AI makes stuff up. The publisher deals don’t just provide legitimacy — they may also help feed generative AI information that is less likely to result in embarrassing errors.

There’s more at play than just lawsuit prevention and reputation management. Remember how the deals also give OpenAI up-to-date information? OpenAI recently announced SearchGPT, its very own search engine. AI-native web searching is still nascent, but being able to filter out AI-generated SEO glurge in favor of real sources of reliable information would be a leg up.

OpenAI searches for an answer to its copyright problems

Evidenziatore

Ricerca avanzata

Evidenziatore

Tag

Ricerca avanzata

Related Post