The scary truth about AI copyright is nobody knows what will happen next

Generative AI has had a very good year. Corporations like Microsoft, Adobe, and GitHub are integrating the tech into their products; startups are raising hundreds of millions to compete with them; and the software even has cultural clout, with text-to-image AI models spawning countless memes. But listen in on any industry discussion about generative AI, and you’ll hear, in the background, a question whispered by advocates and critics alike in increasingly concerned tones: is any of this actually legal?

The question arises because of the way generative AI systems are trained. Like most machine learning software, they work by identifying and replicating patterns in data. But because these programs are used to generate code, text, music, and art, that data is itself created by humans, scraped from the web and copyright protected in one way or another.

For AI researchers in the far-flung misty past (aka the 2010s), this wasn’t much of an issue. At the time, state-of-the-art models were only capable of generating blurry, fingernail-sized black-and-white images of faces. This wasn’t an obvious threat to humans. But in the year 2022, when a lone amateur can use software like Stable Diffusion to copy an artist’s style in a matter of hours or when companies are selling AI-generated prints and social media filters that are explicit knock-offs of living designers, questions of legality and ethics have become much more pressing.

Generative AI models are trained on copyright-protected data — is that legal?

Take the case of Hollie Mengert, a Disney illustrator who found that her art style had been cloned as an AI experiment by a mechanical engineering student in Canada. The student downloaded 32 of Mengert’s pieces and took a few hours to train a machine learning model that could reproduce her style. As Menger told technologist Andy Baio, who reported the case: “For me, personally, it feels like someone’s taking work that I’ve done, you know, things that I’ve learned — I’ve been a working artist since I graduated art school in 2011 — and is using it to create art that that [sic] I didn’t consent to and didn’t give permission for.”

But is that fair? And can Mengert do anything about it?

To answer these questions and understand the legal landscape surrounding generative AI, The Verge spoke to a range of experts, including lawyers, analysts, and employees at AI startups. Some said with confidence that these systems were certainly capable of infringing copyright and could face serious legal challenges in the near future. Others suggested, equally confident, that the opposite was true: that everything currently happening in the field of generative AI is legally above board and any lawsuits are doomed to fail.

“I see people on both sides of this extremely confident in their positions, but the reality is nobody knows,” Baio, who’s been following the generative AI scene closely, told The Verge. “And anyone who says they know confidently how this will play out in court is wrong.”

Andres Guadamuz, an academic specializing in AI and intellectual property law at the UK’s University of Sussex, suggested that while there were many unknowns, there were also just a few key questions from which the topic’s many uncertainties unfold. First, can you copyright the output of a generative AI model, and if so, who owns it? Second, if you own the copyright to the input used to train an AI, does that give you any legal claim over the model or the content it creates? Once these questions are answered, an even larger one emerges: how do you deal with the fallout of this technology? What kind of legal restraints could — or should — be put in place on data collection? And can there be peace between the people building these systems and those whose data is needed to create them?

Let’s take these questions one at a time.

Two images of the Mona Lisa each in a different art style, one classical, the other more modern abstract with vibrant colors.

a:hover]:shadow-highlight-franklin [&>a]:shadow-underline-black dark:[&>a:hover]:shadow-highlight-franklin dark:[&>a]:shadow-underline-white md:text-26″>The output question: can you copyright what an AI model creates?

For the first query, at least, the answer is not too difficult. In the US, there is no copyright protection for works generated solely by a machine. However, it seems that copyright may be possible in cases where the creator can prove there was substantial human input.

In September, the US Copyright Office granted a first-of-its-kind registration for a comic book generated with the help of text-to-image AI Midjourney. The comic is a complete work: an 18-page narrative with characters, dialogue, and a traditional comic book layout. And although it’s since been reported that the USCO is reviewing its decision, the comic’s copyright registration hasn’t actually been rescinded yet. It seems that one factor in the review will be the degree of human input involved in making the comic. Kristina Kashtanova, the artist who created the work, told IPWatchdog that she had been asked by the USCO “to provide details of my process to show that there was substantial human involvement in the process of creation of this graphic novel.” (The USCO itself does not comment on specific cases.)

According to Guadamuz, this will be an ongoing issue when it comes to granting copyright for works generated with the help of AI. “If you just type ‘cat by van Gogh,’ I don’t think that’s enough to get copyright in the US,” he says. “But if you start experimenting with prompts and produce several images and start fine-tuning your images, start using seeds, and start engineering a little more, I can totally see that being protected by copyright.”

Copyrighting an AI model’s output will likely depend on the degree of human involvement

With this rubric in mind, it’s likely that the vast majority of the output of generative AI models cannot be copyright protected. They are generally churned out en masse with just a few keywords used as a prompt. But more involved processes would make for better cases. These might include controversial pieces, like the AI-generated print that won a state art fair competition. In this case, the creator said he spent weeks honing his prompts and manually editing the finished piece, suggesting a relatively high degree of intellectual involvement.

Giorgio Franceschelli, a computer scientist who’s written on the problems surrounding AI copyright, says measuring human input will be “especially true” for deciding cases in the EU. And in the UK — the other major jurisdiction of concern for Western AI startups — the law is different yet again. Unusually, the UK is one of only a handful of nations to offer copyright for works generated solely by a computer, but it deems the author to be “the person by whom the arrangements necessary for the creation of the work are undertaken.” Again, there’s room for multiple readings (would this “person” be the model’s developer or its operator?), but it offers precedence for some sort of copyright protection to be granted.

Ultimately, though, registering copyright is only a first step, cautions Guadamuz. “The US copyright office is not a court,” he says. “You need registration if you’re going to sue someone for copyright infringement, but it’s going to be a court that decides whether or not that’s legally enforceable.”

Two images of the Marilyn Diptych each in a different art style.

a:hover]:shadow-highlight-franklin [&>a]:shadow-underline-black dark:[&>a:hover]:shadow-highlight-franklin dark:[&>a]:shadow-underline-white md:text-26″>The input question: can you use copyright-protected data to train AI models?

For most experts, the biggest questions concerning AI and copyright relate to the data used to train these models. Most systems are trained on huge amounts of content scraped from the web; be that text, code, or imagery. The training dataset for Stable Diffusion, for example — one of the biggest and most influential text-to-AI systems — contains billions of images scraped from hundreds of domains; everything from personal blogs hosted on WordPress and Blogspot to art platforms like DeviantArt and stock imagery sites like Shutterstock and Getty Images. Indeed, training datasets for generative AI are so vast that there’s a good chance you’re already in one (there’s even a website where you can check by uploading a picture or searching some text).

The justification used by AI researchers, startups, and multibillion-dollar tech companies alike is that using these images is covered (in the US, at least) by fair use doctrine, which aims to encourage the use of copyright-protected work to promote freedom of expression.

When deciding if something is fair use, there are a number of considerations, explains Daniel Gervais, a professor at Vanderbilt Law School who specializes in intellectual property law and has written extensively on how this intersects with AI. Two factors, though, have “much, much more prominence,” he says. “What’s the purpose or nature of the use and what’s the impact on the market.” In other words: does the use-case change the nature of the material in some way (usually described as a “transformative” use), and does it threaten the livelihood of the original creator by competing with their works?

Training a generative AI on copyright-protected data is likely legal, but you could use that same model in illegal ways

Considering the onus placed on these factors, Gervais says “it is much more likely than not” that training systems on copyrighted data will be covered by fair use. But the same cannot necessarily be said for generating content. In other words: you can train an AI model using other people’s data, but what you do with that model might be infringing. Think of it as the difference between making fake money for a movie and trying to buy a car with it.

Consider the same text-to-image AI model deployed in different scenarios. If the model is trained on many millions of images and used to generate novel pictures, it’s extremely unlikely that this constitutes copyright infringement. The training data has been transformed in the process, and the output does not threaten the market for the original art. But, if you fine-tune that model on 100 pictures by a specific artist and generate pictures that match their style, an unhappy artist would have a much stronger case against you.

“If you give an AI 10 Stephen King novels and say, ‘Produce a Stephen King novel,’ then you’re directly competing with Stephen King. Would that be fair use? Probably not,” says Gervais.

Crucially, though, between these two poles of fair and unfair use, there are countless scenarios in which input, purpose, and output are all balanced differently and could sway any legal ruling one way or another.

Ryan Khurana, chief of staff at generative AI company Wombo, says most companies selling these services are aware of these differences. “Intentionally using prompts that draw on copyrighted works to generate an output […] violates the terms of service of every major player,” he told The Verge over email. But, he adds, “enforcement is difficult,” and companies are more interested in “coming up with ways to prevent using models in copyright violating ways […] than limiting training data.” This is particularly true for open-source text-to-image models like Stable Diffusion, which can be trained and used with zero oversight or filters. The company might have covered its back, but it could also be facilitating copyright-infringing uses.

Another variable in judging fair use is whether or not the training data and model have been created by academic researchers and nonprofits. This generally strengthens fair use defenses and startups know this. So, for example, Stability AI, the company that distributes Stable Diffusion, didn’t directly collect the model’s training data or train the models behind the software. Instead, it funded and coordinated this work by academics and the Stable Diffusion model is licensed by a German university. This lets Stability AI turn the model into a commercial service (DreamStudio) while keeping legal distance from its creation.

Baio has dubbed this practice “AI data laundering.” He notes that this method has been used before with the creation of facial recognition AI software, and points to the case of MegaFace, a dataset compiled by researchers from the University of Washington by scraping photos from Flickr. “The academic researchers took the data, laundered it, and it was used by commercial companies,” says Baio. Now, he says, this data — including millions of personal pictures — is in the hands of “[facial recognition firm] Clearview AI and law enforcement and the Chinese government.” Such a tried-and-tested laundering process will likely help shield the creators of generative AI models from liability as well.

There’s a last twist to all this, though, as Gervais notes that the current interpretation of fair use may actually change in the coming months due to a pending Supreme Court case involving Andy Warhol and Prince. The case involves Warhol’s use of photographs of Prince to create artwork. Was this fair use, or is it copyright infringement?

“The Supreme Court doesn’t do fair use very often, so when they do, they usually do something major. I think they’re going to do the same here,” says Gervais. “And to say anything is settled law while waiting for the Supreme Court to change the law is risky.”

Two images of Keith Haring’s “Skateboarders” each in a different art style.

a:hover]:shadow-highlight-franklin [&>a]:shadow-underline-black dark:[&>a:hover]:shadow-highlight-franklin dark:[&>a]:shadow-underline-white md:text-26″>How can artists and AI companies make peace?

Even if the training of generative AI models is found to be covered by fair use, that will hardly solve the field’s problems. It won’t placate the artists angry their work has been used to train commercial models, nor will it necessarily hold true across other generative AI fields, like code and music. With this in mind, the question is: what remedies can be introduced, technical or otherwise, to allow generative AI to flourish while giving credit or compensation to the creators whose work makes the field possible?

The most obvious suggestion is to license the data and pay its creators. For some, though, this will kill the industry. Bryan Casey and Mark Lemley, authors of “Fair Learning,” a legal paper that has become the backbone of arguments touting fair use for generative AI, say training datasets are so large that “there is no plausible option simply to license all of the underlying photographs, videos, audio files, or texts for the new use.” Allowing any copyright claim, they argue, is “tantamount to saying, not that copyright owners will get paid, but that the use won’t be permitted at all.” Permitting “fair learning,” as they frame it, not only encourages innovation but allows for the development of better AI systems.

Others, though, point out that we’ve already navigated copyright issues of comparable scale and complexity and can do so again. A comparison invoked by several experts The Verge spoke to was the era of music piracy, when file-sharing programs were built on the back of massive copyright infringement and prospered only until there were legal challenges that led to new agreements that respected copyright.

“So, in the early 2000s, you had Napster, which everybody loved but was completely illegal. And today, we have things like Spotify and iTunes,” Matthew Butterick, a lawyer currently suing companies for scraping data to train AI models, told The Verge earlier this month. “And how did these systems arise? By companies making licensing deals and bringing in content legitimately. All the stakeholders came to the table and made it work, and the idea that a similar thing can’t happen for AI is, for me, a little catastrophic.”

Companies and researchers are already experimenting with ways to compensate creators

Wombo’s Ryan Khurana predicted a similar outcome. “Music has by far the most complex copyright rules because of the different types of licensing, the variety of rights-holders, and the various intermediaries involved,” he told The Verge. “Given the nuances [of the legal questions surrounding AI], I think the entire generative field will evolve into having a licensing regime similar to that of music.”

Other alternatives are also being trialled. Shutterstock, for example, says it plans to set up a fund to compensate individuals whose work it’s sold to AI companies to train their models, while DeviantArt has created a metadata tag for images shared on the web that warns AI researchers not to scrape their content. (At least one small social network, Cohost, has already adopted the tag across its site and says if it finds that researchers are scraping its images regardless, it “won’t rule out legal action.”) These approaches, though, have met with mixed from artistic communities. Can one-off license fees ever compensate for lost livelihood? And how does a no-scraping tag deployed now help artists whose work has already been used to train commercial AI system?

For many creators it seems the damage has already been done. But AI startups are at least suggesting new approaches for the future. One obvious step forward is for AI researchers to simply create databases where there is no possibility of copyright infringement — either because the material has been properly licensed or because it’s been created for the specific purpose of AI training. Startup Hugging Face, for example, has created “The Stack” — a dataset for training AI designed to specifically avoid accusations of copyright infringement. It includes only code with the most permissive possible open-source licensing and offers developers an easy way to remove their data on request. Its creators say their model could be used throughout the industry.

“The Stack’s approach can absolutely be adapted to other media,” Yacine Jernite, Machine Learning & Society lead at Hugging Face, told The Verge. “It is an important first step in exploring the wide range of mechanisms that exist for consent — mechanisms that work at their best when they take the rules of the platform that the AI training data was extracted from into account.” Jernite says Hugging Face wants to help create a “fundamental shift” in how the creators are treated by AI researchers. But so far, the company’s approach remains a rarity.

a:hover]:shadow-highlight-franklin [&>a]:shadow-underline-black dark:[&>a:hover]:shadow-highlight-franklin dark:[&>a]:shadow-underline-white md:text-26″>What happens next?

Regardless of where we land on these legal questions, the various actors in the generative AI field are already gearing up for… something. The companies making millions from this tech are entrenching themselves: repeatedly declaring that everything they’re doing is legal (while presumably hoping no one actually challenges this claim). On the other side of no man’s land, copyright holders are staking out their own tentative positions without quite committing themselves to action. Getty Images recently banned AI content because of the potential legal risk to customers (“I don’t think it’s responsible. I think it could be illegal,” CEO Craig Peters told The Verge last month) while music industry trade org RIAA declared that AI-powered music mixers and extractors are infringing members’ copyright (though they didn’t go so far as to launch any actual legal challenges).

The first shot in the AI copyright wars has already been fired, though, with the launch last week of a proposed class action lawsuit against Microsoft, GitHub, and OpenAI. The case accuses all three companies of knowingly reproducing open-source code through the AI coding assistant, Copilot, but without the proper licenses. Speaking to The Verge last week, the lawyers behind the suit said it could set a precedent for the entire generative AI field (though other experts disputed this, saying any copyright challenges involving code would likely be separate from those involving content like art and music).

“Once someone breaks cover, though, I think the lawsuits are going to start flying left and right.”

Guadamuz and Baio, meanwhile, both say they’re surprised there haven’t been more legal challenges yet. “Honestly, I am flabbergasted,” says Guadamuz. “But I think that’s in part because these industries are afraid of being the first one [to sue] and losing a decision. Once someone breaks cover, though, I think the lawsuits are going to start flying left and right.”

Baio suggested one difficulty is that many people most affected by this technology — artists and the like — are simply not in a good position to launch legal challenges. “They don’t have the resources,” he says. “This sort of litigation is very expensive and time-consuming, and you’re only going to do it if you know you’re going to win. This is why I’ve thought for some time that the first lawsuits around AI art will be from stock image sites. They seem poised to lose the most from this technology, they can clearly prove that a large amount of their corpus was used to train these models, and they have the funding to take it to court.”

Guadamuz agrees. “Everyone knows how expensive it’s going to be,” he says. “Whoever sues will get a decision in the lower courts, then they will appeal, then they will appeal again, and eventually, it could go all the way to the Supreme Court.”

https://www.theverge.com/23444685/generative-ai-copyright-infringement-legal-fair-use-training-data