The launch of ChatGPT and other deep learning quickly led to a flurry of lawsuits against model developers. Legal theories vary, but most are rooted in copyright: plaintiffs argue that use of their works to train the models was infringement; developers counter that their training is fair use. Meanwhile developers are making as many licensing deals as possible to stave off future litigation, and it’s a sound bet that the existing litigation is an elaborate scramble for leverage in settlement negotiations.
These cases can end one of three ways: rightsholders win, everybody settles, or developers win. As we’ve noted before, we think the developers have the better argument. But that’s not the only reason they should win these cases: while creators have a legitimate gripe, expanding copyright won’t protect jobs from automation. A win for rightsholders or even a settlement could also lead to significant harm, especially if it undermines fair use protections for research uses or artistic protections for creators. In this post and a follow-up, we’ll explain why.
State of Play
First, we need some context, so here’s the state of play:
DMCA Claims
Multiple courts have dismissed claims under Section 1202(b) of the Digital Millennium Copyright Act, stemming from allegations that developers removed or altered attribution information during the training process. In Raw Story Media v. OpenAI, Inc., the Southern District of New York dismissed these claims because the plaintiff had not “plausibly alleged” that training ChatGPT on their works had actually harmed them, and there was no “substantial risk” that ChatGPT would output their news articles. Because ChatGPT was trained on “massive amount of information from unnumerable sources on almost any given subject…the likelihood that ChatGPT would output plagiarized content from one of Plaintiffs’ articles seems remote.” Courts granted motions to dismiss similar DMCA claims in Andersen v. Stability AI, Ltd., , The Intercept Media, Inc. v. OpenAI, Inc., Kadrey v. Meta Platforms, Inc., and Tremblay v. OpenAI.
Another such case, Doe v. GitHub, Inc. will soon be argued in the Ninth Circuit.
Copyright Infringement Claims
Rightsholders also assert ordinary copyright infringement, and the initial holdings are a mixed bag. In Kadrey v. Meta Platforms, Inc., for example, the court dismissed “nonsensical” claims that Meta’s LLaMA models are themselves infringing derivative works. In Andersen v. Stability AI Ltd., however, the court held that copyright claims based on the assumption that the plaintiff’s works were included in a training data set could go forward, where the use of plaintiffs’ names as prompts generated outputted images that were “similar to plaintiffs’ artistic works.” The court also held that the plaintiffs plausibly alleged that the model was designed to “promote infringement” for similar reasons.
It’s early in the case—the court was merely deciding if the plaintiffs had alleged enough to justify further proceedings—but it’s a dangerous precedent. Crucially, copyright protection extends only to the actual expression of the author—the underlying facts and ideas in a creative work are not themselves protected. That means that, while a model cannot output an identical or near-identical copy of a training image without running afoul of copyright, it is free to generate stylistically “similar” images. Training alone is insufficient to give rise to a claim of infringement, and the court impermissibly conflated permissible “similar” outputs with the copying of protectable expression.
Fair Use
In most of the AI cases, courts have yet to consider—let alone decide—whether fair use applies. In one unusual case, however, the judge has flip-flopped, previously finding that the defendant’s use was fair and changing his mind. This case, Thomson Reuters Enterprise Centre GMBH v. Ross Intelligence, Inc., concerns legal research technology. Thomson Reuters provides search tools to locate relevant legal opinions and prepares annotations describing the opinions’ holdings. Ross Intelligence hired lawyers to look at those annotations and rewrite them in their own words. Their output was used to train Ross’s search tool, ultimately providing users with relevant legal opinions based on their queries. Originally, the court got it right, holding that if the AI developer used copyrighted works only “as a step in the process of trying to develop a ‘wholly new,’ albeit competing, product,” that’s “transformative intermediate copying,” i.e. fair use.
After reconsidering, however, the judge changed his mind in several respects, essentially disagreeing with prior case law regarding search engines. We think it’s unlikely that an appeals court would uphold this divergence from precedent. But if it did, it would present legal problems for AI developers—and anyone creating search tools.
Copyright law favors the creation of new technology to learn and locate information, even when developing the tool required copying books and web pages in order to index them. Here, the search tool is providing links to legal opinions, not presenting users with any Thomson Reuters original material. The tool is concerned with noncopyrightable legal holdings and principles, not with supplanting any creative expression embodied in the annotations prepared by Thomson Reuters.
Thomson Reuters has often pushed the limits of copyright in an attempt to profit off of the public’s need to access and refer to the law, for instance by claiming a proprietary interest in its page numbering of legal opinions. Unfortunately, the judge in this case enabled them to do so in a new way. We hope the appeals court reverses the decision.
The Side Deals
While all of this is going on, developers that can afford it—OpenAI, Google, and other tech behemoths—have inked multimillion-dollar licensing deals with Reddit, the Wall Street Journal, and myriad other corporate copyright owners. There’s suddenly a $2.5 billion licensing market for training data—even though the use of that data is almost certainly fair use.
What’s Missing
This litigation is getting plenty of attention. And it should because the stakes are high. Unfortunately, the real stakes are getting lost. These cases are not just about who will get the most financial benefits from generative AI. The outcomes will decide whether a small group of corporations that can afford big licensing fees will determine the future of AI for all of us. More on that tomorrow.
This post is part of our AI and Copyright series. Check out our other post in this series.