Academic Journals v. OpenAI?

Pushback to LLMs mounts.

Three authors file lawsuits against OpenAI

Last month, comedian Sarah Silverman and novelists Mona Awad and Paul Tremblay sued OpenAI, the creator of ChatGPT, for copyright infringement. According to the AP report, Silverman grew suspicious of the AI chatbot because it could generate synopses of every part of her 2010 memoir, The Bedwetter, indicating that the chatbot absorbed the text of the book itself rather than information about it floating freely online. Indeed, the lawsuit alleges that OpenAI trained its AI models off of the book “without consent, without credit, and without compensation,” likely using a pirated copy. 

Awad and Tremblay filed a separate lawsuit against OpenAI for the same reason. Included in their complaint as exhibit B were several responses ChatGPT provided to prompts asking for detailed summaries of their books. The long summaries draw on many specific details about the characters, settings, and plot devices. The novelists similarly allege a violation of copyright law. A lawyer in the AP story opines that this lawsuit will likely face an uphill battle after the Supreme Court rejected the challenge on copyright grounds to Google’s book digitization efforts, which allows users “snippet” views of copyrighted materials, back in 2016. 

These lawsuits bring attention to the black box inside of ChatGPT and other Large Language Models (LLMs). Programmers train LLMs using terabytes of data, but users don’t have access to—and likely couldn’t process effectively—the datasets within these black boxes. The AP story raises the plausible possibility that ChatGPT has simply absorbed so many reviews about Silverman’s book that it could synthesize them into molecular-level summaries of the book without having access to the primary text. 

🌟 Subscribe to our newsletter for more great posts, including:

Will academic journals be next to sue OpenAI?

The black box is at the center of a possible pushback among academic journals. Times Higher Ed speculates (fittingly behind a paywall) that large academic journals may soon file their own lawsuits against AI companies for copyright infringement. As debates over open access scholarship have made clear, some academic journals fiercely guard their copyrighted scholarship. Indeed, journals such as Science and companies such as Elsevier have already made statements about limiting or refusing the access of public AI technologies to their content and data. 

Will such efforts be effective? Like the Silverman case, it is possible that LLMs will disregard copyright by accessing repositories of pirated scholarship. But it is equally possible that they wouldn’t need to, because they are capable of compiling enough data from news reports on scholarly findings and citations in open access scholarship and abstracts to effectively recreate the copyrighted information on their own. Regardless of both possibilities, a lawyer in the Times Higher Ed piece suggests that courts may well decide that chatbots don’t infringe on the copyrights, because their written expressions of the information are entirely unique to them. 

Ultimately, the outcomes of the lawsuits from Silverman and Awad and Tremblay will likely inform whether academic journals and publishing companies proceed with legal action, should they find evidence of copyright infringement.

What does this mean for LLMs as research tools?

I call this quagmire to your attention not merely for the ethical or legal stakes of copyright infringement, though they are important. As we formulate how to use LLMs as academic research tools, we need to remain cognizant of the fact that we don’t have a firm grasp of the datasets used to train them. This doesn’t just mean inaccuracies and hallucinations, but also gaps and biases. Should Science, for instance, successfully opt out of LLMs, then what other data will take the place of its scholarship in ChatGPT’s responses? Will it be the most appropriate or relevant research? What if it isn’t? Or consider the claims of women scholars and scholars of color that their research is under-cited in relevant publications. Will LLMs, using data such as citation numbers, promote only top scholars and recreate problems in academic scholarship by de-prioritizing scholars who are relevant but under-cited due to bias? 

The future of generative AI is uncertain to say the least, and I imagine that few scholars and librarians are comfortable with the idea of relying on ChatGPT as a one-stop shop for their research needs. But these problems offer us an opportunity to begin thinking about how we want to position LLMs among our research tools and how we should teach students about them. Perhaps to avoid major problems down the road, we need to begin conceiving of them as we do Wikipedia: not a bad starting place, but only a starting place. 

🔥 Sign up for LibTech Insights (LTI) new post notifications and updates.

📅 Join us for a free webinar on AI prompt engineering for librarians.

✍️ Interested in contributing to LTI? Send an email to Deb V. at Choice with your topic idea.