Beth Montague-Hellen, in an interesting Insights piece:
When discussing AI, much time has been spent on identifying how the technology can be of use, where it might be dangerous and how we may want to restrict or enable it. There has been considerable discussion about restricting the use of copyrighted material within training datasets and how the use of this material may be breaking existing copyright laws. However, little time has been spent discussing the more positive issue of which materials we would like generative AI to have access to. This article proposes that scholarly communications professionals, including librarians and publishers, should be pushing for the inclusion, rather than the exclusion, of scholarly content in AI training datasets.
It’s fascinating to watch the open-everything ethos of the OA movement contend with the AI training question. Here Montague-Hellen applies the movement’s logic to AI: Of course we should make it easier for models to train on published scholarship. There is, at the same time, lots of unease with the big-tech profiteering off of scholars’ work—a concern I share. Montague-Hellen’s main rationale is to save the big models’ from their hallucination problem, on the grounds that scholarship is vetted, unlike the cesspool of the internet they’ve been trained on. But do we want to help Google, OpenAI, Microsoft, Apple and the rest to take still further control of the knowledge ecosystem? It’s the absence of any serious engagement with big tech—and no mention of the scholarly publishing oligopoly, which thinks it’s sitting on a gold mine—that makes a piece like this come off as a bit naive.
Either way, the tension between open-everything and stop-profiting-off-consentless-extraction is coming more and more into focus. And there are echoes, with this AI-training question, of the old CC BY vs. CC BY-NC wars in open licensing—even if, in the AI case, it’s not clear that even CC BY permits it. The question in common is whether by “open” we mean “open to profit.”