Roy Kaufman, CEO of the for-profit Copyright Clearance Center (CCC), posted excerpts from the company’s response to the US Copyright Office’s call for comments on AI training data. The Scholarly Kitchen post is fascinating and worthy of notice. It’s hardly surprising that the CCC—the publishing industry’s main licensing and permissions broker—took a muscular stance:
There is certainly enough copyrightable material available under license to build reliable, workable, and trustworthy AI. Just because a developer wants to use “everything” does not mean it needs to do so, is entitled to do so, or has the right to do so. Nor should governments and courts twist or modify the law to accommodate them.
What’s more interesting is CCC‘s frontal assault on the fair use rationale that big tech has used to justify its permission-free training-data hoovering. The key passages are a bit technical, filled with references to “transformer architecture” and “tokenizers.” The plain reason is to establish that language-model training requires large swaths of text:
It is important to understand the role of word embeddings when it comes to copyright because the embeddings are the representations (or encodings) of entire sentences, paragraphs, and even documents, in a high-dimensional vector space. It is through the embeddings that the AI system captures and stores the meaning and the relationships of the words from the natural language.
Such embeddings, CCC continues, are used in “practically every task” in generative AI. The message? OpenAI and the rest aren’t using mere snippets.
There’s lots more in the comments, which are (I predict) a preview of how the big commercial publishers will approach litigation and/or licensing negotiations with the Silicon Valley model builders.
Can the likes of OpenAI scrape up copyrighted content into their models, without permission or compensation? The tech companies think so; they’re fresh converts to fair-use maximalism, as revealed by their public comments filed with the US Copyright Office. The companies’ “overall message,“ reported The Verge in a round-up, is that they “don’t think they should have to pay to train AI models on copyrighted work.”
It’s time to heat up the popcorn.