The AI-Copyright Wars have begun
Willy Nilly Datasets are mighty bad business.
I'm still in the cold harsh fangs of a non-covid flu virus that swamps my body with a trillion Peta-Liters of slime behind my face since Christmas, flowing all over my body, flooding my bedroom, wooshing by behind my eyes in yellow-green smears of yuck, worming around my lungs, withering itself into every inch of my breathing flesh, only to be coughed up in grainy bits of fleshy crumbs for some extra yummy breakfast addons.
For me, the new year starts with a slippery wet nose full of yellow, to the sick slimey delight of the 12 year old who lives inside my still beating heart. As you can see, i'm also having some fun with it.
The AI-Copyright Wars have begun
So, while i was fighting those slime wars behind my face, the New York Times sued OpenAI and Microsoft over use of copyrighted work, which is not very interesting in terms of copyright lawsuits (we have plenty of those and it's not even the latest, albeit this may be the strongest case yet), but mostly because of it's stark contrast with OpenAIs partnership with Springer. Only half a year ago, reportedly large publishers were seeking to form a coalition to adress the impact of AI on journalism, and both the NYT and Springer were involved. Obviously that coalition broke apart and two strategys emerged: "Sue into oblivion" and "cooperate for a buck".
That "buck" seems to top out at $5 million a year, which is a hint at why the NYT chose to sue so hard that they're "asking for the destruction of 'All GPT and other LLM models that contain NYT data'", which was first reported by Ars Technica back in June: Potential NYT lawsuit could force OpenAI to wipe ChatGPT and start over. Good times.
In a larger framing, Alberto Romero writes about how The NYT vs OpenAI Is Not Just a Legal Battle, but a battle between morals and progress. I think morality vs progress is the wrong dichotomy here. Historically, it's about innovation vs morals, as shown by MIT economists Daron Acemoğlu and Simon Johnson their book Progress and Power, and only through that struggle in which innovation has to be reigned in by morals we create progress for the whole of society.
You see, morals by and large are societal cooperation maximizers, the consensus of what you can and can not do, what is taboo, location and size of the overton window, and so forth. These moral values are fought over so harshly exactly because they determine how well participants in a society can and will cooperate. Mostly, morals make sense, sometimes, they get outdated, fought over and replaced, they are never canonized and are in constant flux.
Any groundbreaking innovation with society-wide impact disturbes this process, and the battle not only for and against AI-regulation, but also the impact of new forms of digital mass communication and pubishing systems (aka Social Media) is testament to this process.
Progress (applied innovation for the betterment of the whole of society) happens when that conflict is settled, new morals are developed and accepted in new consensus or innovation get integrated into the morally accepted ways of doing things.
Progress is something that all of us can get behind, from technolibertarian accelerators to leftwing tech-critics. The fight is about how we get there, and i'm not willing to let go off "old" moral values like freedom and ownership to decide if i want my work to be part of the AI-machine to provide training data for Microsoft, and if so, that i set the price. This sort of freedom in business relations might be outdated moral values for e/acc libertarian types, but not for me.
And it's ofcourse not just LLMs and ChatGPT, but all the image generators out there which already have their own bulk of lawsuits going on. As Gary Marcus and Reid Southen write in their IEEE Spektrum article about how Generative AI Has a Visual Plagiarism Problem, image generators effortlessly produce near identical copies of copyrighted works, sometimes even without prompting them with IP-including terms. A "cartoon sponge" will give you protected Spongebobs at a very high rate.
The funniest thing about all of this is that the discourse hasn't even started to think about Trademarks yet, and i simply repeat myself from one year ago: "All Stable Diffusion-Checkpoint-files 'know' what Batman looks like, because Stability and LAION used tons of Batman-images, without paying a dime to Warner, and sometimes it puts out that data unchanged. That's all the lawyers and their plaintiffs need." This is not just true for Batman, but all the other trademarked and copyrighted stuff any commercial AI-system is trained on, from Super Mario owned by Nintendo to the Nike Swoosh.
And the legal problems of GenAI doesn't stop at copyright and trademarks ofcourse. Just before i went into holiday-slime-mode, "the LAION-5B machine learning dataset used by Stable Diffusion and other major AI products [had to be] removed by the organization that created it after a Stanford study found that it contained 3,226 suspected instances of child sexual abuse material".
If that's not horrible enough for you, here's the always brillant Eryk Salvaggio on the Original Sin of Generative AI:
In my own analysis of LAION’s content — prior to the dataset’s removal — I was troubled by its inclusion of images of historical atrocities, which are abstracted into unrelated categories. Nazi soldiers are in the training data for “hero,” for example. I refer to these assemblages as “trauma collages,” noting that a single generated image could incorporate patterns learned from images of the Nazi Wehrmacht on vacation, portraits of people killed in the Holocaust, and prisoners tortured at Abu Ghraib, alongside images of scenes from the Archie Comics reboot “Riverdale'' and pop culture iconography.
We have little understanding of how these images trickle into the display of these “beautiful” illustrations and images, but there seems to be a failure of cultural reckoning with the fact that these are rotten ingredients.
It doesn't help at all that the org behind LAION over the holidays was caught openly trying to launder their datasets by creating synthetic data of IP-material, supposedly to make them unprotected by law, only to take down that synthetic dataset because they possibly realized that IP laws don't work that way. Or that a database of artists used to train Midjourney leaked to the public (here's a PDF). All of this looks mighty shady to me, to say the least, and this is what you get when you just play willy nilly with the one thing that powers your AI-revolution: Data.
Lawmakers just introduced a new bill which would require AI companies to disclose copyrighted training data, and if this or similar legislation goes through, then MS/OAI is in the uncomfortable position to have already launched commercial products, used by millions of users generating millions of dollars in revenue, that are, under that law, illegal. Move fast, break yourself.
People keep comparing the situation to Napster (disclaimer: I worked for the then-legal streaming-service Napster in the mid 2000s) and they're kind of right (centralized servers full of copyrighted stuff gets sued into nonexistence) and kind of wrong (decentralized opensource technology will allow copyright infringement for tech savy users for a very long time and this will arguably never go away). The difference here is that a tech savvy user torrenting Rebel Moon is not the same as Microsoft building an office tool on unlicensed works protected by law. Microsoft (and Midjourney and Stability and others) gonna learn this the hard way, it seems, and they are in for a hell of a ride. Yes, I'm having some mighty fun watching all of this.
I keep saying here that Generative AI corporations with their millions of dollars in venture capital backings are sitting on a timebomb, and i said early on that LLMs are basically Large Language Warez.
That Timebomb just got its biggest explosive component from the Times (heh), and it's ticking in broad daylight now for everyone to see.
Tick. Tick. Tick.