After Ignoring Copyright Concerns For So Long, AI Companies Now Face A Reckoning
Credit: DALL-E
How will artists and content creators make a living in an AI-driven information ecosystem? It’s a question AI companies didn't even want to think about until they were forced to.
That appears to be the key takeaway of a 3,000-word story The New York Times published over the weekend on how Big Tech giants deliberately threw copyright concerns to the wind as they created the generative systems that can conjure up content on a whim prompt. According to the report, executives at OpenAI, Google, and Meta all knew they were likely violating copyright, and even their own policies, as they harvested vast datasets without permission, but they did it anyway.
While that revelation is surprising to precisely no one, it helps to see the evidence laid out so clearly. The tech giants had motive (the existential need for ever-more data to train their models), the means (software to interpret and ingest media), and the opportunity (the data being publicly available on the internet).
A big chunk of the piece zeroes in on OpenAI's harvesting YouTube videos, the content of which is more difficult for large language models (LLMs) to consume than a text-based webpage. OpenAI needed to develop custom software and processes to hoover up over 1 million hours of videos to train GPT-4, according to the Times. The article is essentially a cartoon smoking gun you could put in a thought balloon over the head of OpenAI CTO Mira Murati during her much-maligned interview with The Wall Street Journal's Joanna Stern.
Similarly, the investigation reveals Google sneakily made adjustments to its own terms of service to ensure it could train its AI models on user data (at least some of it). And Meta, too, looked for any path to AI that allowed the company to ignore copyright concerns around the data it needed to make its models competitive with OpenAI. Negotiations with artists and writers would take too long? "Don't even try" was the attitude, the report claims.
There used to be a saying in the digital economy: data is the new oil. The analogy is obvious: just as oil empowered the robber barons of the industrial revolution, data — especially user-generated data — was fuel for the algorithm-powered internet of Web 2.0. That, in turn, inspired the idea of Web3, where every netizen has the power to erect a data derrick in their own backyard.
But what happens when Big Tech siphons all the data away before you even start digging?
Public = Free?
Over the past year, how AI companies have harvested copyright data has come under increased scrutiny, particularly at OpenAI since it's the leader in the field. Between the lawsuits, the interviews, and investigations like the Times', it's come out just how contradictory the company's position is: It believes anything on the publicly available internet is fair use, but it also says it will honor "do not train." It doesn't think it needs to pay for publisher's content, yet it's paying for publisher's content. It says the value of any individual data source is negligible, but it obviously can't get enough of them.
As ever, the heart of the issue is how content creators are fairly compensated. When working properly, AI creates wholly transformative content from its training data, but it couldn't do that without the training data. What a healthy AI-mediated information ecosystem looks like, one that respects and incentivizes content creators, is still being figured out.
If the Times article is any guide, Big Tech's vision is a data economy where publicly available content is essentially free. While that's probably not the best choice for a thriving media industry, that may be what the future looks like if it's shaped entirely on their terms.
Outside pressure could change that, and it's building. The New York Times famously sued OpenAI back in December, and it was joined by a few other publishers in February. But for the most part, publishing hasn't been able to muster collective action to bargain with the AI companies.
Why? Medium CEO Tony Stubblebine gave a first-hand account on a recent People vs. Algorithms podcast. Medium apparently tested the waters on a publishers coalition with other content platforms, including Wikipedia, to better negotiate on a far-reaching licensing deal with OpenAI et al.
It didn't get off the ground.
"We're blocking [our content from] these companies because they screwed up," Stubblebine said. "Just exchange of value — no consent, no credit, no compensation. And then I used that to go to all the other platforms and say, 'Look, we're going to fail unless we form a coalition.' And every single one of them had their own plan."
Stubblebine pointed to mismatched incentives as the culprit that killed the idea of cooperation: Wikipedia wants its information available everywhere, even via AI summarization. Reddit negotiated a deal with OpenAI on its own. An executive at another, unspecified company told Stubblebine they wanted to get in the AI business themselves.
I would add a key factor to that list: The Times lawsuit actually disincentivizes other publishers from suing OpenAI. With the Times taking on the fight, they can just grab some popcorn and wait for a winner. If the Times wins, they benefit from the ruling no matter what. And if OpenAI wins, they won't be made to sit in the corner when they come to the bargaining table.
In the end, AI companies clearly believe they have the right to harvest data from the entire internet, but are afraid to say so openly. Media companies want to be compensated for their data, but they're terrified of being shut out by Big Tech.
Cowardice is easy. It's much harder to work together to chart a way forward in good faith that enables content creators to thrive while preserving innovation. We at The Media Copilot are doing our part by creating new rules for journalists and creatives in an AI world via our upcoming manifesto. But given the feelings of existential threat on both sides, the clarity that's sorely needed likely won't come from the industry, but from a judge and jury.