Big tech firms trained their AI models on Australian content without permission, relying on data provided by an open-source AI institute that uses US fair use provisions to circumvent local copyright laws.
And the organisation behind the massive data scrape, EleutherAI, was in part funded by Australian tech unicorn Canva, according to TechCrunch. Canva declined to comment, and it is not listed as a sponsor on the EleutherAI. However, a post from EleutherAI’s executive director Stella Biderman from a year ago on Reddit confirmed the connection.
Apple, Anthropic, Nvidia, and Salesforce, all used videos to train their AI that were taken without permission by EleutherAI which argues fair use to try to sidestep copyright law in jurisdictions beyond the US.
The tech firms may not have aware of the legalities around how the data was collected.
EleutherAI claims on its website to be a non-profit research institute focused on building large-scale artificial intelligence.
As well as Canva, it has raised funds from Hugging Face and Stability AI, former GitHub CEO Nat Friedman, and Lambda Labs.
Thousands of YouTube videos from Australian publishers, broadcasters, and the government, have been used by Apple and Nvidia to train their AIs.
Anthropic also used the data that was harvested from subtitles on 173,536 videos in breach of Google-owned YouTube’s usage policies.
ABC News, Nine’s 60 Minutes, multiple federal government departments, the military, major universities, charities and religious groups were targeted.
It’s a live issue with Australian media companies who are still working through how best to respond.
Some, like News Corp, have struck deals with large language model companies like OpenAI, owner of ChatGPT. Under the terms of that deal, “OpenAI will receive access to current and archived content from News Corp’s major news and information publications, including The Wall Street Journal, Barron’s, MarketWatch, Investor’s Business Daily, FN, and New York Post; The Times, The Sunday Times and The Sun; The Australian, news.com.au, The Daily Telegraph, The Courier Mail, The Advertiser, and Herald Sun; and others. The partnership does not include access to content from any of News Corp’s other businesses.”
A spokesperson for Nine meanwhile told Mi3: “Nine is exploring a number of options to ensure we receive fair compensation for both the historical and ongoing use of our content to train Large Language Models.”
For their part, and as Mi3 reported recently, Big Tech has been rewriting the terms and conditioning of its platforms to gift itself new rights over what it can do with other companies’ content.
Per an Mi3 report in June: “Tech vendors are rewriting platform usage rulebooks, adjusting terms and conditions – and privacy policies – to give themselves legal cover, and in some cases to gift themselves new rights over customer data and content.”
Hoovered up
An enormous training dataset of 489 million words was siphoned from the platform, and even the US Embassy in Australia and Google Australia were hit.
EleutherAI is part of the open-source community, a widespread but unregulated network of developers who share code, often without payment, to hasten tech advances.
Its contribution to this network is to suck in information from the open web and convert it into massive data sets that can be used to train AIs.
EleutherAI says the goal is to democratise AI by making data and tech available to the developer community to accelerate the adoption of emerging technology.
It calls its largest dataset The Pile, and it contains 825 GBs of data, most of which was collected and made available under the US’ opaque fair use version of copyright law.
EleutherAI’s website says: “The Pile is comprised of 22 different text sources, ranging from original scrapes done for this project to text data made available by the data owners, to third-party scrapes available online.”
It was taken from multiple sources including:
Many of these are controversial and may breach Australian copyright law, and all other jurisdictions that do not recognise America’s unique fair use carve-out.
The Pile’s data collection policies have already provided legal responses. It once contained 197,000 pirated books until authors launched a court action, and the Book3 dataset was removed.
It has also triggered a lawyer’s picnic downstream.
More than 150 authors are suing Nvidia alleging the $3 trillion chipmaker used The Pile to train their AI. Meta also acknowledged in court filings it accessed it.
Zoom in
But it’s the YouTube Subtitles dataset that’s now in focus.
YouTube’s terms of use, explicitly ban its videos being scraped. Yet, EleutherAI founder Sid Black wrote on GitHub that he created the YouTube Subtitles dataset by using a script to download them.
He did not have the permission of content creators, and the code used remains freely available to download on the web, and the YouTube data remains in The Pile.
Wikipedia says The Pile has “become widely used to train other models, including Microsoft and Meta AI”.
My trawl of The Pile unearthed video content taken from myriad Australian outlets – too many to list – however most prominent were:
A joint investigation by Wired and data-driven US publisher Proof News into the scraping practices included the hostile reactions of YouTube creators including David Pakman, host of The David Pakman Show, a politics channel with two million subscribers
“No one came to me and said we would like to use this,” he said.
If AI companies are paid, Pakman said, he should be compensated for the use of his data. He pointed out that some media companies have recently penned agreements to be paid for the use of their work to train AI.
“This is my livelihood, and I put time, resources, money, and staff time into creating this content,” Pakman said.
Proof News uncovered internal Apple research documents that confirmed it used Pile data for AI training.
It also found more documents confirming Nvidia, Salesforce, and Anthropic did too.
Proof News has given Mi3 permission to quote its article at length. The lead reporter on the story, Annie Gilbertson, shared a quote from Anthropic confirming it uses the for it AI Claude, but downplaying the significance
“The Pile includes a very small subset of YouTube subtitles,” the spokesperson claimed. “YouTube’s terms cover direct use of its platform, which is distinct from use of The Pile dataset.
“On the point about potential violations of YouTube’s terms of service, we’d have to refer you to The Pile authors,” the spokesperson told Proof News
Salesforce also confirmed to Proof News it use the Pile to build an AI model – Caiming Xiong, VP of AI research, said the dataset was “publicly available.”