The privacy of Australian children is being violated on a large scale by the artificial intelligence (AI) industry, with personal images, names, locations and ages being used to train some of the world’s leading AI models.
Researchers from Human Rights Watch (HRW) discovered the images in a prominent dataset, including a newborn baby still connected to their mother by an umbilical cord, preschoolers playing musical instruments, and girls in swimsuits at a school sports carnival.
“Ordinary moments of childhood were captured and scraped and put into this dataset,” said Hye Jung Han, a children’s rights and technology researcher at HRW.
“It’s really quite scary and astonishing.”
The images were found in LAION-5B, a free online dataset of 5.85 billion images, used to train a number of publicly available AI generators that produce hyper-realistic images.
Researchers were investigating the AI supply chain following an incident at Bacchus Marsh Grammar School, where deepfake nude images of female students were allegedly produced by a peer, using AI.
HRW examined a sample of 5,850 images from the collection, covering a broad range of subject matter — from potatoes to planets to people — and found 190 Australian children, from every state and territory.
“From the sample that I looked at, children seem to be over-represented in this dataset, which is indeed quite strange,” Ms Hye Jung said.
“That might give us a clue [as] to how these AI models are able to then produce extremely realistic images of children.”
The images were gathered using a common automated tool called a “web crawler”, which is programmed to scour the internet for certain content.
HRW believes the images have been taken from popular photo and video-sharing sites including YouTube, Flickr, and blogging platforms, as well as sites many would presume were private.
“Other photos were uploaded [to their own websites] by schools, or by photographers hired by families,” said Hye Jung Han, adding that the images were not easily findable via search, or on public versions of the websites they came from.
Some images also came with highly specific captions, often including children’s full names, where they lived, hospitals they’d attended, and their ages when the photo was taken.
The revelations are a wake-up call for the industry, according to Professor Simon Lucey, Director of the Australian Institute for Machine Learning at the University of Adelaide.
He says AI is in a “wild west” phase.
“If there’s a dataset out there, people are going to use it,” he said.
According to the experts, AI models are incapable of forgetting their training data.
“The AI model has already learned that child’s features and will use it in ways that nobody can really foresee in the future,” Ms Hye Jung said.
Additionally, there’s a slim but real risk that AI image models will reproduce elements of their training data — for example, a child’s face.
“There has been quite a lot of research going into this … and it seems to be that there is some leakage in these models,” Professor Lucey said.
There are no known reports of actual children’s images being reproduced inadvertently, but Dr Lucey said the capability was there.
He believes there are certain models which should be switched off completely.
“Where you can’t reliably point to where the data has come from, I think that’s a really appropriate thing to do,” he said.
He emphasised though that there were plenty of safe and responsible ways to train AI.
“There’s so many examples of AI being used for good, whether it’s about discovering new medicines [or] things that are going to help with climate change.
“I’d hate to see research in AI stopped altogether,” he said.
The dataset LAION-5B has been used to train many of the world’s leading AI models, such as Stable Diffusion and Midjourney, used by millions of people globally.
It was created by a German not-for-profit organisation called LAION.
In a statement to the ABC, a LAION spokesperson said its datasets “are just a collection of links to images available on [the] public internet”.
They said, “the most effective way to increase safety is to remove private children’s information from [the] public internet”.
In 2023, researchers at Stanford found hundreds of known images of child sexual abuse material (CSAM) in the LAION-5B dataset.
LAION took its dataset offline and sought to remove the material, before making the collection publicly available again.
LAION’s spokesperson told the ABC, “it’s impossible to make conclusions based on [the] tiny amounts of data analysed by HRW”.
The organisation has taken steps to remove the images discovered by HRW, even though they’ve already been used to train various AI generators.
“We can confirm that we remove all the private children data [sic] reported by HRW.”
HRW didn’t find any new instances of child sexual abuse imagery in the sample it examined, but said the inclusion of children’s images was a risk in its own right.
“The AI model is able to combine what it learns from those kinds of [sexualised] images, and… images of real Australian kids,” Ms Hye Jung said.
“[It] essentially learns from both of those concepts … to be able to then produce hyper-realistic images of Australian kids, in sexualised poses.”
While the use of children’s data to train AI might be concerning, experts say the legalities are murky.
“There are very, very few instances where a breach of privacy leads to regulatory action,” said Professor Edward Santow, a former Human Rights Commissioner and current Director at the Human Technology Institute.
It’s also “incredibly difficult” for private citizens who might want to take civil action, he said.
“That’s one of the many reasons why we need to modernise Australia’s Privacy Act,” he said.
The federal government is expected to unveil proposed changes to the Act next month, including specific protections for children online.
Mr Santow said it was a long-overdue update for a law that was mostly written “before the internet was created”.
“We have a moment now where we need governments to really stand up for the community … because pretty soon in the next year or two, that moment will have passed,” he said.
“These [AI] models will all have been created and there’ll just be no easy way of unpicking what has gone wrong.”
HRW is also calling for urgent law reform.
“These things are not set in stone… it is actually possible to shape the trajectory of this technology now,” Ms Hye Jung said.
Loading…Loading…
Posted , updated