Researchers found child abuse material in the largest AI image generation datase...
source link: https://www.engadget.com/researchers-found-child-abuse-material-in-the-largest-ai-image-generation-dataset-154006002.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Researchers found child abuse material in the largest AI image generation dataset
Researchers found child abuse material in the largest AI image generation dataset
The non-profit behind the LAION-5B dataset has taken it down 'in an abundance of caution.'
Researchers from the Stanford Internet Observatory say that a dataset used to train AI image generation tools contains at least 1,008 validated instances of child sexual abuse material. The Stanford researchers note that the presence of CSAM in the dataset could allow AI models that were trained on the data to generate new and even realistic instances of CSAM.
LAION, the non-profit that created the dataset, told 404 Media that it "has a zero tolerance policy for illegal content and in an abundance of caution, we are temporarily taking down the LAION datasets to ensure they are safe before republishing them." The organization added that, before publishing its datasets in the first place, it created filters to detect and remove illegal content from them. However, 404 points out that LAION leaders have been aware since at least 2021 that there was a possibility of their systems picking up CSAM as they vacuumed up billions of images from the internet.
According to previous reports, the LAION-5B dataset in question contains "millions of images of pornography, violence, child nudity, racist memes, hate symbols, copyrighted art and works scraped from private company websites." Overall, it includes more than 5 billion images and associated descriptive captions. LAION founder Christoph Schuhmann said earlier this year that while he was not aware of any CSAM in the dataset, he hadn't examined the data in great depth.
It's illegal for most institutions in the US to view CSAM for verification purposes. As such, the Stanford researchers used several techniques to look for potential CSAM. According to their paper, they employed "perceptual hash‐based detection, cryptographic hash‐based detection, and nearest‐neighbors analysis leveraging the image embeddings in the dataset itself." They found 3,226 entries that contained suspected CSAM. Many of those images were confirmed as CSAM by third parties such as PhotoDNA and the Canadian Centre for Child Protection.
Stability AI founder Emad Mostaque trained Stable Diffusion using a subset of LAION-5B data. The first research version of Google's Imagen text-to-image model was trained on LAION-400M, but that was never released; Google says that none of the following iterations of Imagen use any LAION datasets. A Stability AI spokesperson told Bloombergthat it prohibits the use of its test-to-image systems for illegal purposes, such as creating or editing CSAM.“This report focuses on the LAION-5B dataset as a whole,” the spokesperson said. “Stability AI models were trained on a filtered subset of that dataset. In addition, we fine-tuned these models to mitigate residual behaviors.”
Stable Diffusion 2 (a more recent version of Stability AI's image generation tool) was trained on data that substantially filtered out 'unsafe' materials from the dataset. That, Bloomberg notes, makes it more difficult for users to generate explicit images. However, it's claimed that Stable Diffusion 1.5, which is still available on the internet, does not have the same protections. "Models based on Stable Diffusion 1.5 that have not had safety measures applied to them should be deprecated and distribution ceased where feasible," the Stanford paper's authors wrote.
Correction, 4:30PM ET: This story originally stated that Google's Imagen tool used a subset of LAION-5B data. The story has been updated to note that Imagen used LAION-400M in its first research version, but hasn't used any LAION data since then. We apologize for the error.
Recommend
-
17
Metamaterial created with Artificial Intelligence that transforms a brittle material into a sponge-like material. Unlike a sponge, this metamate...
-
10
Apple is already scanning your emails for child abuse material
-
7
Apple says it's already scouring your emails for child abuse material The Cupertino company has been scanning emails for the past three years By...
-
7
The largest group of nesting fish ever found lives beneath Antarctic ice The unexpected find is much larger than any other known collection of fish nests
-
5
Needs more moderation — TikTok under US government investigation over child sexual abuse material Short-video app accused of being "perfect place for predators" by DHS.
-
4
Judge Rules Visa Can Be Sued for Monetizing Child Sexual Abuse Material on PornhubThe court could infer that “Visa intended to help MindGeek monetize child porn," the decision states.August 1,...
-
6
This Chatbot Aims to Steer People Away From Child Abuse MaterialPornhub is trialing a new automated tool that pushes CSAM-searchers to seek help for their online behavior. Will it work?...
-
8
Child abuse material found on VR headsets, police data showsPublished1 hour ago
-
3
Thousands of child abuse images found in AI training tool LAION-5b taken offline as researchers identi...
-
8
This AI Chatbot is Trained to Jailbreak Other ChatbotsResearchers used an open-source tool to generate malicious prompts that evade content filters in ChatGPT, Bard, and Bing Chat.by
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK