GitHub - alex000kim/nsfw_data_scraper: Collection of scripts to aggregate image...
source link: https://github.com/alex000kim/nsfw_data_scraper
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
NSFW Data Scraper
Note: use with caution - the dataset is noisy
Description
This is a set of scripts that allows for an automatic collection of tens of thousands of images for the following (loosely defined) categories to be later used for training an image classifier:
porn
- pornography imageshentai
- hentai images, but also includes pornographic drawingssexy
- sexually explicit images, but not pornography. Think nude photos, playboy, bikini, etc.neutral
- safe for work neutral images of everyday things and peopledrawings
- safe for work drawings (including anime)
Here is what each script (located under scripts
directory) does:
1_get_urls_.sh
- iterates through text files underscripts/source_urls
downloading URLs of images for each of the 5 categories above. The Ripme application performs all the heavy lifting. The source URLs are mostly links to various subreddits, but could be any website that Ripme supports. Note: I already ran this script for you, and its outputs are located inraw_data
directory. No need to rerun unless you edit files underscripts/source_urls
.2_download_from_urls_.sh
- downloads actual images for urls found in text files inraw_data
directory.3_optional_download_drawings_.sh
- (optional) script that downloads SFW anime images from the Danbooru2018 database.4_optional_download_neutral_.sh
- (optional) script that downloads SFW neutral images from the Caltech256 dataset5_create_train_.sh
- createsdata/train
directory and copy all*.jpg
and*.jpeg
files into it fromraw_data
. Also removes corrupted images.6_create_test_.sh
- createsdata/test
directory and movesN=2000
random files for each class fromdata/train
todata/test
(change this number inside the script if you need a different train/test split). Alternatively, you can run it multiple times, each time it will moveN
images for each class fromdata/train
todata/test
.
Prerequisites
- Docker
How to collect data
$ docker build . -t docker_nsfw_data_scraper Sending build context to Docker daemon 426.3MB Step 1/3 : FROM ubuntu:18.04 ---> 775349758637 Step 2/3 : RUN apt update && apt upgrade -y && apt install wget rsync imagemagick default-jre -y ---> Using cache ---> b2129908e7e2 Step 3/3 : ENTRYPOINT ["/bin/bash"] ---> Using cache ---> d32c5ae5235b Successfully built d32c5ae5235b Successfully tagged docker_nsfw_data_scraper:latest $ # Next command might run for several hours. It is recommended to leave it overnight $ docker run -v $(pwd):/root/nsfw_data_scraper docker_nsfw_data_scraper scripts/runall.sh Getting images for class: neutral ... ... $ ls data test train $ ls data/train/ drawings hentai neutral porn sexy $ ls data/test/ drawings hentai neutral porn sexy
How to train a CNN model
- Install fastai:
conda install -c pytorch -c fastai fastai
- Run
train_model.ipynb
top to bottom
Results
I was able to train a CNN classifier to 91% accuracy with the following confusion matrix:
As expected, drawings
and hentai
are confused with each other more frequently than with other classes.
Same with porn
and sexy
categories.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK