8

scraper/datasets/javascript-libs-from-top-1mm-sites at main · get-set-fetch/scra...

 2 years ago
source link: https://github.com/get-set-fetch/scraper/tree/main/datasets/javascript-libs-from-top-1mm-sites
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Javascript Libraries From Top 1 Million Sites

CSV files available as open access dataset

  • getsetfetch-dataset-javascript-libraries.csv.gz (146 MB)

    • Each row contains a page URL followed by script source URLs (absolute or relative) encountered in that page. Inline scripts have an "<inline>" value.
      ex: https:// sitemaps.org/,"<inline>","/lang.js"
  • getsetfetch-dataset-javascript-libraries-frequency-count.csv.gz (214 KB)

    • Each row contains a partial script pathname followed by a frequency count. The pathname is split in fragments based on "/" and expanded from right to left until the first non-generic fragment is found. If the full pathname contains only generic keywords (index, main, dist, etc...) the script hostname is added as well. Common suffixes like .min, .min.js are removed.
      ex: jquery/ui/core,62554

Get Input Data

The project scrapes URLs from Majestic 1 Million (June 5th, 2022).
Download the csv from the official site.
Keep 3rd column with the domain name. Manually remove 1st row containing labels.

cd ansible/files
cut -d, -f 3 downloaded-majestic-million.csv > majestic-million-compact.csv
sed -i '1d' majestic-million-compact.csv

majestic-million-compact.csv is referenced by ansible playbook scraper-setup.yml. It will be used to add the URLs to the initial scraping queue.

Scrape in Cloud

See getsetfetch.org/blog/cloud-scraping-running-existing-projects.html on detailed instructions on how to setup Terraform and Ansible, start scraping, monitor progress and export scraped content.

The defined terraform module main.tf provisions one central PostgreSQL instance and 20 scraper instances deployed on DigitalOcean Frankfurt FRA1 datacenter.

terraform apply \
-var "api_token=${API_TOKEN}" \
-var "public_key_file=<public_key_file>" \
-var "private_key_file=<private_key_file>" \
-parallelism=30

Summarize Scraped Data

cd charts/extract
npx ts-node summarize-js-libs.ts

Generate Chart(s)

Start a basic http server serving static files from current directory on localhost:9000.

cd charts
npx ts-node ../../utils/serve-static.ts

Most Used Javascript Libraries (percentage)

Most Used Javascript Libraries

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK