5

CLI Tool for Downloading Huggingface Models and Datasets

 8 months ago
source link: https://gist.github.com/padeoe/697678ab8e528b85a2a7bddafea1fa4f
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

🤗Huggingface Model Downloader

Considering the lack of multi-threaded download support in the official huggingface-cli, and the inadequate error handling in hf_transfer, this command-line tool smartly utilizes wget or aria2 for LFS files and git clone for the rest.

Features

  • ⏯️ Resume from breakpoint: You can re-run it or Ctrl+C anytime.
  • 🚀 Multi-threaded Download: Utilize multiple threads to speed up the download process.
  • 🚫 File Exclusion: Use --exclude or --include to skip or specify files, save time for models with duplicate formats (e.g., .bin and .safetensors).
  • 🔐 Auth Support: For gated models that require Huggingface login, use --hf_username and --hf_token to authenticate.
  • 🪞 Mirror Site Support: Set up with HF_ENDPOINT environment variable.
  • 🌍 Proxy Support: Set up with HTTPS_PROXY environment variable.
  • 📦 Simple: No dependencies & No installation required.

Usage

First, Download hfd.sh or clone this repo, and then grant execution permission to the script.

chmod a+x hfd.sh

Usage Instructions:

$ ./hfd.sh -h
Usage:
  hfd <model_id> [--exclude exclude_pattern] [--hf_username username] [--hf_token token] [--tool wget|aria2c] [-x threads] [--dataset]

Description:
  Downloads a model or dataset from Hugging Face using the provided model ID.

Parameters:
  model_id        The Hugging Face model ID in the format 'repo/model_name'.
  --exclude       (Optional) Flag to specify a string pattern to exclude files from downloading.
  exclude_pattern The pattern to match against filenames for exclusion.
  --hf_username   (Optional) Hugging Face username for authentication.
  --hf_token      (Optional) Hugging Face token for authentication.
  --tool          (Optional) Download tool to use. Can be wget (default) or aria2c.
  -x              (Optional) Number of download threads for aria2c.
  --dataset       (Optional) Flag to indicate downloading a dataset.

Example:
  hfd bigscience/bloom-560m --exclude safetensors
  hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken --tool aria2c -x 8
  hfd lavita/medical-qa-shared-task-v1-toy --dataset

Download a model:

./hfd.sh bigscience/bloom-560m

Download a model need login

Get huggingface token from https://huggingface.co/settings/tokens, then

hfd meta-llama/Llama-2-7b --hf_username YOUR_HF_USERNAME --hf_token YOUR_HF_TOKEN

Download a model and exclude certain files (e.g., .safetensors):

./hfd.sh bigscience/bloom-560m --exclude safetensors

Download with aria2c and multiple threads:

./hfd.sh bigscience/bloom-560m --tool aria2c -x 4

Output: During the download, the file URLs will be displayed:

$ ./hfd.sh bigscience/bloom-560m --exclude safetensors
...
Start Downloading lfs files, bash script:

wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/flax_model.msgpack
# wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/model.safetensors
wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/onnx/decoder_model.onnx
...

Create an Alias for Convenience

For easier access, you can create an alias for the script:

alias hfd="$PWD/hfd.sh"

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK