Notifications

🆘 Need Any Help?

Today at 8:40 AM

🤑 1k Brands Publish Here

Today at 8:30 AM

✍🏽 How Do I Get Published?

Today at 9:11 AM

🆓 Is HackerNoon Free to Read?

Today at 9:03 AM

📰 Today's Top Tech Stories

Today at 8:40 AM

How to Speed Up File Downloads With Python by@exactor

How to Speed Up File Downloads With Python

February 7th 2022 new story

Some admins set limits on the speed of downloading files, this reduces the load on the network. But at the same time it is very annoying for users, especially when you need to download a large file (from 1 GB), and the speed fluctuates around 1 megabit per second (125 kilobytes per second) Based on these data, we conclude that the download speed will be at least 8192 seconds (2 hours 16 minutes 32 seconds) Although our bandwidth allows us to transfer up to 16 Mbps (2 MB per second), it will take 512 seconds.

Audio Presented by Plivo Inc-icon

Speed:

Read by:

Your browser does not support theaudio element.

@exactor

Maksim Kuznetsov

Just a Senior Python Developer.

NEWABOUT PAGE

Very often, some admins set limits on the speed of downloading files, this reduces the load on the network, but at the same time it is very annoying for users, especially when you need to download a large file (from 1 GB), and the speed fluctuates around 1 megabit per second (125 kilobytes per second). Based on these data, we conclude that the download speed will be at least 8192 seconds (2 hours 16 minutes 32 seconds). Although our bandwidth allows us to transfer up to 16 Mbps (2 MB per second) and it will take 512 seconds (8 minutes 32 seconds).

0 reactions

These values are not taken by chance, for such a download, initially I used exclusively 4G Internet.

0 reactions

Case:

0 reactions

The utility developed by me and laid out below - works only if:

0 reactions

You know in advance that your bandwidth is higher than the download speed
Even large site pages load quickly (the first sign of artificially low speed)
You are not using a slow proxy or VPN
Good ping to the site

What are these restrictions for?

0 reactions

optimization of the backend and the return of static files
DDoS Protection

How is this slowdown implemented?

0 reactions

Nginx

0 reactions

location /static/ {
   ...
   limit rate 50k; -> 50 kilobytes per second for a single connection 
   ...
}

location /videos/ {
   ...
   limit rate 500k; -> 500 kilobytes per second for a single connection
   limit_rate_after 10m; -> after 10 megabytes download speed, will 500 kilobytes per second for a single connection
   ...
}

Feature with zip file

0 reactions

An interesting feature was discovered when downloading a file in the zip extension, each part allows you to partially display the files in the archive, although most archivers will say that the file is broken and not valid, some of the content and file names will be displayed.

0 reactions

Code parsing:

0 reactions

To create this program, we need Python, asyncio, aiohttp, aiofiles. All code will be asynchronous to increase performance and minimize overhead in terms of memory and speed. It is also possible to run on threads and processes, but when loading a large file, errors may occur when a thread or process cannot be created.

0 reactions

async def get_content_length(url):
    async with aiohttp.ClientSession() as session:
        async with session.head(url) as request:
            return request.content_length

This function returns the length of the file. And the request itself uses HEAD instead of GET, which means that we get only the headers, without the body (content at the given URL).

0 reactions

def parts_generator(size, start=0, part_size=10 * 1024 ** 2):
    while size - start > part_size:
        yield start, start + part_size
        start += part_size
    yield start, size

This generator returns ranges for download. An important point is to choose part_size that is a multiple of 1024 to keep proportions per megabytes, although it seems that any number will do. It doesn't work correctly with part_size = 1, so I defaulted to 10 MB per part.

0 reactions

async def download(url, headers, save_path):
    async with aiohttp.ClientSession(headers=headers) as session:
        async with session.get(url) as request:
            file = await aiofiles.open(save_path, 'wb')
            await file.write(await request.content.read())

One of the main functions is a file download. It works asynchronously. Here we need asynchronous files to speed up disk writes by not blocking input and output operations.

0 reactions

async def process(url):
    filename = os.path.basename(urlparse(url).path)
    tmp_dir = TemporaryDirectory(prefix=filename, dir=os.path.abspath('.'))
    size = await get_content_length(url)
    tasks = []
    file_parts = []
    for number, sizes in enumerate(parts_generator(size)):
        part_file_name = os.path.join(tmp_dir.name, f'{filename}.part{number}')
        file_parts.append(part_file_name)
        tasks.append(download(URL, {'Range': f'bytes={sizes[0]}-{sizes[1]}'}, part_file_name))
    await asyncio.gather(*tasks)
    with open(filename, 'wb') as wfd:
        for f in file_parts:
            with open(f, 'rb') as fd:
                shutil.copyfileobj(fd, wfd)

The most basic function gets the filename from the URL, converts it into a numbered .part file, creates a temporary directory under the original file, all parts are downloaded into it. await asyncio.gather(*tasks) allows you to execute all collected coroutines concurrently, which significantly speeds downloading. After that, the already synchronous shutil.copyfileobj method concatenates all files into one file.

0 reactions

async def main():
    if len(sys.argv) <= 1:
        print('Add URLS')
        exit(1)
    urls = sys.argv[1:]
    await asyncio.gather(*[process(url) for url in urls])

The main function receives a list of URLs from the command line and, using the already familiar asyncio.gather, it starts downloading many files at the same time.

0 reactions

Benchmark:

0 reactions

On one of the resources I found, a benchmark was carried out on downloading a Gentoo Linux image from a site of one university(slow server).

0 reactions

async: 164.682 seconds
sync: 453.545 seconds

Download DietPi distribution (fast server):

0 reactions

async: 17.106 seconds best time, 20.056 seconds worst time
sync: 15.897 seconds best time, 25.832 worst time

As you can see, the result reaches almost 3x acceleration. On some files, the result reached 20-30 times.

0 reactions

Possible improvements:

0 reactions

More secure download. If there is an error, restart the download.
Memory optimization. One of the problems is a 2x increase in storage space consumption. (when all parts are downloaded, copied to a new file, but the directory has not yet been deleted). Easily fixed by deleting the file immediately after copying the contents of the part.
Some servers keep track of the number of connections and can ruin such a load, this requires pausing or greatly increasing the size of the part.
Adding a progress bar.

In conclusion, I can say that asynchronous loading is the way out, but unfortunately not a silver bullet in the matter of downloading files.

0 reactions

import asyncio
import os.path
import shutil

import aiofiles
import aiohttp
from tempfile import TemporaryDirectory
import sys
from urllib.parse import urlparse


async def get_content_length(url):
    async with aiohttp.ClientSession() as session:
        async with session.head(url) as request:
            return request.content_length


def parts_generator(size, start=0, part_size=10 * 1024 ** 2):
    while size - start > part_size:
        yield start, start + part_size
        start += part_size
    yield start, size


async def download(url, headers, save_path):
    async with aiohttp.ClientSession(headers=headers) as session:
        async with session.get(url) as request:
            file = await aiofiles.open(save_path, 'wb')
            await file.write(await request.content.read())


async def process(url):
    filename = os.path.basename(urlparse(url).path)
    tmp_dir = TemporaryDirectory(prefix=filename, dir=os.path.abspath('.'))
    size = await get_content_length(url)
    tasks = []
    file_parts = []
    for number, sizes in enumerate(parts_generator(size)):
        part_file_name = os.path.join(tmp_dir.name, f'{filename}.part{number}')
        file_parts.append(part_file_name)
        tasks.append(download(url, {'Range': f'bytes={sizes[0]}-{sizes[1]}'}, part_file_name))
    await asyncio.gather(*tasks)
    with open(filename, 'wb') as wfd:
        for f in file_parts:
            with open(f, 'rb') as fd:
                shutil.copyfileobj(fd, wfd)


async def main():
    if len(sys.argv) <= 1:
        print('Add URLS')
        exit(1)
    urls = sys.argv[1:]
    await asyncio.gather(*[process(url) for url in urls])


if __name__ == '__main__':
    import time

    start_code = time.monotonic()
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())
    print(f'{time.monotonic() - start_code} seconds!')

by Maksim Kuznetsov @exactor.Just a Senior Python Developer.

Read my stories

Customized Experience.|

How to Speed Up File Downloads With Python

How to Speed Up File Downloads With Python

Recommend

Polkadot Allocates $18 Million to Solve the Biggest Challenges facing Polkadot E...

JavaScript Strict Mode vs. Bad Programming Practices

After Banks Froze Their Accounts, Some Adult Entertainers Turned to Cryptocurren...

The Facebook Business Model Might Die

9 Best Java Online Courses to Learn Programming for Beginners

What Is Bored Ape Yacht Club? + Why The NFT's Are So Popular

How to Build a Directed Acyclic Graph (DAG) - Towards Open Options Chains Part I...

way0utwest 🎙️ (He/Him/His) on Twitter: "@thepointsguy not wearing goggles ;...

way0utwest 🎙️ (He/Him/His) on Twitter: "I hate seeing this in git commits o...

如何知道安全意识计划成功与否

About Joyk