3

How to Speed Up File Downloads With Python

 2 years ago
source link: https://hackernoon.com/how-to-speed-up-file-downloads-with-python
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client
Notifications
๐Ÿ†˜ Need Any Help?
Today at 8:40 AM
๐Ÿค‘ 1k Brands Publish Here
Today at 8:30 AM
โœ๐Ÿฝ How Do I Get Published?
Today at 9:11 AM
๐Ÿ†“ Is HackerNoon Free to Read?
Today at 9:03 AM
๐Ÿ“ฐ Today's Top Tech Stories
Today at 8:40 AM
see more
How to Speed Up File Downloads With Pythonโ€‚by@exactor

How to Speed Up File Downloads With Python

Some admins set limits on the speed of downloading files, this reduces the load on the network. But at the same time it is very annoying for users, especially when you need to download a large file (from 1 GB), and the speed fluctuates around 1 megabit per second (125 kilobytes per second) Based on these data, we conclude that the download speed will be at least 8192 seconds (2 hours 16 minutes 32 seconds) Although our bandwidth allows us to transfer up to 16 Mbps (2 MB per second), it will take 512 seconds.

Audio Presented by Plivo Inc-icon

Speed:
Read by:
Your browser does not support theaudio element.
voice-avatar
Maksim Kuznetsov

Just a Senior Python Developer.

Very often, some admins set limits on the speed of downloading files, this reduces the load on the network, but at the same time it is very annoying for users, especially when you need to download a large file (from 1 GB), and the speed fluctuates around 1 megabit per second (125 kilobytes per second). Based on these data, we conclude that the download speed will be at least 8192 seconds (2 hours 16 minutes 32 seconds). Although our bandwidth allows us to transfer up to 16 Mbps (2 MB per second) and it will take 512 seconds (8 minutes 32 seconds).

0 reactions
heart.png
light.png
money.png
thumbs-down.png

These values โ€‹โ€‹โ€‹โ€‹are not taken by chance, for such a download, initially I used exclusively 4G Internet.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Case:

0 reactions
heart.png
light.png
money.png
thumbs-down.png

The utility developed by me and laid out below - works only if:

0 reactions
heart.png
light.png
money.png
thumbs-down.png
  • You know in advance that your bandwidth is higher than the download speed
  • Even large site pages load quickly (the first sign of artificially low speed)
  • You are not using a slow proxy or VPN
  • Good ping to the site

What are these restrictions for?

0 reactions
heart.png
light.png
money.png
thumbs-down.png
  • optimization of the backend and the return of static files
  • DDoS Protection

How is this slowdown implemented?

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Nginx

0 reactions
heart.png
light.png
money.png
thumbs-down.png
location /static/ {
   ...
   limit rate 50k; -> 50 kilobytes per second for a single connection 
   ...
}

location /videos/ {
   ...
   limit rate 500k; -> 500 kilobytes per second for a single connection
   limit_rate_after 10m; -> after 10 megabytes download speed, will 500 kilobytes per second for a single connection
   ...
}

Feature with zip file

0 reactions
heart.png
light.png
money.png
thumbs-down.png

An interesting feature was discovered when downloading a file in the zip extension, each part allows you to partially display the files in the archive, although most archivers will say that the file is broken and not valid, some of the content and file names will be displayed.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Code parsing:

0 reactions
heart.png
light.png
money.png
thumbs-down.png

To create this program, we need Python, asyncio, aiohttp, aiofiles. All code will be asynchronous to increase performance and minimize overhead in terms of memory and speed. It is also possible to run on threads and processes, but when loading a large file, errors may occur when a thread or process cannot be created.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
async def get_content_length(url):
    async with aiohttp.ClientSession() as session:
        async with session.head(url) as request:
            return request.content_length

This function returns the length of the file. And the request itself uses HEAD instead of GET, which means that we get only the headers, without the body (content at the given URL).

0 reactions
heart.png
light.png
money.png
thumbs-down.png
def parts_generator(size, start=0, part_size=10 * 1024 ** 2):
    while size - start > part_size:
        yield start, start + part_size
        start += part_size
    yield start, size

This generator returns ranges for download. An important point is to choose part_size that is a multiple of 1024 to keep proportions per megabytes, although it seems that any number will do. It doesn't work correctly with part_size = 1, so I defaulted to 10 MB per part.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
async def download(url, headers, save_path):
    async with aiohttp.ClientSession(headers=headers) as session:
        async with session.get(url) as request:
            file = await aiofiles.open(save_path, 'wb')
            await file.write(await request.content.read())

One of the main functions is a file download. It works asynchronously. Here we need asynchronous files to speed up disk writes by not blocking input and output operations.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
async def process(url):
    filename = os.path.basename(urlparse(url).path)
    tmp_dir = TemporaryDirectory(prefix=filename, dir=os.path.abspath('.'))
    size = await get_content_length(url)
    tasks = []
    file_parts = []
    for number, sizes in enumerate(parts_generator(size)):
        part_file_name = os.path.join(tmp_dir.name, f'{filename}.part{number}')
        file_parts.append(part_file_name)
        tasks.append(download(URL, {'Range': f'bytes={sizes[0]}-{sizes[1]}'}, part_file_name))
    await asyncio.gather(*tasks)
    with open(filename, 'wb') as wfd:
        for f in file_parts:
            with open(f, 'rb') as fd:
                shutil.copyfileobj(fd, wfd)

The most basic function gets the filename from the URL, converts it into a numbered .part file, creates a temporary directory under the original file, all parts are downloaded into it. await asyncio.gather(*tasks) allows you to execute all collected coroutines concurrently, which significantly speeds downloading. After that, the already synchronous shutil.copyfileobj method concatenates all files into one file.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
async def main():
    if len(sys.argv) <= 1:
        print('Add URLS')
        exit(1)
    urls = sys.argv[1:]
    await asyncio.gather(*[process(url) for url in urls])

The main function receives a list of URLs from the command line and, using the already familiar asyncio.gather, it starts downloading many files at the same time.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Benchmark:

0 reactions
heart.png
light.png
money.png
thumbs-down.png

On one of the resources I found, a benchmark was carried out on downloading a Gentoo Linux image from a site of one university(slow server).

0 reactions
heart.png
light.png
money.png
thumbs-down.png
  • async: 164.682 seconds
  • sync: 453.545 seconds

Download DietPi distribution (fast server):

0 reactions
heart.png
light.png
money.png
thumbs-down.png
  • async: 17.106 seconds best time, 20.056 seconds worst time
  • sync: 15.897 seconds best time, 25.832 worst time

As you can see, the result reaches almost 3x acceleration. On some files, the result reached 20-30 times.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Possible improvements:

0 reactions
heart.png
light.png
money.png
thumbs-down.png
  • More secure download. If there is an error, restart the download.
  • Memory optimization. One of the problems is a 2x increase in storage space consumption. (when all parts are downloaded, copied to a new file, but the directory has not yet been deleted). Easily fixed by deleting the file immediately after copying the contents of the part.
  • Some servers keep track of the number of connections and can ruin such a load, this requires pausing or greatly increasing the size of the part.
  • Adding a progress bar.

In conclusion, I can say that asynchronous loading is the way out, but unfortunately not a silver bullet in the matter of downloading files.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
import asyncio
import os.path
import shutil

import aiofiles
import aiohttp
from tempfile import TemporaryDirectory
import sys
from urllib.parse import urlparse


async def get_content_length(url):
    async with aiohttp.ClientSession() as session:
        async with session.head(url) as request:
            return request.content_length


def parts_generator(size, start=0, part_size=10 * 1024 ** 2):
    while size - start > part_size:
        yield start, start + part_size
        start += part_size
    yield start, size


async def download(url, headers, save_path):
    async with aiohttp.ClientSession(headers=headers) as session:
        async with session.get(url) as request:
            file = await aiofiles.open(save_path, 'wb')
            await file.write(await request.content.read())


async def process(url):
    filename = os.path.basename(urlparse(url).path)
    tmp_dir = TemporaryDirectory(prefix=filename, dir=os.path.abspath('.'))
    size = await get_content_length(url)
    tasks = []
    file_parts = []
    for number, sizes in enumerate(parts_generator(size)):
        part_file_name = os.path.join(tmp_dir.name, f'{filename}.part{number}')
        file_parts.append(part_file_name)
        tasks.append(download(url, {'Range': f'bytes={sizes[0]}-{sizes[1]}'}, part_file_name))
    await asyncio.gather(*tasks)
    with open(filename, 'wb') as wfd:
        for f in file_parts:
            with open(f, 'rb') as fd:
                shutil.copyfileobj(fd, wfd)


async def main():
    if len(sys.argv) <= 1:
        print('Add URLS')
        exit(1)
    urls = sys.argv[1:]
    await asyncio.gather(*[process(url) for url in urls])


if __name__ == '__main__':
    import time

    start_code = time.monotonic()
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())
    print(f'{time.monotonic() - start_code} seconds!')
5
heart.pngheart.pngheart.pngheart.png
light.pnglight.pnglight.pnglight.png
boat.pngboat.pngboat.pngboat.png
money.pngmoney.pngmoney.pngmoney.png
by Maksim Kuznetsov @exactor.Just a Senior Python Developer.
Read my stories
Customized Experience.|

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK