2

How to Read Extremely Large Text Files Using Python

 1 year ago
source link: https://code.tutsplus.com/tutorials/quick-tip-how-to-read-extremely-large-text-files-using-python--cms-25992
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Let me start directly by asking, do we really need Python to read large text files? Wouldn't our normal word processor or text editor suffice for that? When I mention large here, I mean extremely large files!

Well, let's see some evidence on whether we would need Python for reading such files or not.

Obtaining the File

In order to carry out our experiment, we need an extremely large text file. In this tutorial, we will be obtaining this file from the UCSC Genome Bioinformatics downloads website. The file we will be using in particular is the hg38.fa.gz file, which as described here, is:

"Soft-masked" assembly sequence in one file. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are shown in lower case; non-repeating sequence is shown in upper case.

I don't want you to worry if you didn't understand the above statement, as it is related to Genetics terminology. What matters in this tutorial is the concept of reading extremely large text files using Python.

Go ahead and download hg38.fa.gz (please be careful, the file is 938 MB). You can use 7-zip to unzip the file, or any other tool you prefer.

After you unzip the file, you will get a file called hg38.fa. Rename it to hg38.txt to obtain a text file.

Opening the File the Traditional Way

What I mean here by the traditional way is using our word processor or text editor to open the file. Let's see what happens when we try to do that.

I first tried using Microsoft Word to open the file, and got the following message:

error message file too large

error message file too large

error message file too large

Although opening the file also didn't work using WordPad and Notepad on a Windows-based machine, it did open using TextEdit on a macOS machine.

But you get the point. Having a guaranteed way to open such extremely large files would be a nice idea. In this quick tip, we will see how to do that using Python.

Reading the Text File Using Python

In this section, we are going to see how we can read our large file using Python. Let's say we wanted to read the first 500 lines from our large text file. We can simply do the following:

input_file = open('hg38.txt','r')
output_file = open('output.txt','w')
for lines in range(500):
line = input_file.readline()
output_file.write(line)

We begin by using the built-in open() function in Python to open our file and get back a file object. The r passed as the second parameter means that we intend to read the contents of hg38.txt. On the next line, we use open() again, but this time we pass the w flag because we want to write the contents of our original file to the new file.

After that, we iterate over the first 500 lines in hg38.txt by using the readline() method. The newline character at the end of each line in the file is returned untouched by readline(). It only returns an empty string when we reach the end of the file.

Notice that we read 500 lines from hg38.txt, line by line, and wrote those lines to a new text file output.txt, which should look as shown below:

>chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
.
.
.
.
CTCCAGAGACCTTCTGCAGGTACTGCAGGGCATCCGCCATCTGCTGGACG
GCCTCCTCTCGCCGCAGGTCTGGCTGGATGAAGGGCACGGCATAGGTCTG
ACCTGCCAGGGAGTGCTGCATCCTCACAGGAGTCATGGTGCCTGTGGGTC
GGAGCCGGAGCGTCAGAGCCACCCACGACCACCGGCACGCCCCCACCACA

Our code for reading the text file could be made more secure and readable by using the with statement in Python. This way, the context managers will automatically take care of freeing up resources once the file no longer needs to be read.

with open('hg38.txt', 'r') as input_file, open('output.txt', 'w') as output_file:
for lines in range(500):
line = input_file.readline()
output_file.write(line)

But say that we wanted to directly navigate through the text file without extracting it line by line and sending that to another text file, especially since this way seems more flexible.

Navigating Through Large Text Files

Although the above step allowed us to read large text files by extracting lines from that large file and sending those lines to another text file, directly navigating through the large file without the need to extract it line by line would be a preferable idea.

To do that, we can simply use Python to read the text file through the terminal screen as follows (navigating through the file 50 lines at a time):

with open('hg38.txt','r') as input_file:
while(1):
for lines in range(50):
print(input_file.readline())
user_input = input('Type STOP to quit, otherwise press the Enter/Return key ')
if user_input == 'STOP':
break

As you can see from this script, you can now read and navigate through the large text file immediately using your terminal. Whenever you want to quit, you just need to type STOP (case sensitive) in your terminal.

I'm sure that you will notice how smooth Python makes it to navigate through such an extremely large text file without having any issues. Python is again proving itself to be a language that makes our lives easier!

This post has been updated with contributions from Monty Shokeen. Monty is a full-stack developer who also loves to write tutorials and to learn about new JavaScript libraries.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK