Useful Command Line Tools for Data Scientists

An assortment of handy tools for your linux terminal

Oct 8 ·5min read

Working from the command-line can be daunting, but it is an important skillset for any data scientist. When working on a remote linux instance, you no longer have access to your favorite GUI and instead must navigate your remote instance using the command-line. This list of tools is not an introductory guide to getting started with the command-line but rather a hodgepodge of tools that I find useful to work with, and I hope that you do as well!

Our Good Friend grep

grepis a command-line tool that searches for patterns within a file. grep will print each line within the file that has a pattern match to standard output (terminal screen). This can be especially useful when we maybe want to model or perform EDA on a subset of our data with a given pattern:

  grep "2017" orders.csv > orders_2017.csv

In the above command, we capture all rows from the “orders.csv” dataset that include the pattern “2017” in them and then we write the rows containing this string to a new file titled “orders_2017.csv”. Of course, if we are interested in the order_date column, and a different column such as address contains the pattern “2017” in them (i.e. 2017 Wallaby Lane) , then we could potentially have data from the incorrect year; however, this could be quickly sorted in pandas. One reason to use grep before using pandas right-off the bat is that command-line tools are often written in C, and so they are very fast. Another reason is that they can easily be placed within a python script using os.system():

Grep is great and can also be used with all kinds of regular expressions using the -E option.

htop

Have you ever worked in pandas and received a memory error? Have you ran some operation in parallel (i.e. fit an sklearn model using n_jobs=-1) and been unsure whether all CPUs are in use? Well then htop is for you! htop is great because it allows you to see the current CPU and RAM usage on your machine:

In my case, my machine has four CPUs so the first four lines show the usage statistics for each of my CPUs. The fifth line shows memory usage out of the 8GB that my computer has. The table below shows running processes with associated process id (PID), memory and CPU usage along with some other useful statistics.

Let’s say we are having memory issues with pandas operations in Jupyter Notebook, using htop can help us to monitor how much each operation contributes to RAM usage. If we have code like this:

We are making a bunch of copies in memory of our pandas dataframe in the code segment above, and htop can aid us in being more cognizant of when we are approaching our RAM threshold. Also, if there are other memory intensive processes running on our machine that we do not currently need, we can kill or pause them using the kill command-line tool (The -9 option for kill will kill the process, make sure that the process you are killing is non-essential/not a system process). Look at htop or ps output to get PIDs.

kill -9 {insert PID here}

df

dfis a useful tool for verifying how much disk space is available on our machine. When used without any file name specified, df returns the space available on all currently mounted file systems; however, if I generally want to view how much space is available on the entire machine, I run:

df -h /

The -h option returns human readable format, and the “/” is saying how much space is available from the root file-system:

When needing to create room on a remote AWS EC2 instance for a large download, df can be helpful in determining how much more room you need to clear. In the case where you are freeing up disk space from an old project that has lots of large files, you can run “ls -hlS” to see in human-readable (-h), long-format (-l, means you can see permissions/file-size/date last modified) the files in a given directory sorted by file size in Descending order (-S). This allows for quickly identifying which files if removed could free up the most space:

watch

watchis an easy-to-use command that can come in handy when you need the output of a command to refresh every n seconds. I have used this command in the past when downloading large JSON files from the pushshift reddit comment database . Each unzipped month of comments was a 100GB file, so I wrote a script to download a file at a time, filter out comments from subreddits I was not interested in, write to MongoDB, and delete the large JSON file. In this case, I used watch to repeatedly check whether the correct files were being downloaded, unzipped, and then deleted after downloading.

U36bmyr.png!web

As we can see above, the default for watch is to repeat the command every two seconds although we can modify this with the “ -n {number_seconds} ” option.

scp

scpstands for secure copy and is a useful command we can use to send files to and from a remote instance.

Send to remote:

scp -i ~/.ssh/path/to_pem_file /path/to/local/file ubuntu@IPv4:./desired/file/location

In the above command, “ubuntu” is the default AMI ssh username for an ubuntu instance, so this can change based on what linux distribution you are using. Your IPv4 is typically formatted as four 8-bit fields that are separated by periods (i.e. 32.14.152.68). The -i option just specifies that we are about to use an identity file that contains the private key necessary for public key authentication.

Download from remote:

scp -i ~/.ssh/path/to_pem_file ubuntu@IPv4:./path/to/desired/file/   ~/my_projects/

Note that this command to download is still performed from our local terminal. We have just switched the order in which local file-system and remote file-system are written. Also, it is important to mention that if we use the -r option for this command we can recursively copy directories as opposed to just files:

scp -i ~/.ssh/path/to_pem_file -r ubuntu@IPv4:./path/to/desired/folder/ ~/my_projects/

Conclusion

The linux command-line offers a stable of powerful tools that can really aid in boosting productivity as well as gaining an understanding of the current state of your machine (i.e. disk-space, running processes, RAM, CPU-usage). Working on a remote linux instance is often a great way of becoming familiar with the command-line as you are forced to use it and cannot fall back on Mac’s Finder to navigate the file-system. The tools discussed above are some of my favorite/most-used (besides ls and cd), and I hope that if you did not know of them previously you are able to incorporate them into your workflow. Thank you for reading!

An assortment of handy tools for your linux terminal

Our Good Friend grep

htop

df

watch

scp

Send to remote:

Download from remote:

Conclusion

Further Reading

Recommend

唯一ID生成算法剖析

最前线 | 印度拟限制支付应用市场份额，阿里小米将受影响

同人世界里的热爱、商机和道德陷阱

非官方商城破发的 iPhone 11 能不能买？我劝你小心被反薅

9900万女性玩游戏，撑起一个IPO：市值超30亿港元

永辉超市版“花呗”上线，永辉金融上半年营收增长181%

被“雪藏”的豆瓣

探访日本创业孵化器：为什么没有值得一提的成功孵化案例？

“长城系”债务压顶，千亿“诸暨帮”大佬难解困局

国庆档7天票房破40亿，你贡献了多少？

About Joyk