15

An Introduction To Data Science On The Linux Command Line

 4 years ago
source link: https://blog.robertelder.org/data-science-linux-command-line/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

An Introduction To Data Science On The Linux Command Line

2019-10-16 - By Robert Elder

     This article will provide the reader with a brief overview for a number of different Linux commands.  A special emphasis will be placed on explaining how each command can be used in the context of performing data science tasks.  The goal will be to convince the reader that each of these commands can be extremely useful, and to allow them to understand what role each command can play when manipulating or analyzing data.

The '|' Symbol

     Many readers are likely already familiar with the '|' symbol, but if not, it's worth pointing it out in advance:  All of the inputs and outputs for the commands discussed in the next few sections can be automatically 'piped' into one another using the '|' symbol!  This means that all of the specialized tasks done by each command can be chained together to make extremely powerful and short mini programs, all directly on the command line!

     What is grep?  'grep' is a tool that can be used to extract matching text from files.  You can specify a number of different control flags and options that allow you to be very selective in determining what subset of text you'd like to have extracted from a file or stream.  Grep is generally used as a 'line-oriented' tool, which means that when matching text is found, grep will print all of the text on that line, although you can use the '-o' flag to only print the matched part of the line.

     Why is grep useful?  'grep' is useful because it's the fastest way to search for a particular piece of text in a large number of files.  Some great examples of use cases are:  Filtering accesses to a particular web page out of a huge web server log;  Searching a code base for instances of a specific keyword (this is much faster and more reliable than using the Eclipse Editor's search); Filtering the output from another command in the middle of a Unix pipe.

     How does grep relate to data science?  Grep can be very useful for ad-hoc data science tasks because it allows you to very quickly filter out the information you want from your data set.  It's very likely that your source data has a lot of information that isn't relevant to the question you're trying to answer.  If the data is stored on individual lines in a text file, you can use grep to extract only the lines that you want to work with if you can think of a very precise search rule to filter them out.   For example, if you had the following .csv file full of sales records on each line:

item, modelnumber, price, tax
Sneakers, MN009, 49.99, 1.11
Sneakers, MTG09, 139.99, 4.11
Shirt, MN089, 8.99, 1.44
Pants, N09, 39.99, 1.11
Sneakers, KN09, 49.99, 1.11
Shoes, BN009, 449.22, 4.31
Sneakers, dN099, 9.99, 1.22
Bananas, GG009, 4.99, 1.11

     you could use a command like this:

grep Sneakers sales.csv

     to filter out only the sales records that contain the text 'Sneakers'.  Here is the result from running this command:

Sneakers, MN009, 49.99, 1.11
Sneakers, MTG09, 139.99, 4.11
Sneakers, KN09, 49.99, 1.11
Sneakers, dN099, 9.99, 1.22

     You can also use complicated regular expressions with grep to search for text that contains certain patterns.  For example, this command will use grep to filter out all model numbers that begin with either 'BN' or 'MN' followed by at least with 3 numbers:

grep -o "\(BN\|MN\)\([0-9]\)\{3\}" sales.csv

     Here is the result from running this command:

MN009
MN089
BN009

     What is sed?  Sed is a tool for performing search and replace operations.  For example, you could use the following command:

sed -i 's/dog/cat/g' *

     to replace all instances of the word 'dog' with the word 'cat' in all files in the current directory.

     Why is sed Useful?  'sed' is useful because you can use regexes to perform complex matches and substitutions.  Regex replacements also support backreferences that allow you to match arbitrary patterns and then change only part of the matched text in some way.  For example, this sed command will look for two quoted strings on any given line, and then swap their positions without changing any other part of the text.  It also changes the quotes into parentheses at the same time:

echo 'The "quick brown" fox jumped over the "lazy red" dog.' | sed -E 's/"([^"]+)"([^"]+)"([^"]+)"/(\3)\2(\1)/'

     and here is the result:

The (lazy red) fox jumped over the (quck brown) dog.

     How does sed relate to data science?  The biggest use case for sed in data science has to do with the fact that your data probably isn't in exactly for the format you need to to be in if you want to do something with it.  For example, if your boss gave you a text file 'data.txt' containing thousands of numbers that were erroneously enclosed in double quotes:

age,value
"33","5943"
"32","543"
"34","93"
"39","5943"
"36","9943"
"38","8943"

     you could run this file through the following sed command:

cat data.csv | sed 's/"//g'

     and obtain the following result with all the quest removed:

age,value
33,5943
32,543
34,93
39,5943
36,9943
38,8943

     This would be useful in a situation where you need to import the numbers into another program that can't work with quotes around numbers.  If you've ever encountered a problem where some simple formatting error was preventing you from importing or properly working with a data set, the chances are good that there is a 'sed' command that could fix your problems.

     What is awk?  Awk is a tool that can do more advanced search and replace operations that may require general purpose computation.

     Why is awk Useful?  Awk is useful because it's basically a general purpose programming language that can easily work with formatted lines of text.  There is some overlap with what 'sed' can do, but 'awk' is much more powerful.  Awk can also be used for changes things that require remembering state between different rows.

     How does awk relate to data science?  Let's say you're given a CSV file 'temps.csv' that contains temperature values, but instead of using Celsius or Fahrenheit the file contains a mixture of both and denotes the units as 'C' for Celsius and 'F' for Fahrenheit:

temp,unit
26.1,C
78.1,F
23.1,C
25.7,C
76.3,F
77.3,F
24.2,C
79.3,F
27.9,C
75.1,F
25.9,C
79.0,F

     You could accomplish this with one simple awk command:

cat temps.txt | awk -F',' '{if($2=="F")print (($1-32)*5/9)",C";else print $1","$2}'

     And the result will be:

temp,unit
26.1,C
25.6111,C
23.1,C
25.7,C
24.6111,C
25.1667,C
24.2,C
26.2778,C
27.9,C
23.9444,C
25.9,C
26.1111,C

     with all temperature values normalized to Celsius.

     What is sort?  Sort's name gives it all away:  It's used for sorting!

     Why is sort Useful?  Sorting by itself isn't that useful, but it is an important pre-requisite to a lot of other tasks: Want to find the greatest/least? Just sort them, and take the first or last.  Want the top 10?  Sort them, and take the last 10.  Need numeric sort vs. lexicographical sort?  The sort command does both!  Let's sort the following file of random text 'foo.txt' in a few different ways:

0
1
1234
11
ZZZZ
1010
0123
hello world
abc123
Hello World
9
zzzz

     Here is a command for doing the default sort:

cat foo.txt | sort

     And the result is:

0
0123
1
1010
11
1234
9
abc123
Hello World
hello world
ZZZZ
zzzz

     Notice that the sort above is in lexicographical order instead of numeric order, so the numbers might not be in the order you expect.  We can do a numeric sort instead using the '-n' flag:

cat foo.txt | sort -n

     And here is the result:

0
abc123
Hello World
hello world
ZZZZ
zzzz
1
9
11
0123
1010
1234

     Now the numbers are in the correct order.  Another common requirement is to sort things in reverse order, which you can do with the '-r' flag:

cat foo.txt | sort -r

     And here is the result:

zzzz
ZZZZ
hello world
Hello World
abc123
9
1234
11
1010
1
0123
0

     How does sort relate to data science?  Several of the other data science related Linux commands in the article (comm, uniq, etc.) require that you sort the input data first.  Another useful flag of the 'sort' command is the '-R' flag which will re-arrange the lines of the input randomly.  This can be useful for developing a large number of test cases for other software that needs to work no matter what order is of lines in a file.

     What is comm?  Comm is a tool for computing the results of set operations: (unions, intersections and complements) based on the lines of text in the input files.

     Why is comm Useful?  Comm is useful when you want to learn something about the lines that are either common or different in two different files.

     How does comm relate to data science?  A great example of how this would be useful in data science is if you had two lists of email addresses:  One file called 'signups.txt' that contains email addresses of people who signed up to your newsletter:

[email protected]
[email protected]
[email protected]
[email protected]
[email protected]

     and another file called 'purchases.txt' that contains the email addresses of people who purchased your product:

[email protected]
[email protected]
[email protected]
[email protected]

     Given these files, you might want to know the answers to three different questions:  1)  Which users signed up and did make a purchase?  2)  Which users signed up for the newsletter, but didn't convert to make a purchase?  3)  Which users made a purchase but didn't sign up for the newsletter?  Using the 'comm' command, you can answer all three of these questions easily.  Here's a command we can use to find out users who signed up for the newsletter and also made a purchase:

comm -12 signups.txt purchases.txt

     which produces the following result:

[email protected]
[email protected]

     And here's how we can find out who signed up for the newsletter, but didn't convert:

comm -23 signups.txt purchases.txt

     which produces the following result:

[email protected]
[email protected]
[email protected]

     And finally, here's a command that shows people who made a purchase without singing up to the newsletter:

comm -13 signups.txt purchases.txt

     which produces the following result:

[email protected]
[email protected]

     The 'comm' command requires that any input you pass to it be sorted first.  Usually, your input files won't be pre-sorted, but you can use the following syntax in bash to use the sort command directly when passing input to 'comm' without needing to create any extra files:

comm -12 <(sort signups.txt) <(sort purchases.txt)

     What is uniq?  The 'uniq' command helps you answer questions about uniqueness.

     Why is uniq Useful?  If you want to de-duplicate lines and only output the unique ones, uniq will do this.  Want to know how many times each item is duplicated?  Uniq will tell you that.  Want to *only* output the duplicated items (for example, to sanity check input that should already be unique)?  You can do that too.

     How does uniq relate to data science?  Let's say you have a file full of sales data called 'sales.csv':

Shoes,19.00
Shoes,28.00
Pants,77.00
Socks,12.00
Shirt,22.00
Socks,12.00
Socks,12.00
Boots,82.00

     and you'd like a concise list of all unique products that are referenced in the data set.  You can just use awk to grab the product and pipe the result into sort and then uniq:

cat sales.csv | awk -F',' '{print $1}' | sort | uniq

     and here is the result:

Boots
Pants
Shirt
Shoes
Socks

     The next thing you might want to know is how many of each unique item were sold:

cat sales.csv | awk -F',' '{print $1}' | sort | uniq -c

     and here is the result:

      1 Boots
      1 Pants
      1 Shirt
      2 Shoes
      3 Socks

     You can also use the '-d' flag with uniq to get a list of items that occur more than once.  This can be useful when working with lists that are already almost unique.

     What is tr?  The 'tr' command is a tool that can remove or replace individual characters, or character 'sets'.

     Why is tr Useful?  The most common reason that I've found for using the 'tr' command is to remove unwanted carriage return characters in files that were created on Windows machines.  Here's an example that illustrates this and pipes the result into xxd so we can inspect the hexadecimal:

echo -en "Hello\r" | tr -d "\r" | xxd

     You can also use the 'tr' command for other special case corrections that you may need to apply in the middle of some other unix pipe.  For example, sometimes, you may encounter binary data that uses null character delimiting instead of newlines.  You can replace all null characters in a file with newlines using the following tr command:

echo -en "\0" | tr \\0 \\n | xxd

     Note that the double '\' character in the above command is required because tr expects "\0" to denote the null character, but the '\' itself needs to be escaped on the shell.  The above command shows the result being piped into 'xxd' so you can verify the result.  In an actual use case, you probably don't want xxd on the end of this pipe.

     How does tr relate to data science?  The 'tr' command's relationship to data science isn't as profound as some of the other commands listed here, but it is often an essential addition for special cases fixes and cleanups that may be necessary in another stage of processing your data.

     What is cat?  The 'cat' command is a tool that you can use to concatenate files together and print them to stdout.

     Why is cat Useful?  The cat command is useful whenever you need to stitch multiple files together, or if you want to output file(s) to stdout.

     How does cat relate to data science?  The 'concatentation' feature of the 'cat' command does come up quite a bit when performing data science tasks.  A common case is when you encounter multiple csv files with similarly formatted content that you want to aggregate.  Let's say you have 3 .csv files with email addresses from newsletter signups, purchases, and a purchased list.  You might want to calculate the potential reach you have over all your user data, so you want to count the number of unique emails you have in all 3 of these files.  You can use cat to print them all out together and then use sort and uniq to print out the unique set of emails:

#  Assume column 1 is the email and csv is tab delimited:
cat signups.csv purchases.csv purchased.csv | awk -F'\t' '{print $1}' | sort | uniq

     It's likely that you're used to seeing people use 'cat' to read a file and pipe it into some other program:

cat file.txt | somecommand

     You will also occasionally see people point out that this is a 'useless' use of cat and isn't necessary because you can use this syntax instead:

somecommand < file.txt

     What is head?  The 'head' command allows you to print out only the first few lines (or bytes) of a file.

     Why is head Useful?  This is useful if you want to view a small part of a huge (many GiB) file, or if you want to calculate the 'top 3' results from another part of your analysis.

     How does head relate to data science?  Let's say you have a file 'sales.csv' that contains a list of sales data for products you've sold:

Shoes,19.00
Shoes,19.00
Pants,77.00
Pants,77.00
Shoes,19.00
Shoes,28.00
Pants,77.00
Boots,22.00
Socks,12.00
Socks,12.00
Socks,12.00
Shirt,22.00
Socks,12.00
Boots,82.00
Boots,82.00

     You might want to know the answer to the following question: 'What are the top 3 selling products ordered from most popular to least popular?'.  You can answer that question with the following pipe:

cat sales.csv | awk -F',' '{print $1}' | sort | uniq -c | sort -n -r | head -n 3

     The shell pipe above will input the sales data to 'awk' and print out only the first column on each line.  We then sort the product names (because the 'uniq' program requires that we sort data), then we use 'uniq' to obtain the counts of unique products.  In order to sort the list of product counts from greatest to least, we use 'sort -n -r' to obtain a numeric sort on the product count.  Then, we can pipe the complete list into 'head -n 3' to only see the top 3 items in the list:

      4 Socks
      4 Shoes
      3 Pants

     What is tail?  The 'tail' command is a companion to the 'head' command so you can expect it to work similarly, except it prints the end of a file instead of the start.

     Why is tail Useful?  The 'tail' command is useful for all the same tasks that the 'head' command is useful for.

     How does tail relate to data science?  Here's an example of how you can use the following command to calculate the bottom 3 products by number of sales for the sales data in the previous section:

cat sales.csv | awk -F',' '{print $1}' | sort | uniq -c | sort -n -r | tail -n 3

     And the result is:

      3 Pants
      3 Boots
      1 Shirt

     Note that this might not be the presentation format you want since the lowest count is at the bottom.  To see the lowest count at the top of the output you could use the 'head' command instead without reverse order sorting:

cat sales.csv | awk -F',' '{print $1}' | sort | uniq -c | sort -n | head -n 3

     And the result is:

      1 Shirt
      3 Boots
      3 Pants

     Another great use case for the tail command is to trim off the first line of a file.  For example, if you have CSV data like this:

product,price
Shoes,19.00
Shoes,28.00
Pants,77.00
Socks,12.00
Shirt,22.00

     and you try to count the distinct products using awk and uniq as follows:

cat sales.csv | awk -F',' '{print $1}' | sort | uniq -c

     you'll end up with the following output:

      1 Pants
      1 product
      1 Shirt
      2 Shoes
      1 Socks

     which contains the word 'product' from the header, which we don't actually want.  What we need to do is trim off the header line and start processing only data on the remaining lines (line 2 in our case).  We can do this by using the 'tail' command with the '+' before the line number (1-based indexing) where we want to start outputting data:

cat sales.csv | tail -n +2 | awk -F',' '{print $1}' | sort | uniq -c

     and now we get the desired result with the header omitted:

      1 Pants
      1 Shirt
      2 Shoes
      1 Socks

     What is wc?  The 'wc' command is a tool that you can use to obtain word ccounts and line counts.

     Why is wc Useful?  This command is useful any time you want to quickly answer the questions 'How many lines are there?' or 'How many characters is this?'.

     How does wc relate to data science?  More often than you'd probably expect, many quick questions can be rephrased as 'How many lines are in this file?'  Want to know how many emails are on your mailing list?  You can probably just use this command:

wc -l emails.csv

     and possibly subtract one from the result (if there is a csv header included in the file).

     If you have multiple files in the current directory and you want to count the lines of all of them (including a total), you can use a wildcard:

wc -l *.csv

     It's often useful to count the number of characters in a piece of text or in a file.  You can even paste text into an echo statement (use -n to avoid the newline which would increase the count by 1):

echo -n "Here is some text that you'll get a character count for" | wc -c

     And the result is:

55

     What is find?  The 'find' command can search for files using a number of different options and it also has the ability to execute commands against each file.

     Why is find Useful?  The find command is useful for searching for files given a number of different options (file/directory type, file size, file permissions, etc.), but one of it's most useful features comes from the '-exec' option which lets you execute a command against the file once it's been found.

     How does find relate to data science?  First, let's show an example of how you can use the find command to list all files in and under the current directory:

find .

     As you saw with the 'wc' command above, you can count the number of lines in all files in the current directory.  However, if you want to do something like iterate over all files, directories, and sub-directories to get a total line count of every file (for example to do a total line count in your code base) you can use find to print out the text for every file and then pipe the *aggregate* output of every file into 'wc' to get the line count:

find . -type f -exec cat {} \; | wc -l

     The syntax for the '-exec' option with the find command is probably one of the hardest to remember since it isn't a common pattern repeated elsewhere.  Here is another example of how to use the find command to replace the word 'dog' with the word 'cat' in every file in and under the current directory:

find . -type f -exec sed -i 's/dog/cat/g' {} \;

     Of course, you can run a command just like the one above on another specific directory other than the current directory by changing the '.' to be the directory you want.  Just be careful running find with '-exec' especially if you're running as root!  You could do a lot of damage if you run the wrong command against the '/' directory by accident.

tsort

     What is tsort?  The 'tsort' is a tool that can be used to perform a Topological Sort.

     Why is tsort Useful?  'Topological sorting' is the solution to a number of real-world problems that you probably encounter on a daily basis without noticing it.  One very famous example is the problem of coming up with a schedule to complete a number of tasks that can't be started until a previous task has been completed.  Considerations like this are necessary during construction work since you can't complete the work on painting walls until the dry wall has been installed.  You can't install the dry wall until the electrical work has been done, and you can't complete the electrical work until the walls framing has been completed etc.  You might be able to keep this all in your head if you're just building a house, but large construction projects require more automated methods.  Let's review an example using construction tasks for building a house in the file 'task_dependencies.txt':

wall_framing foundation
foundation excavation
excavation construction_permits
dry_wall electrical
electrical wall_framing
wall_painting crack_filling
crack_filling dry_wall

     In the above file, each line consists of two 'words'.  When the 'tsort' command processes the file, it will assume that the first word describes something that needs to come after the second word.  After all the lines have been processed, 'tsort' will output all words in the order of most downstream dependencies to least downstream dependencies.  Let's try that now:

cat task_dependencies.txt | tsort

     And here is the result:

wall_painting
crack_filling
dry_wall
electrical
wall_framing
foundation
excavation
construction_permits

     As you may recall from above, you can use the '-R' flag with the 'sort' command to get a random ordering of lines in a file.  If we do repeated 'random' sorts on our list of dependencies and pipe it into tsort, you'll note that the result is always the same even though the output from 'sort -R' is different every time:

cat task_dependencies.txt | sort -R | tsort

     This is because the actual order in which the tasks depend on each other doesn't change even if we re-arrange the lines in this file.

     This only scratches the surface of what topological sorting is useful for, but hopefully this piques your interest enough for you to check out the Wikipedia page on Topological Sorting

     How does tsort relate to data science?  Topological sorting is a fundamental graph theory problem that comes up in lots of places:  Machine learning; logistics; scheduling; project management etc.

     What is tee?  The 'tee' command is a tool that lets you split off stream information into a file while simultaneously also printing that to the output of the current stream.

     Why is tee Useful?  Let's say you want to run a script and watch the output, but also log it to a file at the same time.  You can do that with 'tee':

./myscript.sh | tee run.log

     How does tee relate to data science?  The 'tee' command doesn't actually do anything analytical for you, but it can be very useful if you're trying to debug why a complicated shell pipe isn't working.  Let's take an example from above and put references to the 'tee' command between every stage of the pipe:

cat sales.csv | tail -n +2 | tee after_tail.log | awk -F',' '{print $1}' | tee after_awk.log | sort | tee after_sort.log | uniq -c | tee after_uniq.log

     Now, when you run this command, you'll get 4 files that all show what the output looked like at each stage in the process.  This can be extremely convenient if you want to be able to go back and inspect a shell pipe that experienced infrequent or complicated errors.  Complicated regex patterns that are often employed in pipes like this can sometimes match things you don't expect them to, so using this method you can easily gain more insight into every stage of what's going on.

The '>' Symbol

     What is the '>' Symbol?   The '>' symbol is an output redirection symbol that can be used to redirect output.

     Why is the '>' Symbol Useful?  The '>' symbol can be used to re-direct output a file instead of having it print to the screen:

cat sales.csv | tail -n +2 | awk -F',' '{print $1}' | sort | uniq -c > unique_counts.txt

The '<' Symbol

     What is the '<' Symbol?  The '<' symbol is an output redirection symbol that can direct the contents of a file to the input of a program.  This is alternative to the 'useless cat' problem discussed above:

grep Pants < sales.txt

Confusing Results With Unicode

     One common problem that you'll run into eventually is related to mixing different Unicode encodings.  Specifically, it's worth noting that many enterprise software providers will choose UTF-16 over UTF-8 when encoding .CSV files or database dumps.

     For example, let's say you want to grep through a bunch of files for all instances of the word 'Hello'.  First, you can check what the file contains:

cat sometext.txt

     and you see that it contains the text 'Hello':

Hello World!

     but then when you use grep to search the file, you may get nothing:

grep Hello sometext.txt

     How could this possibly happen?  The answer becomes a bit more clear when you look at the file in hex:

xxd sometext.txt

     gives the following output:

00000000: fffe 4800 6500 6c00 6c00 6f00 2000 5700  ..H.e.l.l.o. .W.
00000010: 6f00 7200 6c00 6400 2100 0a00            o.r.l.d.!...

     What happened here is that the file 'somefile.txt' is encoded in UTF-16, but your terminal is (probably) by default set to use UTF-8.  Printing the characters from the UTF-16 encoded text to the UTF-8 encoded terminal doesn't show an apparent problem since the UTF-16 null characters don't get represented on the terminal, but every other odd byte is just a regular ASCII character that looks identical to its UTF-8 encoding.

     As you can see in the above output, this file isn't using UTF-8 to encode the file, it's using UTF-16le.  The text 'Hello' isn't found because when you grep for 'Hello' on the command-line, the characters that you type get interpreted in the character encoding that is currently set in the terminal environment (which is probably set to UTF-8).  Therefore, the search string doesn't include the extra null bytes after these ASCII characters, so the search fails.  If you want to search for UTF-16 characters anyway, you could use this grep search:

grep -aP "H\x00e\x00l\x00l\x00o\x00" * sometext.txt

     The 'a' flag is necessary to turn on binary file searching since the null characters in UTF-16 cause the files to be interpreted as binary by grep.  The 'P' flag specifies that the grep pattern should be interpreted as a Perl regex which will cause the '\x' escaping to be interpreted.

     Alternatively, if you know that a given file is in UTF-16, but you'd prefer to just convert it into UTF-8 format, you can do so with the following command:

iconv -f UTF-16 -t UTF-8 sometext.txt > sometext-utf-8.txt

     Now you don't need to take any special steps when processing this file since the encoding is now likely compatible with your terminal's current encoding:

00000000: 4865 6c6c 6f20 576f 726c 6421 0a         Hello World!.

Piping Directly From A Database

     You're not much of a data scientist if can't work with databases.  Fortunately, most common database applications have some mechanism for running ad-hoc queries directly from the command-line.  Note that this practice is very hacky and not at all recommended for serious investigations, but rather for getting fast low fidelity results.  Let's start out with an example using Postgres SQL server.  Assume that you have a simple database table called 'urls':

DROP TABLE urls;
CREATE TABLE urls (
  id serial NOT NULL PRIMARY KEY,
  url character varying(1000)
);
insert into urls (url) values ('http://example.com/');
insert into urls (url) values ('http://example.com/foo.html');
insert into urls (url) values ('http://example.org/index.html');
insert into urls (url) values ('http://google.ca/');
insert into urls (url) values ('http://google.ca/abc.html');
insert into urls (url) values ('https://google.ca/404.html');
insert into urls (url) values ('http://example.co.uk/');
insert into urls (url) values ('http://twitter.com/');
insert into urls (url) values ('http://blog.robertelder.org/');

     And you'd like to create a list that shows you how common each domain name is among the urls in this table.  You can start off by creating a command to extract the url data (and include commas for similar queries with multiple columns):

psql -d mydatascience -t -A -F"," -c "select url from urls;"

     which produces this output:

http://example.com/
http://example.com/foo.html
http://example.org/index.html
http://google.ca/
http://google.ca/abc.html
https://google.ca/404.html
http://example.co.uk/
http://twitter.com/
http://blog.robertelder.org/

     Now, we can add on to this pipe to use a simple regex to pick out only the domain name:

psql -d mydatascience -t -A -F"," -c "select url from urls;" | sed -E "s/^https?:\/\/([^\/]+).*/\1/"

     Here is the list we're working with now:

example.com
example.com
example.org
google.ca
google.ca
google.ca
example.co.uk
twitter.com
blog.robertelder.org

     And now we can use the sort/uniq tricks we learned about above to come up with our final solution:

psql -d mydatascience -t -A -F"," -c "select url from urls;" | sed -E "s/^https?:\/\/([^\/]+).*/\1/" | sort | uniq -c | sort -n -r

     And here is the final product, all produced from a quick shell one-liner:

      3 google.ca
      2 example.com
      1 twitter.com
      1 example.org
      1 example.co.uk
      1 blog.robertelder.org

     The Mysql client has a similar set of command-line options for extracting data onto the command-line:

mysql ... -s -r -N -e "select 1,2;"

     Of course, you may argue that your favourite query language would be able to do this all directly on the SQL command-line as a single query, but the point here is to show that you can do it on the command-line if you need to.

Conclusion

     As we've discussed in this article, there are a number of Linux commands that can be very useful for quickly solving data science problems.  This article only shows a couple useful flags for each command, but in reality there are dozens more.  Hopefully, your interest has been piqued enough to research them further.

Shell


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK