Why Every Data Scientist Should Know Command Line Tools

February 07, 2017 essays

Like many data scientists, I regularly need to clean lots of data in a short timeframe. I regularly need to concatenate many files together, chop headers and footers, and count lines to quickly check the size of a file.

The UNIX command line is great for basic data processing tasks because it has very low latency. If you have a file with millions of rows, performing basic operations in a higher-level language requires reading the entire data file into memory. This can take unacceptably long amounts of time. With the command line, you can work on an entire file without worrying about your task taking hours because it is never necessary to read the entire file into memory.

Command line utilities are also useful because of their simplicity. In higher-level languages, reading in data assumes that the data is in a particular format, such as a tab-delimited file. If the data comes in a form that violates said format, recognizing and fixing this can require copious amounts of debugging and checking documentation to see the proper way to handle these issues.

For example, I recently started working with Google Adwords data. When pulling Adwords files without an API, there is often a size restriction, requiring many files. If I had wanted to make one large Adwords file without the command line, this is what the process would have looked like:

Open multi-megabyte file.csv in Excel. Wait a few minutes. Immediately notice that there’s header information.
Try to read file.csv into Python while skipping the header. What’s the argument called again? Google “Pandas read_csv documentation.” Oh okay it’s “skiprows”. Wait how many lines was the header again? Open Excel, wait another few minutes, and get the right number of lines to skip. Reread data with skiprows=6.
TypeError?! What is this?! Check Excel again. Scroll to the bottom and notice that there’s a “total” row where there’s aggregated summary statistics for the entire file.
Reread files into Python. What’s the argument to skip the footer? Okay it’s just skipfooter. What a relief. Load data with skipfooter=1.
Now that I’ve successfully read one file in, how do I concatenate a bunch of files together in Python? After a bunch of Stack Overflow posts, I find a solution using the glob package and pd.merge.
Read every data file into Python. Pray I don’t run out of memory. Concatenate them and export as a .csv.

Now here’s how I would do the same thing using the command line:

head file.csv. In milliseconds, I have a display of the first ten lines of the file. I see that there’s a six-line header I need to get rid of.
tail file.csv. In milliseconds, I see a one-line footer I need to get rid of.
Iterate over all files in the directory. tail -n+6 file_{i}.csv | head -n-1 > big_file.csv. Wait a few minutes.
…That’s it?! Surely it can’t be this straightforward!

A process that could take half a day in powerful tools like Python and R is reduced to a few minutes on the command line. The command line doesn’t care that your csv file has too much header or footer information, and it doesn’t care how big your files are. It does exactly what you need it to do, and that’s precisely why it’s so useful.