Merge large CSV files with headers
If you have ever worked with classic data warehousing tools, you may already know the problem: The CSV export splits the output into different files, often with one file containing no more than 1 million observations. However, for analysis purposes, this format is not always optimal and must be merged into a single file. In bash this is not a big problem with cat, but only if there is no header present. The BigQuery CSV table export, for example, adds a header to each CSV file and this must be taken into account when merging.
To stay with the BigQuery example, you can extract the header from the first file (cat 0000000000 | head -n1
) and then append the contents except the first line of all files. These bash commands allow to process even huge datasets in minutes. In sum, this results in the following command:
{ cat 0000000000 | head -n1 ; for f in 000000000*; do cat "$f" | tail -n+2; done; } > merged.csv
For gzip compressed files you can simply replace cat
with zcat
.