Pivot joining 19 million lines of CSV with unix tools
I’m participating in Kodoeba, an open hackathon around the Tatoeba example sentence project.
My plan is to incorporate Mandarin-English example sentence pairs from Tatoeba into the Chinese example sentence search on Chinese Boost, mainly to allow searching them via pinyin with or without tone-marks.
The first challenge was actually to extract the Mandarin-English example sentence pairs from Tatoeba’s data files.
The data is structured in three separate TSV files (tab-separated values, equivalent to CSV):
These have columns of data, most importantly an id column which identifies each sentence:
6539139 cmn 原來如此。
1284 eng I will be back soon.
links.tsv file is like a pivot table that links the pairs of sentences
This would be fairly easy to join together in a naïve way using some loops and hashes, but the total number of lines across the files is quite large:
wc -l cmn_sentences.tsv eng_sentences.tsv links.tsv
62829 cmn_sentences.tsv 1327755 eng_sentences.tsv 17670835 links.tsv 19061419 total
join tool can handle this with a bit of fiddling around, though. The
join tool seems to be using a binary search to do the joining, as it requires
that the input is sorted. We have the
sort tool for that.
Sorting one of the TSV files by the id column with
sort looks like this:
sort -t $'\t' -k 1 cmn_sentences.tsv
-t is indicating that the column separator is a tab character, and the
-k is saying that we want to sort on the first column.
That gives output like:
... 989154 cmn 你能证明吗 ... 989157 cmn 他的故事肯定是真的。 ...
We can then piece that together with
join to join the sorted Mandarin
sentences with the sorted links like this:
join -t $'\t' <(sort -t $'\t' -k 1 cmn_sentences.tsv) <(sort -t $'\t' -k 1 links.tsv)
That gives us a list of rows with the Mandarin sentence and the English sentence id on the end:
995177 cmn 他有太多書。 302096
We can then use that to join with the English sentences, but we need to sort it by the English sentence id on the end first, like this:
join -t $'\t' <(sort -t $'\t' -k 1 cmn_sentences.tsv) <(sort -t $'\t' -k 1 links.tsv) | sort -t $'\t' -k 4
Finally, we can put the whole thing together to join the English and Mandarin sentences via the links:
join -t $'\t' -1 1 -2 4 <(sort -t $'\t' -k 1 eng_sentences.tsv) <(join -t $'\t' <(sort -t $'\t' -k 1 cmn_sentences.tsv) <(sort -t $'\t' -k 1 links.tsv) | sort -t $'\t' -k 4)
This joins across the 19 million rows pretty quickly, according to
real 0m14.032s user 0m6.555s sys 0m0.374s
And we get a final output of Mandarin-English sentence pairs like this:
995333 eng I really like travelling. 984480 cmn 我很喜欢旅游。