Pivot joining 19 million lines of CSV with unix tools
I’m participating in Kodoeba, an open hackathon around the Tatoeba example sentence project.
My plan is to incorporate Mandarin-English example sentence pairs from Tatoeba into the Chinese example sentence search on Chinese Boost, mainly to allow searching them via pinyin with or without tone-marks.
The first challenge was actually to extract the Mandarin-English example sentence pairs from Tatoeba’s data files.
The data is structured in three separate TSV files (tab-separated values, equivalent to CSV):
- cmn_sentences.tsv
- eng_sentences.tsv
- links.tsv
These have columns of data, most importantly an id column which identifies each sentence:
6539139 cmn 原來如此。
1284 eng I will be back soon.
The links.tsv
file is like a pivot table that links the pairs of sentences
together:
995325 989154
This would be fairly easy to join together in a naïve way using some loops and hashes, but the total number of lines across the files is quite large:
wc -l cmn_sentences.tsv eng_sentences.tsv links.tsv
62829 cmn_sentences.tsv
1327755 eng_sentences.tsv
17670835 links.tsv
19061419 total
The join
tool can handle this with a bit of fiddling around, though. The
join
tool seems to be using a binary search to do the joining, as it requires
that the input is sorted. We have the sort
tool for that.
Sorting one of the TSV files by the id column with sort
looks like this:
sort -t $'\t' -k 1 cmn_sentences.tsv
The -t
is indicating that the column separator is a tab character, and the
-k
is saying that we want to sort on the first column.
That gives output like:
...
989154 cmn 你能证明吗
...
989157 cmn 他的故事肯定是真的。
...
We can then piece that together with join
to join the sorted Mandarin
sentences with the sorted links like this:
join -t $'\t' <(sort -t $'\t' -k 1 cmn_sentences.tsv) <(sort -t $'\t' -k 1 links.tsv)
That gives us a list of rows with the Mandarin sentence and the English sentence id on the end:
995177 cmn 他有太多書。 302096
We can then use that to join with the English sentences, but we need to sort it by the English sentence id on the end first, like this:
join -t $'\t' <(sort -t $'\t' -k 1 cmn_sentences.tsv) <(sort -t $'\t' -k 1 links.tsv) | sort -t $'\t' -k 4
Finally, we can put the whole thing together to join the English and Mandarin sentences via the links:
join -t $'\t' -1 1 -2 4 <(sort -t $'\t' -k 1 eng_sentences.tsv) <(join -t $'\t' <(sort -t $'\t' -k 1 cmn_sentences.tsv) <(sort -t $'\t' -k 1 links.tsv) | sort -t $'\t' -k 4)
This joins across the 19 million rows pretty quickly, according to time
:
real 0m14.032s
user 0m6.555s
sys 0m0.374s
And we get a final output of Mandarin-English sentence pairs like this:
995333 eng I really like travelling. 984480 cmn 我很喜欢旅游。