I’m participating in
Kodoeba, an open
hackathon around the Tatoeba example sentence project.
My plan is to incorporate Mandarin-English example sentence pairs from Tatoeba
into the Chinese example sentence
search on Chinese
Boost, mainly to allow searching them via pinyin with or without tone-marks.
The first challenge was actually to extract the Mandarin-English example
sentence pairs from Tatoeba’s data files.
The data is structured in three separate TSV files (tab-separated values,
equivalent to CSV):
- cmn_sentences.tsv
- eng_sentences.tsv
- links.tsv
These have columns of data, most importantly an id column which identifies each
sentence:
1284 eng I will be back soon.
The links.tsv file is like a pivot table that links the pairs of sentences
together:
This would be fairly easy to join together in a naïve way using some loops and
hashes, but the total number of lines across the files is quite large:
wc -l cmn_sentences.tsv eng_sentences.tsv links.tsv
62829 cmn_sentences.tsv
1327755 eng_sentences.tsv
17670835 links.tsv
19061419 total
The join tool can handle this with a bit of fiddling around, though. The
join tool seems to be using a binary search to do the joining, as it requires
that the input is sorted. We have the sort tool for that.
Sorting one of the TSV files by the id column with sort looks like this:
sort -t $'\t' -k 1 cmn_sentences.tsv
The -t is indicating that the column separator is a tab character, and the
-k is saying that we want to sort on the first column.
That gives output like:
...
989154 cmn 你能证明吗
...
989157 cmn 他的故事肯定是真的。
...
We can then piece that together with join to join the sorted Mandarin
sentences with the sorted links like this:
join -t $'\t' <(sort -t $'\t' -k 1 cmn_sentences.tsv) <(sort -t $'\t' -k 1 links.tsv)
That gives us a list of rows with the Mandarin sentence and the English sentence
id on the end:
We can then use that to join with the English sentences, but we need to sort it
by the English sentence id on the end first, like this:
join -t $'\t' <(sort -t $'\t' -k 1 cmn_sentences.tsv) <(sort -t $'\t' -k 1 links.tsv) | sort -t $'\t' -k 4
Finally, we can put the whole thing together to join the English and Mandarin
sentences via the links:
join -t $'\t' -1 1 -2 4 <(sort -t $'\t' -k 1 eng_sentences.tsv) <(join -t $'\t' <(sort -t $'\t' -k 1 cmn_sentences.tsv) <(sort -t $'\t' -k 1 links.tsv) | sort -t $'\t' -k 4)
This joins across the 19 million rows pretty quickly, according to time :
real 0m14.032s
user 0m6.555s
sys 0m0.374s
And we get a final output of Mandarin-English sentence pairs like this:
995333 eng I really like travelling. 984480 cmn 我很喜欢旅游。
View post:
Pivot joining 19 million lines of CSV with unix tools
|