A quick Python script to help me translate classical Chinese poetry

One of my pastimes is translating Chinese poetry and other text into English on my East Asia Student blog. The translations I make are more like annotations, as I try to adhere as closely as I can to the structure and literal meaning of the original text. The idea is that the translation is helpful for reading the original Chinese text, rather than being a full rendering into English.

One aspect of this is the inclusion of pinyin and literal glosses for each line of the poem. For example:


xiāng shuǐ wúqíng diào qǐ zhī

[Xiang] [water] [not have] [feeling] [homage] [how] [know]

the waters of the Xiang are pitiless – if one paid homage who would know?


Again, this is intended to make the original text as accessible as possible, and to show how each line has been interpreted in the translation.

That structure has to be written in a shortcode syntax so that the blog software will produce nice HTML for it on the page. Finding the pinyin, selecting glosses and getting it in the right format is quite time consuming. I had some spare time today so I wrote a rough-and-ready Python script to automate some of that:

#!/usr/bin/env python3

import re
import sys
from typing import List, Dict
from functools import lru_cache
from multiprocessing import Pool

import pinyin
import requests

def process_input_line(input_line: str):
    for poem_line in split_input_line(input_line):

def split_input_line(input_line: str) -> List[str]:
    input_line = re.sub(r",\n*", ",\n", input_line)
    input_line = re.sub(r"。\n*", "。\n", input_line)
    input_line = re.sub(r"\n+", "\n", input_line)
    return [l.strip() for l in re.split("\n", input_line) if l]

def format_poem_line(poem_line: str) -> str:
    return f"""
\{\{\{\{< annotated >\}\}\}\}
  \{\{\{\{< hanzi >\}\}\}\}{poem_line}\{\{\{\{< /hanzi >\}\}\}\}
  \{\{\{\{< reading >\}\}\}\}{pinyin_for_hanzi_line(poem_line)}\{\{\{\{< /reading >\}\}\}\}
  \{\{\{\{< gloss >\}\}\}\}{gloss_for_hanzi_line(poem_line)}\{\{\{\{< /gloss >\}\}\}\}
  \{\{\{\{< yingwen >\}\}\}\}\{\{\{\{< /yingwen >\}\}\}\}
\{\{\{\{< /annotated >\}\}\}\}"""

def pinyin_for_hanzi_line(hanzi_line: str) -> str:
    return " ".join([pinyin.get(hanzi) for hanzi in re.sub(r"[,。]", "", hanzi_line)])

def gloss_for_hanzi_line(hanzi_line: str) -> str:
    with Pool() as p:
        glosses = p.map(gloss_for_hanzi, re.sub(r"[,。]", "", hanzi_line))
    return " ".join(glosses)

def gloss_for_hanzi(hanzi: str) -> str:
    ccdb = ccdb_for_hanzi(hanzi)
    ccdb_definition = ccdb["kDefinition"]
    definitions = re.split(r"[,;]", ccdb_definition)

    return f"[{definitions[0].strip()}]"

def ccdb_for_hanzi(hanzi: str) -> Dict:
    url = f"http://ccdb.hemiola.com/characters/string/{hanzi}?fields=kDefinition,kMandarin"
        response = requests.get(f"http://ccdb.hemiola.com/characters/string/{hanzi}?fields=kDefinition,kMandarin",
                                headers={"User-Agent": "Mozilla Firefox"})
        if response.status_code != 200:
            raise ConnectionError(f"{response.status_code} {response.content}")
        json = response.json()
        return json[0]
    except Exception as err:
        print(f"Error fetching hanzi info for {hanzi} from {url} : {err}", file=sys.stderr)
        return {
            "kDefinition": "",
            "kMandarin": ""

for line in sys.stdin:

For example

echo '狂风吹古月,窃弄章华台。' | ./prepare_poem.py


{{< annotated >}}
  {{< hanzi >}}狂风吹古月,{{< /hanzi >}}
  {{< reading >}}kuáng fēng chuī gŭ yuè ,{{< /reading >}}
  {{< gloss >}}[insane] [wind] [blow] [old] [moon]{{< /gloss >}}
  {{< yingwen >}}{{< /yingwen >}}
{{< /annotated >}}

{{< annotated >}}
  {{< hanzi >}}窃弄章华台。{{< /hanzi >}}
  {{< reading >}}qiè nòng zhāng huá tái 。{{< /reading >}}
  {{< gloss >}}[secretly] [do] [composition] [flowery] [platform]{{< /gloss >}}
  {{< yingwen >}}{{< /yingwen >}}
{{< /annotated >}}

The pinyin will usually require some tweaks, and the glosses will almost all be changed. However, it makes a nice starting point for translating the poem.

This is the kind of programming that I miss when I don’t get the chance to do it: just trying to solve the problem in the quickest way possible without worrying about much else. This script is fragile and boring, but it’s useful to its only user.

Tech mentioned