Creating a Corpus: Issues in the Digital Text Processing of Cantonese, Hakkanese, and Taigi
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The encoding of texts written with Chinese characters posed a challenge to the early stages of digital technology. In the 21st century, the digital representation of Mandarin-based Standard Written Chinese faces few issues—typically limited to the realm of outdated or overly regional software. However, one major barrier that remains is the representation of varieties of Chinese that do not have a widely accepted or encoded orthography, which are the non-Mandarin varieties. The present text explores some of the challenges faced when creating a multilingual corpus of translations of Le Petit Prince (The Little Prince) by Antoine de St. Exupéry into Cantonese, Hakkanese, and Taigi. The results show that despite progress in Unicode representation, the technological gap between Standard Mandarin and other dialects remains large.