Creating a Corpus: Issues in the Digital Text Processing of Cantonese, Hakkanese, and Taigi

Loading...
Thumbnail Image

Date

2024-08

Journal Title

Journal ISSN

Volume Title

Publisher

Ohio State University. Libraries

Research Projects

Organizational Units

Journal Issue

Abstract

The encoding of texts written with Chinese characters posed a challenge to the early stages of digital technology. In the 21st century, the digital representation of Mandarin-based Standard Written Chinese faces few issues—typically limited to the realm of outdated or overly regional software. However, one major barrier that remains is the representation of varieties of Chinese that do not have a widely accepted or encoded orthography, which are the non-Mandarin varieties. The present text explores some of the challenges faced when creating a multilingual corpus of translations of Le Petit Prince (The Little Prince) by Antoine de St. Exupéry into Cantonese, Hakkanese, and Taigi. The results show that despite progress in Unicode representation, the technological gap between Standard Mandarin and other dialects remains large.

Description

Keywords

Dialect writing, text digitalization, Chinese dialectology, minority language, encoding

Citation

Ueda, Paul, Ka Fai Law, and Marjorie K.M. Chan. "Creating a Corpus: Issues in the Digital Text Processing of Cantonese, Hakkanese, and Taigi." Buckeye East Asian Linguistics, vol. 8 (August 2024), p. 172-191.