Evolving Pre-processing of Raw Corpus: The Digitization Initiative of Cantonese Material at the Sino-Vietnamese Border in the Late 19th Century

Loading...
Thumbnail Image

Date

2024-11

Journal Title

Journal ISSN

Volume Title

Publisher

Ohio State University. Libraries

Research Projects

Organizational Units

Journal Issue

Abstract

The linguistic diversity in the Gulf of Tonkin (GoT) is intricately documented in the late Qing materials, notably in Lagarrue’s (1900) textbook, which composes Cantonese using the Vietnamese alphabet, deviating significantly from the standard utilization of the Latin alphabet. This valuable resource contains over 2,400 vocabulary items, 2,500 unique characters with pronunciation, pronunciation guides, dialogues, and classical Chinese pleadings with Cantonese phonetics written in Vietnamese alphabet. Furthermore, the corpus includes trilingual vocabulary, idioms translated into French, and a comparison with late 19th-century Guangzhou Cantonese. The study focuses on developing a comprehensive pre-processing workflow for Lagarrue’s corpus, involving technology-enhanced text organization (manual organization, optical character recognition (OCR), machine translation), conversion of Lagarrue’s text to Jyutping++, extraction of linguistic insights through statistical analysis. The methodology includes a Jyutping++ transcription scheme for enhanced reversibility and frequency priority, a Vietnamese alphabet decomposing algorithm, useful regular expression patterns for Jyutping++ and the establishment of an open-access online corpus with search capabilities for worldwide research (got.jyutdict.org). Preliminary linguistic findings (Lai, et al., 2023), such as the merging of rhymes 豪 and 侯, along with the 陽 rhyme merging with the colloquial reading of the class 梗, and noticeable instances of the rising tones 古上聲 are recorded. They highlight significant phonological characteristics of the Cantonese dialect at the Sino-Vietnamese border in the late 19th century. This underscores the importance of the pre-processing workflow, facilitating deeper dialectal exploration and emphasizing the significance of digitization and open-source efforts in linguistic research.

Description

Keywords

Éléments de Langue Chinoise: Dialecte Cantonais, late Qing Cantonese, Vietnamese alphabet, historical corpus, pre-processing

Citation

Huang, Junxin and Lai, Joeng-zit. "Evolving Pre-processing of Raw Corpus: The Digitization Initiative of Cantonese Material at the Sino-Vietnamese Border in the Late 19th Century." Buckeye East Asian Linguistics, vol. 9 (November 2024), p. 32-51.