Taiwan's NDAP Language Archives Project: From bronze inscription texts to Austronesian field recording |
![]() |
Cui-xia Weng*, Ru-yng Chang*, Elizabeth Zeitoun*, Chao-jung Chen*, Derming Juang*, Chu-ren Huang*, and Chin-chuan Cheng# | |
*Academia Sinica and #City University of Hong Kong |
0 Abstract | ||
The Language Archives Project is part
of Taiwan's National Digital Archives Program (NDAP). The project
digitizes and archives a wide range of linguistic data, from heritage
texts to endangered Formosan languages. The goal is two-fold: both to
preserve unique cultural heritages and to provide a comprehensive
linguistic infrastructure to support content interpretation of archives.
Based on these two goals, the main challenges of this project are: to
provide versatile yet uniform presentation of different text types, to
account for language change, and to account for language variation.
We take two archives of contrasting characteristics to illustrate how these challenges are met. The Bronze Inscription archives deal with an archaic language preserved in a written form that is significantly different from Modern Chinese writing. The Formosan (i.e. Taiwan Austronesian) archives deal with indigenous languages that are endangered and have no written conventions. We show how OLACMS lays the common ground for content documentation of these contrasting archives. First, for the Bronze Inscription Archives, the fundamental issue is how to represent the archaic inscribed written form and to establish the direct correspondences with modern writing systems at the same time. We adopt the Intelligent Character System to deal with this issue. Basically, although glyphs vary greatly, the composition of Chinese characters from basic glyph remains regular. Hence an system based on composition of basic glyphs will not only help with diachronic Chinese archives but can also deal with cross-lingual variations (e.g. Korean and Japanese Kanji, new characters from Hong Kong, etc.). Second, the Formosan languages are indigenous languages in Taiwan that are also thought to be close to the common ancestor of Austronesian languages. The first issue we face is that of establishing orthography, which is solved by the common use of IPA among field linguists. The second issue involves establishing segmentation and tagging standards. The third issue involves audio-representation of field recording. And the last issue involves mapping the lexicon to GIS (geographic information system) to represent language variations and contrasts. |
1.0 Introduction | ||
The Language Archives Project is part
of Taiwan's five-year Digital Archives Program (NDAP), which was launched
in 2002. The NDAP Language Archives Project, carried out primarily at
Academia Sinica, digitizes and archives a wide range of linguistic data,
from heritage texts to endangered Formosan languages. In the face of these
diverse data types, how to digitize and annotate data properly, and how to
provide versatile yet uniform presentation to account for language change
and language variation are two main challenges for this project. Two goals
of this project are: both to preserve unique cultural heritages and to
provide a comprehensive linguistic infrastructure to support content
interpretation of archives. In this paper, we will first briefly describe each sub-project in this project. We then discuss how OLACMS lays the common ground for content documentation of these contrasting archives. In ensuing more detailed discussion, two archives of contrasting characteristics will be focused on to illustrate how we meet the challenges mentioned above. The Bronze Inscription archives deal with an archaic language preserved in a written form that is significantly different from Modern Chinese writing. Hence we take examples from this archive to show how the missing characters problem is solved in our project. Lastly, the Formosan (i.e. Taiwan Austronesian) archives deal with indigenous languages that are endangered and have no written conventions. We offer our example of how to create a multimodal archive for endangered languages. The paper ends with a short conclusion. |
2.0 Organization of the NDAP Language Archives Project | |||
The Language Archives Project has two
branch projects on Chinese and Formosan Language archives. The former is
further divided into 5 sub-projects. These five sub-projects represent
different language usage and historical period of Chinese. Formosan
Language archives project aim to preserve the endangered Formosan
Austronesian languages with corpora, lexicons and grammars of each
language. The Formosan languages are Austronsian languages. There is a great diversity and complexity among these indigenous languages spoken in Taiwan. Unfortunately, most of these languages are endangered. Hence, we aim not only to preserve their linguistic data and structures, we would also like to preserve some of their cultural heritage through the preservation of their languages. This is why audio story telling, as well as mapping to GIS is used. More detailed on our approaches and experience will be given in a later session. Among the Chinese archives, "Early Mandarin Chinese Lexicon" is designed as part of the Lexical Knowledgebase (LKB) tracing the historical changes of the Chinese language. The LKB will contain a series of synchronic lexicon from Pre-Qin to Modern Mandarin. The archived materials include written records of lectures, documents of laws and decrees, and fiction and drama of the Ming-Qing period. The "Lexicon of Pre-Qin Bronze Inscriptions and Bamboo Scripts (LBB)" project aims to build a lexicon of Yin, Zhou, and Chun Qiu bronze inscriptions (from 13th century through 3rd century BC), and the bamboo manuscripts of the Warring States (475BC-221BC). This will be one of the earliest lexicon in the LKB series of Chinese language evolution. For a long time, manual copying and rubbings reproduction are two ways to preserve archaic written languages. However, if a lexical database can be built to preserve these characteristic ideograms, it would make a great progress in archives of ancient Chinese culture. The first difficulty that has to be conquered while developing this kind of database is missing character problem in computers. This project has adopted the Intelligent Character System to solve this problem. We will describe this system in the next session. The "Modern Chinese Corpus and Treebank" will complete a 10 million-word tagged and balanced corpus for modern Mandarin, as well as complete a grammatically annotated treebank. The emphasis will be on value-added applications such as information search, retrieval, automatic Q & A, and summarization. The "New Age Corpus: Linguistic Representations and Archives of Multimedia Data" project documents the everyday usage of modern Chinese, such as oral communication, discussion topics, lexicons, gestures and facial expressions, in Taiwan in digital multimedia forms. The "Southern-Min Archive: A Database of Historical Change in Language Distribution" project is a new addition in 2003. It aims to provide both a historical depth and sociological variation to the archives of Chinese languages in Taiwan. All subsidiary projects of the Language Archives Project are listed below for easier reference. In addition, the three main components of the Linguistic Anchoring project are also included. The Linguistics Anchoring project is a NDAP technology research and development project. Its goal is to provide the infrastructure for language-based knowledge processing and management. The anchoring reference of this project will transfer the Language archive contents into inter-operable information. The website is at http://LingAnchor.sinica.edu.tw/ Table 1: All subsidiary projects of the Language Archives Project, and the Linguistics Anchoring Project
Diagram 1, using the numeral and alphabetical designation of each subsidiary project given above, illustrates the functional structure of the Language Archives Project. The green column at the center represents the standards and tools supporting the digitizing and archiving of language data. Each of the peripheral circles extended from the column represents a language group or variety. The diagram shows how a sharable and reusable set of technologies can be used to support a wide range of language archives. This is exactly the design feature of OLAC, which we adopt and will discuss in the following section. Diagram 1: Functional Structure of the Language Archives Project ![]() |
3.0 The Application of OLAC to the NDAP Language Archives Project | ||
The Open Language Archives Community
(OLAC) is an international partnership of institutions and individuals who
are creating a worldwide virtual library of language resources. Three
primary standards serve to bridge the multiple gaps which now lie in
between language resources and users: (1) OLACMS: the OLAC Metadata Set
(Qualified DC, Dublin Core), (2) OLAC MHP: refinements to the OAI (Open
Archives Initiative) protocol, and (3) OLAC Process: a procedure for
identifying Best Common Practice Recommendations. On December 2002 there
was an OLAC Workshop (IRCS Workshop on Open Language Archives) in
Philadelphia which revised the OLAC standards and controlled vocabularies,
reviewed OLAC archives and services, and considered proposals for new
activities. The metadata format, OLAC extensions, defining a third-party
extension and documenting an extension are described in the OLAC Metadata
1.0 version. The NDAP language archives plan to be OLAC compliant. Three of the resultant archives are already registered with the OLAC repository: Academia Sinica Balanced Corpus or Modern Chinese, Academia Sinica Formosan Language Archive, and Academia Sinica Tagged Corpus of Early Mandarin Chinese. These resources have also been registered at the OLAC-compliant Asian Language Resources Repository (hosted by Tokyo Institute of Technology, yet to be released.) Chang and Huang (2002) reported on the application of OLACMS to the Language Archives Project. They found OLACMS to provide a solid basis that will allow productive and in depth description of our archives with extensions and elaborations. The additional information that we need are Temporal and Geographic Location, as well as textual information such as style, mode, genre, and medium. The suggested additions and elaborations are discussed in section 3.1.-3.4. |
3.1 Temporal and Geographic Location | ||
Since China used a different calendar
system until early 20th century, all temporal description of inherited
Chinese archives do not conform to the current DC standard. The sub-type
of Chinese calendar will then include time, dynasty name, state name, and
emperor's reign. We may also add other chronological methods, such as
lunar or solar calendar. Take the Academia Sinica Ancient Chinese Corpus
for example. Its coverage is Early Mandarin Chinese. The users will be
able to refer to a historical calendar and find that the time equals to
the dynasties of Yuan, Ming, and Qing. And will be able to convert the
time to western calendar using the conversion table provided by Academia
Sinica. It offers conversion table for the past 2000 years between Chinese
and Western calendars. When Coverage has a spatial refinement, a location can have different names because of the unit used in cataloguing, as well as because of temporal and linguistic variations. When describing spatial coverage, we need to know more than a place name. E.g. Washington State is different from Washington D.C. and Taipei City is different from Taipei County. Hence we need to define the sub-types of spatial description that include Continent, Country, Administrative Division, Longitude, Latitude, Address, etc.. |
3.2 Mode and Genre | ||
Each text in Academia Sinica Balanced
Corpus of Modern Chinese (Sinica Corpus) is marked up with five textual
parameters: Mode, Genre, Style, Topic and Medium. These are important
textual information that needs to be catalogued in metadata. Table 2: The relation between Mode and Genre of Sinica Corpus (Chinese Knowledge Information Processing (CKIP) 1993) ![]() |
3.3 Style | ||
There are four styles that are differentiated in Sinica Corpus: narrative, argumentative, expository, and descriptive. |
3.4 Medium | ||
Sinica Corpus specifies the media of
the language resources as: Newspaper, General Magazine, Academic Journal,
Textbook, Reference Book, Thesis, General Book, Audio/Visual Medium,
Conversation/Interview. Table 3: Topic of Sinica Corpus (CKIP 1993) ![]() An example for the adoption follows: for a Sinica Corpus text with a Topic of Arts and a sub-topic of Music. <topic xml:lang="x-sil-CHN">Art/Music</topic> |
4.0 The Intelligent Character System | ||
The Intelligent Character System,
which was developed by the Chinese Document Processing Lab of the
Institute of Information Science at Academia Sinica, mainly contains four
parts: components, glyphs, operators and production rules. Components are
the basic unit of glyph. Take ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() There are three basic operators to express the structure of a glyph: horizontal, vertical and contained composition. Again, take ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Table 4: Decomposition of ![]() ![]() Although glyphs vary greatly, the composition of glyphs from basic components is basically according to these three production rules. Hence an encoding scheme based on the composition of basic components will not only help with diachronic Chinese archives but can also deal with cross-lingual variations that also uses glyphs to compose their language characters (e.g. Korean and Japanese Kanji, new glyphs from Hong Kong, etc.). |
4.2 User Interface | ||
Intelligent Character System provides
tools and Hanzi (Chinese character) glyph database to let users browse and
transform missing characters through Java Applet on the webpage. In
addition, users can edit and browse missing characters under the Microsoft
Office environment. Besides, a user interface was provided to search and
access missing characters. Diagram 2 gives a chart of system's structure
to show the processing between tools and the database. Diagram 2: Database and tools for the Intelligent Character System ![]() The basic idea of the intelligent Character System comes from the traditional study on Chinese "forms" of characters, especially the knowledge about glyphs. Through assistance of modern technology, this system not only solves missing character problem in the Lexicon of Pre-Qin Bronze Inscriptions and Bamboo Scripts project, but also improves the performance of the existing Hanzi processing system for possible application, such as data sharing, in the future. |
5.0 The Formosan Language Digital Archive | ||
The Formosan languages are indigenous
languages in Taiwan that are also thought to be close to the common
ancestor of Austronesian languages. According to linguistic studies, there
are still 15 extant languages (Thao, Kavalan, Pazeh, Atayal, Saisiyat,
Bunun, Tsou, Rukai, Paiwan, Puyuma, Amis, Seediq, Saaroa, Kanakanavu,
Yami), but declining rapidly. So far, this project has built four out of
six Rukai dialects corpora (including Mantauran, Tanan, Maga, and Tona),
and can be browsed and searched via internet as well. Others are being
added to the archive progressively. It is hoped that by the end of this
project, there will be at least nine Formosan languages
archived. The Formosan language archive, which includes both Chinese and English browsing display, contains three main types of information databases: (1) corpora with annotated texts, (2) a language GIS (geographic information system), and (3) four bibliographical databases. These respective databases allow all kinds of research and are briefly introduced below. |
5.1 Linguistic Corpora | ||
The collection of the Formosan
language corpora includes folktales, narratives, conversations, songs and
elicited sentences. The last two categories are not yet available on the
web. The structure of an annotated text comprises of the transcription of
the original language, divided into paragraphs, sentences; glosses; and
free translations. IPA symbols are used to transcribe collected text. This
is based on two reasons: (1) there is no standardized writing system for
Formosan languages, and (2) IPA is an international standard for
transcribing sound recordings in other Archives projects. Glosses, on the other hands, can be provided at the word level (stems) or at the morphemic level (roots and affixes). For Rukai corpus, morphemic analysis has been adopted for the annotations. The information tagging on each morpheme contains grammatical functions and lexical and syntactic categories. The tags of grammatical functions are according to Formosan Linguistics conventions. A tagset of abbreviations of grammatical functions used in the corpora is shown in table 5. The tags of lexical categories are following the standardization of CKIP (CKIP 1993) but with some reservation. Meanwhile, in addition to annotations, each sentence is heard on an audio output that was digitally recorded in the original file and then transformed into MP3 format. This audio-representation allows users to download recorded sentences, to view and analyze the sound spectrographs, and to process the sound data with sound editing software. Figure 2 gives an example to show how a text is displayed on a webpage interface. |
Table 5:
Abbreviations of grammatical functions (for Rukai as a pilot
study)![]() |
![]() |
Besides, a set of
metadata regarding general information of a text, such as text profile,
fieldwork activity, and management statements, was also developed. This
will facilitate data access and sharing with other similar resources in
the future. Figure 3 shows a piece of metadata information of a
text.![]() |
5.2 The Formosan Geographical Information System | ||
As for geographic information
database, the language distribution search enables users to learn the
geographical distribution of each language and dialect. Another search
system, comparative word search, allows users to spot the distribution of
cognates/non-cognates within the Formosan languages and identify spatial
features. In the future, we hope to add these two functions in this
database: (1) a system to observe the expansion or decrease of a
particular linguistic community over the last hundred years; and (2) an
audio recording mapping system. |
5.3 Four Bibliographic Databases | ||
The on-line reference search system
is provided for user to access Formosan languages information on
linguistic references, indigenous teaching references, indigenous
literature, and music references. These pieces of information are
regularly updated. And, it is hoped that a complete and abundant Formosan
language bibliographic databases will be achieved to satisfy linguistic
worker's needs. |
6.0 Conclusion | ||
Unlike artifact and specimen, the
non-physical characteristic of languages is the biggest challenge to
language archives. Advances in technology make it possible for us to
digitize ancient writing manuscripts, as well as oral or written records
of an endanger language. However, preservation goes beyond digitization.
In order to prevent the undesirable consequence of the digitized data
becoming cold and lifeless digital antiques, we emphasize the reusability,
sharability, and accessibility of the archives. This philosophy coincides
with the vision and mission of OLAC. In describing the NDAP Language
Archives project in Taiwan, we showed digitization of two archives as
different as early Chinese documents and endangered Formosan languages can
be done under the same project and with the same infrastructure. This is
important testimony to the open archives initiative vision. We hope that
this work can symbolize a small step to the direction where linguistic and
cultural diversity can be accepted and shared by all. |
References | ||
|
||
Referential Websites | ||
|
|