Symposium on Language Resources

Julia Hockenmaier (University of Illinois at Urbana-Champaign)

"The future role of language resources for natural language parsing (We won't be able to rely on Pierre Vinken forever... or will we have to?)"

The transformation that natural language parsing has undergone since the nineties would have been impossible without the availability of syntactically annotated corpora such as the Penn Treebank and similar resources for other languages. By now, it has become increasingly difficult to increase parsing accuracy on our standard data sets. But as we move to other domains of text, or aim to recover richer representations that are required for natural language understanding, it is also clear that parsing is far from being a solved task. In this panel, I would like to initiate a discussion about the kind of language resources needed to advance natural language parsing. I will also reflect on what the translation of existing resources into other grammatical representations has taught us about treebank design.


Julia Hockenmaier is an assistant professor in Computer Science at the University of Illinois. Her research on the translation of the Penn Treebank into Combinatory Categorial Grammar has enabled the creation of the first wide-coverage statistical parsers for CCG. Although her main interest and background are in natural language processing and computational linguistics, she has also shown how ideas from statistical parsing can be used to model how proteins fold, and is currently working with students and colleagues at Illinois at the interface of language processing and computer vision.

Thomas Hun-Tak Lee (The Chinese University of Hong Kong)

"The acquisition of word order in a topic-prominent language: Corpus findings and experimental investigation"

It is well-known that Chinese has SVO as predominant word order, with variant orders OSV and SOV marking topic and focus, intimately linked to the topic-prominence of the language. Assuming early setting of the head parameter in syntactic acquisition and the peripheral positions of topic and focus in clausal structure, one might hypothesize that Chinese-speaking children will acquire the predominant SVO order early, but develop the variant orders later. Such acquisition findings, if true, would seemingly go against the idea of a topic prominence parameter. This paper explores the development of topic prominence by examining word order in early child Chinese, based on naturalistic longitudinal corpora as well as cross-sectional experiments that investigated the relative accessibility of different word orders.

Our naturalistic data show that the word order of two-year-old Chinese children reflects adherence to canonical mapping of thematic roles to structural positions, as well as sensitivity to the unselectivity of subject and object. While sentences of OSV order appeared around two years of age, double nominative structures were virtually absent before three, suggesting that as a typological characteristic, topic-prominence is not acquired early. Our experimental results show that Mandarin-speaking children by three years of age have established SVO solidly as the dominant word order, on both comprehension and production, but still find the topicalized and fronting orders (OSV, SOV) difficult, indicating that the structures of the left periphery may be acquired at a later stage, and at different times. The implications of these acquisition findings for the topic prominence parameter will be explored.


Thomas Hun-tak Lee is an acquisition researcher teaching at the Department of Linguistics and Modern Languages of The Chinese University of Hong Kong. He received his PhD in Linguistics from UCLA. His research has centered on the acquisition of syntactic and semantics and issues of learnability, with special reference to the quantificational competence of Mandarin-speaking and Cantonese-speaking children. He led the construction of CANCORP (The Hong Kong Cantonese Child Language Corpus). In recent years, he has been heading a Chinese Early Language Acquisition (CELA) project in Beijing, Hunan and Hong Kong, investigating how Chinese children acquire the core properties of the target languages from infancy to three years of age.

He is currently Associate Editor of Journal of Chinese Linguistics, and a member of the editorial boards of Journal of East Asian Linguistics and Taiwan Journal of Linguistics. He has held visiting professorships at Guangdong Foreign Studies University, National Chung Cheng University, Hunan University, Beijing Language and Culture University, and Nankai University. He was President of Linguistic Society of Hong Kong in 1990-1991, and is currently President of the International Association of Chinese Linguistics.

Masataka Goto (National Institute of Advanced Industrial Science and Technology)

"PodCastle: A Spoken Document Retrieval Service Improved by Anonymous User Contributions" (with Jun Ogata)

PodCastle is a public web service that provides full-text searching of speech data (podcasts) on the basis of automatic speech recognition technologies. PodCastle enables users to find podcasts that include a search term and read full texts of their recognition results. However, even state-of-the-art speech recognizers cannot correctly transcribe all podcasts because their content and recording environments vary very widely. PodCastle therefore encourages a number of anonymous users to cooperate by correcting speech recognition errors on our original easy-to-use interface so that podcasts can be searched more reliably. Furthermore, using the resulting corrections to train our speech recognizer, it implements a mechanism whereby the speech recognition performance is gradually improved. This is an instance of our new research approach, ``Speech Recognition Research 2.0'', which is aimed at providing users with a web service based on Web 2.0 so that they can experience state-of-the-art speech recognition performance, and at promoting speech recognition technologies in cooperation with anonymous users. We hope that this project will prove the importance and potential of incorporating user contributions into automatic pattern recognition technologies, and that various other projects that follow our approach will be done, thus adding a new dimension to this field of research.


Dr. Masataka Goto is the leader of the Media Interaction Group at the National Institute of Advanced Industrial Science and Technology (AIST), Japan. In 1992 he was one of the first to start work on automatic music understanding, and has since been at the forefront of research in music technologies and music interfaces based on those technologies. Since 1998 he has also worked on speech recognition interfaces. He has published more than 160 papers in refereed journals and international conferences. Over the past 18 years, he has received 25 awards, including the Young Scientists' Prize, the Commendation for Science and Technology by the Minister of Education, Culture, Sports, Science and Technology, the Excellence Award in Fundamental Science of the DoCoMo Mobile Science Awards, and the Best Paper Award of the Information Processing Society of Japan (IPSJ). He has served as a committee member of over 60 scientific societies and conferences and was the General Chair of the 10th International Society for Music Information Retrieval Conference (ISMIR 2009), and the Chair of the IPSJ Special Interest Group on Music and Computer (SIGMUS).

Symposium: Toward a Linguistic Thoery of Ba: Semantics and Pragmatics of Ba

(This symposium is supported in part and organized in conjunction with JSPS Grant-in-Aid for Challenging Exploratory Research Number 21652041, entitled Construction of Linguistics of 'Ba', Semantics and Pragmatics of 'Ba')

Sachiko Ide (Japan Women's University)

"How and why two strangers can co-create a story: Application of the ‘ba’-theory based approach to the discourse"

This presentation addresses the question: how and why can a pair of teacher-student interactants co-create a story? In analyzing the discourse data taken while interactants try to achieve the task of making a coherent story by arranging cards, the ‘ba (filed)‘ theory is employed. ‘Ba’ based approach is an innovating frame of thinking device that assumes 1) inside perspective of the subject, 2) dual mode thinking of the self, 3) a dynamic model like an improvised drama and 4) two modes of communication, i.e. overt and covert. The data have been analyzed into two phases of discourse: the dialogue discourse and the merging discourse. The former is indexed by the interpersonal modalities such as honorifics and sentence final particles, while the latter is characterized by dropping these linguistic features. The sudden drop of presupposed use of modalities by a teacher is obviously a deviation, but it would serve as a creative use (cf. Silverstein 1967). It is in this merging discourse that the discourse phenomena of repetition, simultaneous utterances, and chaining utterances occur. These phenomena would add no information to the conversation, but function to synchronize and to entrain the interactants. When synchronization and entrainment are maintained between the interactants, covert communication is to be maintained that would create a basis for smooth overt communication. I will argue that it is by the ‘ba’ theory based approach to discourse that we can illuminate the dynamic processes of co-creating a story by interactants.


Professor Sachiko Ide started her study in linguistics while she was an undergraduate student at Japan Women's University, and continued to study at the Graduate School of International Christian University in Tokyo. While she was teaching at Japan Women's University, she was engaged in research in linguistics and linguistic anthropology at the University of Wisconsin at Madison, Harvard University and MIT, and at the University of North Carolina at Chapel Hill. Currently she is Professor Emeritus at Japan Women's University, and Research Fellow at Media Network Center of Waseda University and serves as President of the International Pragmatics Association.

Stanley Peters (Stanford University)

"Listening In"

Communicating agents are commonly thought of as intentionally addressing messages to other agents. A growing body of research exists on the interactive case: natural language dialogue. A somewhat different case, also important in many real life social and work settings, is a person overhearing or intentionally listening in on dialogue among a group of other people. Comparatively little research so far illuminates how, for example, a minute taker for a meeting can comprehend a discussion well enough to accurately record decisions, action items, and other such meeting outcomes, including ones that concern technical matters he does not understand. What prevents the small misunderstandings that frequently creep into discussions, even between active participants, from growing into a gross misunderstanding by the minute taker of the discussion to which he is listening?

This talk will present some similarities and differences between participating in a conversation and listening in on one, with emphasis on how overhearers who lack opportunities to contribute to a discussion target their interpretive efforts in productive ways. Progress in creating artificial agents capable of similar listening feats will be surveyed and research directions assessed.

Hideyuki Nakashima (Future University of Hakodate)

"Situated Language: Case of Japanese"

Language and thought shape, or at least give some influence, each other. Japanese viewpoint of the world is different from that of western countries, and it is reflected in linguistic structures. Japanese uses insects' view incontrast to Western birds' view. Reflecting this, Japanese language, and therefore thought process, suits use of situated expressions. This talk elaborates on the point and tries to formalize dialog processes in a situated manner.


Dr. Hideyuki Nakashima is the president of Future University Hakodate from 2004. Before this, he was the director of Cyber Assist Research Center, National Institute of Advanced Industrial Science and Technology (AIST). He received his Ph. D. in information science at the University of Tokyo, 1983. His research field includes ubiquitous computing, multiagent systems, artificial intelligence, and cognitive science. He has a long-standing collaborative relationship with researchers at CSLI, Stanford University.

His home page is at http://www.fun.ac.jp/~nakashim/welcome.html.

He served as a programming committee co-chair for ICMAS (Int. Conf. on MultiAgent Systems) 2000, a general co-chair for AAMAS (Int. Conf. on Autonomous Agents and MultiAgent Systems) 2006 and a general co-chair for PRICAI (Pacific Rim Int. Conf. on AI) 2006. He also served as a programming committee member for many international conferences including IJCAI (Int. Joint Conf. on AI), ICMAS, AAMAS, and UbiComp (Int. Conf. on Ubiquitous Computing).

He was on the editorial board for Journal of AI Research (1993-1995), an associate editor for Journal of Visual Languages and Computing (2002-2008), and is on the editorial board for Journal of Agent and MultiAgent Systems (1998-2010).

Plenary Speakers

Qun Liu (Institute of Computing Technology, Chinese Academy of Science)

"Statistical Translation Model Based On Source Syntax Structure"

Syntax-based statistical translation model is proved to be better than phrase-based model, especially for language pairs with very different syntax structures, such as Chinese and English. In this talk I will introduce a serial of statistical translation models based on source syntax structure. The tree-based model uses the one best syntax tree for translation. The forest-based model uses a compact forest which encodes exponential number of syntax trees in a polynomial spaces and lead to better performance. The joint parsing and translation model produces source parse trees, using the source side of the translation rules instead of separate parsing rules, and generate translations on the target side simultaneously, which outperforms the forest-based model. Some extension of these models are introduced also.


Qun Liu is a researcher and professor in the Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS). He got his Master degree in computer science at ICT/CAS and then stayed in ICT/CAS as a researcher till now. He got his PhD degree at Peking University in 2004. His research interest is mainly on natural language processing, especially on machine translation and Chinese language processing. From 1993 to 2000, he developed an rule-based Chinese-English machine translation system using an unification-based grammar with the cooperation of Peking University. He and one of his students developed the most popular open source Chinese word segmenter ICTCLAS in 2002. Since 2004 he and his group were devoted to statistical machine translation research and published more than 20 papers in the top conferences and journals, including ACL, EMNLP, COLING and the Computational Linguistics journal, among which the COLING-ACL06 paper obtained the ACL-AFNLP Meritorious Asian NLP Paper Award. In these papers he and his group proposed a serial of statistical translation models based on source syntax structure. His group also gained high ranks in the open machine translation competitions such as NIST and IWSLT. He gives Computational Linguistics and Machine Translation courses in CAS every year. He served as machine translation area co-chairs for COLING2010 and EMNLP2010. Now he is an associate editor of ACM Transactions on Asian Language Information Processing.

Shirley N. Dita (De La Salle University)

"A morphosyntactic analysis of the pronominal system of Philippine languages"

Pronominal orientation is widely argued to be universal component of human languages. Meanwhile, the pronominal system of Philippine languages has always been an obscure subject of investigation. With approximately 150 living languages in the archipelago, the structures of pronominals are just as many. This study attempts to explicate the grammatical functions, along with other known phenomena such as cliticization, homophony, inclusivity/exclusivity, person-deixis interface, and hierarchy, of some languages in the Philippines. Using an ergative-absolutive case marking analysis, this cross-linguistic investigation of Philippine languages presents examples that illustrate the distinctive features of the personal pronouns.

Using a 100,000-word corpus for each language included, there are various similarities and differences revealed by the study: (1) some languages allow encliticization and some don’t; (2) homophony, as well as inclusivity/exclusivity, is a persistent feature of the languages; and (3) the strength of hierarchy poses semantic constraints among others.


Shirley N. Dita is Assistant Professor of the Department of English and Applied Linguistics of De La Salle University, the Philippines. She holds a Ph.D. in Applied Linguistics from the same university where she graduated With Distinction. Her dissertation entitled A Reference Grammar of Ibanag is currently in press as a monograph of Lambert Academic Publishing (LAP), Germany. She is editing a volume entitled Issues in Applied Linguistics in the Philippines: A Decade in Retrospect as part of the De La Salle University Centennial Publication Series. Shirley has been involved in the corpus building of Philippine languages and is doing a project on a dictionary of Itawit, a minor Philippine language, funded by the National Commission of Culture and the Arts (NCCA) of the Government of the Republic of the Philippines. Currently, she serves as Secretary of the Linguistics Society of the Philippines. Shirley considers Austronesian linguistics and world Englishes her areas of interest on which she has read papers and given lectures in various conferences in the Philippines and abroad.

Hee-Rahk Chae (Hankuk University of Foreign Studies)

"Basic Units of Lexicons and Ontologies: Words, Senses and Concepts"

Dictionaries have been one of the most important resources for linguistic research and applications. Ontologies are also becoming an indispensible resource not only for linguistics but also for other areas dealing with knowledge. In many cases, however, they fall short of our expectations. One reason for this under-expectation is that their basic units are not well-established. There are two kinds of basic units of dictionaries: head words and (word) senses. Head words have to be words rather than affixes or phrases. The meaning of a word has to be carved into different senses on the basis of objective criteria. The building blocks of ontologies have to be (simple and/or complex) concepts rather than senses. We will examine the morpho-syntactic status of head words in Korean (and Japanese) dictionaries. It will be shown that many head words are phrases and, hence, have to be removed from the list of head words. In addition, many elements that are treated as affixes are actually words and, hence, have to be registered as head words. We need to realize that agglutinative languages like Japanese and Korean have many clitics, i.e. (syntactic) words which have some affixal properties as well. Then, we will consider basic units of ontologies. Some scholars argue that they have to be senses rather than concepts. However, many scholars assume that they have to be concepts rather than senses. We will show, based on a variety of phenomena, that building blocks of ontologies should be concepts.


Hee-Rahk Chae obtained his Ph.D. in linguistics from the Ohio State Univ. in 1992. The title of his thesis is "Lexically Triggered Unbounded Discontinuities in English: An Indexed Phrase Structure Grammar Approach." He is a professor in the Dept. of Linguistics and Cognitive Science at Hankuk Univ. of Foreign Studies, Korea. From March 2006, he is leading a Brain Korea 21 team, whose research topic is "A Study of the Language-Neutral Ontology." Currently he is also serving as the President of the Korean Society for Cognitive Science and as the Secretary-in-General of the International Association for Cognitive Science. He has worked on such topics as light verb constructions, concord adverbial constructions, constructions involving clitic elements (case markers, postpostions, delimiters, etc.) and the like. With reference to the ontology project, he has focused on elucidating basic units of lexicons and ontologies, and relationships between them.

