Alastair Butler - Homepage

Contact details:
Theory and Typology Division
National Institute for Japanese Language and Linguistics
10-2 Midori-cho, Tachikawa City, Tokyo, 190-8561, JAPAN

email: ajb129 __AT__ hotmail __DOT__ com

This page offers a summary of my research interests, and then gathers links to my work.

Research interests

Being interested in all aspects of natural language syntax and semantics, my research (see Resources, Tools, Publications) mixes generative linguistics, dynamic semantics, and functional programming, with the aim of building models and methods which aid characterising and understanding properties of natural language. In this regard, a particular concern is to capture how dependencies, formally captured with operator bindings, are managed. Typically my research includes implementing ideas with computer programs to: (i) check whether things work out, and (ii) observe consequences with scale. My ambition is to reach for more understanding of how natural language works, while also aiming to build tools of practical value, from automated methods assisting parsed corpus annotation through to systems for generating language, extracting information, preparing training data, and the checking of systems performance.

The following are corpus resources that can be viewed on the web and downloaded.

Keyaki Treebank — annotates phrase structure with functional and zero information for Japanese sentences.

NINJAL Parsed Corpus of Modern Japanese (NPCMJ) — as an official product of the National Institute for Japanese Language and Linguistics, this extends and develops the Keyaki Treebank, adding lemmas and romanisation, and presents the parsed data with web based interfaces.

Treebank Semantics Parsed Corpus (TSPC) — contains parsed corpus annotation for a range of English texts (fiction, law, newswire, nonfiction, poetry, textbook, wikipedia). Annotation follows a scheme informed by both the Penn Historical Corpora scheme (adopting tag labels, construction analysis, and CorpusSearch format), and the SUSANNE scheme (adopting construction analysis, functional and grammatical information, and the forming of complex expressions). This was built primarily as a testing ground for Treebank Semantics.

SUSANNETS — is a conversion of Geoffrey Sampson's SUSANNE treebank into the same format as the TSPC.

The Man'yoshu97 Parsed Corpus (M97PC) — provides parsed annotation for the first 97 poems of the Man'yoshu, which contains the oldest attested forms of the Japanese language, Old Japanese (OJ).


The following gathers a list of tools developed for parsed corpus building as well as wider research efforts.

Treebank Semantics — automatically obtain meaning representations from utterances of natural language given as parsed expressions following treebank guidelines.

View Semantics — illustrates ways to further process results from Treebank Semantics.

Generation Tools — offers a way to generate natural language from results of Treebank Semantics.

Normalization Tools — offers ways to normalize parsed analysis in various treebank formats.

Combining Tools — offers ways to combine various treebank formats.

Tree Search Tools — provides ways to process trees to obtain search patterns.

HARUNIWA2 Parser — pipeline for parsing Japanese trained on data of the Keyaki Treebank.

Ranking Tools — provides an automatic way to select a parse result.

Keyaki-aid — offers assistance with building parsed corpora, either by providing methods of access to the parsed data, methods to (re-)process parsed data, or methods to visualise parsed data.



Last updated: Jan 2, 2018