Section 1 The Keyaki Treebank Manual

1 Introduction

This manual details an annotation scheme for parsing Contemporary Japanese. Syntactic structure is represented with labelled parentheses adapting the format of the Penn Treebank (Bies et al 1995), and more particularly the Penn Historical family of parsed corpora, exemplified by the Annotation manual for the Penn Historical Corpora and the Parsed Corpus of Early English Correspondence (PCEEC) (Santorini 2010). This scheme has tags that are familiar to generative linguists, eliminates VP structure, has phrasal nodes (NP, PP, ADVP, etc.) that immediately dominate the phrase head (N, P, ADV, etc.), and marks for function all clausal nodes and all NPs that are clause level constituents. In addition, semantic information for resolving ambiguities has been added.

Annotation practice strives first for observational adequacy. The aim is to present a consistent linguistic analysis for each attestation of an identifiable linguistic relation or process. Relations and processes are treated uniformly as much as possible, and their treatments are detailed in this documentation. Setting aside the issue of whether the system of description is theoretically correct, the practice is to render lexical and functional items, parts of speech, constituents of various categories and functions, and constructions defined by combinations of properties, in such ways as to be unambiguously identifiable. The documentation sets out basic principles both for the annotator (assigning segmentations, tags, and structural positions) and for the user (searching for classes of items, categories, and relationships between these). Searches combine terminal strings, tag names and extensions, and structural relations between these. Examples of suitable tools for searching the parsed data include CorpusSearch (Randall 2009),^1 and Tregex (Levy and Andrew 2006).^2

The current annotation also aims to offer syntactic analysis that can serve as a base for the subsequent generation of predicate logic based meaning representations using the methods of Treebank Semantics (Butler 2015).^3 To this latter end, extra disambiguation information is added to feed the calculation of semantic analyses from the syntactic annotation. One example of this is seen in different specifications of clause linkage (i.e., different types of non-final clauses). The annotation identifies two types of subordinate clause linkage with disambiguation tags: CND (conditional) and SCON (non-conditional). Subordinate clause status influences the distribution of empty subject positions within such clauses and the antecedence relationships these positions have with respect to upstairs arguments (according to an antecedent calculation called “control”). These cases are contrasted with coordinate clause linkage, also identified with a disambiguation tag: CONJ. Status as a coordinate clause influences the distribution of arguments shared between that clause and one or more other clauses (according to an antecedent calculation called “Across the Board extraction” (ATB)). With these calculations in place, most antecedent relations in Japanese can be accurately determined without resorting to annotation with overt indexing, provided that the distinction between subordination and coordination is properly annotated. The practice provides a robust basis for calculating semantics, a simplified anotation scheme that is descriptively adequate, and a set of constraints on the distribution of null positions which have interesting consequences with respect to, for example, the placement of zero pronouns.

1. See: http://corpussearch.sourceforge.net/

2. See: http://nlp.stanford.edu/software/tregex.shtml

3. See: http://www.compling.jp/ajb129/ts,html


contents		Section 2