4   Segmentation and part-of-speech annotation

The policy for segmentation and part-of-speech labelling follows the principle of using terminal nodes that are as large as possible, but not so large as to incorporate into purely lexical elements other elements with functional roles. Such a policy corresponds closely with the LUW (Long Unit Word) standard of the Corpus of Spontaneous Japanese (CSJ; Maekawa 2003) and the Balanced Corpus of Contemporary Written Japanese (BCCWJ; Maekawa et al 2014). A SUW (single-morpheme Short Unit Word) corresponds to an entry in the UniDic dictionary (Den et al 2008). A LUW is composed of at least one SUW, but complex LUWs containing more than one SUW are common.

     The chunking obtained with LUW analysis is not limited to complex nominal expressions or complex predicates: Heterogeneous strings that appear to have undergone grammaticalisation (e.g., some formal noun/particle pairs, some complex modal expressions, etc.) are chunked as well. Complex LUWs are usually incorporated just as single segments. For example, numerals are analysed digit-by-digit into component SUWs, but the parsed annotation chunks these into a single segment according to the LUW containing them.

     While the LUW chunking of the BCCWJ and CSJ is intended to identify units with significance in syntax, the information is not always rich enough to generate immediate constituency trees approaching descriptive adequacy for syntax. Depending on the circumstances, SUW analysis may not be sufficiently fine grained, making it necessary to split the SUW into more than one segment. For example, the volitional form of a verb may be analysed into a combining stem and a volitional morpheme: (VB 結ぼ) (MD う). Conversely, a series of LUWs may need to be concatenated under one terminal node label (an instance of further chunking). For example, a series of LUWs that together form a complex proper noun may be chunked into one segment. Furthermore, some finer distinctions in morphological analysis that have no consequence for syntax (e.g., the distinction between personal names and place names) are sometimes ignored, while other distinctions deemed important (e.g., sorting instances of items that share the same phonological form into more than one part of speech according to their grammatical function) are introduced. This is a consequence of aiming to expose the basic functional structure of the language, while keeping the structure fairly flat and easily searchable.

     The policy is to chunk as large as possible in the automatic parse, and this is the form of segments that annotators initially see. But when there is clearly some constituency in a string that must be expressed by structure, or when there is a need to indicate the semantic effects of structure, chunking may have to be undone. To exemplify the former situation, consider the morpheme 中. Following a “verbal noun” like 旅行, 中 is analysed as a nominalising suffix by UniDic and grouped together with a preceding string to form a LUW. This analysis is appropriate for a situation with a noun modifier:


But in a different context 旅行 may appear with adverbial elements and 中 is better analysed as a formal noun:


( (IP-MAT (PP (NP (NPR 佐藤さん))
              (P は))
          (NP-SBJ *)
          (NP-PRD (IP-EMB (NP-SBJ *pro*)
                          (PP (NP (N 海外))
                              (P を))
                          (NP-OB1 *を*)
                          (VB 旅行))
                  (N 中))
          (AX だ)
          (PU 。))
  (ID 53_misc_EXAMPLE))

Another scenario where chunking may have to be undone involves complex particles. UniDic chunks many verb-particle combinations into complex particles. Note how in the string にしたがって in (4) the part corresponding to the verb 従う has been bleached of its semantics.


( (IP-IMP (-LRB- 「)
          (PP (NP (N 地))
              (P は))
          (NP-SBJ *)
          (PP (NP (N 生き物))
              (P を))
          (NP-OB1 *を*)
          (PP (NP (N 種類))
              (P にしたがって))
          (VB いだせ)
          (PU 。))
  (ID 46_bible_old))

While the UniDic analysis is frequently correct, there still appear instances where the part corresponding to the verb 従う should be treated as a full-fledged verb:


( (IP-MAT (PP (NP (NPR モーセ))
              (P は))
          (NP-SBJ *)
          (IP-ADV (PP (NP (PP (NP (N 主))
                              (P の))
                          (N 命))
                      (P に))
                  (VB したがっ)
                  (P て))
          (CONJ *)
          (PU 、)
          (PP (NP (PP (NP (NPR パラン))
                      (P の))
                  (N 荒野))
              (P から))
          (PP (NP (PRO 彼ら))
              (P を))
          (NP-OB1 *を*)
          (VB つかわし)
          (AXD た)
          (PU 。))
  (ID 408_bible_old))

In a scenario such as (5) the annotator splits the segment and relabels its parts as necessary.