The Keyaki Treebank








Thanks to the Japan Science and Technology Agency (JST) PRESTO Sakigake program in the research area Synthesis of Knowledge for Information Oriented Society (Awardee; Alastair Butler, 2010–2014) for supporting the initial design of the Keyaki Treebank. Subsequent funding from an NTT (Linguistic Intelligence Research Group) agreement with Kei Yoshimoto and Alastair Butler dated 2014/06/24 sustained annotation through to 2016. Since 2016/04/01 work on the Keyaki Treebank has continued under the Development of and Linguistic Research with a Parsed Corpus of Japanese internal project of the National Institute for Japanese Language and Linguistics (NINJAL) (Project Leader: Prashant Pardeshi).

In addition to data sourced directly from Aozora Bunko, Wikipedia, Japanese Law Translation Database System, and articles of the Kahoku Shimpo newspaper (by permission of KAHOKU SHIMPO PUBLISHING CO.), the corpus also includes annotations for data sourced from the following no-cost available resources:

  • (A massively parallel corpus: the Bible in 100 languages, Christos Christodoulopoulos and Mark Steedman, Language Resources and Evaluation, 49 (2)).
  • Kyoto-University and NTT Blog (KNB) Corpus Version 1.0 (Chikara Hashimoto, Sadao Kurohashi, Daisuke Kawahara, Keiji Shinzato and Masaaki Nagata. 2011. Construction of a Blog Corpus with Syntactic, Anaphoric, and Sentiment Annotations [in Japanese], Journal of Natural Language Processing, Vol 18, No. 2, pp. 175-201). Available from
  • National Institute for Japanese Language and Linguistics. The Compound Verb Lexicon [Original Data]. Available from
  • The NAIST-NTT Ted Talk Treebank Version 1 (Graham Neubig, Katsuhito Sudoh, Yusuke Oda, Kevin Duh, Hajime Tsukada, Masaaki Nagata. The NAIST-NTT Ted Talk Treebank. International Workshop on Spoken Language Translation (IWSLT). Lake Tahoe, USA. December 2014). Available from
  • Japanese-English Bilingual Corpus of Wikipedia's Kyoto Articles Version 2.01. Available from
  • Tanaka Corpus. Available from
  • Japanese Semantics Test Suite (JSeM) beta (Ai Kawazoe, Ribeka Tanaka, Koji Mineshima, Daisuke Bekki (2015) "An Inference Problem Set for Evaluating Semantic Theories and Semantic Processing Systems for Japanese," Proceedings of the Twelfth International Workshop on Logic and Engineering of Natural Language Semantics (LENLS12)). Available from
  • Spoken Refusal Response Corpus. Tomoko Hotta, Tohoku University.

The corpus also contains annotations for data sourced from the following purchasable resources, which are required to reinstate words stripped off due to license issues:

  • Mainichi Shinbun 1995 CD-ROM data collection (the same set of data as used by the Kyoto Text Corpus). Available from Nichigai Associates:
  • Balanced Corpus of Comtemporary Written Japanese (BCCWJ) DVD edition data (Kikuo Maekawa, Makoto Yamazaki, Toshinobu Ogiso, Takehiko Maruyama, Hideki Ogura, Wakako Kashino, Hanae Koiso, Masaya Yamaguchi, Makiro Tanaka, and Yasuharu Den. 2014. "Balanced Corpus of Contemporary Written Japanese". Language Resources and Evaluation 48(2), pp.345-371). Available from
  • Corpus of Spontaneous Japanese (CSJ) DVD-ROM data (Maekawa, K., "Corpus of Spontaneous Japanese: Its design and evaluation." Proceedings of the ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition (SSPR2003), Tokyo, 2003). Available from
  • 同時通訳データベース (SIDB) Simultaneous Interpretation Database (Hitomi Toyama, Shigeki Matsubara, Koichiro Ryu, Nobuo Kawaguchi, Yasuyoshi Inagaki. CIAIR Simultaneous Interpretation Corpus, Proceedings of the Oriental Chapter of the International Committee for the Co-ordination and Standardization of Speech Databases and Assessment Techniques for Speech Input/Output (Oriental COCOSDA), 2004). Available from

Finally, the corpus contains annotations of examples from the following book:

  • Takashi Masuoka and Yukinori Takubo (1992). Kiso Nihongo Bunpou (Basic Grammar of Japanese). Kuroshio Shuppan, Tokyo.

We are deeply indebted for the availability of the above mentioned resources.