The Keyaki Treebank
The Keyaki Treebank is a parsed corpus that aims to instantiate a
coherent descriptive grammar of the Japanese language,
allowing searches for a wide variety of grammatical phenomena.
We hope this resource offers a useful foundation for
research into understanding and processing Japanese.
For an instant way to see the annotation, a search query can be entered into the following text area:
Bracket search accepts as a search term a tree specified with traditional Penn Treebank “bracketed notation”. Results will be trees that contain the input reference tree, which the results page will graphically present. While the reference tree might be a full tree, entering a partial tree is also possible, where the partial tree may, but need not, include terminal nodes.
As an introductory example, consider the query:
This will return trees where the noun “人” is inside a noun phrase. Note that the search term will have matched noun phrases that contain elements in addition to “(N 人)”, as well as noun phrases with labels that contain tag extensions, e.g., NP-SBJ, NP-PRD, etc.
Now consider the query:
This will return trees where the noun “人” is inside a noun phrase and followed by a relative clause of any kind, which is captured because of the wild card presence (“__”).
As a final example, consider the query:
This query consists of a partial tree with no commitment for how the tree terminates, and so will find trees where an IP dominates a PP that in turn dominates an IP.
The following tag prefixes can be added to change the search behaviour:
The TGrep language by Richard Pito formulates queries as patterns that consist of expressions to match tree nodes and relationships defining links or negated links to other tree nodes. Nodes of searched trees are matched either with simple character strings or regular expressions (see sections 1 and 2). A complex expression consists of a node expression followed by relationships, as presented in section 3. Possible relationships are illustrated in section 4.
The TGrep functionality available here is referred to as “TGrep-lite”. This is less expressive than the full TGrep language implemented in the original TGrep program. In particular, the expression of relationships between nodes is limited to the relations detailed in section 4. TGrep-lite is especially weak when compared to enhanced TGrep languages available with TGrep2 and Tregex implementations, notably, missing the ability to express disjunctions of relations. TGrep-lite will also exhibit behaviour distinct from what is expected from TGrep with regards to how pre-terminal and terminal nodes are specified (described in section 2). Despite these mentioned limitations, TGrep-lite is the easiest and most accessible way to search the corpus using this on-line interface, and it is a very powerful search language.
Single tree nodes are matched in TGrep-lite with either:
Specified as a string, a regular expression matches a node if there is a part of the node that is matched. For example, “/IP/” matches IP-MAT, IP-SMC, etc. The caret (“^”) anchors the regular expression to the beginning of a matched node, while a dollar sign (“$”) as the last character will anchor the regular expression to the end of a matched node. Use of both the caret and dollar-sign in “/^NP$/” constrains the match to only NP.
TGrep-lite works by rewriting expressions of the TGrep language into XPath queries over a database of XML encoded trees. The formatting of the XML is such that what is a pre-terminal (part-of-speech) node when captured with bracketed notation or a tree diagram is actually the value of a “pt” attribute of the same node that carries word content (value of a “word” attribute) and lemma information (value of a “lemma” attribute). That is, whereas part-of-speech and text are divided respectively into preterminal and terminal nodes with bracketed notation or a tree diagram, they occupy the same node in XML. Because of this organisation, the rewrite to XPath must distinguish three different “node” kinds expressed with TGrep-lite node patterns:
The wild card (“__”) is exceptional in not needing to distinguish its node kind, since it will match all nodes. For the expression of all other node patterns, their kind is determined based on their character content. In particular, the following order for calculating node kind is followed:
Note that the on-line interface is case insensitive when the node is identified as being either a pre-terminal or phrase-level node, while being case sensitive for terminal (word) nodes. Thus “/PRO/” and “/pro/” will be interpreted as a pre-terminal to find the PRO (pronoun) part-of-speech tag. Coincidentally, “*pro*” is the segment used to indicate an underspecified null pronoun at the word level. Accordingly, if you want to search for “*pro*”, you cannot simply search for “/pro/”, because you will get pre-terminal nodes PRO（代名詞）という品詞ノードを探そうとするからです。 To find “*pro*” you may either use (i) an underspecified string as follows: “/pr/”, (ii) escape symbols “\” plus the special symbol “*” as follows: “\*pro\*/” or (iii) single-character wild cards as follows: “/.pro./” to get the right search results.
TGrep-lite expressions are composed of a node pattern followed by the relationships the node pattern participates in. With word/lemma information serving as content of the same node under the XML encoding as part-of-speech information, it becomes necessary if you wish to match the combination of a particular word/lemma with a particular part-of-speech that the “==” (equals) relation serves to connect this information about the same underlying node. For example, the following will find instances of “と” with the “P-ROLE” part-of-speech tag.
The following example,
will match an IP node which immediately dominates a PP node and which dominates an IP node. Note the parenthesis to ensure that the second relationship “<< /IP/” refers to the first IP and not to the PP. As another example,
will match an IP which immediately dominates a PP which in turn dominates some IP.
The first node in a pattern or the first node following a left parenthesis is a “master” node which is related to the relationships to its right. Thus, a TGrep-lite pattern consists of a master node for the entire query followed by a series of relationships to other nodes that can themselves with parenthesis form master nodes with relationships to yet other nodes. In the first example above only the first /IP/ is a master node, while in the second example both the first /IP/ and the PP are master nodes.
Relationships define connections between the master node (being defined) and other nodes. There is a complete pairing of forward and backward links, allowing for flexibility in choosing what is the master node. Notably relationships are:
A == B B is the same node as A A << B A dominates (is an ancestor of) B A >> B A is dominated by (is a descendant of) B A < B A immediately dominates (is the parent of) B A > B A is immediately dominated by (is the child of) B A .. B A precedes B A ,, B A follows B. A . B A immediately precedes B A , B A immediately follows B. A $ B A is a sister of and not equal to B A $.. B A is a sister of and precedes B. A $,, B A is a sister of and follows B. A $. B A is a sister of and immediately precedes B A $, B A is a sister of and immediately follows B
The following presents pictures grouping the above relationships as forward and backward links:
C << __ (dominates, is an ancestor of)
__ >> C (is dominated by, is a descendant of)
E >> __ (is dominated by, is a descendant of)
__ << E (dominates, is an ancestor of)
C > __ (immediately dominates, is the parent of)
__ < C (is immediately dominated by, is the child of)
H .. __ (precedes)
__ ,, H (follows)
H ,, __ (follows)
__ .. H (precedes)
H . __ (immediately precedes)
__ , H (immediately follows)
H , __ (immediately follows)
__ . H (immediately precedes)
F $ __ (sister)
__ $ F (sister)
E $.. __ (sister and precedes)
__ $,, E (sister and follows)
E $. __ (sister and immediately precedes)
__ $, E (sister and immediately follows)
An exclamation mark (!) can be placed immediately before any relationship to negate it. Thus, A !.. B means that A is not followed by B.
TGrep-lite returns the match for the left-most element in the search pattern. The following pattern matches PPs that are immediately dominated by an IP that dominates an IP:
The tree search interfaces on this website present views of the parsed data generated from XML source files using the Alpino XML encoding (see below). The most direct way to search, as well as being the method of search with the most expressive power, is to enter XPath queries directly into any of the text boxes in the interfaces. Knowing about the format in which the data is distributed is essential information for writing XPath queries. Before proceeding further on this page, we suggest that you find an example in the Pattern Browser, open a tree view of the example, then download the “XML tree”, and look at the names of the elements, the names of the attributes, and the values of the attributes. The rest of this section describes the XPath syntax used to query the data.
XPath can refer to the hierarchical information of Alpino XML as the embedding of “node” elements, grammatical categories by the “cat” attribute, parts-of-speech by the “pt” attribute, and surface order with attributes “begin” and “end”.
As an introductory example, the query:
identifies nodes for which the value of the “cat” attribute equals “pp”. The double slash (“//”) notation of the query indicates that this node can appear anywhere in a tree structure. Conditions for this node are given between square brackets and often refer to particular values of particular attributes. Conditions can be combined using the boolean operators “and”, “or” and “not”. For example, the previous query can be extended to require that the PP node has a sentence initial placement:
Brackets can be used to indicate the intended structure of the conditions. For example, the following query will match either PP-SBJ or ADVP nodes that are not in sentence initial positions:
Conditions can also refer to what is outside the node itself. The following query, imposes restrictions from the daughter node of PP by finding all sentences in which a PP occurs with a complement that is not NP. This works by requiring that there is a “cat” attribute, but that this should not start with “np”.
It is possible to shift the selection of the query further down inside a node with single slash notation. For example, the following query will refer to a noun that is inside an NP-OB1:
XPath expressions can select nodes using a number of different axis specifiers. Each axis specifier describes a different set of nodes relative to the context node, where the context node is the central node in the following diagram:
The following is a description of the axis specifiers:
To select nodes from a specific axis, use the syntax axis::e where e is any XPath expression. For example, the next query finds embedded clauses, so IP nodes that are not immediately under the root alpino_ds node, and that do not have a parent (e.g., a CP) that is immediately under the root alpino_ds node.
This section presents details for how parsed content is encoded in Alpino XML. Under this XML-format, first developed for treebanks of Dutch (van Noord et al 2013), nodes of tree structure are encoded by a recursive XML element “node”. Other information is presented as values of various XML-attributes of those nodes, including:
As an example, consider:
<alpino_ds id="2_textbook_kisonihongo;page_13;EN" version="1.3"> <node cat="ip-mat" id="1" begin="0" end="5"> <node cat="np-sbj" id="2" begin="0" end="1"> <node pt="pro" word="He" id="3" begin="0" end="1"/> </node> <node pt="vbp" word="rolls" id="4" begin="1" end="2"/> <node cat="np-ob1" id="5" begin="2" end="4"> <node pt="d" word="a" id="6" begin="2" end="3"/> <node pt="n" word="hoop" id="7" begin="3" end="4"/> </node> <node pt="pu" word="." id="8" begin="4" end="5"/> </node> <sentence>He rolls a hoop .</sentence> </alpino_ds>
This XML can be presented as a tree as follows:
The key innovation of this XML format is that the surface order of nodes is explicitly encoded by the XML attributes “begin” and “end” with each “node” element. This information allows querying the linear axis of the tree. For example, a query that requires a node x to immediately follow a node y can be encoded by requiring that the “begin”-attribute of x equals the “end”-attribute of y. This also enables querying constituent length, what resides at the left and right edges of given constituents, etc.
Non-terminal nodes have an attribute “cat” to represent the syntactic tag. The attribute “cat” is the syntactic tag including any functional information for the node (e.g., “ip-mat” for matrix clause, “np-sbj” for subject noun phrase, “pp-ob1” for object postposition phrase). Leaf nodes have an attribute “pt” to represent the part-of-speech tag.
Questions or comments? Please contact firstname.lastname@example.org.