Google Universal Treebanks - 本文 Thesis 総合研究大学院大学学術情報リポジトリ A1835本文

This dataset is a collection of 12 languages treebanks, i.e., Brazilian-Portuguese, English, Finnish, French, German, Italian, Indonesian, Japanese, Korean, Spanish and Swedish. Most treebanks are created by hand in this project except the following two languages:

English: Automatically convert from the Wall Street Journal portion of the Penn Treebank (Marcus et al., 1993) (with a different conversion method than the CoNLL dataset).

Swedish: Talbanken05 (Nivre et al., 2006) as in CoNLL dataset.

Basically every treebank follow the annotation guideline based on the Stanford typed depen-dencies as in UD, but contrary to UD, the annotation of Google treebanks is not fully content head-based. As we show in Figure 3.1(c), it annotates specific constructions in function head-based, in particularADPphrases.

We do not summarize the statistics of this dataset here as we use it only in our experiments in Chapter 5 where we will see the statistics of the subset of the data that we use (see Section 5.3.1).

Left-corner Transition-based Dependency Parsing

Based on several recipes introduced in Chapter 2, we now build a left-corner parsing algorithm operating on dependency grammars. In this chapter, we formalize the algorithm as a transition system for dependency parsing (Nivre, 2008) that roughly corresponds to the dependency version of a push-down automaton (PDA).

We have introduced PDAs with the left-corner parsing strategy for CFGs (Section 2.2.3) as well as a conversion method of any projective dependency trees into an equivalent CFG parse (Section 2.1.4). Thus one may suspect that it is straightforward to obtain a left-corner parsing algorithm for dependency grammars by, e.g., developing a CFG parser that will build a CFG parse encoding dependency information at each nonterminal symbol.

In this chapter, however, we take a different approach to build an algorithm in a non-trivial way.

One reason for this is because such a CFG-based approach cannot be an incrementalalgorithm.

On the other hand, our algorithm in this chapter is incremental; that is, it can construct a partial parse on the stack, without seeing the future input tokens. Incrementality is important for assessing parser performance with a comparison to other existing parsing methods, which basically assume incremental processing. We perform such empirical comparison in Sections 4.4 and 4.5.

For example, let us assume to build a parse in Figure 4.1(b), which corresponds to the CFG parse for a dependency tree on “a big dog”. To recognize this parse on the left-corner PDA in Section 2.2.3, after shifting token “a” (which becomes X[a]), the PDA may covert it to the symbol

“X[dog]/X[dog]”. However, for creating such a symbol, we have to know that “dog” will appear on the remaining inputs at this point, which is impossible in incremental parsing. This contrasts with the left-corner parser for phrase-structure grammars that we considered in Section 2.2.3 in which there is only a finite inventory of nonterminal, which might be predicted.

The algorithm we formalize in this chapter does not introduce such symbols to enable incremen-tal parsing. We do so by introducing a new concept, adummynode, which efficiently abstracts the predicted structure of a subtree in a compact way. Another important point is since this algorithm directly operates on a dependency tree (not via a CFG form), we can get intuition into how the left-corner parser builds a dependency parse tree. This becomes important when developing efficient

tabulating algorithm with head-splitting (Section 2.3.5) in Chapter 5.

We formally define our algorithm as atransition system, a stack-based formalization like push-down automata and is the most popular way for obtaining algorithms for dependency grammars (Nivre, 2003; Yamada and Matsumoto, 2003; Nivre, 2008; Gómez-Rodríguez and Nivre, 2013). As we discussed in Section 2.2.3, a left-corner parser can capture the degree of center-embedding of a construction by its stack depth. Our algorithm preserves this property, and its stack depth increases only when processing dependency structures involving center-embedding.

The empirical part of this chapter comprises of two kinds of experiments: First, we perform a corpus analysis to show that our left-corner algorithm consistently requires less stack depth to recognize annotated trees relative to other algorithms across languages. The result also suggests the existence of a syntactic universal by which deeper center-embedding is a rare construction across languages, which has not yet been quantitatively examined cross-linguistically. The second exper-iment is a supervised parsing experexper-iment, which can be seen as an alternative way to assess the parser’s ability to capture important syntactic regularities. In particular, we will find that the parser using our left-corner algorithm is consistently less sensitive to the decoding constraints of stack depth bound across languages. Conversely, the performance of other dependency parsers such as the arc-eager parser is largely affected by the same constraints.

The motivation behind these comparisons is to examine whether the stack depth of a left-corner parser is in fact a meaningful measure to explain the syntactic universal among other alternatives, which would be valuable for other applications such as unsupervised grammar induction that we explore in Chapter 5.

The first experiment is astaticanalysis, which strictly analyzes the observed tree forms in the treebanks, while the second experiment takesparsing errorsinto account. Though the result of the first experiment seems clearer to claim a universal property of language, the result of the second experiment might also be important for real applications. Specifically we will find that the rate of performance drop with a decoding constraint is smaller than the expected value from the coverage result of the first experiment. This suggests that a good approximation of the observed syntactic structures in treebanks is available from a highly restricted space if we allow small portion of parse errors. Since real applications always suffer from parse errors, this result is more appealing for finding a good constraint to restrict the possible tree structures.

This chapter proceeds as follows: Since our empirical concern is the relative performance of our left-corner algorithm compared to existing transition-based algorithms, we begin the discussion in this chapter with a survey of stack depth behavior in existing algorithms in Section 4.2. This discussion is an extension of a preliminary survey about the incrementality of transition systems by Nivre (2004), which is (to our knowledge) the only study discussing how stack elements increase for a particular dependency structures in some algorithm. Then, in Section 4.3, we develop our new transition system that follows a left-corner parsing strategy for dependency grammars and discuss the formal properties of the system, such as the spurious ambiguity of the system and its implications, which are closely relevant to the spurious ambiguity problem we discussed in Section 2.1.4. The empirical part is devoted to Sections 4.4 and 4.5, focusing on the static corpus analysis and supervised parsing experiments, respectively. Finally, we give discussion along with the the relevant previous studies in Section 4.6 to conclude this chapter.

The preliminary version of this chapter appeared as Noji and Miyao (2015), which was itself an

extension of Noji and Miyao (2014). Although these previous versions limited the dataset to the one in the CoNLL shared tasks (Buchholz and Marsi, 2006; Nivre et al., 2007a), we add new analysis on Universal dependencies (Marneffe et al., 2014) (see also Chapter 3). The total number of analyzed treebanks is 38 in total across 26 languages.

4.1 Notations

We first introduce several important concepts and notations used in this chapter.

Transition system Every parsing algorithm presented in this chapter can be formally defined as a transition system. The description below is rather informal; See Nivre (2008) for more details. A transition system is an abstract state machine that processes sentences and produces parse trees. It has a set ofconfigurationsand a set oftransition actionsapplied to a configuration. Each system defines aninitial configurationgiven an input sentence. The parsing process proceeds by repeatedly applying an action to the current configuration. After a finite number of transitions the system arrives at aterminal configuration, and the dependency tree is read off the terminal configuration.

Formally, each configuration is a tuple(σ, β, A); here,σis a stack, and we use a vertical bar to signify the append operation, e.g.,σ=σ^′|σ₁denotesσ₁is the topmost element of stackσ. Further, β is an input buffer consisting of token indexes that have yet to be processed; here, β = j|β^′ indicates thatj is the first element ofβ. Finally,A ⊆ V_w ×V_w is a set of arcs givenV_w, a set of token indexes for sentencew.

Transition-based parser We distinguish two similar terms, a transition system and a transition-based parser in this chapter. A transition system formally characterizes how a tree is constructed via transitions between configurations. On the other hand, a parser is built on a transition system, and it selects the bestaction sequence(i.e., the best parse tree) for an input sentence probably with some scoring model. Since a transition system abstracts the way of constructing a parse tree, when we mention a parsing algorithm, it often refers to a transition system, not a parser. Most of the remaining parts of this chapter is about transition systems, except Section 4.5, in which we compare the performance of several parsers via supervised parsing experiments.

Center-embedded dependency structure The concept of center-embedding introduced in Sec-tion 2.2.1 is originally defined on a constituent structure, or a CFG parse. Remember that a depen-dency tree also encodes constituent structures implicitly (see Figure 4.1) but the conversion from a dependency tree into a CFG parse (in CNF) is not unique, i.e., there is a spurious ambiguity (see Sec-tion 2.1.4). This ambiguity implies there is a subtlety for defining the degree of center-embedding for a dependency structure.

We argue that the tree structure of a given dependency tree (i.e., whether it belongs to center-embedding) cannot be determined by a given tree itself; We can determine the tree structure of a dependency treeonly ifwe have some one-to-one conversion method from a dependency tree to a CFG parse. For example some conversion method may always convert a tree of Figure 4.1(c) into

a big dog (a)

X[dog]

dog X[big]

big X[a]

(b)

dogs run fast (c)

X[run]

X[fast]

fast X[run]

X[run]

run X[dogs]

dogs

(d)

X[run]

X[fast]

fast X[run]

run X[dogs]

dogs

(e)

Figure 4.1: Conversions from dependency trees into CFG parses; (a) can be uniquely converted to (b), while (c) can be converted to both (d) and (e).

the one of Figure 4.1(d). In other words, the tree structure of a dependency tree should be discussed along with such a conversion method. We discuss this subtlety more in Section 4.3.3.

We avoid this ambiguity for a while by restricting our attention to the tree structures like Figure 4.1(a) in which we can obtain the corresponding CFG parse uniquely. For example the dependency tree in Figure 4.1(a) is an example of aright-branchingdependency tree. Similarly we call a given dependency tree is center-embedding, or left- (right-)branching, depending on the implicit CFG parse when there is no conversion ambiguity.

ドキュメント内本文 Thesis 総合研究大学院大学学術情報リポジトリ A1835本文 (ページ 72-76)