Introduction - A Study of Syntactic Typed-Dependency Trees for English and Japanese and Graph-C

The aim of this thesis is to introduce the typed-dependency trees for English and Japanese sentences, and to introduce graph-centrality measures to capture the structural characteristics of these typed-dependency trees. Typed-dependency trees are syntactic structures for sentences that illustrate the dependency relationships among the words in a sentence as a network of words.

The structural characteristics of the network can be captured by a number of measures that have been developed in the field of graph theory and network analysis. Introducing these measures into the typed-dependency trees for sentences allows us to capture the structural characteristics of these sentences as networks of words, and ultimately, this analysis sheds new light on Japanese and English speakers’ syntactic intuitions.

In order to accomplish this aim, this thesis asks several questions. First, what are the dependency relationships among the words in a sentence? Chapters 2 and 3 answer this question.

Chapter 2 introduces the concept of dependency grammar through a discussion of Tesnière’s (1959) seminal assumption about dependency along with more recent theories of dependency grammar proposed by I. Mel’čuk and his colleagues (Iordanskaja & Mel’čuk 2000; Mel’čuk &

Pertsov 1987; Mel’čuk 1988, 2003, 2004, 2009, and 2011). This chapter also addresses the difference between dependency grammar and phrase-structure grammar. Section 2.2 presents an overview of dependency grammar and Section 2.3 focuses specifically on Tesnière’s (1959) seminal work on dependency grammar. Section 2.4 discusses Mel’čuk’s work on Deep Syntactic Relations and Surface Syntactic Relations as a development of Tesnière’s (1959) concept of dependency. Finally, the difference between dependency and phrase-structure grammar is briefly discussed in Section 2.5 with reference to Osborne, Putnam, & Gross (2011).

Chapter 3 examines whether a typed-dependency tree for a sentence is equivalent to a functional-structure representation according to Lexical-Functional Grammar (LFG) (Bresnan

1978; Bresnan 1982; Kaplan & Bresnan 1982; Bresnan 2001). Both LFG and dependency grammar theory assume that the individual pieces of lexical information contained in a sentence are integrated into the whole, and both frameworks are concerned with making explicit the process through which these pieces of lexical information are integrated. Dependency grammar does so at only one level of representation (i.e., a dependency tree for a sentence), while LFG does so by connecting multiple levels of representation (i.e., constituent structure, functional structure, argument structure, and phonological structure). The idea of structural correspondence in LFG can be seen as an extension of typed-dependency tree representation of grammatical knowledge. In this sense, LFG represents one direction of development of the dependency grammar tradition started by Tesnière (1959). In Chapter 3, the revised version of Mel’čuk’s Criteria for surface syntactic dependencies is proposed; the idea behind this revision is that two words in a sentence establish a dependency relationship iff they constitute one fragment functional structure.

Second, what are the graph centrality measures, and how are they calculated? Chapter 4 answers this question by introducing the representation of a typed-dependency syntactic tree as a directed acyclic graph (DAG) and examining the idea of quantifying the structural property of typed-dependency trees in terms of graph centrality. The advantage of dependency grammar representation is that a sentence’s dependency can be interpreted as a DAG, allowing the formal syntactic properties to be defined and analyzed mathematically in terms of graph theory (Oya 2010b, 2011, 2013a, and 2013b). Dependency grammar makes explicit the connections among the words in a sentence, or the network of words (Tesnière 1959). Approaches in the field of graph theory and network analysis can help make salient the characteristics of these networks.

In other words, the structural properties of networks of words in sentences can be made explicit in dependency grammar and then quantified by applying graph theory. Quantified structural properties are useful for linguistic analyses that have previously relied on the subjective

judgment of researchers, such as investigations into stylistic differences across different genres or similarities in syntactic structures across different languages. Quantitative approaches to syntactic structure contribute to these types of linguistic analyses by incorporating more objectivity. The centrality measures used in this thesis are degree centrality and closeness centrality, based on Freeman 1979 and Wasserman & Faust 1994. Degree centrality of a given typed-dependency tree indicates how flat the tree is, while Closeness centrality of a given typed-dependency tree indicates how embedded the tree is (Oya 2010b).

Third, how can we obtain the typed-dependency trees for given sentences, and what are their characteristics? Chapter 5 answers this question for English sentences, and Chapter 6 for Japanese sentences. Chapter 5 introduces the Stanford Parser (Klein & Manning 2003; de Marneffe & Manning 2012), along with the definition of each dependency type according to the revised version of Mel’čuk’s Criteria introduced in Chapter 3. Stanford Parser is a state-of-the-art parser used in this study for acquiring typed-dependency tree representations for English sentences. In traditional analyses, it is time-consuming for the researcher to construct typed-dependency trees for each sentence in a corpus and manually calculate their centrality measures. This chapter proposes this syntactic parser as a more efficient method to obtain typed-dependency trees for individual sentences in large corpora. Each dependency type is defined according to the revised version of Mel’čuk’s criteria, which is proposed in Chapter 3, so that it is based on a tradition of dependency grammar which was started by Tesnière and developed by Mel’čuk. The functional structures for example sentences are also provided in this chapter, so as to examine the equivalence between the typed-dependency tree for a sentence and its functional-structure representation, which is proposed in Chapter 3. Chapter 6 introduces another parser for Japanese called KNP (Kurohashi & Nagao 1992, 1994, 1998;

Kawahara & Kurohashi 2007), including the dependency-type annotation for KNP output and the definition of each dependency type and functional structures for sentences containing each

dependency type. KNP is a rule-based dependency parser used for generating automatic dependency tree representations for Japanese sentences. The accuracy of this parser has been improved since its use in the development of Kyoto University Text Corpus ver. 4, a parsed corpus of Japanese (Kurohashi & Nagao 1998). Since the parsed output of KNP does not contain the type of each dependency, it is necessary to annotate the parsed output. Doing so allows us to use the KNP output to obtain cross-linguistic typed-dependency tree representations of Japanese. The annotated dependency types must be based on a tradition of dependency grammar which was started by Tesnière and developed by Mel’čuk. Similarly to dependency types of English, dependency types of Japanese are defined according to the revised version of Mel’čuk’s criteria, which is proposed in Chapter 3. The functional structures for example Japanese sentences are also provided in this chapter, so as to examine the equivalence between the typed-dependency tree for a sentence and its functional-structure representation, which is proposed in Chapter 3.

Fourth, from which source are the graph centrality measures obtained, and what is the result? Chapter 7 answers this question. The accuracies of the Stanford Parser and the KNP are examined by comparing the typed-dependency trees obtained from the parsed output of the English sentences and their Japanese counterparts in a small-scale parallel corpus (Iida 2010) to their manually corrected typed-dependency trees. Results show that the distributions of both degree centralities and closeness centralities before and after manual corrections are almost identical. Thus, the Stanford Parser and KNP are accurate enough to obtain degree centralities and closeness centralities. Next, the distributions of degree and closeness centralities for English typed-dependency trees are compared to those for their Japanese counterparts, and results show that their distributions are different. Thus, the structural properties of the typed-dependency trees for sentences in these two languages are different in terms of their degree centralities (flatness) and closeness centralities (embeddedness). Lastly, the

distributions of degree centralities and of closeness centralities obtained from the parsed output of sentences from different genres of texts in Manually Annotated Sub-corpus of American National Corpus (MASC 500k) (Ide, Baker, Fellbaum, Fillmore, & Passonnau 2008) are compared to each other. It is shown that sentences from different genres have different distributions of these measures; sentences in the subsections Fiction, Ficlets and Jokes are flatter and more embedded than sentences in other subsections. However, it is pointed out that these different distributions are dependent on the word counts of the sentences. It is also pointed out that controlling the word count of the sentences taken from different genres could make explicit that difference in genre is reflected on the number of sentences of the same degree centrality and of the same closeness centrality.

2. Dependency Grammar

ドキュメント内 A Study of Syntactic Typed-Dependency Trees for English and Japanese and Graph-Centrality Measures (ページ 42-47)