Qualitative analysis - Empirical Analysis

5.4 Empirical Analysis

5.4.2 Qualitative analysis

Verb-or-noun constraint Verb-otherwise-noun constraint Unif. C= 2 C= 3 βlen= 0.1 Unif. C= 2 C= 3 βlen= 0.1

Basque 44.7 55.2 54.3 46.4 55.8 55.6 54.8 51.0

Bulgarian 73.4 75.8 75.1 64.1 72.7 75.8 75.2 70.6

Croatian 40.1 52.5 41.4 47.3 57.0 52.5 52.5 55.8

Czech 50.7 54.8 64.7 59.2 63.2 54.9 66.3 58.1

Danish 40.9 43.1 41.3 40.9 48.7 46.9 50.1 47.3

English 39.8 55.8 41.3 40.2 57.2 55.2 58.5 53.9

Finnish 26.2 27.7 27.7 28.3 40.3 32.5 34.3 40.4

French 35.7 50.9 49.5 47.0 44.2 55.8 54.6 42.1

German 49.7 47.1 56.0 51.2 49.5 55.7 57.4 49.9

Greek 61.7 70.0 62.1 60.2 60.5 68.8 62.0 60.2

Hebrew 52.9 58.7 60.9 57.5 54.8 62.6 54.2 57.2

Hungarian 68.8 41.6 71.3 63.6 69.2 65.5 72.4 64.8

Indonesian 32.0 58.3 58.1 43.6 50.2 58.6 58.5 59.4

Irish 63.1 64.5 65.2 63.0 63.4 64.4 64.7 63.9

Italian 62.7 77.1 73.6 72.5 69.2 65.2 69.8 72.4

Japanese 56.4 70.5 56.9 73.9 56.9 69.0 57.0 73.5

Persian 46.9 45.1 51.2 39.7 48.0 45.1 51.1 41.7

Spanish 46.8 56.1 57.3 63.1 57.7 56.2 58.5 62.3

Swedish 43.5 44.8 43.2 43.5 57.9 53.3 53.3 56.9

Avg. 49.3 55.2 55.3 52.9 56.7 57.6 58.2 56.9

Table 5.4: Accuracy comparison on UD15 for selected configurations with the hard constraints on possible root POS tags.

or a noun or a adjective if the main verb is copula. We remove adjective from this set as we found it is relative rare across languages. Interestingly in this case, the stack depth constraints (C= 2and C = 3) work the best. In particular, the average score of the length bias (β_len = 0.1) drops. We inspect the reason of this below.

We next see the effect of another constraint, the verb-otherwise-noun constraint, which excludes nouns from the candidate for the root if both a verb and noun exist. This probably decreases the recall though we expect that it increases the performance as the majority of predicates is verbs.

As we expected, with this constraint the average performance of baseline uniform model increases sharply from 49.2 to 56.7 (+7.5), which is larger than any increases with structural constraint to the original baseline model. In this case, though the change is small, again our stack depth constraints perform the best (58.2 withC= 3); the average score with the length bias does not increase.

depth constraints and other models are often quite different. Specifically,

• It is only Greek on which the baseline uniform model improves score from the setting with no root POS constraint.

• In other languages, the scores of the uniform model are unchanged or dropped when adding the root POS constraint. For example the score for Czech drops from 61.8 to 50.7.

• The same tendency is observed in the models ofβlen = 0.1. Its score for Greek improves from 30.4 to 60.2 while other scores are often unchanged or dropped; an exception is French, on which the score improves from 35.7 to 47.0. For other languages, such as Croatian (-7.2), English (-16.4), Indonesian (-6.4), and Swedish (-12.4), the scores sharply drop.

• On the other hand, we observe no significant performance drops in our models with stack depth constraints (i.e.,C= 2andC= 3) by adding the root POS constraint.

These differences between the length constraints and stack depth constraints are interesting and may shed some light on the characteristics of two approaches. Here we look into the output parses in English, with which the performance changes are typical, i.e., the model of βlen = 0.1 drops while the scores of other models are almost unchanged.

When we compare output parses of different models, we notice that often the same tree is predicted by several different models. Figure 5.6 shows examples of output parses of different models. The following observations are made with those errors. Note that these are typical, in that the same observation can be often made on other sentences as well.

1. One strong observation from Figure 5.6 is that the output ofβlen= 0.1reduces to that of the uniform model when the root POS constraint is added to the model. As can be seen in other parses, every model in fact predicts that the root token is a noun or a verb, which suggests this explicit root POS constraint is completely redundant in the case of English.

2. Contrary to βlen = 0.1, the stack depth constraints,C = 2andC = 3, are not affected by the root POS constraint. This is consistent with the scores in Tables 5.3 and 5.4; the score ofC = 3is unchanged and that ofC = 2increases by just 1.0 point with the root POS constraint.

3. Whie the scores of the uniform model and C = 3 are similar in Table 5.3 (38.7 and 41.3, respectively), the properties of output parses seem very different. The typical errors made by C = 3 are the root tokens, which are in most cases predicted as nouns as in Figure 5.6(d), and arcs between nouns and verbs, which also are typically predicted as NOUN → ^VERB. Contrary to theselocalmistakes, the uniform model often fail to capture the basic structure of a sentence. For example, whileC = 3correctly identifies that “of two beheading video’s”

comprises a constituent, which modifies “screenshots”, which in turn becomes an argument of “took”, the parse of the uniform model is more corrupt in that we cannot identify any semantically coherent units from it. See also Figure 5.7 where we compare outputs of these models on another sentence.