Discussion - 本文 Thesis 総合研究大学院大学学術情報リポジトリ A1886本文

We characterized the underlying information via LDA topics and noticed that the num-ber of clusters in both data sets were the same, but the realm of the keywords composing those clusters were very diﬀerent as in the case of PubMed we mainly obtained medical related terms while in the case of Twitter we obtained keywords such as generic verbs (“take”, “get”, “feel”) and abbreviations (“im”) that were not useful for describing a medical related set of tweets as the one we used. The use of those verbs is an important ﬁnding as those are verbs known to be used in an informal setting as opposed to other formal verbs [191], and also because that is an element that is not taken into considera-tion in Biber’s MD analysis, meaning that we should extend that framework to account for those variations.

We also assessed the similarity in terms of the diﬀerent relations between the drug and the related eﬀects (symptoms and diseases) and observed a very low level of similarity between PubMed sentences and tweets containing drug-use reports. We observed that, in particular, the reports of negative outcomes were very far from being comparable, and only the relations containing positive eﬀects (i.e. “Beneﬁt” and “Reason-to-use”

relations) had some small similarity (although below 30%). These facts would require an in-depth study to conﬁrm our ﬁndings, as in case the similarity ratio does not im-prove when using a much larger dataset it would mean that reports from PubMed and Twitter contain complementary information, which is a not a surprising ﬁnding but to our knowledge this is an observation that has not been presented before.

The comparison on the features between diﬀerent tweet data sets showed that drug-use tweets used more often hedging, ampliﬁers (“absolutely”, “altogether”, “completely”...), sentence relatives (“which”), and past participial WHIZ and seem/appear verbs. These are features that we would expect to see more often in formal texts. On the other hand, generic tweets used more stranded prepositions, discourse particles (“well”, “anyway”...), place adverbs (“above”, “around”...), emphatics (“just”, “really”...) and “wh clauses”

(e.g. “I believed what he told me”). These features show that generic tweets use a number of traits expected to appear in informal texts. In terms of the dimensions we did not expect to see that four of them were diﬀerent when comparing both data sets, and in particular, we did not expect to see that the diﬀerences showed that the set

Chapter 4. Comparing the linguistic register and the type of information contained in

formal and informal drug use reports 95

of drug-related tweets contained traits expected to be seen in academic texts by being less involved and less narrative than generic tweets, while also being more abstract and containing more academic qualiﬁcations.

When comparing drug use reports in Twitter and PubMed we saw that these were not very dissimilar and also noticed that there were some features, namely the use of the verb “do”, “ﬁrst person” and “It” pronouns, the analytic negation (“not”), the adverbial subordinator “because”, “predictive modals” and “conditional adverbial subordinators”

that were more frequent in tweets than in PubMed sentences. The use of emphatics and stranded prepositions also appeared more often in tweets which is in line with informal register traits [8]. In the case of PubMed, sentence relatives, nouns, “and” coordination, suasive verbs, and the use of past participial WHIZ were the features that appeared noticeably more often than in tweets. These features are know for signalling the use of more complex texts [192].

Comparing the similar features in tweets and PubMed sentences containing drug-use reports showed that, among other features, the use of nominalizations, past participial clauses and the type/token ratio were very similar in drug-use reports in Twitter and PubMed, which in turn means that these particular features are related to the informa-tion being conveyed independently of the type of linguistic register that is being used.

In terms of the dimensions themselves we found that the levels of narrativeness and the use of academic qualiﬁcations were not clearly diﬀerent between these two date sets, and those similarities could be helpful in detecting sentences from either formal or informal sources containing drug-use reports.

The main result from the study presented in this chapter is the evidence that Biber’s MD analysis is able to capture diﬀerences and similarities related to the linguistic register and tell apart drug-use reports in Twitter from generic tweets as well as to discover the linguistic features that are most similar in drug-use reports in Twitter and PubMed.

We also found that there are other sets of features that are not included on Biber’s analysis, such as the use of informal verbs, that diﬀer in the drug-use reports we may ﬁnd in Twitter and PubMed. We believe Biber’s features, either in their raw form or expanded to increase the coverage, as well other politeness features (e.g. to account for the use of informal verbs) have potential for contribution in pharmacovigilance. Those are ﬁndings that can be applied to classiﬁcation and NER systems as a way to validate the gains produced when using these features. Our empirical analyses using these new features are presented in the following chapter5.

Chapter 5

Exploring the register in pharmacovigilance systems

This chapter presents the diﬀerent experiments we performed using the data sets de-scribed in Chapter 3.

Our ﬁrst experiments tried to answer the question of whether we could build a binary classiﬁer able to detect ﬁrst-hand drug use reports in Twitter. We describe this novel task and the set of features we used when building our ﬁrst binary classiﬁer aimed at the detection of those messages in Twitter.

Following experiments build on top of the previous task as we improved our binary classiﬁers and besides working on the detection of ﬁrst-hand experience report in Twitter we expand our goals to work on diﬀerent tasks, also assessing the improvements produced by the newly added set of register-related features.

The ﬁnal experiments show the contribution that the set of register-related features provide to a NER system when trying to identify drugs, diseases and symptoms.

5.1 A first approach to binary classification of first-hand

ドキュメント内本文 Thesis 総合研究大学院大学学術情報リポジトリ A1886本文 (ページ 120-123)