Contributions - 本文 Thesis 総合研究大学院大学学術情報リポジトリ A1886本文

This section gives an overview on the three contributions that we produced as a result of our work. These are:

• The development of new resources, i.e. corpora, to perform our studies.

• A linguistic analysis of the diﬀerences between formal and informal pharmacovig-ilance reports.

• An assessment on the impact of the register-related features in pharmacovigilance systems.

1.5.1 First contribution

To begin with our research we explore Twitter and curate a corpus of ﬁrst-hand expe-rience drug use reports. The reasoning behind that is that to evaluate drug use reports from both formal and informal sources one key element, also mentioned by Biber [23], should be the correct choice of the corpus that would be used in the study. We found that in most cases corpora composed of Tweets were not directly available due to Twit-ter Terms and Conditions⁹ that disallow the direct share of tweets as they clearly state:

“If you provide Content to third parties, including downloadable datasets of Content or an API that returns Content, you will only distribute or allow download of Tweet IDs and/or User IDs.”. That allows researchers to share the annotated data by providing the Tweet ID, and although that can be of some use that could pose a problem as the existing list of shared tweets can be outdated and some tweets could be oﬀ-line by the time of the download.

Another key element to keep in mind is the set of drugs used in pharmacovigilance studies, as it typically varies from one study to another as also does the set of annotated entities and tokens. Annotations tend to target diﬀerent elements as can be the chemical entity itself, outcomes, symptoms, diseases, or even drug-symptom relations, and that implies that even in the case of having data sets studying the same set of drugs the annotations could vary at great extent, being one example of these the freely available data sets that only provide binary annotations on the ADR mentions.

Bearing those ideas in mind we decided to ﬁrst agree on the set of drugs that would be part of the study and then look for existing data sets that could help us in our research, ﬁnding that none of the available data sets met all our requirements which motivated the curation of our own resources. To curate our corpora we explored two diﬀerent approaches in terms of data annotation as we used expert annotators and also laymen to helped us in labelling the data.

For our ﬁrst study we decided to focus on the personal use of popular drugs, i.e. ﬁrst-hand experience, ﬁnding that no other study was targeting at the same set of drugs that we wanted to include in our research, for which we curated a corpus to be used in our study, which is the ﬁrst by-product we produced, presenting it in 3.1. While exploring the potential of Twitter as a source of data for a register study we try to answer the question of whether we can use Twitter as a reliable source for building a system to

9https://dev.twitter.com/overview/terms/policy

Chapter 1. Introduction 11

Figure 1.2: Non-ﬁrst hand experience tweet on the drug use (avastin).

extract ﬁrst-hand experience reports on drug use, and the question of whether we can rely on laymen to help us in curating the corpus for such a system.

To create a similar corpus of drugs we used the texts from PubMed and PubMed Central, PMC, and curate a corpus using the same set of drugs from the previous study and also constraint the list of extracted sentences to those mentioning some keyword related to patients under the assumption that those sentences would contain drug use reports. This corpus, known as “Neuroses”, was also produced as part of this study and it is another by-product freely available online as described in 3.2.

Having those data sets ready we realized that there were three key points that we could improve to produce a corpus of much higher quality and use it in our study:

• In the case of Twitter some drugs appeared much more frequently than others, thus biasing the sample.

• The list of drugs was very focused on two types of drugs, and a more diverse set of medicines could capture more insights on the data.

• We were missing important information by only using ﬁrst-hand experience reports from Twitter, as some reports from relatives or doctors were left out (See Figures 1.2 and 1.3).

For the ﬁrst and second points we decided to expand the list of drugs to be used in our study to include drugs studied by other researchers. The third point was also addressed by including in the study any drug use report containing drug mentions appearing in a sentence reporting symptoms or diseases, which would include in the study tweets as

Figure 1.3: Non-ﬁrst hand experience tweet on the drug use (prozac).

the ones presented in Figures 1.2 and 1.3. This improved version of the drug use corpus is explained in detail in3.3.

Additionally, our work to produce these data sets allowed us to detect some entities and relations that caused disagreements in the annotation between pharmacists. For these ﬁndings we explain the problematic elements, the causes for those diﬀerences in the annotation, and present our strategy for reducing those disagreements.

1.5.2 Second contribution

While curating this third corpora we studied if, as stated inHypothesis 1, the contents found in formal and informal drug use reports were similar. To measure the similarity of the contents we inspected the diﬀerent information that we found in each data set.

Our understanding was that the information to be compared should be the one that we would record in a database, this is, the relations between drug, symptoms and diseases found in the sentences. By studying which were the drug-related reports mentioned in each source of information we got an idea of the similarity of those drug use reports, and addressed Hypothesis 1 discovering that there is very little overlap in the drug use reports in each source of information, although in the case of “Outcome-negative”

relations the similarity between the drug use reports in PubMed and Twitter texts was strikingly low.

TheHypothesis 2was motivated by the assumption that drug use reports in informal media are probably not sharing some of the traits commonly seen on generic social media messages due to the fact that these reports are providing important content, and elements typically seen in social media messages such as contractions or slang appear at a much lower extent in drug use tweets. To addressHypothesis 2we gathered generic

Chapter 1. Introduction 13 tweets, i.e. tweets retrieved from the API without applying any strong constraint nor ﬁlter, and also drug-related tweets, i.e. the tweets distributed in our TwiMed corpus and presented in 3.3. We then compared our data sets using Biber’s approach [8] and found that the tweets containing drug use reports had some features that characterised them as more informative using Biber’s schema.

To testHypothesis 3we used the set of sentences from PubMed and Twitter included in TwiMed corpus3.3and applied the method proposed by Biber [8] in the same way as we applied it to testHypothesis 2. In this case too, we discovered that the most salient diﬀerences were the features related to the informativeness of the texts, and conﬁrmed that PubMed texts were more informative in general. We also observed that some of the features that were diﬀerent between generic tweets and drug related tweets also appeared when comparing drug related tweets and PubMed texts. In this case we saw that Biber’s schema reported that the set of drug related tweets were not so diﬀerent from the set of PubMed sentences.

1.5.3 Third contribution

Our last contribution was aimed at the area of pharmacovigilance to study positive and negative drug use reports, to understand which are the set of features that help in detecting either report, and also to assess which features vary depending on the type of register in which the reports are written. AddressingHypothesis 4 showed that there are some features that can provide gains in NER systems for pharmacovigilance and in classiﬁers targeted at detecting sentences containing drug use reports describing both beneﬁcial as well as negative outcomes, and the gains provided by these features have diﬀerent impact in systems using Twitter and PubMed corpora.

ドキュメント内本文 Thesis 総合研究大学院大学学術情報リポジトリ A1886本文 (ページ 35-39)