学術雑誌掲載論文等

(1)

A uthor(s )

Z ielinski, K azimierz; Nielek, R adoslaw; W ierzbicki, A dam;

J atowt, A dam

C itation

Information Processing & Management (2018), 54(1): 14-36

Is s ue D ate

2018-01

UR L

http://hdl.handle.net/2433/230645

R ig ht

©

2017 T he A uthors. Published by E lsevier L td. T his is an

open access article under the C C B Y -NC -ND license.

(http://creativecommons.org/licenses/by-nc-nd/4.0/)

T ype

J ournal A rticle

T extvers ion

publisher

(2)

Contentslistsavailableat ScienceDirect

Information

Processing

and

Management

journalhomepage:www.elsevier.com/locate/infoproman

Computing

controversy:

Formal

model

and

algorithms

for

detecting

controversy

on

Wikipedia

and

in

search

queries

Kazimierz

Zielinski

∗

_,

_Radoslaw

_Nielek,

_Adam

_Wierzbicki,

_Adam

_Jatowt

Polish-JapaneseAcademyofInformationTechnology,Warsaw,Poland,KyotoUniversity,Kyoto,Japan

a

r

t

i

c

l

e

i

n

f

o

Articlehistory:

Received22January2017 Revised11August2017 Accepted23August2017 Availableonline22September2017

Keywords: Controversy Wikipedia Websearch

a

b

s

t

r

a

c

t

Controversyisacomplexconceptthathasbeenattractingattentionofscholarsfrom di-versefields.IntheeraofInternetandsocialmedia,detectingcontroversyandcontroversial conceptsbythemeansofautomaticmethodsisespeciallyimportant.Websearcherscould be alertedwhenthe contentstheyconsumearecontroversialorwhentheyattempt to acquireinformationondisputedtopics.Presentinguserswiththeindicationsand explana-tionsofthecontroversyshouldofferthemchance toseethe“widerpicture” ratherthan letting themobtainone-sided views.Inthiswork wefirstintroduceaformalmodelof controversyasthebasisofcomputationalapproachestodetectingcontroversialconcepts. Thenweproposeaclassificationbasedmethodforautomaticdetectionofcontroversial ar-ticlesandcategoriesinWikipedia.Next,wedemonstratehowtousetheobtainedresults forthe estimationofthecontroversy level ofsearchqueries.Theproposed methodcan beincorporatedintosearchenginesasacomponentresponsiblefordetectionofqueries relatedtocontroversialtopics.Themethodisindependentofthesearchengine’sretrieval andsearchresultsrecommendationalgorithms,andisthereforeunaffectedbyapossible filterbubble.

OurapproachcanbealsoappliedinWikipediaorotherknowledgebasesforsupporting thedetectionofcontroversyandcontentmaintenance.Finally,webelievethatourresults couldbeusefulforsocialscienceresearchersforunderstandingthecomplexnatureof con-troversyandinfosteringtheirstudies.

1. Introduction

Controversy abounds inthe world. At timesit seems that almost anysubject canbe asource of controversysince it isvirtuallyimpossibleforeveryone toagreeonanysubjectofinterest.However, suchassumptionisafallacy:byfocusing onthepresence ofminordisagreements, weignorethe existenceofan organized,encyclopediccollectionofagreed-upon facts.Still,itisquiteclearthatcontroversialtopicsandcontroversialinformationexist,andcomparedwithnon-controversial topicsorinformation,they are usually moreprominentandeasilynoticeable inmedia andontheWeb. Eventopicsthat arewellsupportedbyfactsandevidencemaynotconstituteaguaranteedconsensus.Peopletendtoargueaboutopinions, interpretations,andpointsofview.Thesedisagreements arenotalwayscounterproductive;theymayallowparticipantsto

∗ _{Corresponding}_author.

E-mailaddress:[email protected] (K.Zielinski).

http://dx.doi.org/10.1016/j.ipm.2017.08.005

(3)

adjusttheirpointsofviewortoinspireasearch fortruthorcommonground.Yet,controversycanalsoimpedeorprevent cooperation.Thisproblemhasbeenrecognizedandaddressedbycommunitiesthatgeneratepeer-producedcontent,suchas Wikipedia,whichattemptstolimittheimpactandspreadofcontroversyandimprovethearticleeditingprocess.Therefore, detectionofcontroversymaybecrucialforensuringhigh-qualitypeer-producedcontentandmorebalancedviewpoints.

WhilebrowsingtheWebinsearchforinformationaboutatopicofinterest,individualsmayfindcontentthatsupports onlyonepointofview.Thethreatofreceivingbiasedinformationormisinformationisevenmoreseriouswiththepervasive personalization ofsearch engines andthe associated filterbubbleeffect (Nguyen, Hui, Harper, Terveen, & Konstan, 2014; Pariser,2011).Unlessuserspersevereintheirsearchforacquiringmoreinformationtocorroborateandtofindinformation fromdiverseviewpoints,theyarelikelytolearnaboutonlyone-sideofanyargumentregardingthetopicoftheirinterest. Consequently,apersonmaymake wrongdecisionsasaresultofbeingexposed tomisleading(orone-sided) information aboutacontroversialtopic,withoutevenrealizingthatcontroversyexists.Ifthepersonwasmadeawareofthecontroversy, she/hewould be ableto make more informed andbetter decisions basedon more comprehensiveknowledge about the topic.Thetruism’forewarnedisforearmed’alsoappliestothecontextofWebsearches.

Withtheaboveinmind,atimelywarningabouttheexistenceofcontroversycouldbebeneﬁcialforuserswhomaybe unawareof itbeforethey trust incompleteorbiasedinformationonimportanttopics.In ourstudy,we focusonthe two mostwidelyused“startingpoints” foruserswhoarelookingforinformationontheInternet:Wikipediaandsearchengines. Withcloseto 500million unique visitors per monthWikipediais immenselypopular knowledge basethat haspowerful educationalimpactonoursociety.Oftenusersuseitasaspringboardwhenresearchingvarietyoftopicsincludingcomplex onesinscientiﬁcdomainsormoretrivialoneslikecelebrityinformation.Wikipediaisalsocommonlyusedforsupporting variousdataprocessingtasks.

SearchenginesaregatewaystotheInternet.Theyassistusersinlocatinganydesiredinformationorwebsites.Assuch,it wouldbebeneﬁcialtosearchersiftheygotalertedontopiccontroversywheninteractingwithsearchengines– beforeeven seeingthereturnedresults.Inthisworkweconsideronlysearchqueriesandnottheresultsreturnedbysearchenginesto avoidtheneedtoaccountforthesearchenginetypeandlimitationsassociatedwithpersonalizingthesearchprogress.

1.1. Research objectives

Giventhedescribedimportanceandcomplexityoftheconcept ofcontroversy,we identiﬁedthefollowingresearch ob-jectives.

Whilepriorworkhasalreadydemonstratedmethodsofidentifyingcontroversy,thereisnouniﬁedframeworkthatwould deﬁne the problemand representit formally. We believe that a conceptual modelof controversy is necessary. It would providebasisforcomputationalapproachestowardsbuildingapplicationsaimingatdetecting,estimatingandunderstanding controversyintexts.Suchamodelwouldbeusefulforestablishingdetectionmethods,butitcouldalsoallowustocompare existingapproaches.

GiventheimportanceofWikipediaandsearchenginesinacquiringinformationandlearning,onewouldexpectastrong supportforonlineuserstopreventacquisitionofone-sidedinformation.However, nosuchsupporthasbeenexplicitly of-feredsofar,besidessolutionsthatrelyonmanualdetectionofcontroversyorfocusonrelatedtasks(e.g.,contentrelevance estimation).Toimplementsuchfunctionalities, thedevelopmentofeffectivealgorithmic approachesfortreatmentof con-troversyisrequired.

1.2. Contributions

Inthiswork,wemakethefollowingcontributions:

Definition and formal representation of controversy: We propose a definition of controversy to serve as the basis forourresearch. Thisdefinitionisbasedonthesocial sciencesandepistemology. Itmayalsobeusefulforresearchersin computerscience,asit drawsalinebetweencontroversialandnon-controversialcontentandtopics.Themain advantage ofthisdefinitionisthat thelevelofcontroversycanbequantified. Moreover,ourformalmodelclassifieserrorsashaving threemainsources;suchclassificationisindispensablewhenattemptingtopredictcontroversy.

Estimatingcontroversy ofWikipediaarticles andcategories: We develop newmethods foridentifying controversial articlesonWikipediabyanalyzingthesentimentsofdiscussionsonthearticletalkpages.Usingthesedetectionmethods,we canpinpointthesectionsofarticlesthatgeneratethegreatestcontroversy.Basedontheproposed classiﬁersofWikipedia articlecontroversy, we can estimate the levels ofcontroversy associated with topical categorieson the English-language Wikipedia.Thisapproachcanbethenusedtopredictthelevelofcontroversyinnewarticleswithinacategory.

(4)

different,personalizedresultstothesamequery.Thissituationismostsigniﬁcantforqueries relatedtocontroversial top-ics, because a usermayreceive search resultsthat reﬂect one side ofthe controversy (andanother usermayonly learn abouttheotherside).Incontrast,ourmethodproduces resultsthat areindependentofthepersonalizedrecommendation algorithmthatselectsauser’ssearchresults.

Theremainderofthisarticleisorganizedasfollows:inthenextsection,wereviewtheabundantliteratureonthetopic ofcontroversy andfocuson workrelevant todeﬁning controversy andonalgorithmic methods forcontroversydetection orprediction.In Section3,weproposeourdeﬁnitionofcontroversy.In Section4,wedescribetheWikipediadatasetused inour research.In Section 5,we introduceandevaluate ourmethods forpredictingcontroversy ofWikipediaarticles.In

Section6,weevaluateandrankthelevelofcontroversyoftopicalcategoriesinWikipedia. Section7introducesourmethod forpredictingthecontroversyinWebqueries.Finally,in Section8,wedrawconclusionsanddiscussthefuturework.

2. Relatedwork

2.1. Controversy in social science

Controversyisaubiquitousphenomenon.SincethetimesofancientGreecetothe20thcentury,thestudyonthenature ofcontroversywasa domainofgreatphilosophers. Itisusually deﬁnedaspublicargument/discussion aboutatopicthat has supporters on one side and many people who strongly disagree with them or are somehow shocked by this issue. Withrespect to time,controversy ischaracterized by the factthat no agreement orconsensus isreached onthe subject for a prolonged period oftime. Controversies range from ﬁercepolemicsto polite and orderlydiscussions. What makes controversyresearchanexcitingtopicisthatcontroversiesusuallystemfromissuesofhighsocialrelevance.

Today,the studyofcontroversyis carriedout mainly by philosophers,sociologists, andinformationscientists. Contro-versies areknownto existinall areaswherethereare disputes relevanttocurrentsocieties, includingeconomy,politics, art,religion,science,gender,sexualorientation,race,ethnicity,culture,immigration,educationandmanyotherareas. Con-troversy, by its verynature, isrelated toconﬂict.As early asin1956, thefunctionsof socialconﬂicts were describedby

Coser(1956),whoconsideredtheroleofconflictinestablishingandmaintaininggroupidentitiesandrelationshipsbetween them.Conflictsthatarisefromthefrustrationofspecificdemandsaremorepersistentandleadtoaggressivebehaviorbased onstrongsentiments.Inthoseindividualswhoaredeeplyinvolvedinthecontroversy,feelingsofattractionaswellas hos-tilityarelikelytoarise.Sometimespartiesviewthemselvesasrepresentativesofcollectivesorgroupswhoarefightingnot forthemselvesbutratherforthegoalsandideals ofthegroup.The eliminationofpersonalmotivations tendstointensify conflict.Conflictalsoleadstotheformationofcoalitionsandassociationsbetweenpreviouslyunrelatedparties.

SomeofCoser’sobservationsregardingsocialconflictsareparticularlymeaningfulintheInternetcontroversiesoftoday, whereinwefindthatideologicalconflicts,forexample,areoftenparticularlydifficultandleadtoongoingcontroversies.This isbecauseinsuch casesthecauseisnotrelatedtoanyreal-worldfactorincident,butrathertodiffering worldviewsor identities,whichcorrespondtoCoser’ssocialgroups. Currently,wealsofrequentlyobservetheexpressionofstrong senti-mentsandtheformationofcoalitionsbetweenpartiesinacontroversy.

Koutra,Bennett, andHorvitz (2015) examined the browsingbehavior of searcherson the controversialand polarizing topicofguncontrol,andinthisstudyfocusedontheinfluenceofasingledisruptiveandshockingnewsevent.Theauthors foundthat,overall,mostpeopleusetheWebtoaccessinformationwithwhichtheyagree.Whenanexternaleventthreatens to directly influence users, only then do they explore content outside their filter bubble.Otherwise, most users appear “narrow-minded” with respecttothenumberofvisitedWebdomains, andthesedomains presentmostlyone sideofthe issue,whichistypicalinideologicalconflict(andassociatedcontroversy),asdefinedbyCoser.

In1976, GurrandDuvall(1976)formulatedaformaltheoryofpoliticalconflict.Thistheory,developedinthepre-Internet era,focusesonthephysicalconfrontationbetweensocialcollectiveactorsandtheassociatedmaterialdestructionincurred. Whiletheauthors’mainpurposeindevelopingthetheorywastodescribe socialphenomenasuchasrebellionsandother physical confrontationsbetweenorganizedcollectives, itis applicable toInternet conflicts andcontroversies inthe sense thatthemagnitudeofaconflictstilldependsonindividual potentialsforaction,collectivedispositionstowardaction,and organizationalstrength– thelattertoalesserextentbecauseintheWeb2.0erawe aredealingwithinformalgroupsand coalitionsofInternetusers.

2.2. Role of controversy in media

Controversy plays very important role in media as it typically increases media popularity and coverage. Media pro-grams likereality TV were developed forthe purposeof sparking controversiesby creating moral panic,asdescribedin

Biltereyst(2004).TheMITCenterforCivicMediadevelopedaControversyMapper,whereinmajornewsstoriesare reverse-engineeredto visualize thespread ofideasandthe changeinmedia frames overtime, andto identifywhose voicesare dominating adiscussion. In their investigations ofa major national controversy inthe media,the MITresearchers found that broadcast media continues to be important as an ampliﬁer and gatekeeper of news, but that it is susceptible to media activistsworkingthrough participatorymedia toco-create newsandinﬂuence theframing ofmajor controversies.

(5)

foundmany differencesbetweennewssources, thus enablingthe identificationofcontroversy basedon a comparisonof how the subjectis portrayed by different media groups. An unsupervised method for identifying controversial semantic categoriesinonlinemedia wasproposedby DeClercqetal.(2014).The authorsusedDBpedia toaggregateconcepts into categoriesandthenidentifiedcategoriescharacterizedbysignificantdeviationsinsentimentacrossdifferentonlinemedia.

2.3. Controversy in social media

As we will detail later, controversy needs appropriate community. Research on mining antagonistic communities in social networks was undertaken by Lo, Surian, Prasetyo, Zhang, and Ee-Peng (2013) who modeled direct antagonistic sub-communities using existing positive and negative links between members of Eopinions and Facebook . In Twitter, a widely used participatory media, controversies receive wide coverage very quickly. However, Smith, Zhu, Lerman, and Kozareva(2013)foundthatTwitterisprimarilyusedforspreadinginformationtolike-mindedpeopleratherthanfor debat-ingissues.Usersaremorelikelytorebroadcastinformationthantorespondtoacommunicationbyanotheruser. Individu-alstypicallytakeapositiononanissuepriortomakingpostsaboutitandareunlikelytochangetheiropinion. Yardiand Boyd(2010)observedgroupdynamicsonTwitterbetweenpro-andantiabortionrightsgroupsinrelationtoacertainevent anddiscovered bothhomophilyandheterogeneityinconversations aboutabortion.Althoughpeople todayareexposed to broaderviewpointsthantheywerepreviously,theyarelimitedintheirabilitiestoengageinmeaningfuldiscussion.

IntheirresearchoncontroversiesinvolvingknownentitiesinTwitter, PennacchiottiandPopescu(2010)usedtimelyand historicalscorestoclassifycontroversy.TheauthorsfoundthatmostcontroversiesonTwitterrelatetomicro-events(e.g.,TV shows,awardshows,orsportevents). Garimella,Morales,Gionis,andMathioudakis (2015)usedthesocialmedia network structureonTwittertodevelopanetwork-basedcontroversymeasureandfoundcontentfeaturestoberelativelyunhelpful inthistask. Laterthey focused their research on exploring thetopics ofdiscussion on Twitterandunderstanding which onesarecontroversial(Garimella,Mathioudakis,Morales,&Gionis,2016).

2.4. Controversy in Wikipedia

Wikipediaprovides aknowledge basethat has beenexploitedby researchers fromvariousscientiﬁc domains. The re-searchtrendsinareassuchasinformationretrieval,naturallanguageprocessing,andontologybuildinghadbeenreviewed anddescribedrecentlyby Mehdi,Okoli,Mesgari,Nielsen,andLanamaki(2017).

The first paper addressing the problem of cooperation and conflict between editors on Wikipedia was published in 2004 by Viégas, Wattenberg, and Dave (2004). The authors used the history of edits to develop a tool for visu-alizing patterns of conflict. Based on similar meta-information (i.e., number of edits, unique editors, and anonymous edits), Stvilia, Twidale, Smith, and Gasser (2005) measured article quality. Buriol, Castillo, Donato, Leonardi, and Mil-lozzi(2006)studiedconflictsinWikipediabyfocusingonentriescharacterizedbyaseriesofeditsrevertingthemto previ-ousversionsbetweenauthors.

Wikipediaedits are organized around the concept of revision . An editor will make changes to an articleand publish anewversion (aka revision). Acurrentrevisionmay alwaysbe reverted toan earlierversion (byanyone) andtheentire historyofrevisions isaccessible onWikipedia.Repetitive mutualreverts haveled toedit warsandtypically indicate the existenceofcontroversy(oratleastanargumentaboutsomefacts).Anin-depthstudyofeditwarshasbeenconductedby

Sumi,Yasseri,Rung,Kornai,andKertész(2011a)and Sumi,Yasseri,Rung,Kornai,andKertesz(2011b).Notalleditwars are necessarilynon-constructive.Intheir study, Yasseri,Sumi, Rung,Kornai,andKertész(2012) identiﬁed mutualrevertsthat haveledtoconsensusaswellasthosethathaveremainedinastateofpermanentcontroversy.Later Yasseri,Spoerri, Gra-ham,andKertész(2014) usededitwars to present,visualize andanalyzetopical overlapsofcontroversiesin 10different language versionsof Wikipedia.The authors concludedthat there are both globalcontroversial topicswith cross-culture resonanceandtopicswithmorenarrowinterestlimitedtolanguagecommunitiesorgeographicalareas.

Todetectcontroversialarticles,researcherscanaccessabroadspectrumofmetainformationregardingeditorial behav-iors.Vuongetal.proposedtwocontroversy-rankingmodelsforWikipediaarticles,drawnfromthehistoryofcollaboration andedits(Vuongetal.,2008). Kittur,Suh,Pendleton,andChi(2007)proposedaverygoodmethodbasedoneditdynamics andtalkpages.Recently, RadandBarbosa(2012)compareddifferentmethodsfordetectingcontroversy(someofwhichare basedonthepropertiesofcollaborationnetworks),whichconstitutedthebasisforoneofourrecentresearchefforts Predict- ing Controversy of Wikipedia Articles Using the Article Feedback Tool (Jankowski-Lorek,Nielek,Wierzbicki, &Zieli´nski,2014). FocusingonthecollaborationhistoryofpairsofeditorsSepehri-RadandBarbosaintroducedamodeltopredictcontroversy inasupervisedway(Sepehri-Rad&Barbosa,2015).

(6)

ofcollaborationthatincreasedthearticlequality. Laniado,Kaltenbrunner,Castillo,andMorell(2012)analyzedtheemotional stylesofWikipediaeditors.Finally, Dori-HacohenandAllan(2013)usedsentimentasbasisfordetectingcontroversyonthe WebandcomparedtheseresultswithamethodbasedonataggedcorpusofarticlesfromWikipedia.Wikipediahasreceived muchresearch attention, particularlyinthe ﬁeldsofsocial behaviorandcontentquality.Incontrasttostudies ofconﬂict,

Borzymek, Sydow, andWierzbicki (2009), Turek, Wierzbicki, Nielek, Hupa, and Datta(2010), and Wierzbicki, Turek, and Nielek(2010)investigatedcollaborationandteamworkonWikipediasocialnetwork.

Wikipediahasbeensuccessfullyusedasdatasourceforretrievingsemanticinformationtoimproveresultsfromsearch engines andto categorize texts. Milne andWitten (2008) measured semanticrelatedness by the hyperlink structures of Wikipediaarticles.Theycounted term frequency-inverse document frequency (tf-idf)oflinksweighted bythe probabilityof eachlinkandthencomputedtherelatednessofeacharticle. HajianandWhite(2011)createdamulti-treeforeachentityin theWikipediacategoriesnetwork,combinedthem,andusedamulti-treesimilarityalgorithmtocomputethesimilarityof theentities.Recently, Han(2013)proposedamethodformeasuringsemanticsimilaritythatusesWikipediaasanontology. In additionto semanticsimilarity, the Wikipediacategorygraph (WCG)hasbeen usedin research toimprove adhoc documentretrieval(Kaptein,Koolen,&Kamps,2009),identifydocumentcategories(Schonhofen,2006),andacquire knowl-edge(Nastase&Strube,2008). Medelyan,Milne,Legg,andWitten(2009)publishedanextensiveoverviewofresearchwhich minedWikipedia.In2006, Voss(2006)coinedaterm collaborative thesaurus fortheWikipediacategorystructure.The struc-turedformofWikipediacategoriesallowedfortheautomatedlearningofontology(Yu,Thom,&Tam,2007). Kittur,Chi,and Suh (2009) used a WCG of annotated data to detect contentious topics in Wikipedia. Recently, Biuk-Aghai, Pang, and Si(2014)attemptedtovisualizehumancollaborationinWikipediabyvisualizingWCGsubtreesassimpletrees.

IntheirsearchforcontroversialtopicsinWikipediaarticles, Borraetal.(2015)usedlanguageagnosticprogrammingto develop a tool they calledContropedia. It assignscontroversyscores toevery Wiki link. Thesescores are thenpresented graphicallyasacontroversydashboardshowingboththeWikilinkscoreanditsassociatedtimelineofedits.

2.5. Credibility and controversy

WiththeadventofWeb2.0,onlineevaluationhadbecomeanimportantfeatureinmanyapplicationsinvolving informa-tion.InordertoassessthecredibilityofinformationontheWeb,anumberofevaluationsystemsweredevelopedandmuch researchhadbeendevotedtothedesignofcredibilityassessmentsystems. Ka̧kol,Jankowski-Lorek,Abramczuk,Wierzbicki, andCatasta(2013)studiedtheinﬂuenceofsubjectivityandbiasincredibilityratingsandfoundtheir presencetobeakey design issueinWeb credibility systems.Theconcept ofcontroversy hasbeenstudied withrespect toevaluationsystems asan important inﬂuencingfactor forimprovingoverall credibility. The existence ofsubjectivity andmutualdependency betweenreviewerbiasandobjectcontroversyisbroadlyrecognised.Lauwetal.developedareinforcement-basedmodelto quantifybiasandcontroversywithinanevaluationsystem(Lauw,Lim,&Wang,2006;2008).

2.6. Controversy in Web search

Ennals, Trushkowsky,andAgosta (2010b)developed aDispute Finder browserextension that ﬁnds andhighlights text snippets that correspond with knowndisputes. A databaseof knowndisputes ismanually maintained and a textual en-tailment algorithm is used to ﬁnd snippets that correspond with a known dispute. Initially, the authors populated the databasefromwebsiteslike PolitifactandSnopes,andsubsequentlythey developeda methodforidentifyingdisputes on the Internet based onsome Englishtext patternsfound to be typical in thesedisputes (Ennals, Byler, Agosta, & Rosario, 2010a). Kawahara, Inui, andKurohashi (2010) focused on the recognition andbird’s eye view presentation of contradic-toryandcontrastiverelationshipsbetweenstatementsrelatedtoatopic(searchquery)andthoseexpressedonWebpages (search engineresults). Theirmethod isbased on predicate-argument language structures inJapanese, butis believedto be adaptable to other languages. Finding contradictions through the sentiments expressed by viewerswas attempted by

Tsytsarau,Palpanas,andDenecke(2010).Theirmethodrequiredthetopicinquestiontohaveanumberofonlinereviews oropinions,fromwhicha sentimentcouldfirstbe aggregatedandthen usedtoidentifyacontroversy.Dori-Hacohen and Allan classifiedcontroversiesonInternet pages,basedon theirsimilarity toWikipediaarticlesknown tobe controversial. Initially,the authorsbasedtheir classificationson a manuallyannotated setofarticles (Dori-Hacohen & Allan,2013), but thenthey generalizedtheir methodforautomaticclassification(Dori-Hacohen&Allan, 2015).Theauthors’recentresearch (Dori-Hacohen,Jensen,&Allan,2016)proposedanewstackedmodelandacorresponding methodofclassificationof con-troversial pagesby creatingsubnetworks oftopically relatedpages.Theyachievedimprovement inAUC ofthedeveloped classifierwhencomparedwiththeclassifierbasedonasubnetworkofrandompages,thusprovingthatcontroversyexists intopical neighborhoods. JangandAllan(2016) addressedtheunavailabilityofedithistoryforsome pages bysmoothing fromthe scoresof theneighborswith moreestablished edithistory,which helpedthem improvethebinarycontroversy classification. Jang,Foley,Dori-Hacohen,andAllan(2016)investigatedtheprobabilityofcontroversyinadocumentand de-velopedalanguagemodelforcontroversy. Yamamoto(2012)developedaWebquerysupportsystembycollectingdisputed sentencesrelatedtosearchqueriesandthenprovidingtheuserwithsome ofthemosttypicalandrelevantdisputed sen-tencestoenhanceusers’awarenessofsuspiciousstatements.Despitetheseadvances,moreproblemsarise withrespectto controversialqueries andcontroversialresultsifourgoal istoprovide assistanceto search engineusers.As describedby

(7)

vs.popularbeliefs,theneedformoraljudgmentwhenfacedwithseveralpossiblecorrectanswers,andtheculturaland so-cialsettingsofthepersonseekinginformation.GoingfurtherbeyondcontroversydetectionDori-Hacohenproposedresearch onautomaticstancedetection(Dori-Hacohen,2015).

3. Formalmodelofcontroversy

3.1. Formal model of controversy

Manyresearchershavestudiedcontroversybut, surprisingly,noformalcomputational modelfordetectingcontent con-troversyhasyetbeenproposedorwidelyaccepted.Themain reasonforthisistheissue’scomplexitycausedby multiple aspectsoftheproblems.Diversemethods,heuristicsandalgorithmsfordetectingcontroversyproposedinrecentyears ap-pliedtodifferenttypesofobjects(startingfromsentencesandendingwithproductsorvideoclips).However,itisdifficult tocreateacomprehensivetheoreticalframeworkforthem.Ontheotherhandsuchamodeliscrucialforthedevelopment andevaluationofalgorithmsfordetectingcontroversy.Inthissectionweoutlinethemodelanddiscusspotentialdifficulties inusingittoshapeandpredictcontroversies.

The formal model of controversy proposed in this section cannot be calculated per se without adding content- and community-dependentmethods of calculating(or estimating) empirical distributionof communityopinions about evalu-atingsubjectandselectingtheappropriatemetricformeasuringdisagreement(itmightbeassimpleasLeik’sordinal con-sensusorascomplexasapplicationofclusteringandearthmoverdistanceproposedby Rafalak,Deja,Wierzbicki,Nielek,& Kakol,2016).Itisalsoworthtonoticethatproposedmodelisuniversalandcanbeappliedtoanytypeofobjectsandtheir features– e.g.controversyaboutthetrustworthinessofwebpagesorthetasteoficecream.

The Merriam-WebsterDictionary defines controversyas an “argument that involves many people who strongly disagree about something: strong disagreement about something among a large group of people”.From thisdefinition,we canextract thetwobuildingblocksofcontroversy:alargegroupofpeopleandtheopinionsorevaluationstheyexpress.Ifwecombine thesetwoaspects withthe factthat controversymustbeaboutsomething (e.g.,text,event, orpicture)we obtaina triad composedoftheobject,agroupofpeople,andtheir opinionaboutthegivenobject.Separately,neitheragroupofpeople (community)noranobjectissufficienttogeneratecontroversy.Formally,controversycanbedefinedasafunctionofthree variables:

f

(

ob,com,Eob

com

)

→

{

uncontro

v

ersial,contro

v

ersial

}

, (1)

where

O –set of objects

ob ∈O –single object of controversy (e.g., web page, picture, opinion, etc.), C –set of communities,

com ∈C –community, E ob

com –empirical distribution of opinions given by members of community comfor object ob.

Object ob iscontroversialforagivencommunity com onlyifthereisstrongdisagreement initsevaluation E ob com.

Inter-rater agreementmeasures (e.g., Fleiss’kappa or intra-class correlationcoefficient) ordispersion measures of results(e.g., standard deviation) are typically used to assess agreement but are not suitable for use with a Likert-type ordinal scale (Jamiesonetal.,2004).ItistypicallyusedtoevaluateobjectsontheWeb.Historically,theﬁrstmeasuretailoredtodealwith ordinalscaleswastheLeiksordinalconsensus,introducedin1966by Leik(1966).Othermetricswereproposedby Tastleand Wierman(2007) and Van der Eijk (2001). Yet another approach is to measure the distancebetween the distribution of opinion E ob

com andthereferencedistributiontoidentifyacontroversialclass(e.g.,u-shapeddistribution).Theearthmover’s

distanceandtheBhattacharyyadistanceare twoofmanypossiblemetricsfordiscretedistributions. Althoughcontroversy mayvaryinintensity,forsimplicityweassume thata proposed functionreturnsonlyone oftwo values– eitheragiven object ob iscontroversialornot– butthesameapproachmaybeusedforfunctionsincorporatingmoreshadesofgray.

Eq.(1)canbesimpliﬁedbyremoving objects ob and com fromthefunction,becauseinourcalculationweonlyusethe object ob evaluationsandnottheobjectorcommunitythemselves.Asaresultweobtainthefollowing:

f

(

Eob

com

)

→

{

uncontro

v

ersial,contro

v

ersial

}

. (2)

Followingtheabovedeﬁnitions,itbecomesclearthatthecontroversycannotbeanalyzedwithoutregardtothe commu-nity.Oneconsequenceofthisisthatvariouscommunitiesmayhavedifferentcontroversiesonthesameobject.InWikipedia, anobviousexampleofthesedisjunctivecommunitiescanbeseeninitsdifferentlanguageversions.Anexampleofhowthe controversyononetopicchangesacrossWikipedialanguageversionsisdescribedinthe Section5. Fig.1depictstwoactual examplesofevaluationdistributions fordifferentarticles intheEnglish Wikipedia.Fromthesegraphs, wesee thatJustin Bieberseemstobe,atleastfortheWikipediacommunity,muchmorecontroversialthanAlexandreDumas.

3.2. Meta approaches for estimating object controversy

(8)

Fig. 1. DistributionsofvotesontrustworthyratingobtainedviaArticleFeedbackToolfortwoarticles.

somedataismissingandmustbe compensatedforinsomeway.Belowwe presentthreetypicalreal-worldscenariosand a brief analysis of the potential sources oferrors. We note that there is one more important aspect of computer-based controversy. Bothinthe model we presentin Section 3 andinour discussion below,we assume that neither theobject nor thecommunity changesover time. Web pages evolve,people change their opinions,andcommunities grow. Eachof theseprocesses maystronglyinﬂuence thecalculation results of Eq.(1)andimpose yetother limitations oncontroversy predictionmethods.

3.2.1. Estimation by similarity to other objects

In thisscenario, we have high-qualitydata regarding the controversyabout object ob a fora community com i butno

informationaboutobject ob b’sevaluation, forwhichwe wanttoestimate thecontroversy. So,we cantryto measurethe

similarity between objects ob a and ob b and, ifwe ﬁnd them to be very similar, we can assume that the results ofthe

controversyfunctiondeﬁnedin Eq.(1)willbe thesameforboth.Themainproblemhereisthat calculatingthesimilarity betweenrealobjects can bedone in manydifferentwaysandwe donot knowa prioriwhichof thesemethodsis most accurateinapproximatingthecontroversyfunctionvalues.

Ourattempts to apply this approachto texts by usingtf-idf formeasuring the similarity oftexts, however, exceptin some rare cases,were not successful. Thisisnot surprisingif youconsider that thepresence ofeven asingle wordina lengthy text mightsparkcontroversy.Forother typesofobjects,such aspictures orvideo,thisissuebecomesevenmore challenging.

3.2.2. Estimation of distribution and community

Acommonapproachtostudyingcontroversyinthecomputersciencesistoaskpeopletolabelwhetheragivenobjectis controversialandthentousetheselabelsforfurtherstudy(e.g.,applyingmachine-learningalgorithmsandbuildingmodels (Dori-Hacohen&Allan,2013;Dori-Hacohenetal., 2015)).Researchers,beingawarethat controversyisan inherentlyfuzzy concept,collectmanyevaluationsforeachobjectandthencalculatetheinter-raterreliability.Whenpeopleareaskedabout thecontroversyofagivenobject,theyhavetojudgetheobjectandestimatetwoadditionalvariablesrequiredtocalculate

Eq. (1): com and E ob

com. People tend to overestimate the popularity of their own opinion in a community andfollowing availability heuristic think of themselves as a typical communitymember. Bothof these potential errorsources may be somewhatcontrolledbygatheringanswersfrommanypeopleinsteadoffromasingleindividual,butasyetitisunknown whetherthisismoreeffectivethancollectingsubjectiveevaluationsaboutanobject.

3.2.3. Bias sources

There are quitea few serviceson theInternet on whichpeople can rateobjects,including movies,books, restaurants andothers. Eventhe English Wikipediahada mechanism forratingarticles (see Section 4.1). In that casewe knowthe community and the object,and we have evaluations of that object.Therefore in theory we should be able to precisely calculatethecontroversyfunction(Eq.(1)).Inrealityitisnotthatsimple.

OfthemillionsofWikipediareaders,onlyatinyfractionevaluateagivenarticle.Moreover,thissmallsamplemightnot berepresentativeofthewholecommunitysinceitisself-selected.Peopletendtoevaluateobjectsthatinspiretheirintense emotions (either positive ornegative). Therefore,the opinions of thesilent majority are usually not well representedin collected evaluations. Rafalak etal. (2016) found the distributions ofevaluations to have relatively robust samplingbias withrespecttosocio-demographicvariables.Theover-representationofaparticularpointofviewinoursamplemightalso beduetointentionalmanipulation– evenasmallminorityofwell-organizeduserscanstronglyinﬂuencethedistribution ofevaluations.

4. Datasets

(9)

Fig. 2. ArticleFeedbackToolv4.

Fig. 3. Histogramofratingscores(0–5)oftrustworthydimensionperclass.

4.1. Article Feedback Tool

Article FeedbackTool (AFT) – a Wikimediasurvey forarticle feedback– engaged readers inthe assessmentof article quality.AFTwasusedonWikipediainﬁveversionsandwasrecentlydiscontinued.Inourresearch,we useAFTv4,which allowedreadersoftheEnglishWikipediatoratearticlesonfouraspects,regardingwhetherthey wereperceived asbeing trustworthy,objective, complete,and well-written. Users were presentedwith a survey (Fig. 2) atthe endof every arti-cleandcould submittheir ratingsforeach dimension ona starscaleof 0to 5.Inthisresearch, we consideredonly the

t rust worthy dimensionofAFTv4 inorderto createuniversal controversydetectionclassiﬁersthat canbe applied without directlyasking usersaboutthe objectivityofthe text.Ouroriginal AFTv4 datadump containedover11 M articleratings basedon5.6 Mdifferentrevisionsofmorethan1.5 MdistinctarticlescollectedbetweenJuly2011andJuly2012.

4.2. Articles dataset

TheprimarysourceofWikipediaarticlestaggedascontroversialisalistcomposedbyWikipediaadministrators.Forthe EnglishWikipedia,this“listofcontroversialissues” (theofficiallisttitle)contains963articles,lessthan0.04%ofallarticles. Thisverysmallnumberofcontroversialarticlesmayindicatethatonlyafractionofarticlesiscorrectlytagged(anexample ofone decidedly controversialarticle that is not onthe listis givenin Section 5). Acareful examinationof the listalso revealsthat manyofthesearticlesare no longercontroversial(althoughthey were controversialinthe pastandmaybe soagaininthe future).We testedmanyalternativewaysforcollecting a listofcontroversialarticles (includinga dataset sharedby Dori-Hacohen& Allan,2013),butdecided tocontinuetousealistcomposedbythe Wikipedians,asitwasthe easiestto replicateforother researchersand, atthesametime,represented theclosestexampleofthe widercommunity understandingofcontroversy.

Fromatotal numberof963controversialarticles,we chose219that hadatleastthreeevaluationsintheAFTdataset. Since theapplication ofmachine learningtechniques requiresboth positive andnegative examples (preferablyinsimilar quantities),wealsorandomlyselectednon-controversialarticlesinawaythatmaintainedthesamedistributionoflength(in characters)inbothclasses.Theﬁnaldatasetthuscontainedrecordsof438articles,bothcontroversialandnon-controversial.

(10)

Fig. 4. Histogramofnumberofratingsperarticle.

4.3. Sections dataset

Todeterminewhetheritispossibletoautomaticallyidentifycontroversialissueswithinanarticle,werandomlyselected 512sectionsfromthetalkpages ofcontroversialarticlesandmanuallyannotatedthem,usingbinaryvalues(1for contro-versialand0fornon-controversial).Surprisingly,only19.5%ofthesectionsturnedouttobecontroversial.

5. DetectingcontroversyofWikipediaarticles

Tomanagetheeditingofcontroversialarticles,Wikipediahasestablisheda setofrules.Althoughtheserulescanvary amongdifferentlanguageversions,inprincipletheyarebasedonlimitingthenumberofuserswhoareauthorizedtoedit. Ononehand,restrainingtheeditingofcontroversialarticlesreduceseditwars,butontheotherhanditslowsthe publica-tionofqualitycontent.Earlydetectionofcontroversymayhelptolimitthecostandimpactofeditwars andincreasethe reliabilityofWikipediaarticles.OnWikipedia,atopicthatiscontroversialinonelanguageversionmaynotbecontroversial inanother. One exampleofthisis anarticle aboutthehistoric bombingeventof thePolish townof Wielunby the Ger-manLuftwaffeattheoutbreakofWWII.ThecontroversybetweensomePolishandGermanhistorianscanonlybefoundin theEnglishWikipedia.AmuchlongerarticlecoveringthiseventexistsonthePolishWikipedia,whichisnotcharacterized by anycontroversy, whilethereisno sucharticleatall ontheGermanWikipedia.Thisexampleﬁts ourformalmodel of controversyandreﬂectsthedependencyofcontroversyonthecommunity.

Below, we present three classiﬁers for the detection of controversy on Wikipedia. We are proposing the ﬁrst two (Sections5.1and 5.2), whereas weare using theother one (Section 5.3.1), whichwasintroduced by Kittur etal.(2007), asabaselineforcomparison.

5.1. AFT-based controversy detection

Forevery articleinour datasetwe computedthefrequencyofeach ofﬁveavailable AFTratings, theLeik’sdispersion measure (Leik, 1966), andthe total number ofratings. We used the resultsas features in a machine learningalgorithm (Random Forest).We foundthemostimportantfeaturestobe thenumberof1-starratings,thenumberof2-starratings, andthetotalnumberofratings.

5.2. Emotion polarity-based controversy detection

(11)

Table 1

PerformancecomparisoninAFTdataset.

AFT EP Kittur Combined F-measure 80.37% 69.05% 81.48% 84.10%

PrecisionControversial 79.96% 67.66% 79.77% 83.14%

PrecisionNon-controversial 80.78% 70.78% 83.42% 85.11%

ROC-AUC 0.88 0.73 0.89 0.91

5.3. Baseline

Ascomparativebaselineforouralgorithms,weusedthemethodproposedby Kitturetal.(2007).Inaddition,we deter-minedwhethercombiningornotthesemethodwithouralgorithmsincreaseditsperformance.

5.3.1. Meta information

Kitturetal.proposedthemetaclassifier(Kittur etal.,2007),whichweusedasone baselineforourstudy.Ithadbeen comparedwithfiveother classifiersinaprevious study(Rad&Barbosa,2012).Wechose thismethodbecauseofitsgood performanceandtheeasinessofdatasetconstruction.Basedonastatisticalanalysisoftherevisionhistoriesforarticlesand theirtalkpages,wefoundsevenfeaturesoutof30tobethemostimportant:

• numberofrevisionsofthediscussionpage

• numberofminoreditsofthediscussionpage

• numberofuniqueeditorsofthediscussionpage • numberofrevisionsofthearticle

• numberofuniqueeditorsofthearticle

• numberofrevisionsofthediscussionpagebyanonymouseditors • numberofrevisionsofthearticlebyanonymouseditors

Basedontheabove,wetrainedaclassiﬁertodistinguishbetweencontroversialandnon-controversialarticles.

5.4. Performance of classifiers

Thebaselineclassiﬁerweusedtoevaluatetheperformanceofourproposedclassiﬁersisdescribedin Section5.3.1above. Weusedtherandomforest(Dori-Hacohen&Allan,2015)implementationinRformachinelearning.Duetothestochasticity ofthisalgorithm,weran thelearningprocess foreach model50timeswithdifferentseedvaluesandthenwe calculated theaverageoftheconfusionmatrix.

5.5. Detecting controversial articles

Table 1summarizes the performance of the algorithms we describe in this section. The algorithm proposed by

Kitturetal.(2007)yieldsthebestresultsbutitsoveralladvantageoverourAFTalgorithmisnegligible(slightlymorethan 1%fortheF-measure(accuracy)and0.01forthereceiveroperatingcharacteristic-areaunderthecurve(ROC-AUC)).TheEP methodperformssubstantiallyworse(F-measureof69.05%).TheCombinedmethod(Kittur,AFTandsentiment)worksbest forourdatasetandreached85.11%precisionfornon-controversialarticles.Thus wecanclaimthattheAFT-basedmethod, whichfollowsourformalmodelofcontroversyperformsapproximatelyasgoodasKittur’sbaselinemethod.

Formany applications,the misclassiﬁcation cost is not symmetric, i.e., identifying a truly non-controversialarticle as controversialislesscostlythan doingthe opposite. Fig. 5showstheROCforall thetestedalgorithms. The AUCpeaksat 0.91forthealgorithmthatcombinesthemethodsdescribedinthispaper,buttheyarealsoquitehighfortheKitturandAFT algorithms(0.88and0.89,respectively).Again,thedifferencebetweentheKitturandAFTalgorithmsisveryslight,butwe canobserveaninterestingdifferenceintheshapesoftheROCcurvesforthesealgorithms.AFTworksbetterforlowerfalse positiveratiosandKittursurpassesAFTathigherfalsepositiveratios.Thismayexplainwhycombiningthesetwomethods improvestheperformance.Ingeneral, thesteep riseintheROCfunctionsforall thealgorithms (exceptemotionpolarity) indicates that by acceptingslightly morefalse positives, for whichmisclassiﬁcation is not costly, almost all controversial articlescanbedetected.

5.6. Classification performance based on emotion polarity

(12)

Fig. 5. ROCofclassiﬁersinarticledataset.

Fig. 6. ROCofsectionsclassiﬁer.

(13)

Fig. 7. F-measurebyratioofcontroversialsections.

Tousethesectionsclassifier forfurtherclassification,wechose acut-off valueforwhichits recallinthecontroversial class was79% andits F-measure was68.6%. Fig. 7 showsclassification performancesbased onthe ratio ofpre-classified sections.Wecanseethatweobtainedthebestcontroversyclassificationforarticleshavingaratioofcontroversialsections of26%ormore.

WecanalsousetheclassifierbasedontheEPofsectionsforclassifyingthecontroversyofentirepages.However,ifwe compareperformancedatafrom Fig.7withthosefrom Table1,we canseethatthe directuseofsentimentmeasures on entiretalkpagesis moreefficientdespite thedilutioneffectdescribedabove.We can usethisclassifierto locate contro-versialtopics inthearticle’s talkpage.Bydoing so,we can better determinethe“essence” of theparticular controversy, becausemanycontroversialarticlesconsistlargelyofconsensustext,whichhasbeenstrippedofcontroversy.UsingourEP classifier,wecanidentifyandlocateotherwiseinvisiblecontroversyinthetalkpage.

ThenoveltyofourapproachistheincreasedgranularityofcontroversydetectioninWikipedia,suchastheonewithan AUCashighas0.80andwithoutanyavailablemetadata.Wecannowdetectcontroversialissuesinarticlesatthesections levelofthediscussionpage.

6. ControversialcategoriesinWikipedia

HavingdescribedourapproachfordetectingcontroversyinWikipedia,inthissectionwe explainhowtodetect contro-versialWikipediacategories.

TheWikipediacategorysystemisadirectedcyclicgraph,inwhicharticlesandcategoriesarenodesanddirectededges indicatetheparent-childrelationship.Everycategoryorarticlecanbeachildofmanyothercategoriesandcanhavemany subcategories.ThemainWikipediacategoryis Content ,whichhassevensubcategories.Currentlythereareover1.5 M sub-categoriesin Wikipedia.A categorymay consist ofboth articles andsubordinate categories.Leaf categories inthe graph consistofonlyarticles.

Whilewehaveagoodmodelfordeterminingifasinglearticleiscontroversial,we needtoaggregate microscoresand determine thecontroversy of each node (i.e., category)in the Wikipedia category graph,which will make it possible to detectnewandpotentiallycontroversialarticles.Foreachcategory,wetakeallitsdirectlysubordinatearticlesandcalculate ameanpredictioncontroversyscore.Thiswebcontentcontroversydetectionsystemwaspresentedby Jankowski-Lorekand Zieli´nski(2015).

6.1. Dataset

Currently,theEnglishWikipediaconsistsofroughly5 M articles.Wecomputedalltherequiredfeatures andusedthe modeldescribedintheprevioussectiontodeterminethecontroversyscoreandthisscore’sconﬁdencelevelforeacharticle. Forfurtherresearch,thisdatasetispubliclyavailableandcanbedownloadedfromtheﬁgshare.comwebsite.1

(14)

Fig. 8. Histogramofratioofcontroversialarticlesincategories.

Only0.5%(23,103)ofarticlesareclassifiedascontroversial.Weconsiderthistobeaplausibleresult,astheofficiallistof controversialWikipediaarticlescontainsjust963entriesandinthediscussion pagesweidentified 2153entriesusingthe controversialarticletemplate.Thislistisnotoftenupdatedanddefinitivelydoesnotcontainallthecontroversialtopicsand articles.Foreacharticle,weretrievedallthecategoriestowhichitisassigned.Asdescribedabove,anarticlecanbelongto severalcategoriesanddifferentlevelsoftheWikipediacategorygraph.Mostarticleshavefewerthan20assignedcategories butsome havemorethan 200.Thereare 354,940categorieswithonly oneortwo pagesassigned.We willdiscardthose categoriesinfuturestudiesasitisnotmeaningfultoaggregatesuchasmallnumberofarticlespercategory.

6.2. Controversy of Wikipedia categories

Fig.8showsthenumberofcategorieswithagivenpercentageofcontroversialarticles.Thetrendindistributionisnearly exponential forratios smaller than 0.5. Above 0.5, i.e., where the number of controversial articles is greater than non-controversial ones, the distributionis irregular witha notable peak at 15categories witha ratio equalto 1.0. This peak correspondstocategoriescontainingonlycontroversialarticles.Only346categorieshave50%ormorecontroversialarticles. Itisreasonabletotreatthesecategoriesascontroversial.

Fig.9showsahistogramofmeanconfidencelevelofcontroversyinWikipediacategories.Forcategorieswiththemean confidencelevellowerthan0.5thetrendofthedistributionofthehistogramisnearlylogarithmic.Thistrendismoresteady thanin Fig.8.Thereare514categoriesthathavethemeanconfidencelevelofcontroversyof0.5orhigher.Itconfirmsthat thevastmajorityofWikipediacontentisnon-controversialandtheprobabilityoffindingarandomcontroversialwikipedia categoryislow.Thismeansthatingeneral,Wikipediashouldbeconsideredasnotcontroversial.

Therightsidesofbothhistogramsin Figs.8and 9(forvaluesonx-axisbetween0.5and1.0)looklessregularinterms ofthedistributiontrend.Thismaybeattributabletoamuchlowertotalnumberofcategoriesintheseintervals,preventing anycleartrendfromappearing.

(15)

Fig. 9. Histogramofmeanconﬁdencelevelofcontroversyincategories.

Table 2

Top20controversialcategories.

Rank Category MeanConﬁdence #ofArticles

1 G8nations 0.9978125 8

2 G7nations 0.9975 7

3 G20nations 0.952368421 19

4 Hindustani-speakingcountriesandterritories 0.935833333 3 5 NUTS1statisticalregionsoftheUnitedKingdom 0.905 3 6 PeoplebannedfromenteringChina 0.888333333 3 7 Federalconstitutionalrepublics 0.8875 9 8 Slaviccountriesandterritories 0.883076923 13 9 PeopleoftheAmericanEnlightenment 0.87375 6 10 NearEasterncountries 0.873333333 9 11 WarsinvolvingQatar 0.861666667 3 12 MemberstatesoftheUnionfortheMediterranean 0.860064103 39 13 MemberstatesofNATO 0.855535714 28 14 MemberstatesoftheSouthAsianAssociationforRegionalCooperation 0.85375 8 15 MiddleEasterncountries 0.842916667 18

16 Mormonism 0.841666667 3

17 NortheastAsiancountries 0.8325 7 18 MemberstatesoftheCouncilofEurope 0.828928571 14 19 Democratic-RepublicanPartyPresidentsoftheUnitedStates 0.828125 4 20 WesternAsiancountries 0.827368421 19

6.2.1. Empirical validation

Basedonthereal-worldunderstandingthat sometopicsare generallyknowntobe controversial,e.g., politics,religion, racism,tonameafew,wemanuallytestedthelistsofcontroversialandnon-controversialcategories.

(16)

Fig. 10. Percentageofcorrectlyclassiﬁedcontroversialandnon-controversialarticlesincorrespondencewiththechosenthresholdvalue.

controversialcategoriesbased onmean confidence,we verified82% asbeingcontroversial.Out of100 randomlyselected non-controversialcategories,only5%were infactcontroversialtopics(falsenegatives).Thisvalidationresultconfirmsthat wecanusethemeanconfidencelevelofthecontroversyofarticlestofindcontroversialcategories.

6.2.2. Cross validation

Forthe second validation,we randomly split ourdatasetinto trainingandtest subsetsat aratio of0.7. We usedthe trainingsubsettocalculatethecontroversylevelofcategoriesbasedontheratioofcontroversialtonon-controversialarticles andthemeancontroversy confidencelevel.Then weused thesecategoriesto determinewhicharticles inthetest subset werecontroversial,basedontheaveragelevelofconfidenceofallthecategoriestowhichtheywereassigned.Inthetesting dataset,therewere1,399,303articles,6926ofwhichwereoriginallyclassifiedascontroversial.

Fig.10showsthepercentageofcorrectlyclassiﬁedcontroversialandnon-controversialarticlesinrelationtothechosen thresholdvalue. An articlewas assignedasbeingcontroversialif themean conﬁdencecontroversy levelof itscategories exceededthisthreshold.

Wecanseethatthepercentageofcorrectlyclassifiedcontroversialarticlesrapidlyincreasesuponloweringthethreshold valuetobelow0.2andforvaluesupto0.06,whereasthepercentageofcorrectlyclassifiednon-controversialarticlesholds steadyuntil0.12andthenbeginstoslightlydecreaseandthendropsrapidlyafterreachingthe0.05thresholdvalue.Based ontheseresults,wechose0.06asthebestthresholdvalue,withwhichwe candetect92% ofallcontroversialarticlesand produceonly10.3% offalsepositives. Theresultsofoursecond validationalsoconfirmedthatwe canusethemeanlevel ofconfidence inthe controversyofarticles toidentify controversialcategories,although inorderto correctlydetect new controversialarticleswemustchooseanappropriatethresholdvalue.

7. ControversyofWebqueries

(17)

theﬁrstresultpage(Pass,Chowdhury,&Torgeson,2006).Incaseofcontroversy,anearlywarningshouldbeissuedbefore anylinkisfollowed,becausemanyusersneverreturntotheSERPoncetheyhavefollowedaresultlink.

User’s search intent isreﬂectedinthe search query .Themappingofthesearchintentintothesearchqueryisoutsideof thiswork.Inthisresearch,wefocusonthesearchqueryanditsrelationtoknowncontroversies.

A search query will be referred to as controversial when it is related to a topic that is known to be controversial .In our approach,werepresentthistopicbyaWikipediacategory.Asdescribedin Section6,wehaveassignedacontroversyscore toeverycategory,basedonthecalculationofthemeanscoreofthecontroversyclassiﬁer(see Section5.3.1)forallarticles belongingtoa givencategory. Weshallnowcomputeasimilarityfunctionofaqueryandacategory, anduseathreshold todeterminewhetherthequeryisrelatedtothecategory.

7.1. Queries dataset

Today’ssearch-engineproviders consider adatabaseofsearch terms tobe a veryvaluable andprivacy-sensitiveasset. Probablyforthesereasons,theonlylargerscaledatasetofsearchenginequerylogsfreelyavailableforresearchpurposesis theAOLdatasetcollectedduring2006andextensivelydescribedby Passetal.(2006). This collection consists of 21 million web queries, 10M of which are unique .Thequerylogswerecollectedfrom 650kUS-basedusersoverathree-monthperiod. Thisdataset has served as basefor manyresearchers since its publishing in 2006 to recentyears (e.g., Shokouhi, 2013; Whiting& Jose,2014). However, ourresearch doesnot particularlydepend onthisspeciﬁc dataset.In fact,we could use otherquerylogswithourmethodonequalterms,i.e.,asasourceofrealusers’queries.

7.2. Similarity function

Next,we needafunctionforcalculatingthesimilaritybetweenaqueryandaWikipediacategory,whichwe deﬁneas describedbelow.Firstweconcatenatethetextsofall articlesbelongingtothecategoryandﬁndkeywordsusingthetf-idf vectorsin thecorpus of all Wikipediaarticles. Then, we matchthe top-N keywords against the words inthe queryand calculatetheaggregatedtf-idfvaluesfor M matchingwordsand N keywords,respectively.

QSim

(

q,c

)

=

M i=1Qi

N i=1Ci

(3)

Q _i...tf-idfvalueof i-th matchingwordofquery q C i...tf-idfvalueoftop-i-th keywordofcategory c M ≤N

Q _Simtakesvaluesfromtherange[0,1].Avalueof0meansthereisnosimilarity,while1indicates perfect similarity.A scoreof1canbeobtainedonlyifall N keywordsinthecategoryarepresentinthequery.Becausemostqueriesareshorter thanfour words,fourisa good value for N .Higher numbers for N arealso good,buttheFormula (3)resultwillyield a smallernumber.Sinceweusethisnumberforranking,theexactvalueof N doesnotreallymatter.

Whenauserentersa queryinnaturallanguage,thenumberofmeaningfulwordsinthequerycan easilyexceedfour. Insuchcases,morethan N keywords shouldbeusedinthesimilarityformula. Ifthematchofthetop N keywords isnot

perfect ,thenwe canextendtheoutlooktoadditionaltop-L keywords.Ifthenumberofadditionalmatchingkeywords is K

thentheirtf-idfvaluesaresummedto R :

R= K

i=1

Qi (4)

Then

QSim

(

q,c

)

=

M i=1Qi+R

N i=1Ci+R

(5)

InFormula (5),keywords1..N are mandatory ,i.e., theabsenceofa mandatory keywordinthequerydecreasesthe sim-ilarityscore,and keywords N +1..L are optional ,i.e., the presenceof an optional keyword inthe queryincreasesthe score obtainedbyusingonlythe mandatory keywords.Therangeofthefunction Q Simis[0,1].

7.3. Early warning algorithm

Havingdeﬁnedthe Q Sim functionin Section7.2,wecannowmeasurethe similarityofanysearchtermtoanyconcept

representedbyaWikipediacategory,e.g.,foreveryquerywecancalculateitssimilarityscoresforallcategories.However,as wealreadyknowthecontroversyscoreforeach Wikipediacategory,andwearelookingonlyforcontroversy,forpractical reasonswe limit this taskto categorieswith acontroversy score exceeding some arbitrarythreshold value. Wedeﬁne a controversysub-scoreonthelevelquery-categorybythefollowingformula:

(18)

Fig. 11. QContscorefortop1000controversialqueries.

Foreveryquery,wecanhaveasmany Sub Contscoresaswehavecategoriesexceedingthethresholdvalue.Again,because

we are looking only forcontroversy, we consider only those pairs (query, category) for which their Q Sim exceeds some

arbitrarythreshold.Asa result,we canhave0to j Sub Cont scores foreach query.Then, wecan introducethecontroversy

scoreofaqueryasfollows:

QCont

(

q

)

=

(

1 −

α

)

·max

(

SubCont

(

q,ci

))

+

α

· j

i=1

(

SubCont

(

q,ci

))

. (7)

Bymanipulatingthevalueofparameter

α

between0.0and1.0,wecanbalancetheimpactofthehighestmatchagainstthe aggregatedvalueofallmatchesforagivenquery.

Wecalculated Q Cont scorefortop1000controversialqueriesandranked thembythedecreasingscores,i.e.,thehighest

score ﬁrst. Fig. 11showsthe distribution of Q Cont scoreby the decreasing rankforthe top 1000controversial queries.A

rapidfallof Q Cont scorecanbeobservedonthelogarithmicy-axiswiththedecreasingrankonx-axis.Thisindicatesthata

binaryclassiﬁcationbyacut-off valueof Q Cont mayperformquitewell.

7.4. Empirical validation

Wehavemanuallyvalidatedthetop-ranked resultsofthe Early Warning Algorithm fortheexistenceofcontroversy.The validation wasperformedby ﬁveannotators who were askedtoevaluate controversyof topicsthat they associated with the query,using their ownknowledge regarding thesetopics.Annotators wereasked tolook atthetopics fromthe per-spectiveoftheir communities(annotatorswerePolish,butﬂuentintheEnglishlanguage andfamiliarwithcontemporary globalculture).Ifannotatorshadnoknowledgeregardingtopicsassociatedwiththequery,theywereallowedtouseGoogle searches.

Annotators assigneda binaryevaluation ofquery controversy. Ifcontroversy wasdiscovered, they were also askedto brieflyjustifytheirevaluationbydescribingthetopicthatwasrelatedtothequeryandcausedlotsofcontroversy.Thefinal decisionregardingaquery’sassociationwithcontroversialtopicswasmadebasedonmajoritydecisionoffiveannotators.

Table3 liststhequeries andtherelatedtopicsthatwere thereasonfortheevaluationofthequeryascontroversial. Sig-niﬁcantly,withinthetop25highest-rankedqueriesusingtheproposedcontroversyranking,theannotatorsfoundonlytwo falsepositives(truepositiverateof92%).Allremainingquerieswereevaluatedascontroversial.

Ontheotherendoftherankedresultlistthereare queriesthatareleastlikelytorefertocontroversy.Within25such queriesshownin Table4onlythreefalsenegativeswerefound(truenegativerateof88%).Someoftheexamplesofthetop 25queries relatedtocontroversialtopicsclearlyillustratethe potentialofan “earlywarning” about controversy. Consider the query“selena” (position 8inthe controversyranking). A superﬁcial Google search (lookingjustatthetop 10 search results) allows theuser to ﬁnd informationabout the pop-starSelena Gomez.2 _However, _without _digging_deeper _in _the

(19)

Table 3

Topcontroversialqueriesfoundindataset.

Query Controversy Notes

1 puertorico YES migrationrelatedissuesinUSA 2 continentalairlines YES pastairplanecrashes

3 harrypotter YES occultandsatanicsubtextsinnovelforchildren 4 vietnamwar YES strongUS-Americanoppositionagainstthewar

5 abortion YES moralissuebroughtuponpoliticalagendainmanycountries 6 thomasjefferson YES relationshipofUSpresidentwithhismixed-raceslaveSallyHemings 7 ﬁdelcastro YES popularchampionorcrueldictator

8 selena YES disrespectfulphotoinmosque

9 microsoftwindows YES monopolistbusinesspractices,securityissues 10 rushlimbaugh YES anti-immigrationstatements

11 korea YES KoreanhistorytextbookinSouthKorea,int’lcriticismofNorthKoreanpolitics 12 billclinton YES Lewinskyscandal

13 popejohnpaulii YES lackofresponsetoabuseofchildreninchurch 14 startrek NO Notcontroversial.Sexualityinoriginal“StarTrek” series. 15 sierraleone YES Ebolalockdownpolicy

16 osamabinladen YES deathconspiracytheories 17 hillaryclinton YES privateuseofgovernmentalemail 18 mormons YES bookofMormonorigins,polygamy 19 mariogames YES occult,animalcruelty,transgenderism 20 nigger YES racistinsult

21 ameliaearhart YES deathconspiracytheories

22 airlineﬂights NO Notcontroversial.Relationshiptoterrorism. 23 victoriabeckham YES Useofskinnymodelsinfashionshows

24 haileselassie YES dictatorandreligiousleaderoverthrownbyMarxistcoup 25 spaceshuttlechallenger YES CriticismoftheSpaceShuttleprograminUS

Table 4

Exampleleastcontroversialqueriesindataset.

Query Controversy Notes

1 b-box NO babyessentialsmanufacturer 2 boxesandmore NO packagingproductssupplier 3 howtobox NO

4 republicancandidates NO 5 whereispoitiers NO 6 whatismanga NO 7 nixonelection YES 8 donatecomputer NO 9 yankees NO 10 3mcorp NO 11 centercity NO 12 developmentofthebrain NO 13 whattodoinaﬂood NO

14 smokinganditseffects YES Smokingofmarihuana 15 whatisba NO

16 avocadowhatisinit NO

17 starsymbols YES Occult,politicalcontroversies 18 morethancoffee NO cafechain

19 j.a.c.redford NO 20 artandmusic NO 21 concretehowto NO 22 howtopaintfurniture NO 23 messinaitaly NO

24 cityofpeace YES Jerusalem,Baghdad,Auschwitz 25 wildstallions NO

Googlesearch results,orusingthequery“selena mosque”,theuserwouldnot learn abouta controversyaboutSelenain Muslimcountries(referringtoallegedlyinappropriatebehaviorinamosque).

Anotherexampleisthequery“mariogames” (position19inthecontroversyranking).Again,theGoogle top10search resultswillnot revealinformationaboutcontroversy.However, thequery“mariogamescruel” canretrieveaWebpageof the“mariowiki”3 _that_contains_a_list_of_{controversies}_related_to_the_game,_including_concerns _about_animal_cruelty, trans-genderismandtheoccult. Oneofthe falsepositivesinthe top25is thequery“StarTrek”.Interestingly,it seemsthat in the1960s and1970s, whenthe original “StarTrek” series wasrunning,it mayhave generatedsigniﬁcant controversy in

(20)

Fig. 12. TPRofbinaryclassiﬁcationbycut-off value.

theUnitedStatesbecauseofreferencesorsuggestionsofhomosexuality,polyamorousrelationshipsetc.4_It_seems_that_since then thiscontroversyhadbeenreduced, butits tracesstillremain ontheWebandWikipedia.Therelationbetween con-troversyandtimeisaninterestingissuethatweplantoinvestigatemoreinthefuture.Likealmosteverything,controversy tootendstoﬂuctuateandcanoccurovercertainperiodsonly.Theproposedapproachmainlydetectscontroversiesexisting currentlyorintherecentpast.

Someofthefalsenegativesinthebottom25ofthequerylistdemonstratedifficultiesofautomaticcontroversy evalua-tion.Forexample,thequery“starsymbols” wasclassiﬁedasnon-controversial.However,starsymbolsincludethe“Starof David” (symbolofIsraelwhichsometimesappearsinanti-semiticcontroversies)andthepentagram(usedinoccult).Such culturalknowledgeisdifficulttoteachtoan algorithm.Anotherfalsenegative,thequery“smokinganditseffects” can re-fertotobacco(in thiscontext,itmaynolongerbecontroversial),butitcanalsorefertomarijuana.Thealgorithmcannot simply guessthis association,andthe wording ofthe querywas apparently understoodby the algorithm asreferring to smokingoftobacco.

Havingranked the top 1000 results(see Fig.11) we now need a cut-off value forthe binaryclassiﬁcation: if Q Cont>

cut-off,thenthequeryisclassifiedcontroversial,otherwiseit isclassifiednot controversial.Inordertochoosethecut-off value wefirsthadtoestimatetheperformance ofsuchaclassification,forwhichwehadtousehumanannotationagain. Wedividedthetop250rankedqueriesin10bins,25querieseach.Eachbinspreadsoveradecreasingrangeof Q Cont.The

controversyforeach querywasmanually annotatedinbins 1,2,5,6,7,8 and10.Then wecalculated TruePositiveRate (TPR)foreachoftheannotatedbinsandpresenteditgraphicallyin Fig.12.Thisplotletsusestimatethatwithanexample cut-off valueof Q Contscoreof25weobtainTPRof0.7.Thepositionofthesameexamplecut-off pointisalsoshownonthe

rankingplot,see Fig.11.

7.5. Robustness to manipulation

One concernabouttheproposed approachrelatesto adversarialmanipulationof controversycomputation.It could be possiblethatsomeentitiestrytoimpacttheresultstoeithermakecertainWikipediaarticlesseemlesscontroversialor,in contrast,to“render” thenon-controversialonessuchthat theybecome controversial.However, thiswouldrequire manip-ulatingdataonseverallevels.Inparticular,notonlytheperceptionsoftrustworthinessgivenbyWikipediavisitorsshould be impactedbutalsothe sentimentlevels ofWikipediatalk pages,whichrecord communicationbetweenthe editors, as well asthe numbers ofedits,revisions, unique editors, etc. Thisis ratherdifficult, especially, consideringthat Wikipedia categoriesmay containlarge numberof individual pages.To summarize, theuseof severalindependent variablesby the controversyclassiﬁers,aswell astheaggregationofarticlecontroversyscoresintocategoriesmakes ourmethodquite ro-busttoadversarialmanipulation.

8. Conclusionsandfuturework

Detectingcontroversyisachallengethatreachesfarbeyondnaturallanguageprocessingalgorithmsandrelatedresearch. It requires a holistic approachand differenttools drawn fromWeb science andsocial informatics. The resultspresented heredemonstrate that ourformal model,introduced in Section 3, correctlyidentiﬁes potential problems associated with predictingcontroversy. The proposed formal model ofcontroversy might also be usefulas a basis forfurther study and introducesacommondenominatorforusebyverydifferentresearchperspectives.

Wikipediaarticlesprovideagoodstartingpointforlearningaboutcontroversyduetothevastamountofmeta informa-tionthat canbeusedforpredictingandexplainingsourcesofdisagreement.Asshownin Table1in Section5,combining

(21)

ourmethodwithpreviously proposedapproachesbooststheF-measure to84.10%andconstitutesa goodbasisforfurther research.WeusedthepredictedcontroversyofWikipediaarticlestocalculatethelistofthemostcontroversialcategories. Thislistmightbeusefulfortheselectionoftopicsformorecarefulmonitoring.ThisisasubjectimportanttotheWikipedia community,becauseWikipediaaimstorealizeaneutralpointofviewandexcludesexcessivecontroversialviews.

In Section7,weshowedthatevaluatingcontroversyofWikipediacontentmayalsobeusedforidentifyingcontroversial queriesinsearchengines.Wetestedourproposedalgorithmonrealqueriessubmittedbypeople.Ourmanuallyconducted validationshowsthatitworksquite wellformostcontroversialquerieswhile notraisingtoomanyfalse alarms.We em-phasizethat aquerywithacontroversywarningshouldnotbeconsidered asnegativeorharmful,butratherthewarning shouldbeseenasausefulrecommendationforuserstobecarefulwhenstudyinganysearchresults.Moreover,becauseof theheavypersonalization andtheﬁlterbubbleeffecton searchresults,aparticularusermightonlyretrieveone pointof view,soacontroversywarningindicatesthatsomeimportantinformationmaybelackinginthesearchresults.

Controversiesareinevitable.Therefore,furtherresearchoncontroversyontheInternetisneededtobetterunderstandits dynamics.Newalgorithmsfordetectingandmitigatingcontroversieswilldeﬁnitelybedevelopedintheforeseeablefuture. Inourfuturework,we plantoverifywhethertheuseofcontentandmetainformationfromQ&Awebsitesmighthelpto identifycontroversywithrespectto rapidlychangingtopicsortopicsthatare beyondthecoverage ofWikipedia.Wealso plantoincludetimeandcommunitydynamicsinourformalmodelofcontroversy.

Theremainderofthissectiondiscusses thelimitationsofourstudy,possiblesocialramiﬁcations ofalertingWeb users toquerycontroversy,andfuturework.

8.1. Study limitations

Ourstudyfollowsanempiricalresearchmethodology.Inthissection,we describethelimitationsofourresearch meth-ods,andpossiblewaysofaddressingthem.

8.1.1. AOL Query Log

AOL Query Log was published in 2006, hence, it might seem outdated. We have utilized AOL Query Log due to its availabilityandrelativelylargesize.Tothebestofourknowledge,nootherpubliclyavailabledatasetofWebsearchqueries hasa similar size (about 10 million unique queries).While the set of queries is over 10 years old andthe Web search behaviorofusersmayhavechangedsincethen,ourestimations ofcontroversyoftopicsrelatedtothe querieshavebeen performedonthecurrentversionsofWikipediaarticles. Similarly,the evaluationdescribedin Section7.4wasperformed undercurrentcircumstances (i.e.,asifthequerieswereissuedcurrently).Hence,thereisnoproblemoftimedifferenceor thelackofsynchronizationbetweentheknowledgebaseusedforcomputingtheresultsandthetimeofevaluation.

8.1.2. Web users vs. Wikipedia users

OurresearchhasmadetheassumptionthatcontroversyevaluationbasedonWikipediawillbeusefulforallWebusers. Thisisthesameasassumingthat thecommunityofWikipediauserswillprovideausefulcontroversyevaluationforthe communityofallWebusers.Evenifwe restrictthesecommunities bylanguage(English-languageWikipediaand English-languageWeb),theremaybeWebuserswhodonotuseWikipedia.However,accordingtoarecentstudy,5_despite_a_drop_in GooglesearchrankingsofWikipediapages,over50%ofsearchquerieshaveaWikipediapageinthetop10ofGooglesearch results.Wikipediahasabout18billionviewspermonth6_(as_compared_to_about₁₀₀_billion_Google_searches_monthly)._The numberofusersishardertoestimate,butthenumberofuniquedevicesthataccesstheEnglish-languageWikipediadaily isabout60million.7 _These_statistics_show_that _while_the_community_of_Wikipedia_users_is_smaller_than_the_community_of allWebusers(fortheEnglishlanguage),itshouldbesufficientlylargetobetreatedasalargesample(atleastontheorder oftensofmillionsinsize).Nevertheless,littleisknownaboutthecommunityofWebusersthatdoesnotuseWikipedia, andmoreresearchofthissubjectisrequired.

WhatismoresigniﬁcantisthataccordingtoWikipediapolicy,Wikipediaonlyexcludescontentthatisnotencyclopedic. Wikipedia is adding 20,000 new articles each month. It reacts very quickly to current newsworthy events, such as the deathofMichaelJacksonortheFukushima catastrophe.Forthesereasons,itislikelythat controversiesthatoccur inthe communityofWebuserswilleventually(rathersoonerthanlater)bereﬂectedintheWikipedia.

8.1.3. Cultural biases

Naturally,culture andbackground ofa person impacts theway inwhich sheor hejudgesandperceives controversy. ConformistsocietiessuchasJapanesesocietymayhavedifferentattitudeandpropensitythanindividualisticsocietiessuch astheUSAtowards evaluating andmeasuringcontroversy. Theproposed method canbe adapted todifferentcultures by utilizingdifferentlanguage versionsof Wikipedia(e.g.,Japanese Wikipedia). Webelieve that thiscould partially mitigate theproblemofculturalimpactandbias.

5 _{https://www.stonetemple.com/google-still-loves-wikipedia-more-than-its-own-properties/}_{, September 23, 2015.} 6_{http://www.pewresearch.org/fact-tank/2016/01/14/wikipedia-at-15/}_.