• 検索結果がありません。

情報検索技術に基づくブロッククローン検出

N/A
N/A
Protected

Academic year: 2021

シェア "情報検索技術に基づくブロッククローン検出"

Copied!
8
0
0

読み込み中.... (全文を見る)

全文

(1)Vol.2017-SE-196 No.19 2017/7/20. ৘ใॲཧֶձ‫ڀݚ‬ใࠂ IPSJ SIG Technical Report. ৘ใ‫ٕࡧݕ‬ज़ʹ‫ͮ͘ج‬ϒϩοΫΫϩʔϯ‫ݕ‬ग़ ԣҪ Ұً1,a). ቐ Ըಬ2. ٢ా ଇ༟3. Ҫ্ ࠀ࿠1. ֓ཁɿιϑτ΢ΣΞอकʹ͓͚Δ໰୊ͷ̍ͭͱͯ͠ίʔυΫϩʔϯʢιʔείʔυதʹଘࡏ͢ΔಉҰ·ͨ ͸ྨࣅͨ͠෦෼Λ࣋ͭίʔυยʣ͕ࢦఠ͞Ε͍ͯΔɽ‫ط‬ଘͷ‫͍͓ͯʹڀݚ‬ɼ৘ใ‫ٕࡧݕ‬ज़ʹ‫਺͍ؔͯͮج‬ ୯ҐͰίʔυΫϩʔϯΛ‫ݕ‬ग़͢Δख๏͕ఏҊ͞Ε͕ͨɼ‫ݕ‬ग़ཻ౓͕େ͖͍ͨΊ‫ݕ‬ग़࿙ΕΛ‫͏͍ͱ͢͜ى‬໰ ୊఺͕͋Δɽͦ͜Ͱຊ‫Ͱڀݚ‬͸ɼ৘ใ‫ٕࡧݕ‬ज़Λར༻ͨ͠ϒϩοΫΫϩʔϯʢίʔυϒϩοΫ୯Ґͷίʔ υΫϩʔϯʣͷ‫ݕ‬ग़ख๏ΛఏҊ͢Δɽ͜ΕʹΑΓɼैདྷͷؔ਺୯ҐͷίʔυΫϩʔϯʹՃ͑ͯΑΓཻ౓ͷ খ͍͞ίʔυΫϩʔϯ͕‫ݕ‬ग़Ͱ͖Δͱߟ͑ΒΕΔɽຊख๏Ͱ͸ɼιʔείʔυதͷࣝผࢠ΍༧໿‫ʹޠ‬ར༻ ͞ΕΔ୯‫ʹޠ‬ରͯ͠ॏΈ෇͚Λߦ͍ɼίʔυϒϩοΫΛಛ௃ϕΫτϧʹม‫͢׵‬Δɽͦͯ͠ɼಛ௃ϕΫτϧ ؒͷྨࣅ౓Λ‫ٻ‬ΊɼϒϩοΫΫϩʔϯͷ‫ݕ‬ग़Λߦ͏ɽ·ͨධՁ࣮‫Ͱݧ‬͸ɼ‫ط‬ଘͷίʔυΫϩʔϯ‫ݕ‬ग़ख๏ ͱൺֱΛߦ͍ɼߴ଎ʹߴ͍ਫ਼౓Ͱ‫ݕ‬ग़Λߦ͏͜ͱ͕Ͱ͖ͨɽ ΩʔϫʔυɿίʔυΫϩʔϯɼιϑτ΢ΣΞอकɼ৘ใ‫ࡧݕ‬. Block Clone Detection Based on Information Retrieval Techniques Kazuki Yokoi1,a). Eunjong Choi2. Norihiro Yoshida3. 1. ·͕͖͑. Katsuro Inoue1. ຯతʹॲཧ͕ྨࣅͨؔ͠਺୯ҐͷίʔυΫϩʔϯʢؔ਺Ϋ ϩʔϯʣΛ‫ݕ‬ग़͢Δख๏ΛఏҊͨ͠ [13]ɽ৘ใ‫ٕࡧݕ‬ज़ͱ. ιϑτ΢ΣΞͷอकʹ͓͚Δ໰୊ͷͻͱͭͱͯ͠ɼίʔ. ͸ɼେྔͷσʔλ‫͔܈‬Β໨తʹ߹கͨ͠΋ͷΛऔΓग़͢. υΫϩʔϯ͕ࢦఠ͞Ε͍ͯΔ [3]ɽίʔυΫϩʔϯͱ͸ɼ. ͜ͱͰ͋Γɼࣗવ‫͔ॻͰޠݴ‬ΕͨจॻͷॲཧͳͲʹར༻. ιʔείʔυதʹ‫·ؚ‬ΕΔ‫ʹ͍ޓ‬Ұக·ͨ͸ྨࣅͨ͠෦෼. ͞ΕΔ [2]ɽίʔυย୯ҐͰ‫ݕ‬ग़Λߦ͏৔߹ɼߏจͷෆ‫׬‬. Λ࣋ͭίʔυยͷ͜ͱͰ͋ΓɼҰൠతʹɼίʔυΫϩʔϯ. શͳ෦෼Ͱऴྃ͢ΔίʔυยͳͲɼू໿Λߦ͏͜ͱ͕ࠔ. ͷଘࡏ͸ιϑτ΢ΣΞͷอकΛࠔ೉ʹ͢Δͱ‫ݴ‬ΘΕ͍ͯΔɽ. ೉ͳίʔυΫϩʔϯ͕ଟ͘‫ݕ‬ग़͞ΕΔ͜ͱ͕͋Δ [14]ɽҰ. ίʔυΫϩʔϯʹର͢Δ༷ʑͳอक΍؅ཧͷํ๏͕ఏҊ. ํɼؔ਺Ϋϩʔϯ͸ॲཧͷ಺༰͕·ͱ·͍ͬͯΔͨΊɼ։. ͞Ε͍ͯΔ͕ɼιʔείʔυͷ‫ن‬໛͕େ͖͘ͳΔͱιʔε. ൃऀʹͱͬͯू໿ͷର৅ʹͳΓ΍͍͢ίʔυΫϩʔϯΛ‫ݕ‬. ίʔυதʹ‫·ؚ‬ΕΔίʔυΫϩʔϯ΋๲େͳྔͱͳΓɼख. ग़Ͱ͖ΔɽࢁதΒͷख๏Ͱ͸ɼ৘ใ‫ٕࡧݕ‬ज़ͷҰछͰ͋Δ. ࡞‫ͦͰۀ‬ΕΒΛ؅ཧ͢Δ͜ͱ͸ࠔ೉ͱͳΔɽͦ͜Ͱɼίʔ. TF-IDF ๏ [2] Λ༻͍ͯɼιʔείʔυதͷࣝผࢠ΍༧໿‫ޠ‬. υΫϩʔϯΛࣗಈతʹ‫ݕ‬ग़͢Δ͜ͱΛ໨తͱ༷ͨ͠ʑͳ. ʹར༻͞ΕΔ୯‫ʹޠ‬ରͯ͠ॏΈ෇͚Λߦ͏ɽͦͯ͠ɼॏΈ. ίʔυΫϩʔϯ‫ݕ‬ग़ख๏͕ఏҊ͞Ε͍ͯΔ [5], [11]ɽ. ෇͚ʹ‫਺֤͍ؔͯͮج‬Λಛ௃ϕΫτϧʹม‫͠׵‬ɼಛ௃ϕΫ. ࢁதΒ͸৘ใ‫ٕࡧݕ‬ज़ [2] Λར༻͢Δ͜ͱʹΑͬͯɼҙ. τϧؒͷྨࣅ౓Λ‫͢ࢉܭ‬Δ͜ͱʹΑͬͯɼؔ਺Ϋϩʔϯͷ ‫ݕ‬ग़Λߦ͏ɽ·ͨɼۙࣅ࠷ۙ๣୳ࡧΞϧΰϦζϜͷҰछͰ. 1 2 3 a). େࡕେֶ Osaka University ಸྑઌ୺Պֶٕज़େֶӃେֶ Nara Institute of Science and Technology ໊‫ݹ‬԰େֶ Nagoya University [email protected]. c 2017 Information Processing Society of Japan . ͋Δ LSHʢLocality-Sensitive HashingʣΞϧΰϦζϜ [6] Λ༻͍ͯɼಛ௃ϕΫτϧΛΫϥελϦϯά͢Δ͜ͱʹΑΓɼ ‫ݕ‬ग़ͷߴ଎ԽΛߦ͍ͬͯΔɽ ͔͠͠ɼࢁதΒͷख๏Ͱ͸‫ݕ‬ग़ཻ౓͕ؔ਺୯ҐͷͨΊɼ. 1.

(2) Vol.2017-SE-196 No.19 2017/7/20. ৘ใॲཧֶձ‫ڀݚ‬ใࠂ IPSJ SIG Technical Report. ௕͍ؔ਺಺ͷҰ෦ʹίʔυΫϩʔϯ͕‫·ؚ‬Εͨ৔߹ɼ‫ݕ‬ग़ ࿙Ε͕ੜ͡ΔՄೳੑ͕͋Δɽ͜ͷΑ͏ͳ‫ݕ‬ग़࿙ΕΛ‫ݮ‬Β͢ ͨΊʹ͸ɼؔ਺୯ҐΑΓখཻ͍͞౓ͰίʔυΫϩʔϯͷ‫ݕ‬ ग़Λߦ͏΂͖Ͱ͋Δɽͦ͜Ͱɼຊ‫Ͱڀݚ‬͸ίʔυϒϩοΫ ୯ҐͷίʔυΫϩʔϯʢϒϩοΫΫϩʔϯʣΛ‫ݕ‬ग़͢Δख ๏ΛఏҊ͢Δɽ͜͜Ͱ͸ίʔυϒϩοΫΛɼؔ਺ͱɼؔ਺ ಺෦ͷ ifɼforɼwhile จ౳ͷத‫Ͱހׅ‬ғ·Εͨ෦෼ͱఆٛ͢ Δɽຊख๏Ͱ͸ɼ·ͣιʔείʔυʹରͯ͠ߏจղੳΛߦ. λΠϓ 3 λΠϓ 2 ͷҧ͍ʹՃ͑ͯɼจͷૠೖ΍࡟আɼม ߋͳͲ͕ߦΘΕ͍ͯΔίʔυΫϩʔϯɽ λΠϓ 4 ྨࣅͨ͠ॲཧΛ࣮ߦ͢Δ͕ɼߏจ্ͷ࣮૷͕ҟ ͳΔίʔυΫϩʔϯ λΠϓ 4 ͷίʔυΫϩʔϯͱͯ͠ɼҎԼͷ΋ͷ͕‫͛ڍ‬Β ΕΔɽ. • ৚݅෼‫ॲذ‬ཧ΍‫܁‬Γฦ͠ॲཧͳͲͷ੍‫ߏޚ‬଄ͷ࣮૷͕ ҟͳΔɽ. ͍ɼίʔυϒϩοΫͷநग़Λߦ͏ɽͦͷ‫ޙ‬ɼநग़֤ͨ͠ίʔ. • தؒഔհม਺ͷར༻ͷ༗ແ͕ଘࡏ͍ͯ͠Δɽ. υϒϩοΫʹରͯ͠ TF-IDF ๏Λ༻͍ͯಛ௃ϕΫτϧʹม. • จͷฒͼସ͕͑ൃੜ͍ͯ͠Δɽ. ‫͠׵‬ɼಛ௃ϕΫτϧؒͷྨࣅ౓Λ‫͢ࢉܭ‬Δ͜ͱͰϒϩοΫ Ϋϩʔϯͷ‫ݕ‬ग़Λߦ͏ɽ·ͨɼMulti-Probe LSH [9] Λ࢖༻. 2.1 ؔ਺Ϋϩʔϯ‫ݕ‬ग़๏. ͯ͠ಛ௃ϕΫτϧͷΫϥελϦϯάΛߦ͏ɽMulti-Probe. ࢁதΒ͸৘ใ‫ٕࡧݕ‬ज़Λར༻͢Δ͜ͱʹΑͬͯɼҙຯత. LSH ͱ͸ɼैདྷͷ LSH ͷϝϞϦ࢖༻ྔ͕ଟ͍ͱ͍͏໰୊. ʹॲཧ͕ྨࣅͨؔ͠਺ΫϩʔϯΛ‫ݕ‬ग़͢Δख๏ΛఏҊ͠. ఺Λվྑͨ͠ΞϧΰϦζϜͰ͋ΔɽʢҎ߱ɼ୯ʹ“LSH”ͱද. ͨɽίʔυย୯ҐͰίʔυΫϩʔϯͷ‫ݕ‬ग़Λߦ͏৔߹ɼߏ. ‫ͨ͠ه‬৔߹͸ LSH ΞϧΰϦζϜશൠΛࢦ͠ɼLSH Ξϧΰ. จͷෆ‫׬‬શͳ෦෼Ͱऴྃ͢ΔίʔυยͳͲɼू໿Λߦ͏͜. ϦζϜΛ۠ผ͢Δࡍ͸“Multi-Probe LSH”ɼ“ैདྷͷ LSH”. ͱ͕ࠔ೉Ͱ͋ΔίʔυΫϩʔϯ͕ଟ͘‫ݕ‬ग़͞ΕΔ‫ڪ‬Ε͕͋. ͱ͍͏ද‫ه‬Λ༻͍Δʣ͜ͷ͜ͱͰɼΑΓগͳ͍ϝϞϦ࢖༻. Δ [14]ɽҰํɼؔ਺Ϋϩʔϯ͸ॲཧͷ಺༰͕·ͱ·͍ͬͯ. ྔͰߴ଎ͳ‫ݕ‬ग़Λ࣮‫ͨ͠ݱ‬ɽ. ΔͨΊɼ։ൃऀʹͱͬͯू໿ͷର৅ʹͳΓ΍͍͢ίʔυΫ. ධՁ࣮‫Ͱݧ‬͸ɼؔ਺Ϋϩʔϯ‫ݕ‬ग़๏ [13] ͱ CCFinder [7]. ϩʔϯ͕‫ݕ‬ग़Ͱ͖Δɽ·ͨؔ਺Ϋϩʔϯ‫ݕ‬ग़๏͸ɼλΠϓ. ͷ 2 ͭͱ‫ݕ‬ग़ਫ਼౓ͱ‫ݕ‬ग़࣌ؒͷ‫͔఺؍‬ΒൺֱΛߦͬͨɽ. 1 ͔ΒλΠϓ 4 ·ͰͷίʔυΫϩʔϯΛ‫ݕ‬ग़ՄೳͰ͋Δɽ. CCFinder ͸ਆ୩Β͕։ൃͨ͠ίʔυΫϩʔϯ‫ݕ‬ग़πʔϧ. ͜ͷख๏͸ɼ·ͣɼೖྗ͞Εͨιʔείʔυதͷϫʔυʹ. Ͱ͋Γɼࣈ۟୯ҐͷίʔυΫϩʔϯ‫ݕ‬ग़͕ՄೳͰ͋Δɽ3. ‫਺֤͍ؔͯͮج‬Λಛ௃ϕΫτϧʹม‫͢׵‬Δɽ͜͜Ͱϫʔυ. ͭͷ C ‫ޠݴ‬ͷϓϩδΣΫτʹରͯ͠ద༻ͨ݁͠Ռɼຊख๏. ͱ͸ɼҎԼͷ 2 ͭΛର৅ͱ͢Δɽ. ͕૯߹తʹߴ͍ਫ਼౓ͰΑΓଟ͘ͷίʔυΫϩʔϯΛ‫ݕ‬ग़͢. • ม਺΍ؔ਺ͳͲʹ͚ͭΒΕͨࣝผࢠ໊Λߏ੒͢Δ୯‫ޠ‬. Δ͜ͱ͕Ͱ͖ͨɽ·ͨɼຊख๏ͷ‫ݕ‬ग़ʹ͔͔Δ࣌ؒ͸ 3 ෼. • ৚݅จ΍‫܁‬Γฦ͠จͳͲͷߏจʹར༻͞ΕΔ༧໿‫ޠ‬. ҎԼͱͳΓɼؔ਺Ϋϩʔϯ‫ݕ‬ग़๏ͱ CCFinder ΑΓ΋ߴ଎. ͦͯ͠ɼಛ௃ϕΫτϧؒͷྨࣅ౓Λ‫ٻ‬ΊΔ͜ͱʹΑͬͯΫ. ʹίʔυΫϩʔϯΛ‫ݕ‬ग़͢Δ͜ͱ͕Ͱ͖ͨɽ·ͨɼ‫ݕ‬ग़࣌. ϩʔϯϖΞͷू߹ΛϦετͱͯ͠ग़ྗ͢Δɽ·ͨɼྨࣅ౓. ؒͱεέʔϥϏϦςΟͷධՁ΋ߦ͍ɼ100MLOC ͷେ‫ن‬໛. ͷ‫ࢉܭ‬ͷ௚લʹ LSH [6] Λར༻͠ɼಛ௃ϕΫτϧͷΫϥε. ϓϩδΣΫτʹରͯ͠ 4 ࣌ؒఔ౓Ͱ‫ݕ‬ग़Ͱ͖Δ͜ͱΛ֬ೝ. λϦϯάΛߦ͏͜ͱʹΑͬͯɼ‫ݕ‬ग़ͷߴ଎ԽΛߦ͍ͬͯΔɽ. ͨ͠ɽ Ҏ߱ɼ2 ষͰ͸ɼίʔυΫϩʔϯͱؔ਺Ϋϩʔϯ‫ݕ‬ग़๏. 2.2 ؔ਺Ϋϩʔϯ‫ݕ‬ग़๏ͷ໰୊఺. ʹ͍ͭͯड़΂Δɽ3 ষͰ͸ɼຊ‫Ͱڀݚ‬ఏҊ͢ΔϒϩοΫΫ. 2.1 અͰ͸ɼؔ਺Ϋϩʔϯ‫ݕ‬ग़๏ͷ֓ཁʹ͍ͭͯઆ໌͠. ϩʔϯ‫ݕ‬ग़๏ʹ͍ͭͯड़΂Δɽ4 ষͰ͸ɼຊख๏ͷධՁ࣮. ͨɽ͔͠͠ɼؔ਺Ϋϩʔϯ‫ݕ‬ग़๏ʹରͯ͠ҎԼͷ 2 ͭͷ໰. ‫͍ͯͭʹݧ‬ड़΂Δɽ࠷‫ʹޙ‬ɼ5 ষͰ·ͱΊͱࠓ‫ޙ‬ͷ՝୊ʹ. ୊఺͕‫͛ڍ‬ΒΕΔɽ. ͍ͭͯड़΂Δɽ. 2. ίʔυΫϩʔϯ ຊষͰ͸ɼຊ‫ڀݚ‬ͷഎ‫ͯ͠ͱܠ‬ίʔυΫϩʔϯɼ͓Αͼ ࢁதΒͷؔ਺Ϋϩʔϯ‫ݕ‬ग़๏ͱɼͦͷ໰୊఺ʹ͍ͭͯड़΂. 1 ͭ໨͸ɼؔ਺୯Ґͷ‫ݕ‬ग़ʹΑΔ໰୊఺Ͱ͋Δɽ͜ͷख ๏͸ɼؔ਺શମͰ͸ͳ͘ɼҰ෦ͷΈ͕ίʔυΫϩʔϯʹ ͳ͍ͬͯΔ΋ͷΛ‫ݕ‬ग़͢Δ͜ͱ͕Ͱ͖ͳ͍ɽྫ͑͹ਤ 1 ͷ Α͏ʹɼ௕͍ؔ਺಺ͷҰ෦ʹίʔυΫϩʔϯ͕‫·ؚ‬ΕΔ৔ ߹ɼ‫ݕ‬ग़࿙Ε͕ੜ͡ΔՄೳੑ͕͋Δɽ. ΔɽRoy Β͸ίʔυΫϩʔϯͷఆٛͱͯ͠ɼίʔυΫϩʔ. 2 ͭ໨͸ɼϝϞϦ࢖༻ྔ͕ଟ͍ͱ͍͏໰୊఺Ͱ͋Δɽؔ. ϯؒͷҧ͍ͷ౓߹͍ʹ‫͖ͮج‬ҎԼͷ 4 ͭͷλΠϓʹ෼ྨ͠. ਺Λಛ௃ϕΫτϧʹม‫͢׵‬Δํ๏ͱͯؔ͠਺Ϋϩʔϯ‫ݕ‬ग़. ͍ͯΔ [12]ɽ. ๏͸ TF-IDF ๏Λ༻͍͍ͯΔ͕ɼTF-IDF ๏Ͱ͸શؔ਺Ͱ. λΠϓ 1. ۭന΍λϒͷ༗ແɼίʔσΟϯάελΠϧɼί. ग़‫ͨ͠ݱ‬ϫʔυͷछྨ਺͕࣍‫ͳͱ਺ݩ‬ΔͨΊɼಛ௃ϕΫτ. ϝϯτͷ༗ແͳͲͷҧ͍Λআ͖‫׬‬શʹҰக͢Δίʔυ. ϧͷ࣍‫͕਺ݩ‬ඇৗʹେ͖͘ͳΔ܏޲ʹ͋Δɽಛ௃ϕΫτϧ. Ϋϩʔϯ.. ͸֤ؔ਺ʹ 1 ‫ͭͣݸ‬༩͑ΔͨΊɼ࣍‫਺ݩ‬ͷେ͖͍ಛ௃ϕΫ. λΠϓ 2. λΠϓ 1 ͷҧ͍ʹՃ͑ͯɼม਺໊ͳͲͷϢʔβʔ. ఆ໊ٛɼม਺ͷ‫͕Ͳͳܕ‬ҟͳΔίʔυΫϩʔϯ.. c 2017 Information Processing Society of Japan . τϧΛଟ਺อ࣋͢Δ͜ͱʹͳΓɼϝϞϦ࢖༻ྔ͕େ͖͘ͳ Δɽ·ͨɼߴ଎ʹ‫ݕ‬ग़Λߦ͏ͨΊʹ LSH Λ༻͍ͯΫϥελ. 2.

(3) Vol.2017-SE-196 No.19 2017/7/20. ৘ใॲཧֶձ‫ڀݚ‬ใࠂ IPSJ SIG Technical Report ˆ— –‹‘Ƈ. ˆ— –‹‘Ƈ. 㛗. ‹ˆſˆŽ‰ ƀƇ ƒʰɥŚ „ʰɨŚ ƈ. 厦 㛵. ‹ˆſˆŽ‰ ƀƇ ƒʰɥŚ „ʰɨŚ ƈ. ୍⮴㒊ศ. ᩘ ƈ. ਤ 1. ͦͷͨΊɼಛ௃ϕΫτϧͷΫϥελϦϯάͰ͸ɼLSH [6] ͷ ҰछͰ͋Γɼۭؒ‫ྔࢉܭ‬ͷվྑΛߦͬͨ Multi-Probe LSH. [9] Λద༻ͨ͠ɽ 3.1 ༻‫ޠ‬ͷఆٛ. ƈ. ௕͍ؔ਺಺ͷҰ෦ʹίʔυΫϩʔϯΛ‫ؚ‬Ήྫ. Fig. 1 An example of parts of long function containing code clones. 3.1.1 ίʔυϒϩοΫ ϓϩάϥϛϯά‫͍͓ͯʹޠݴ‬ɼෳ਺ͷ໋ྩจΛҰ‫ׅ‬Γʹ ·ͱΊͨ΋ͷΛίʔυϒϩοΫͱ͍͏ɽଟ͘ͷϓϩάϥϛ ϯά‫Ͱޠݴ‬͸ɼίʔυϒϩοΫΛೖΕࢠߏ଄ʹ͢Δ͜ͱ͕ Ͱ͖ɼม਺ͷείʔϓͱͯ͠ͷҙຯΛ࣋ͭ͜ͱ͕͋Δɽ. ϦϯάΛߦ͍ͬͯΔ͕ɼLSH ͸ਫ਼౓Λ্͛ΔͱϝϞϦ࢖༻ ྔ͕େ͖͘ͳΔಛ௃͕͋ΔɽΑͬͯɼେ‫ن‬໛ϓϩδΣΫτ ʢLinux Kernel ౳ʣʹରͯ͠ίʔυΫϩʔϯͷ‫ݕ‬ग़Λߦͬ ͨ৔߹ɼϝϞϦෆ଍Ͱ‫ݕ‬ग़Λ‫ڪ͍ͳ͖Ͱྃ׬‬Ε͕͋Δɽ ͦ͜Ͱɼ্ͷ 2 ͭͷ໰୊఺Λ౿·͑ͨ৽͍͠ίʔυΫ ϩʔϯ‫ݕ‬ग़๏ͷඞཁੑ͕ߟ͑ΒΕΔɽຊ‫Ͱڀݚ‬͸ؔ਺୯Ґ. ຊख๏Ͱ͸ɼҎԼͷ 2 ͭͷ৚݅ͷ͍ͣΕ͔Λຬͨ͢ίʔ υϒϩοΫΛ‫ݕ‬ग़ର৅ͱ͢Δɽର৅‫ޠݴ‬͸ C ‫ ͱޠݴ‬Java ‫͢ͱޠݴ‬Δɽ ৚݅ 1 ؔ਺ͷ“{ }”Ͱғ·Εͨൣғ ৚݅ 2 ifɼelseɼforɼwhileɼdo-whileɼswitch จͷ“{ }” Ͱғ·Εͨൣғ. ΑΓ‫ݕ‬ग़ཻ౓Λখͨ͘͞͠ɼίʔυϒϩοΫ୯Ґͷίʔυ. ͨͩ͠ɼ‫ݱ͕”} {“ʹޙ‬Εͳ͍୯จͷ໋ྩจ͸ίʔυϒϩο. Ϋϩʔϯ‫ݕ‬ग़๏ΛఏҊͨ͠ɽίʔυϒϩοΫ୯ҐͰ‫ݕ‬ग़Λ. Ϋͱͯ͠ͷవ·Γ͕ͳ͍ͨΊ‫ݕ‬ग़ର৅ͱ͠ͳ͍ɽ·ͨਤ 3. ߦ͏͜ͱͰɼؔ਺Ϋϩʔϯ‫ݕ‬ग़๏Ͱ͸‫ݕ‬ग़Ͱ͖ͳ͔ͬͨ. ͷ Block A ʹର͢Δ Block B ΍ Block C ͷΑ͏ʹɼೖΕ. ίʔυΫϩʔϯΛ‫ݕ‬ग़͕ՄೳʹͳΔɽ·ͨɼۭؒ‫ྔࢉܭ‬ͷ. ࢠߏ଄ͷ಺ଆͷίʔυϒϩοΫ΋‫ݕ‬ग़ՄೳͰ͋Γɼ‫ݕ‬ग़ର. গͳ͍ಛ௃ϕΫτϧͷ࣮૷ํ๏΍ LSH ʹมߋ͢Δ͜ͱͰɼ. ৅Λ࠶‫ؼ‬తʹ୳ࡧ͢Δɽ. େ‫ن‬໛ϓϩδΣΫτ΁ͷద༻͕ՄೳͱͳΔɽ. 3.1.2 ϫʔυ. 3. ఏҊ͢Δ‫ݕ‬ग़ख๏ ຊ‫Ͱڀݚ‬͸ɼ2.1 અͰઆ໌ͨ͠ࢁதΒͷؔ਺Ϋϩʔϯ‫ݕ‬ ग़๏Λ‫ʹج‬ɼίʔυϒϩοΫ୯ҐͷΫϩʔϯ‫ݕ‬ग़ʹରԠ͢ ΔΑ͏มߋͨ͠ख๏ʢϒϩοΫΫϩʔϯ‫ݕ‬ग़๏ʣΛఏҊ͢ Δɽຊख๏ͷ֓ཁΛਤ 2 ʹࣔ͢ɽຊख๏͸ओʹҎԼͷ 5 ͭ ͷεςοϓͰ࣮ߦ͞ΕΔɽ. STEP 1 ߏจղੳΛߦ͍ιʔείʔυ͔Βந৅ߏจ໦ Λੜ੒͠ɼੜ੒ͨ͠ந৅ߏจ໦͔ΒίʔυϒϩοΫ ʢ3.1.1 અࢀরʣΛऔΓग़͢ɽ. STEP 2 STEP 1 Ͱநग़֤ͨ͠ίʔυϒϩοΫ͔Βɼϫʔ υʢ3.1.2 અࢀরʣͷநग़Λߦ͏ɽ. ຊख๏Ͱ͸ҎԼͷ৚݅ 3ɼ4 ͷ͍ͣΕ͔Λຬͨ͢΋ͷΛ ϫʔυͱͯ͠ఆٛ͢Δɽ ৚݅ 3 ༧໿‫ޠ‬ ৚݅ 4 ࣝผࢠ໊Λߏ੒͢Δ୯‫ޠ‬ ࣝผࢠ໊͕ෳ਺ͷ୯‫͔ޠ‬Βߏ੒͞ΕΔ৔߹ɼҎԼͷํ๏ Ͱϫʔυ୯Ґʹ෼ׂ͢Δɽ. • ϋΠϑϯ΍ΞϯμʔείΞͳͲͷ۠੾Γ‫ʹ߸ه‬ΑΔ ෼ׂ. • ࣝผࢠ໊தͷେจࣈʹͳ͍ͬͯΔΞϧϑΝϕοτʹΑ Δ෼ׂ ·ͨɼ2 จࣈҎԼͷࣝผࢠ͸ɼͦΕΒΛ·ͱΊͯಉҰͷϝ λϫʔυͱͯ͠ѻ͏ɽྫ͑͹ɼ‫܁‬Γฦ͠จ౳ͰΑ͘ར༻͞. STEP 3 TF-IDF ๏Λར༻͠ɼSTEP 2 Ͱநग़ͨ͠ϫʔ. ΕΔ i ΍ j ͱ͍ͬͨม਺͸ɼҙຯ৘ใ͕ࠐΊΒΕ͍ͯͳ͍. υʹॏΈ෇͚Λߦ͍ɼ֤ίʔυϒϩοΫΛಛ௃ϕΫτ. ม਺ͱͯ͠ѻ͏ͨΊͰ͋Δɽ͞Βʹɼ৚݅෼‫͍༻ʹذ‬ΒΕ. ϧʹม‫͢׵‬Δɽ. Δ if ΍ whileɼ‫܁‬Γฦ͠ʹ༻͍ΒΕΔ for ΍ while ౳ͷ༧໿. STEP 4 LSH Λར༻͠ɼSTEP 3 Ͱ‫ٻ‬Ί֤ͨίʔυϒ ϩοΫʹର͢Δಛ௃ϕΫτϧͷΫϥελϦϯάΛߦ͏ɽ. STEP 5 STEP 4 Ͱ‫ٻ‬ΊͨίʔυϒϩοΫͷ֤Ϋϥελ ͷதͰɼಛ௃ϕΫτϧؒͷྨࣅ౓ͷ‫ࢉܭ‬Λߦ͍ɼϒ ϩοΫΫϩʔϯʢ3.1.3 અࢀরʣΛ‫ݕ‬ग़͢Δɽ ຊख๏ͱؔ਺Ϋϩʔϯ‫ݕ‬ग़๏ͷओͳ૬ҧ఺͸ɼίʔυΫ ϩʔϯͷ‫ݕ‬ग़ཻ౓Ͱ͋Δɽؔ਺Ϋϩʔϯ‫ݕ‬ग़๏Ͱ͸ؔ਺ͷ ‫ݕ‬ग़ͷΈΛߦ͏͕ɼຊख๏Ͱ͸ಛ௃ϕΫτϧͷ‫ํࢉܭ‬๏΋ มߋ͠ɼؔ਺ͱؔ਺಺ͷίʔυϒϩοΫͷ྆ํΛ‫ݕ‬ग़͢Δɽ ͔͠͠ɼຊख๏͸‫ݕ‬ग़ཻ౓Λখ͘͢͞Δ͜ͱͰ‫ݕ‬ग़ର৅ ਺͕૿͑ɼͦΕʹ൐͍‫ݕ‬ग़࣌ؒɼϝϞϦ࢖༻ྔ͕૿େ͢Δɽ. c 2017 Information Processing Society of Japan . ‫ޠ‬΋ϫʔυͱͯ͠ѻ͏ɽͳ͓ɼ֤ϫʔυͷେจࣈͱখจࣈ ʹΑΔ۠ผ͸͚ͭͣɼಉҰͷϫʔυͱͯ͠ѻ͏ɽ. 3.1.3 ϒϩοΫΫϩʔϯ ৚݅ 5 ίʔυϒϩοΫؒͷྨࣅ౓͕ᮢ஋Ҏ্. sim(CB1 , CB2 ) ≥ p. (0 ≤ p ≤ 1). ৚݅ 6 ίʔυϒϩοΫؒʹ‫ڞ‬௨෦෼͕ͳ͍. CB1 ∩ CB2 = φ CB1 ɼCB2 ͦΕͧΕΛਅʹแ‫ؚ͢‬Δ೗ԿͳΔίʔυϒ ϩοΫ΋ϒϩοΫΫϩʔϯϖΞͰͳ͍ͱ͖ɼCB1 ɼCB2 Λ. 3.

(4) Vol.2017-SE-196 No.19 2017/7/20. ৘ใॲཧֶձ‫ڀݚ‬ใࠂ IPSJ SIG Technical Report. ^dWϭ. ^dWϮ. ^dWϯ. ^dWϰ. ^dWϱ. 䝤䝻䝑䜽 . ὁὊἛ ‫ׅ‬ૠ ᶖᶖᶖ ᵑ ᶗᶗᶗ ᵐ Ḡ Ḡ. 䝤䝻䝑䜽  ଵ , ଶ , ଷ , ⊃. 䝤䝻䝑䜽 䝤䝻䝑䜽. 䝤䝻䝑䜽. 䝋䞊䝇䝁䞊䝗. ὁὊἛ ‫ׅ‬ૠ ᶖᶖᶖ ᵒ ᶗᶗᶗ ᵐ Ḡ Ḡ. ᢳ㇟ᵓᩥᮌ. 䝤䝻䝑䜽 䝤䝻䝑䜽 䝤䝻䝑䜽. 䝤䝻䝑䜽 ଵ , ଶ , ଷ , ⊃. 䝽䞊䝗䝸䝇䝖 ≉ᚩ䝧䜽䝖䝹 ਤ 2. 䜽䝷䝇䝍. ᫏˩ࡇ ⇼∓⇩⇕‫⇕ݣ‬∓∞∙ ⇼∓⇩⇕″ ‫ݱ‬ •†‫‧‫‬ ⇼∓⇩⇕‴ ⇼∓⇩⇕‵ •† • ⇼∓⇩⇕‶ ⇼∓⇩⇕‶ ‫ݱ‬ •†‫•‫‬ ⇼∓⇩⇕‷ ⃴ ⃴ ⃴. 䜽䝻䞊䞁䝨䜰䝸䝇䝖. ఏҊख๏ͷ֓ཁ. Fig. 2 An overview of our detection approach. ůŽĐŬ. ˆ— –‹‘Ƈ. ˆ— –‹‘Ƈ. ‹ˆſƀƇ. ‹ˆſƀƇ. ůŽĐŬ. ™Š‹Ž‡ſƀƇ ƒʰɥŚ ƈ. ůŽĐŬ. ƈ. ůŽĐŬ . ™Š‹Ž‡ſƀƇ ƒʰɥŚ ƈ. „ʰɨŚ. ƈ. ƈ. ਤ 3. 3.2 ίʔυϒϩοΫͱϫʔυͷநग़ ຊख๏Ͱ͸ߏจղੳΛߦ͍ɼίʔυϒϩοΫͷநग़Λߦ ůŽĐŬ . ෼͚ΒΕΔɽຊख๏Ͱ͸ɼߏจղੳʹ ANTLR v4*1 Λར༻. „ʰɨŚ. ͨ͠ɽ. ƈ. ೖΕࢠߏ଄ʹ͋Δ. STEP I ιʔείʔυʹରͯ͠ߏจղੳΛߦ͍ɼந৅ߏ. ਤ 4. ‫ڞ‬௨෦෼͕͋Δ. ίʔυϒϩοΫ. ίʔυϒϩοΫϖΞ. Fig. 3 Code blocks in. ͏ɽίʔυϒϩοΫநग़ͷख๏͸ҎԼͷ 6 ͭͷεςοϓʹ. Fig. 4 Code block pair. nested structure. จ໦Λੜ੒͢Δɽ. STEP II ந৅ߏจ໦͔Βؔ਺ʢ3.1.1 અͷ৚݅ 1 Λຬͨ ͢ίʔυϒϩοΫʣͷ෦෼໦ΛऔΓग़͢ɽ. sharing a common part. STEP III STEP II ͰऔΓग़ͨ͠෦෼໦Λ࠷΋֎ଆͷίʔ ˆ— –‹‘Ƈ ‹ˆſƀƇ. 

(5) . ƈ. ਤ 5. STEP IV STEP II ͰऔΓग़ͨ͠෦෼໦͔Βɼίʔυϒ. ‹ˆſƀƇ. 

(6) . ™Š‹Ž‡ſƀƇ ƒʰɥŚ ƈ. ƈ. υϒϩοΫͱͯ͠நग़͢Δɽ. ˆ— –‹‘Ƈ. ™Š‹Ž‡ſƀƇ ƒʰɥŚ ƈ. 

(7) . „ʰɨŚ. ƈ. ϩοΫʢ3.1.1 અͷ৚݅ 2 Λຬͨ͢ίʔυϒϩοΫʣͷ. 

(8) . „ʰɨŚ. ෦෼໦ΛऔΓग़͢ɽ. STEP V STEP IV ͰऔΓग़ͨ͠෦෼໦ΛೖΕࢠؔ܎ʹ ͋ΔίʔυϒϩοΫͱͯ͠நग़͢Δɽ. ƈ. ‫ۃ‬େίʔυϒϩοΫϖΞͱॏෳͨ͠ίʔυϒϩοΫϖΞ. STEP VI Ҏ߱ɼਂ͞༏ઌ୳ࡧͰந৅ߏจ໦͔Βίʔυ ϒϩοΫΛநग़͢Δɽ. Fig. 5 Code block pair overlapped. ίʔυϒϩοΫͷநग़‫ޙ‬ɼ֤ίʔυϒϩοΫ಺ʹ‫·ؚ‬Ε. with a maximam code block pair. Δϫʔυͷநग़Λߦ͏ɽ ‫ۃ‬େϒϩοΫΫϩʔϯͱ‫Ϳݺ‬ɽຊख๏Ͱ͸ɼ‫ۃ‬େϒϩοΫ ΫϩʔϯΛϒϩοΫΫϩʔϯͱఆٛ͢Δɽ. 3.3 ಛ௃ϕΫτϧͷ‫ࢉܭ‬. ৚݅ 6 Ͱࣔͨ͠Α͏ʹɼϒϩοΫΫϩʔϯϖΞ͸ίʔυ. ಛ௃ϕΫτϧͷ‫Ͱࢉܭ‬͸ɼϫʔυʹର͠ TF-IDF ๏ [2]. ϒϩοΫؒʹ‫ڞ‬௨෦෼͕ͳ͍͜ͱ͕৚݅Ͱ͋Δɽίʔυϒ. Λར༻ͯ͠ॏΈΛ‫͠ࢉܭ‬ɼͦͷ஋Λಛ௃ྔͱ֤ͯ͠ίʔυ. ϩοΫؒʹ‫ڞ‬௨෦෼͕ଘࡏ͢Δ৔߹ɼҰํͷίʔυϒϩο. ϒϩοΫΛಛ௃ϕΫτϧʹม‫͢׵‬ΔɽΑͬͯɼ֤ίʔυϒ. Ϋ͕ଞํΛแ‫ؚ͍ͯ͠‬Δ͜ͱΛ͍ࣔͯ͠Δɽྫ͑͹ɼਤ 4. ϩοΫͷಛ௃ϕΫτϧͷ࣍‫਺ݩ‬͸ιʔείʔυதʹଘࡏ͢. ͷίʔυϒϩοΫ A ͱ B ͸‫ڞ‬௨෦෼͕ଘࡏ͠ɼแ‫܎ؚؔ‬. Δશϫʔυͷछྨ਺ͱͳΔɽຊख๏Ͱ͸ɼtf ஋͸ίʔυϒ. ʹ͋ΔͨΊϒϩοΫΫϩʔϯϖΞͰͳ͍ɽ. ϩοΫதͷϫʔυͷग़‫ݱ‬ස౓Λɼidf ஋͸ιʔείʔυத. ·ͨɼ‫ۃ‬େϒϩοΫΫϩʔϯΛϒϩοΫΫϩʔϯͱఆٛ ͢Δͱ͸ɼ‫͑׵͍ݴ‬ΔͱͦΕͧΕೖΕࢠؔ܎ʹ͋Δίʔυ. ͷϫʔυͷ‫ر‬গ͞Λද͍ͯ͠Δɽtf ஋͸ࣜ (1)ɼidf ஋͸ࣜ. (2) Ͱ༩͑ΒΕΔɽ. ϒϩοΫͷྨࣅ౓͕ᮢ஋Ҏ্ͷ৔߹ɼ࠷΋֎ଆͷίʔυϒ. tf i,j = . ϩοΫϖΞΛϒϩοΫΫϩʔϯͱ͢Δͱ͍͏ҙຯͰ͋Δɽ ྫ͑͹ɼਤ 5 ͷίʔυϒϩοΫ A ͱ CɼB ͱ D ͦΕͧΕ. idf i = log. ͷྨࣅ౓͕ᮢ஋Ҏ্ͱͳΔ৔߹ɼ࠷΋֎ଆͷίʔυϒϩο Ϋ A ͱ C ΛϒϩοΫΫϩʔϯͱ͢Δɽ. c 2017 Information Processing Society of Japan . ni,j k∈bj nk,j. *1. |F | |{f : f  wi }|. (1) (2). http://www.antlr.org/. 4.

(9) Vol.2017-SE-196 No.19 2017/7/20. ৘ใॲཧֶձ‫ڀݚ‬ใࠂ IPSJ SIG Technical Report ද 1. ͜͜Ͱ͸ɼni,j ͸ίʔυϒϩοΫ bj ಺ʹ͓͚Δϫʔυ wi  ͷग़‫ݱ‬ճ਺ɼ k∈bj nk,j ͸ίʔυϒϩοΫ bj ʹ͓͚Δશ. ‫ݕ‬ग़ର৅ϓϩδΣΫτ. Table 1 Target projects. ϫʔυͷग़‫ݱ‬ճ਺ͷ࿨ɼ|F | ͸શؔ਺ͷ਺ɼ|{f : f  wi }| ͸ϫʔυ wi ͕ग़‫͢ݱ‬Δؔ਺ͷ਺Λ͍ࣔͯ͠Δɽ ؔ਺Ϋϩʔϯ‫ݕ‬ग़๏ͱൺֱͯ͠ɼtf ஋ͷ‫ٻ‬ΊํΛίʔυ. ϓϩδΣΫτ. όʔδϣϯ. ‫ޠݴ‬. ‫ن‬໛. Apache HTTPD*3. 2.2.14. C/C++. 343 KLOC. PostgreSQL*4. 8.5.1. C/C++. 937 KLOC. Python*5. 2.5.1. C/C++. 435 KLOC. ϒϩοΫ୯Ґʹมߋ͕ͨ͠ɼidf ஋ͷ‫ٻ‬Ίํ͸ίʔυϒϩο Ϋ୯Ґʹมߋͤͣؔ਺୯Ґͷ··Ͱ͋ΔɽͳͥͳΒɼίʔ υϒϩοΫ୯ҐͰ idf ஋Λ‫ٻ‬ΊΔͱϫʔυͷॏΈ෇͚ʹภ. LSH ͷ࣮૷ͱͯ͠ FALCONN [1]*2 ϥΠϒϥϦΛར༻ͨ͠ɽ. Γ͕ੜͯ͡͠·͏͔ΒͰ͋Δɽ͋Δϫʔυ͕͍ͭ͘ͷίʔ υϒϩοΫʹ‫·ؚ‬ΕΔ͔͸ग़‫ݱ‬৔ॴʹΑͬͯҟͳΔɽྫ͑. 3.5 ಛ௃ϕΫτϧͷྨࣅ౓ͷ‫ࢉܭ‬. ͹ɼਤ 4 ͷؔ਺಺ͷม਺ a ͸ϒϩοΫ A ͱ B ͷ 2 ‫ݸ‬ͷίʔ. ຊख๏Ͱ͸ίαΠϯྨࣅ౓Λ༻͍ͯΫϩʔϯϖΞͷ൑ఆ. υϒϩοΫʹ‫·ؚ‬ΕΔ͕ɼม਺ b ͸ϒϩοΫ A ͷΈʹ͔͠ ‫·ؚ‬Εͳ͍ɽͦͷͨΊɼιʔείʔυதͷग़‫ݱ‬ճ਺͸ಉ͡. Λߦ͏ɽίαΠϯྨࣅ౓͸ଟ࣍‫ݩ‬ϕΫτϧͷྨࣅ౓Λද͢ ई౓Ͱ͋Γɼ࣍‫ ͕ݩ‬V Ͱ͋Δ 2 ͭͷಛ௃ϕΫτϧ a, b ؒͷ. ʹ΋ؔΘΒͣɼग़‫͢ݱ‬ΔίʔυϒϩοΫͷ਺͕ҟͳͬͯ͠. ྨࣅ౓͸ҎԼͷࣜ (3) Ͱ༩͑ΒΕΔɽ. |V | i=1 ai bi  sim(a, b) = cos(a, b) =  |V | 2 |V | 2 a i=1 i i=1 bi. ·͏ɽidf ஋͸ϫʔυͷ‫ر‬গੑʹ‫͍ͯͮج‬ॏΈ෇͚Λߦͬ ͍ͯΔͨΊɼภΓΛແͨ͘͢Ίʹؔ਺୯ҐͰ‫ٻ‬Ίͨɽ. 3.4 ಛ௃ϕΫτϧͷΫϥελϦϯά. (3). TF-IDF ๏ͷ‫ࣜࢉܭ‬ΑΓɼಛ௃ྔ͸ৗʹਖ਼ͷ஋ΛऔΔͨΊɼ. ಛ௃ϕΫτϧͷΫϥελϦϯάͱͯ͠ɼؔ਺Ϋϩʔϯ‫ݕ‬. ίαΠϯྨࣅ౓͸ 0 ͔Β 1 ͷൣғͱͳΔɽίαΠϯྨࣅ. ग़๏ͱಉ༷ʹ LSH Λ༻͍ͨɽΫϥελϦϯάΛߦ͏͜ͱ. ౓͕ᮢ஋Ҏ্Ͱ͋Ε͹ɼͦΕΒ 2 ͭͷίʔυϒϩοΫ͸Ϋ. ʹΑͬͯɼΫϩʔϯϖΞͱͳΓ͏ΔީิΛߜΔ͜ͱ͕Ͱ͖ɼ. ϩʔϯϖΞͰ͋Δͱ൑ఆ͢Δɽ. ߴ଎ͳΫϩʔϯϖΞͷ‫ݕ‬ग़͕ՄೳͱͳΔɽ͔͠͠ɼେ‫ن‬໛ ͷσʔληοτΛ LSH ΞϧΰϦζϜΛ༻͍ͯߴ͍ਫ਼౓Ͱ. 4. ධՁ࣮‫ݧ‬ ຊষͰ͸ɼຊ‫Ͱڀݚ‬ఏҊͨ͠ϒϩοΫΫϩʔϯ‫ݕ‬ग़๏ͷ. ‫ٻ‬ΊΑ͏ͱ͢ΔͱɼϝϞϦ࢖༻ྔ͕ඇৗʹେ͖͘ͳΔͱ͍ ͏໰୊఺͕͋Δ [8], [9].. LSH ͷΞϧΰϦζϜ͸ɼۭؒతʹۙ઀ͨ͠ೋ఺͕ಉ͡ ϋογϡ஋ʹͳΔ֬཰͕ߴ͘ͳΔΑ͏ͳϋογϡؔ਺Λ. ධՁ࣮‫͍ͯͭʹݧ‬ड़΂ΔɽධՁ࣮‫Ͱݧ‬͸ɼຊख๏ͱ‫ط‬ଘख ๏ͱͷൺֱͱɼຊख๏ͷ‫ݕ‬ग़࣌ؒͱεέʔϥϏϦςΟͷධ ՁΛߦͬͨɽ. ༻͍ɼಉ͡ϋογϡ஋ΛऔΔ఺Λಉ͡όέοτʹೖΕΔ͜ ͱͰΫϥελϦϯάΛߦ͏ɽ͔͠͠ LSH ͸֬཰తख๏Ͱ. 4.1 ؔ਺Ϋϩʔϯ‫ݕ‬ग़๏ͱ CCFinder ͱͷൺֱ. ͋ΔͨΊɼۙ઀ͨ͠ೋ఺͕‫ۮ‬વʹผͷόέοτʹೖΔՄೳ. ຊઅͰ͸ؔ਺Ϋϩʔϯ‫ݕ‬ग़๏ͱ CCFinder [7] ͷ 2 ͭͷ. ੑ͕͋Δɽैདྷͷ LSH Ͱ͸ෳ਺ͷϋογϡؔ਺Λ༻͍Δ. ‫ط‬ଘख๏ͱͷൺֱ࣮‫͍ͯͭʹݧ‬ड़΂Δɽຊ࣮‫Ͱݧ‬͸ɼର৅. ͜ͱͰ֬཰తͳ‫ࠩޡ‬Λগͳ͘͠ਫ਼౓Λ্͍͛ͯΔɽͦͷͨ. ϓϩδΣΫτ͔Β࡞੒ͨ͠ϕϯνϚʔΫʹର͢Δ‫ݕ‬ग़ਫ਼. Ίɼେ‫ن‬໛ͳσʔληοτʹରͯ͠͸ΑΓଟ਺ͷϋογϡ. ౓ͱ‫ݕ‬ग़࣌ؒͷ‫͔఺؍‬ΒൺֱΛߦͬͨɽCCFinder ͸ࠃ಺. ؔ਺͕ඞཁͱͳΓɼ͜ΕʹΑΓ LSH ͷϝϞϦ࢖༻ྔͷ૿. ֎ͷ‫ۀا‬ɾେֶͰ࢖༻͞Ε͍ͯΔͨΊɼൺֱର৅ʹ௥Ճ͠. େʹͭͳ͕͍ͬͯΔɽ. ͨɽຊ࣮‫ݕͰݧ‬ग़ର৅ͱͨ͠ϓϩδΣΫτͷҰཡΛද 1 ʹ. ͦ͜ͰɼLv Β͸ϝϞϦ࢖༻ྔͷվྑΛߦͬͨ Multi-Probe. LSH[9] ΛఏҊͨ͠ɽMulti-Probe LSH ͸ɼ͋Δ఺͕ೖΔό. ࣔ͢ɽ. 4.1.1 ϕϯνϚʔΫͷ࡞੒ํ๏ ϕϯνϚʔΫͷ࡞੒͸ҎԼͷ 3 εςοϓͰߦͬͨɽ. έοτ͚ͩͰͳ͘ɼۭؒతʹۙ઀ͨ͠όέοτ‫܈‬΋ௐ΂Δ ͱ͍͏΋ͷͰ͋Δɽ͜ΕʹΑΓɼগͳ͍ϋογϡؔ਺Ͱ΋. ( 1 ) ද 1 ͷϓϩδΣΫτʹର͠ɼຊख๏ɼؔ਺Ϋϩʔϯ. ‫ۮ‬વผͷόέοτʹೖͬͨ఺ͷ‫ݟ‬མͱ͠Λ๷͍Ͱ͍Δɽ࣮. ‫ݕ‬ग़๏ɼCCFinder ͷ 3 ͭͷख๏ͰίʔυΫϩʔϯΛ. ࡍʹɼ192 ࣍‫ݩ‬ͷσʔληοτʹରͯ͠ಉ͡࠶‫( ཰ݱ‬0.90). ‫ݕ‬ग़ɽ. ΛಘΔͨΊʹɼैདྷͷ LSH ͸ 49 ‫ݸ‬ͷϋογϡؔ਺͕ඞཁ. ( 2 ) ֤ख๏͕֤ϓϩδΣΫτ͔Β‫ݕ‬ग़ͨ͠ΫϩʔϯϖΞ͔. ͩͬͨͷʹର͠ɼMulti-Probe LSH ͸ 3 ‫ݸ‬ͷϋογϡؔ਺. Βɼ30 ‫ݸ‬ͷΫϩʔϯϖΞΛϥϯμϜαϯϓϦϯά͠ɼ. Ͱ‫ؒ࣌ࡧݕ‬Λ΄΅མͱͣ͞ʹୡ੒ͨ͠ͱ͍͏݁Ռ͕ใࠂ͞. ߹‫ ܭ‬270 ‫ݸ‬ͷΫϩʔϯϖΞू߹Λ࡞੒ɽ. Ε͍ͯΔ [9]ɽ ຊख๏Ͱ͸ϝϞϦ࢖༻ྔͷվྑΛߦ͏ͨΊɼMulti-Probe. LSH Λ༻͍ͯΫϥελϦϯάΛߦ͏ɽͳ͓ɼMulti-Probe. ( 3 ) (2) Ͱ࡞੒ͨ͠ 270 ‫ݸ‬ͷΫϩʔϯϖΞʹର͠ɼ໨ࢹͰ *2 *3 *4 *5. c 2017 Information Processing Society of Japan . https://falconn-lib.org/ http://httpd.apache.org/ http://www.postgresql.org/ http://www.python.org/. 5.

(10) Vol.2017-SE-196 No.19 2017/7/20. ৘ใॲཧֶձ‫ڀݚ‬ใࠂ IPSJ SIG Technical Report ද 2. ू໿·ͨ͸ಉ࣌मਖ਼ͷอकର৅ͱͳΔίʔυΫϩʔϯ. Table 2 Evaluation of detection accuracy. ͔ͷ൑அΛߦ͍ɼϕϯνϚʔΫΛ࡞੒ɽ ͳ͓ɼϕϯνϚʔΫʹ٬‫؍‬ੑΛ࣋ͨͤΔͨΊɼΞϯέʔ. ‫ݕ‬ग़ख๏. ‫ݕ‬ग़ର৅. ద߹཰. ࠶‫཰ݱ‬. F஋. Apache HTTPD. 0.90. 0.74. 0.81. PostgreSQL. 0.57. 0.87. 0.69. Python. 0.90. 0.53. 0.67. ߹‫ܭ‬. 0.68. 0.70. 0.69. Apache HTTPD. 0.87. 0.53. 0.66. ؔ਺Ϋϩʔϯ. PostgreSQL. 0.83. 0.74. 0.78. ‫ݕ‬ग़๏. Python. 0.30. 0.21. 0.25. ߹‫ܭ‬. 0.67. 0.47. 0.55. Apache HTTPD. 0.70. 0.55. 0.62. PostgreSQL. 0.13. 0.33. 0.19. Python. 0.87. 0.63. 0.73. ߹‫ܭ‬. 0.57. 0.52. 0.54. τʹΑΓୈࡾऀʹίʔυΫϩʔϯͷ൑அΛґཔͨ͠ɽΞϯ ຊख๏. έʔτͷ֓ཁΛҎԼʹࣔ͢ɽ ௐࠪର৅. ҎԼͷ 3 ໊ʹґཔ. • ίʔυΫϩʔϯͷ‫ ऀڀݚ‬1 ໊ • ίʔυΫϩʔϯͷ‫͍ͯ͠ࣄैʹڀݚ‬ΔେֶӃੜ 2 ໊ ࣭໰಺༰. ू໿·ͨ͸ಉ࣌मਖ਼ͷอकର৅ͱͳΔ͔. ճ౴ํࣜ. ೋ୒ʢ͸͍/͍͍͑ʣ. ্‫ه‬ͷ࣭໰Λɼ‫ݕ‬ग़݁Ռ͔ΒαϯϓϦϯάͨ͠ 270 ‫ݸ‬ͷΫ CCFinder. ϩʔϯϖΞʹରͯ͠ߦͬͨɽͦͯ͠ɼա൒਺Ͱ͋Δ 2 ਓҎ ্͕ʮ͸͍ʯͱճ౴ͨ͠ΫϩʔϯϖΞΛਖ਼ղͱ͠ɼຊ࣮‫Ͱݧ‬ ༻͍Δਖ਼ղΫϩʔϯϖΞू߹ͱͯ͠ϕϯνϚʔΫΛ࡞੒͠ ͨɽຊ࣮‫Ͱݧ‬͸ɼ֤ϓϩδΣΫτͷਖ਼ղΫϩʔϯϖΞ਺͸ɼ. ද 3. Apache HTTPD ͕ 74 ‫ݸ‬ɼPostgreSQL ͕ 46 ‫ݸ‬ɼPython. ຊઅͰ͸ɼؔ਺Ϋϩʔϯ‫ݕ‬ग़๏ͱ CCFinder ͱͷൺֱ࣮. ‫ݕ‬ग़࣌ؒͷൺֱ. Table 3 Comparison of detection time. ͕ 62 ‫ͨͬͳͱݸ‬ɽ. 4.1.2 ൺֱ݁Ռ. ‫ݕ‬ग़ਫ਼౓ͷධՁ. ‫ݕ‬ग़ର৅. ຊख๏. ؔ਺Ϋϩʔϯ‫ݕ‬ग़๏. CCFinder. Apache HTTPD. 1m 39s. 4m 7s. 2m 1s. PostgreSQL. 2m 27s. 8m 47s. 5m 30s. Python. 1m 15s. 3m 33s. 3m 10s. ‫ݧ‬ͷ݁Ռʹ͍ͭͯड़΂Δɽຊ࣮‫Ͱݧ‬͸ɼϕϯνϚʔΫΛ༻ ͍ͨ‫ݕ‬ग़ਫ਼౓ͱ‫ݕ‬ग़࣌ؒͷ‫͔఺؍‬ΒൺֱΛߦͬͨɽ‫ݕ‬ग़ਫ਼ ౓ͷࢦඪͱͯ͠ɼద߹཰ɼ࠶‫཰ݱ‬ɼF ஋ͷ 3 ͭͷࢦඪΛ༻. ૯߹తʹ࠶‫͕֬ͱ͍͜ߴ͕཰ݱ‬ೝͰ͖ͨɽ. ͍ͨɽຊ࣮‫͚͓ʹݧ‬Δ 3 ͭͷࢦඪΛද 2 ʹࣔ͢ɽ. ‫ݕ‬ग़࣌ؒ. ద߹཰ ຊख๏Ͱ͸ɼApache HTTPD ͱ Python ʹ͓͍ͯɼؔ਺. ‫ݕ‬ग़࣌ؒͷൺֱͰ͸ɼද 1 ͷ 3 ͭͷϓϩδΣΫτʹର͢ Δ‫ݕ‬ग़࣌ؒΛଌఆͨ͠ɽຊ࣮‫ݧ‬ͷ࣮ߦ‫ڥ؀‬͸ɼCPU Intel. Ϋϩʔϯ‫ݕ‬ग़๏΍ CCFinder ΑΓߴ͍ద߹཰͕ಘΒΕͨɽ. Xeon 2.80GHz 4coreɼϝϞϦ 16GBɼϋʔυσΟεΫυϥ. ·ͨɼPostgreSQL ʹ͓͍ͯ͸ɼCCFinder ΑΓ͸ߴ͍ద. ΠϒɼOS Windows 10 64bit Ͱ͋Δɽ. ߹཰͕ಘΒΕ͕ͨɼؔ਺Ϋϩʔϯ‫ݕ‬ग़๏ΑΓ௿͍ద߹཰ͱ. ‫ݕ‬ग़࣌ؒͷൺֱ݁ՌΛද 3 ʹࣔ͢ɽຊख๏Ͱ͸ɼ‫ݕ‬ग़ର. ͳͬͨɽ3 ͭͷશͯͷϓϩδΣΫτͷ߹‫Ͱܭ‬͸ద߹཰ 0.68. ৅શͯͷϓϩδΣΫτʹରͯ͠ 3 ෼ҎԼͰίʔυΫϩʔϯ. ͱɼؔ਺Ϋϩʔϯ‫ݕ‬ग़๏ͱಉఔ౓ͷద߹཰ɼCCFinder Α. Λ‫ݕ‬ग़͕Ͱ͖ͨɽ·ͨɼؔ਺Ϋϩʔϯ‫ݕ‬ग़๏ʹରͯ͠ 3∼4. Γߴ͍ద߹཰Ͱ͋Δ͜ͱ͕֬ೝͰ͖ͨɽ. ׂఔ౓ɼCCFinder ʹରͯ͠ 4∼8 ׂఔ౓ͱɼൺֱख๏ΑΓ. ࠶‫཰ݱ‬. ΋୹࣌ؒͰ‫ݕ‬ग़͢Δ͜ͱ͕֬ೝͰ͖ͨɽ. ຊख๏Ͱ͸ɼApache HTTPDɼPostgreSQL ʹ͓͍ͯɼ ؔ਺Ϋϩʔϯ‫ݕ‬ग़๏΍ CCFinder ΑΓߴ͍࠶‫͕཰ݱ‬ಘΒ Εͨɽ·ͨɼPython ʹ͓͍ͯ͸ɼؔ਺Ϋϩʔϯ‫ݕ‬ग़๏Α Γ͸ߴ͍࠶‫͕཰ݱ‬ಘΒΕ͕ͨɼCCFinder ΑΓ௿͍࠶‫཰ݱ‬. 4.2 ‫ݕ‬ग़࣌ؒͱεέʔϥϏϦςΟͷධՁ ຊઅͰ͸ɼఏҊख๏ͷ‫ݕ‬ग़‫ن‬໛͝ͱͷ‫ݕ‬ग़࣌ؒͱεέʔ ϥϏϦςΟͷධՁʹ͍ͭͯड़΂Δɽ‫ݕ‬ग़ର৅ͷ‫ن‬໛ͷࢦඪ. ͱͳͬͨɽ3 ͭͷશͯͷϓϩδΣΫτͷ߹‫Ͱܭ‬͸ɼ࠶‫཰ݱ‬. ʹ͸ίϝϯτ΍ۭߦΛআ͍ͨ LOC Λ༻͍Δ͜ͱͱͨ͠ɽ. 0.70 ͱɼؔ਺Ϋϩʔϯ‫ݕ‬ग़๏ͱ CCFinder ΑΓ૯߹తʹ࠶. ‫ܭ‬ଌʹ͸ cloc*6 Λ༻͍ͨɽ‫ݕ‬ग़ର৅͸ɼIJaDataset 2.0*7 ͔. ‫͕֬ͱ͍͜ߴ͕཰ݱ‬ೝͰ͖ͨɽ. ΒϥϯμϜʹϑΝΠϧΛબ୒͠ɼද 4 ʹࣔͨ͠ LOC ͝ͱ. F஋. ͷαϒηοτγεςϜΛ࡞੒ͨ͠ɽຊ࣮‫ݧ‬ͷ࣮ߦ‫ڥ؀‬͸ɼ. ຊख๏Ͱ͸ɼApache HTTPD ʹ͓͍ͯ F ஋ 0.81 ͱɼؔ. CPU Intel Xeon 2.80GHz 4coreɼϝϞϦ 32GBɼιϦο. ਺Ϋϩʔϯ‫ݕ‬ग़๏΍ CCFinder ΑΓߴ͍ F ஋͕ಘΒΕͨɽ. υεςʔτυϥΠϒɼOS Windows 10 64bit Ͱ͋Δɽ·. PostgreSQL ʹ͓͍ͯ͸ F ஋ 0.69 ͱɼCCFinder ΑΓ͸ߴ. ͨɼJava Ծ૝ϚγϯͷελοΫྖҬΛ 1GBɼώʔϓྖҬ. ͍ F ஋͕ಘΒΕ͕ͨɼؔ਺Ϋϩʔϯ‫ݕ‬ग़๏ΑΓ௿͍ F ஋. Λ 15GB ͱઃఆ࣮ͯ͠ߦͨ͠ɽ. ͱͳͬͨɽ·ͨɼPython ʹ͓͍ͯ͸ F ஋ 0.67 ͱɼؔ਺Ϋ. ‫ݕ‬ग़࣌ؒͷධՁ݁ՌΛද 4 ʹࣔ͢ɽ͜ΕʹΑΓɼຊख๏. ϩʔϯ‫ݕ‬ग़๏ΑΓ͸ߴ͍ F ஋͕ಘΒΕ͕ͨɼCCFinder Α. Ͱ͸ 100MLOC ͷ‫ݕ‬ग़ର৅ʹରͯ͠ 4 ࣌ؒఔ౓Ͱ‫ݕ‬ग़Մೳ. Γ௿͍ F ஋ͱͳͬͨɽ3 ͭͷશͯͷϓϩδΣΫτͷ߹‫ܭ‬ͷ. F ஋͸ 0.69 ͱͳΓɼؔ਺Ϋϩʔϯ‫ݕ‬ग़๏ͱ CCFinder ΑΓ. c 2017 Information Processing Society of Japan . *6 *7. http://cloc.sourceforge.net http://secold.org/projects/seclone. 6.

(11) Vol.2017-SE-196 No.19 2017/7/20. ৘ใॲཧֶձ‫ڀݚ‬ใࠂ IPSJ SIG Technical Report ද 4. ‫ݕ‬ग़‫ن‬໛͝ͱͷ‫ݕ‬ग़࣌ؒ. Table 4 Detection time for varying input size LOC. 1K. 10K. 100K. 1M. 10M. 100M. ࣌ؒ. 1s. 2s. 15s. 2m 53s. 32m 59s. 4h 5m 17s. Ͱ͋Δ͜ͱ͕֬ೝͰ͖ͨɽ. ɪɪɫśɏſƒ’”ɏ•–ƒ–—•ɏ–ƀƒ’”ɏˆ‹Ž‡ɏˆŽ—•Šſƒ’”ɏˆ‹Ž‡ɏ– Ƌ–Їˆ‹Ž‡ƀ ɪɪɬśƇ ɪɪɭśƒ’”ɏ•–ƒ–—•ɏ– ”˜ ʰɏŚ ɪɪɮś ɪɪɯś‹ˆſ–Їˆ‹Ž‡Şʴ„—ˆˆ‡”‡†ƀƇ ɪɪɰśˆ‹Ž‡ɏŽ‘ ſ–Їˆ‹Ž‡ƀŚ ɪɫɥś”˜ ʰƒ’”ɏˆ‹Ž‡ɏˆŽ—•ŠɏŽ‘ ‡†ſ–Їˆ‹Ž‡ƀŚ ɪɫɨśˆ‹Ž‡ɏ—Ž‘ ſ–Їˆ‹Ž‡ƀŚ ɪɫɩśƈ ɪɫɪśŵƋ ɪɫɫśƋ 䠄䝁䝯䞁䝖┬␎䠅 ɪɫɬśƋŵ ɪɫɭś”‡–—””˜Ś ɪɫɮśƈ. 4.3 ϒϩοΫΫϩʔϯͷ࣮ྫ ຊઅͰ͸ɼΞϯέʔτʹͯอकର৅ͷίʔυΫϩʔϯͱ. (a) httpd/srclib/apr/file io/unix/readwrite.c. ൑அ͞ΕͨΫϩʔϯϖΞͷத͔Βɼຊख๏ʹΑͬͯ‫ݕ‬ग़͠ ͨϒϩοΫΫϩʔϯͷ࣮ྫΛࣔ͢ɽ ಉؔ͡਺಺ʹଘࡏ͢ΔϒϩοΫΫϩʔϯ ؔ਺Ϋϩʔϯ‫ݕ‬ग़๏Ͱ͸ؔ਺୯Ґͷ‫ݕ‬ग़ͷͨΊɼಉؔ͡ ਺಺ͰίϐʔΞϯυϖʔετΛߦ͏ͳͲͯؔ͠਺಺Ͱൃੜ ͨ͠ίʔυΫϩʔϯΛ‫ݕ‬ग़Ͱ͖ͳ͔͕ͬͨɼຊख๏Ͱ͸‫ݕ‬ ग़ՄೳͰ͋ΔྫΛ֬ೝͰ͖ͨɽ ௕͍ؔ਺಺ͷҰ෦͕Ұக͢ΔϒϩοΫΫϩʔϯ. 90 ߦҎ্ͷ௕͍ؔ਺಺ͷҰ෦͕Ұக͢ΔϒϩοΫΫϩʔ ϯͰ͋Δɽؔ਺୯ҐͰ͸ҟͳΔॲཧΛߦ͍ͬͯΔͨΊɼؔ ਺Ϋϩʔϯ‫ݕ‬ग़๏Ͱ͸͜ͷΑ͏ͳίʔυΫϩʔϯΛ‫ݕ‬ग़Ͱ ͖ͳ͔ͬͨɽ͔͠͠ɼຊख๏Ͱ͸‫ݕ‬ग़ՄೳͰ͋ΔྫΛ֬ೝ Ͱ͖ͨɽ จͷૠೖ͕ߦΘΕͨϒϩοΫΫϩʔϯ ؔ਺୯ҐͰ͸ɼྨࣅͨ͠ॲཧΛߦ͏λΠϓ 4 ͷίʔυΫ. ɪɫɰśɏſƒ’”ɏ•–ƒ–—•ɏ–ƀƒ’”ɏˆ‹Ž‡ɏ•› ſƒ’”ɏˆ‹Ž‡ɏ– Ƌ–Їˆ‹Ž‡ƀ ɪɬɥśƇ ɪɬɨśƒ’”ɏ•–ƒ–—•ɏ– ”˜ ʰɏŚ ɪɬɩś ɪɬɪśˆ‹Ž‡ɏŽ‘ ſ–Їˆ‹Ž‡ƀŚ ɪɬɫś ɪɬɬś‹ˆſ–Їˆ‹Ž‡Şʴ„—ˆˆ‡”‡†ƀƇ ɪɬɭś”˜ ʰƒ’”ɏˆ‹Ž‡ɏˆŽ—•ŠɏŽ‘ ‡†ſ–Їˆ‹Ž‡ƀŚ ɪɬɮś ɪɬɯś‹ˆſ”˜ ŠʰɏƀƇ ɪɬɰśˆ‹Ž‡ɏ—Ž‘ ſ–Їˆ‹Ž‡ƀŚ ɪɭɥś”‡–—””˜Ś ɪɭɨśƈ ɪɭɩśƈ ɪɭɪś ɪɭɫś‹ˆſˆ•› ſ–Їˆ‹Ž‡Şʴˆ‹Ž‡†‡•ƀƀƇ ɪɭɬś”˜ ʰƒ’”ɏ‰‡–ɏ‘•ɏ‡””‘”ſƀŚ ɪɭɭśƈ ɪɭɮś ɪɭɯśˆ‹Ž‡ɏ—Ž‘ ſ–Їˆ‹Ž‡ƀŚ ɪɭɰś ɪɮɥś”‡–—””˜Ś ɪɮɨśƈ. (b) httpd/srclib/apr/file io/unix/readwrite.c ਤ 6. ϑΝΠϧͷग़ྗॲཧΛߦ͏ϒϩοΫΫϩʔϯ. Fig. 6 Block clones writing data to a file. ϩʔϯͰ͋Δ͕ɼؔ਺Ϋϩʔϯ‫ݕ‬ग़๏Ͱ͸࣮ࡍʹ‫ݕ‬ग़Ͱ͖ ͳ͔ͬͨɽ͔͠͠‫ݕ‬ग़ཻ౓ΛԼ͛Δ͜ͱͰɼຊख๏Ͱ͸‫ݕ‬. ‫ݕ‬ग़๏ͷ༗༻ੑ͕֬ೝͰ͖Δ [10]ɽ͔͠͠ɼؔ਺Ϋϩʔϯ. ग़ՄೳͰ͋ΔྫΛ֬ೝͰ͖ͨɽ. ‫ݕ‬ग़๏Ͱ͸ؔ਺୯Ґͷ‫ݕ‬ग़ͷͨΊʹ‫ݕ‬ग़࿙Ε͕ੜ͍ͯ͡Δ. ྨࣅͨ͠ॲཧΛߦ͏ϒϩοΫΫϩʔϯ. ՄೳੑΛߟ͑ɼ‫ݕ‬ग़ཻ౓ΛίʔυϒϩοΫ୯Ґʹখ͘͞͠. ਤ 6 ͸ɼϑΝΠϧͷग़ྗॲཧΛߦ͏λΠϓ 4 ͷϒϩοΫ. ͨຊख๏ΛఏҊͨ͠ɽຊ࣮‫ݧ‬ͷ݁Ռɼ3 ͭͷϓϩδΣΫτ. ΫϩʔϯͰ͋Δɽ੺৭ͷίʔυยͱ྘৭ͷίʔυย͕Ұக. ͷ߹‫ʹܭ‬ରͯ͠ద߹཰ɼ࠶‫཰ݱ‬ɼF ஋ͷ 3 ͭͷࢦඪͰؔ਺. ՕॴΛද͍ͯ͠ΔɽͲͪΒ΋ϑΝΠϧͷೖग़ྗؔ࿈ͷॲཧ. Ϋϩʔϯ‫ݕ‬ग़๏΍ CCFinder ΑΓߴ͍஋͕ಘΒΕɼ‫ݕ‬ग़ਫ਼. Λߦ͏ؔ਺͕ͩɼਤ 6(b) ͷ apr file sync ؔ਺͕ɼਤ 6(a). ౓ͷ‫Ͱ఺؍‬ຊख๏ͷ༗༻ੑΛ֬ೝͰ͖ͨɽ. ͷ apr file flush ؔ਺ʹՃ͑ͯಠࣗͷॲཧΛߦ͏ͨΊɼؔ਺. ‫ݕ‬ग़࣌ؒͱεέʔϥϏϦςΟ. Ϋϩʔϯ‫ݕ‬ग़๏Ͱ͸‫ݕ‬ग़Ͱ͖ͳ͔ͬͨɽ͔͠͠‫ݕ‬ग़ཻ౓Λ. ຊख๏͸ؔ਺Ϋϩʔϯ‫ݕ‬ग़๏΍ CCFinder ΑΓߴ଎ʹ‫ݕ‬. Լ͛Δ͜ͱͰ֤ؔ਺ͷ‫ڞ‬௨෦෼Λ‫͚ͭݟ‬ग़͠ɼਤ 6 ͷΑ͏. ग़Ͱ͖Δ͜ͱ͕֬ೝͰ͖ͨɽ·ͨɼ100MLOC ͱ͍͏େ‫ن‬. ʹ‫ݕ‬ग़ՄೳͰ͋ΔྫΛ֬ೝͰ͖ͨɽ. ໛ϓϩδΣΫτʹରͯ͠ 4 ࣌ؒఔ౓Ͱ‫ݕ‬ग़Ͱ͖Δ͜ͱ΋֬ ೝͰ͖ͨɽ‫ݕ‬ग़ͷߴ଎Խͷཧ༝ͱͯ͠ɼMulti-Probe LSH. 4.4 ߟ࡯ ຊઅͰ͸ɼఏҊख๏ɼ͓ΑͼධՁ࣮‫ݧ‬ͷ݁Ռʹ͍ͭͯͷ. Λ༻͍ͯΫϥελϦϯάΛߦ͏͜ͱͰɼϝϞϦ࢖༻ྔ͕‫ݮ‬ ΓɼΑΓޮ཰తͳ‫͑ߦ͕ࢉܭ‬Δ͜ͱʹΑΓߴ଎Խʹ‫͕ͬܨ‬. ٞ࿦Λߦ͍ɼຊख๏ͷ༗༻ੑɼ֦ுੑɼ͓ΑͼධՁ࣮‫ݧ‬ͷ. ͨͱߟ͑ΒΕΔɽ. ଥ౰ੑʹ͍ͭͯͷߟ࡯Λߦ͏ɽ. ϒϩοΫΫϩʔϯͷ࣮ྫ. ‫ݕ‬ग़ਫ਼౓. ϒϩοΫΫϩʔϯͷ࣮ྫʹΑͬͯɼ௕͍ؔ਺಺ͷҰ෦͕. ࢁதΒ͕ߦͬͨ Tempero ΒͷίʔύεΛ༻͍ͨධՁ࣮. Ұக͢ΔίʔυΫϩʔϯ΍ɼಉؔ͡਺಺ʹଘࡏ͢Δίʔυ. ‫Ͱݧ‬͸ɼؔ਺Ϋϩʔϯ‫ݕ‬ग़๏͸ద߹཰͕ 90%Λ௒͑Δͱ͍. ΫϩʔϯͳͲɼؔ਺Ϋϩʔϯ‫ݕ‬ग़๏Ͱ͸‫ݕ‬ग़Ͱ͖ͳ͔ͬͨ. ͏ใࠂ͕͞Ε͍ͯΔ [13]ɽ·ͨɼপాΒ͕ߦͬͨόάΛ‫ؚ‬. ίʔυΫϩʔϯΛ֬ೝͰ͖ͨɽΑͬͯɼؔ਺Ϋϩʔϯ‫ݕ‬ग़. Ήίʔυยʹର͢Δؔ਺Ϋϩʔϯ‫ݕ‬ग़ख๏ͱ CCFinder ͷ. ๏ʹΑΔ‫ݕ‬ग़࿙Εͷ࡟‫ݮ‬Λࣔͤͨɽ. ൺֱ࣮‫ʹݧ‬ΑΓɼؔ਺Ϋϩʔϯ‫ݕ‬ग़๏͕ CCFinder ΑΓߴ. ຊख๏ͷ֦ுੑ. ͍ద߹཰ΛಘΒΕΔ͜ͱ΋ใࠂ͞Ε͓ͯΓɼؔ਺Ϋϩʔϯ. c 2017 Information Processing Society of Japan . ຊख๏ͷ࣮૷͸ɼ‫ ࡏݱ‬C ‫ ͱޠݴ‬Java ‫ʹޠݴ‬ͷΈରԠ͠. 7.

(12) Vol.2017-SE-196 No.19 2017/7/20. ৘ใॲཧֶձ‫ڀݚ‬ใࠂ IPSJ SIG Technical Report. ͍ͯΔɽ͔͠͠ɼຊख๏Ͱ͸ ANTLR Λ༻͍ͯߏจղੳΛ. ༻ੑΛධՁ͢Δඞཁ͕͋Δɽ͞ΒʹɼMeCC ͳͲͷଞ. ߦ͓ͬͯΓɼANTLR ʹͯߏจղੳΛߦ͏ͨΊͷจ๏ϑΝ. ͷπʔϧͱͷൺֱΛߦ͏ඞཁ͕͋Δɽ. Πϧ͕ 100 छྨҎ্༻ҙ͞Ε͍ͯΔ͜ͱ͔Βɼଞͷ‫΁ޠݴ‬ ͷ֦ு͕༰қʹՄೳՄೳͰ͋Δɽ ·ͨɼຊख๏Ͱ͸ίʔυϒϩοΫΛ“{ }”ʹғ·Εͨൣ ғʹΑͬͯநग़͍ͯ͠Δɽ͔͠͠ɼࣈԼ͛ʹΑΔίʔυϒ. ँࣙ ຊ‫༷͍͓ͯʹڀݚ‬ʑͳ͝‫ྗڠ‬Λ͍͍ͨͩͨ೔ຊి ‫ࣜגؾ‬ձࣾ લా௚ਓࢯɼौ୩݈հࢯʹਂ͘‫͢ँײ‬Δɽຊ‫ݚ‬ ‫ڀ‬͸ JSPS Պ‫ݚ‬අ 25220003ɼ16K16034ɼ15H06344 ͷॿ ੒Λड͚ͨɽ. ϩοΫͷநग़΍ɼ͋ΔҰఆߦ਺ͷ·ͱ·ΓΛίʔυϒϩο Ϋͱͯ͠நग़͢ΔͳͲɼநग़ํ๏ʹଞͷख๏Λద༻͢Δ͜. ࢀߟจ‫ݙ‬. ͱ΋ՄೳͰ͋Δɽ. [1]. ධՁ࣮‫ݧ‬ͷଥ౰ੑ ຊ࣮‫Ͱݧ‬͸ɼ3 ͭͷ C ‫ޠݴ‬ͷϓϩδΣΫτͷΈʹରͯ͠ ൺֱΛߦ͏͜ͱʹΑͬͯຊख๏ͷ༗༻ੑΛࣔͨ͠ɽ͔͠ ͠ɼࠓ‫ޙ‬ɼଞͷ‫͞૷࣮Ͱޠݴ‬Εͨଟ͘ͷϓϩδΣΫτʹର. [2]. ͯ͠ద༻͠ɼҰൠੑΛࣔ͢ඞཁੑ͕͋Δɽ·ͨɼࠓճ͸ϕ ϯνϚʔΫͷ࡞੒ʹ͓͍ͯɼΞϯέʔτʹͯա൒਺Ͱ͋Δ. [3]. 2 ਓҎ্͕อकର৅ͱͳΓ͏ΔίʔυΫϩʔϯͱ൑ఆͨ͠ ৔߹Λਖ਼ղͱͯ͠ѻͬͨɽ͔͠͠ɼΑΓਖ਼֬ੑΛ্͛Δͨ Ίʹ͸ɼҙ‫ׂ͕ݟ‬ΕͨΫϩʔϯϖΞʹ͍ͭͯٞ࿦Λߦ͏ඞ. [4]. ཁੑ΋ߟ͑ΒΕΔɽ. 5. ·ͱΊͱࠓ‫ޙ‬ͷ՝୊ ຊ‫Ͱڀݚ‬͸ɼ৘ใ‫ٕࡧݕ‬ज़Λར༻ͨ͠ϒϩοΫΫϩʔϯ. [5]. [6]. ‫ݕ‬ग़ख๏ͷఏҊΛߦͬͨɽຊख๏Ͱ͸ɼߏจղੳΛߦ͍ ίʔυϒϩοΫͷநग़Λߦ͍ɼίʔυϒϩοΫதͷࣝผࢠ ΍༧໿‫ʹޠ‬ར༻͞Ε͍ͯΔ୯‫͔ޠ‬ΒϫʔυΛநग़͢Δɽͦ. [7]. ͯ͠ɼTF-IDF Λར༻֤ͯ͠ϫʔυʹର͢ΔॏΈΛ‫͠ࢉܭ‬ɼ ͦͷॏΈΛಛ௃ྔͱ֤ͯ͠ίʔυϒϩοΫΛಛ௃ϕΫτ ϧʹม‫͢׵‬Δɽͦͷ‫ޙ‬ɼಛ௃ϕΫτϧؒͷྨࣅ౓Λ‫͢ࢉܭ‬. [8]. Δ͜ͱʹΑͬͯɼҙຯతʹॲཧ಺༰͕ྨࣅͨ͠ϒϩοΫΫ ϩʔϯͷ‫ݕ‬ग़Λߦ͏ɽ·ͨɼLSH ΞϧΰϦζϜΛ༻͍ͯ͋. [9]. Β͔͡Ίಛ௃ϕΫτϧͷΫϥελϦϯάΛߦ͏͜ͱʹΑͬ ͯɼߴ଎ͳϒϩοΫΫϩʔϯͷ‫ݕ‬ग़Λ࣮‫ͨ͠ݱ‬ɽ ධՁ࣮‫Ͱݧ‬͸ɼ3 ͭͷ C ϓϩδΣΫτʹର͠ɼ‫ݕ‬ग़ਫ਼౓ ͱ‫ݕ‬ग़࣌ؒͷ‫͔఺؍‬Βɼؔ਺Ϋϩʔϯ‫ݕ‬ग़๏ͱ CCFinder. [10]. ͷ 2 ͭͷख๏ͱൺֱΛߦͬͨɽͦͷ݁Ռɼຊख๏͕૯߹త ʹߴ͍ਫ਼౓Ͱଟ͘ͷίʔυΫϩʔϯΛ‫ݕ‬ग़Ͱ͖Δ͜ͱ͕֬ ೝͰ͖ͨɽ·ͨɼ‫ݕ‬ग़࣌ؒ͸ 3 ෼ҎԼͱͳΓɼؔ਺Ϋϩʔ. [11]. ϯ‫ݕ‬ग़๏ͱ CCFinder ΑΓߴ଎ʹίʔυΫϩʔϯΛ‫ݕ‬ग़Ͱ ͖ͨɽ͞Βʹɼؔ਺Ϋϩʔϯ‫ݕ‬ग़๏Ͱ͸‫ݕ‬ग़Ͱ͖ͳ͔ͬͨ. [12]. ίʔυϒϩοΫ୯ҐͷίʔυΫϩʔϯΛ‫ݕ‬ग़Ͱ͖ͨɽ ࠓ‫ޙ‬ͷ՝୊ͱͯ͠ɼҎԼ͕‫͛ڍ‬ΒΕΔɽ. • ϫʔυͷॏΈͷ‫ʹࢉܭ‬ɼLSIʢLatent Semantic Index-. [13]. ingʣ[2] ΍ɼLDAʢLatent Dirichlet Allocationʣ[4] Λ ༻͍ͨख๏ͱൺֱΛߦ͏ඞཁ͕͋Δɽ. • ຊख๏Ͱ͸“{ }”ʹΑͬͯίʔυϒϩοΫͷఆٛΛߦͬ ͍ͯΔ͕ɼίʔυϒϩοΫͷఆٛ΍நग़ํ๏Λ࠶ߟ͢ Δඞཁ͕͋Δɽ. [14]. Andoni, A., Indyk, P., Laarhoven, T., Razenshteyn, I. and Schmidt, L.: Practical and Optimal LSH for Angular Distance, Proceedings of the 28th International Conference on Neural Information Processing Systems, pp. 1225–1233 (2015). Baeza-Yates, R. and Ribeiro-Neto, B.: Modern information retrieval: The concepts and technology behind search, Addison-Wesley (2011). Baxter, I. D., Yahin, A., Moura, L., Anna, M. S. and Bier, L.: Clone detection using abstract syntax trees, Proceedings of International Conference on Software Maintenance, pp. 368–377 (1998). Blei, D. M., Ng, A. Y. and Jordan, M. I.: Latent dirichlet allocation, Journal of machine Learning research, Vol. 3, No. Jan, pp. 993–1022 (2003). ං‫ޙ‬๕थɼೇຊਅೋɼҪ্ࠀ࿠ɿίʔυΫϩʔϯ‫ݕ‬ग़ͱͦ ͷؔ࿈ٕज़ɼిࢠ৘ใ௨৴ֶձ࿦จࢽɼVol. J91-D, No. 6, pp. 1465–1481 (2008). Indyk, P. and Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality, Proceedings of the 30th annual ACM symposium on Theory of computing, pp. 604–613 (1998). Kamiya, T., Kusumoto, S. and Inoue, K.: CCFinder: a multilinguistic token-based code clone detection system for large scale source code, IEEE Transactions on Software Engineering, Vol. 28, No. 7, pp. 654–670 (2002). ‫ݹ‬լ‫ࢤٱ‬ɿϋογϡΛ༻͍ͨྨࣅ‫ٕࡧݕ‬ज़ͱͦͷԠ༻ɼ ిࢠ৘ใ௨৴ֶձ‫ૅج‬ɾ‫ڥ‬քιαΠΤςΟ Fundamentals Reviewɼ Vol. 7, No. 3, pp. 256–268 (2014). Lv, Q., Josephson, W., Wang, Z., Charikar, M. and Li, K.: Multi-probe LSH: efficient indexing for highdimensional similarity search, Proceedings of the 33rd international conference on Very large data bases, pp. 950–961 (2007). Numata, S., Yoshida, N., Choi, E. and Inoue, K.: On the Effectiveness of Vector-Based Approach for Supporting Simultaneous Editing of Software Clones, pp. 560–567, Springer International Publishing (2016). Rattan, D., Bhatia, R. and Singh, M.: Software clone detection: A systematic review, Information and Software Technology, Vol. 55, No. 7, pp. 1165–1199 (2013). Roy, C. K., Cordy, J. R. and Koschke, R.: Comparison and evaluation of code clone detection techniques and tools: A qualitative approach, Science of Computer Programming, Vol. 74, No. 7, pp. 470–495 (2009). ࢁத༟थɼቐɹԸಬɼ٢ాଇ༟ɼҪ্ࠀ࿠ɿ৘ใ‫ٕࡧݕ‬ज़ ʹ‫ߴͮ͘ج‬଎ͳؔ਺Ϋϩʔϯ‫ݕ‬ग़ɼ৘ใॲཧֶձ࿦จࢽɼ Vol. 55, No. 10, pp. 2245–2255 (2014). Yamanaka, Y., Choi, E., Yoshida, N., Inoue, K. and Sano, T.: Applying clone change notification system into an industrial development process, Proceedings of the 21st International Conference on Program Comprehension, pp. 199–206 (2013).. • ଞͷେ‫ن‬໛ϓϩδΣΫτʹରͯ͠ద༻͠ɼຊख๏ͷ༗ c 2017 Information Processing Society of Japan . 8.

(13)

Fig. 1 An example of parts of
Fig. 2 An overview of our detection approach
Table 2 Evaluation of detection accuracy ݕग़ख๏ ݕग़ର৅ ద߹཰ ࠶ݱ཰ F ஋ Apache HTTPD 0.90 0.74 0.81 ຊख๏ PostgreSQL 0.57 0.87 0.69 Python 0.90 0.53 0.67 ߹ܭ 0.68 0.70 0.69 Apache HTTPD 0.87 0.53 0.66 ؔ਺Ϋϩʔϯ PostgreSQL 0.83 0.74 0.78 ݕग़๏ Python 0.30 0.21 0.25 ߹ܭ 0.67
Table 4 Detection time for varying input size

参照

関連したドキュメント

Here is the “surprise”: the validity of assumption (2.14) on Claim 2.3 for some hyperbolic/Petrowski-type systems is verified (see Section 4) by precisely the same hard analysis

Here is the “surprise”: the validity of assumption (2.14) on Claim 2.3 for some hyperbolic/Petrowski-type systems is verified (see Section 4) by precisely the same hard analysis

Bae, “Blind grasp and manipulation of a rigid object by a pair of robot fingers with soft tips,” in Proceedings of the IEEE International Conference on Robotics and Automation

T´oth, A generalization of Pillai’s arithmetical function involving regular convolutions, Proceedings of the 13th Czech and Slovak International Conference on Number Theory

Moreover, it is important to note that the spinodal decomposition and the subsequent coarsening process are not only accelerated by temperature (as, in general, diffusion always is)

de la CAL, Using stochastic processes for studying Bernstein-type operators, Proceedings of the Second International Conference in Functional Analysis and Approximation The-

In this paper, we obtain some better results for the distance energy and the distance Estrada index of any connected strongly quotient graph (CSQG) as well as some relations between

This is applied in Section 3 to linear delayed neutral difference- differential equations and systems, with bounded operator-valued coefficients: For weighted LP-norms or