• 検索結果がありません。

超低消費電力高性能計算に向けた取り組み

N/A
N/A
Protected

Academic year: 2021

シェア "超低消費電力高性能計算に向けた取り組み"

Copied!
6
0
0

読み込み中.... (全文を見る)

全文

(1)

௒௿ফඅిྗߴੑೳܭࢉʹ޲͚ͨऔΓ૊Έ

େౡ ૱࢙

1,a)

Luo Cheng

2

ฏᖒ কҰ

3

ยۅ ޹༸

1

ਢా ྱೋ

2

ຊଟ ߂थ

4

֓ཁɿ࣍ੈ୅ͷεʔύʔίϯϐϡʔλʹ޲͚ͯɼ·ͨߴੑೳͳύʔιφϧίϯϐϡʔλ΍ϫʔΫεςʔγϣ ϯͷ࣮ݱʹ͓͍ͯɼফඅిྗΛԼ͛ͭͭߴ͍ੑೳΛಘΔ௒௿ফඅిྗߴੑೳܭࢉʢUltra Low Power High Performance Computing : ULPHPCʣٕज़͕ॏཁͱͳ͍ͬͯΔɽ͞ΒʹࡢࠓͰ͸ࠃ಺ిྗࣄ৘Λߟྀͨ͠ ϐʔΫిྗͷ࡟ݮ΍ফඅΤωϧΪʔͷ࡟ݮͳͲɼলిྗͳܭࢉػΛߏங͠ӡ༻͢Δٕज़΁ͷཁٻ͕ߴ·ͬ ͍ͯΔɽզʑͷάϧʔϓͰ͸CRESTϓϩδΣΫτʮULP-HPC:࣍ੈ୅ςΫϊϩδͷϞσϧԽɾ࠷దԽʹ ΑΔ௒௿ফඅిྗϋΠύϑΥʔϚϯείϯϐϡʔςΟϯάʯʢݚڀ୅දऀɿদԬ૱ ౦޻େڭतʣʹ͓͍ͯɼ ௒௿ফඅిྗϋΠύϑΥʔϚϯείϯϐϡʔςΟϯάͷ࣮ݱʹ޲͚ͨిྗଌఆํ๏ͷ։ൃɼిྗଌఆAPI ͷ࡞੒ɼిྗ৘ใΛ༻͍ͨࣗಈνϡʔχϯάٕज़ͷ։ൃͱΞϓϦέʔγϣϯద༻ͱ͍ͬͨݚڀΛߦ͖ͬͯ ͨɽຊߘͰ͸ULPHPCʹ޲͚ͨզʑͷݚڀάϧʔϓʹ͓͚ΔऔΓ૊Έ͓Αͼͦͷ੒Ռʹ͍ͭͯड़΂Δɽ

Research activity for Ultra Low Performance HPC

Satoshi OHSHIMA

1,a)

Luo Cheng

2

Shoichi HIRASAWA

3

Takahiro KATAGIRI

1

Reiji SUDA

2

Hiroki HONDA

4

Abstract: Technologies of ultra low power high performance computing (ULPHPC) which aims obtaining high com-puting performance with reducing power consumption is getting more and more important for attainment of next generation supercomputers and high performance personal computers and workstations. Moreover, because social re-quirement of reducing the peak power consumption and consumption energy is strong, the request for technologies of implementation and operation of low power computers is required. We have researched for attainment of ULPHPC in the “ULP-HPC : Ultra Low-Power, High Performance Computing via Modeling and Optimization of Next Gener-ation HPC Technologies” project. In this project, we researched about measurement method of power consumption, development of power measurement API, and development and application of auto-tuning technologies that use power information. In this report, we describe the activities and results of our research of ULPHPC.

1. ͸͡Ίʹ

ܭࢉػγεςϜͷઃܭͱ։ൃʹ͓͍ͯిྗʹؔ͢Δ՝୊ ͕ॏཁੑΛ૿͍ͯ͠Δɽྫ͑͹εʔύʔίϯϐϡʔλͷߏ ங΍ಋೖʹ͓͍ͯ͸ɼൃిػ΍ిݯઃඋͳͲిྗʹؔ͢Δ

1 ౦ژେֶ ৘ใج൫ηϯλʔ

Information Technology Center, The University of Tokyo

2 ౦ژେֶ େֶӃ৘ใཧ޻ֶܥݚڀՊ

Graduate School of Information Science and Technology, The Uni-versity of Tokyo

3 ౦๺େֶ େֶӃ৘ใՊֶݚڀՊ

Graduate School of Information Sciences, Tohoku University

4 ిؾ௨৴େֶ େֶӃ৘ใγεςϜֶݚڀՊ

Graduate School of Information Systems, The University of Electro-Communications a) [email protected] ੍ݶ͕େ͖ͳ੍ݶͷҰͭͱͳ͍ͬͯΔɽ·ͨۙ೥Ͱ͸؀ڥ ໰୊΍ࠃ಺ిྗࣄ৘ͷӨڹΛड͚ͯ૯ిྗফඅྔ΍ϐʔΫ ిྗͷ࡟ݮ͕ٻΊΒΕ͓ͯΓɼܭࢉػγεςϜͷେখʹؔ ΘΒͣ௿ফඅిྗٕज़΍௿ফඅΤωϧΪʔٕज़ͷॏཁੑ͕ ߴ·͍ͬͯΔɽ ܭࢉػγεςϜͷলిྗԽʹ͍ͭͯ͸ࠃ಺ɾւ֎Λ໰Θ ༷ͣʑͳ෼໺Ͱ਺ଟ͘ͷݚڀ΍঎඼։ൃ͕ߦΘΕ͍ͯΔɽ লిྗԽʹ͸༷ʑͳΞϓϩʔν͕͋Γɼྫ͑͹ϋʔυ΢Σ Ξʹؔͯ͠͸ɼ௿͍ిѹͰՔಇ͢ΔσόΠεΛ։ൃ͢Δɼ ফඅిྗΛอͪͳ͕ΒΑΓߴ଎ʹಈ࡞Ͱ͖ΔΑ͏ʹ࣮ͯ͠ ߦ࣌ؒΛ୹ॖ͠૯ফඅΤωϧΪʔΛখ͘͢͞Δɼ଴ػిྗ ΛԼ͛Δ͜ͱͰ૯ফඅΤωϧΪʔΛখ͘͢͞Δɼಛఆͷॲ ཧʹରͯ͠ిྗޮ཰ͷྑ͍ΞΫηϥϨʔλΛ։ൃ͢Δɼͳ

(2)

Ͳͷྫ͕͋͛ΒΕΔɽ·ͨιϑτ΢ΣΞ΍ΞϧΰϦζϜʹ ؔ͢ΔΞϓϩʔνͱͯ͠͸ɼฒྻ౓΍ϝϞϦΞΫηεͷճ ਺ɾཻ౓Λม͑Δ͜ͱͰϋʔυ΢ΣΞͷಈ࡞Λଅਐ·ͨ͸ ཈੍͢Δ͜ͱͰ௿ిྗԽ΍௿ফඅΤωϧΪʔԽ͕ߦ͑Δ͜ ͱ͕͋Δɽ·ͨର৅ͷॲཧʹ࠷దͳϋʔυ΢ΣΞΛબ୒͢ ΔػߏΛ࡞੒͢Δͱ͍ͬͨྫ΋͋͛ΒΕΔɽ CRESTྖҬʮ৘ใγεςϜͷ௒௿ফඅిྗԽΛ໨ࢦ͠ ٕͨज़ֵ৽ͱ౷߹Խٕज़ʯ[1]ʹ͓͍ͯ΋লిྗԽʹ͍ͭ ͯଟ͘ͷݚڀऀ͕༷ʑͳݚڀΛߦ͖ͬͯͨɽຊߘͷஶऀΒ ΋ʮULP-HPC:࣍ੈ୅ςΫϊϩδͷϞσϧԽɾ࠷దԽʹΑ Δ௒௿ফඅిྗϋΠύϑΥʔϚϯείϯϐϡʔςΟϯάʯ ʢݚڀ୅දऀɿদԬ૱ ౦޻େڭतʣʹࢀՃ͠ɼओʹΞϧΰ ϦζϜ΍ϓϩάϥϛϯά؀ڥ͔ΒͷΞϓϩʔνΛߦ͖ͬͯ ͨɽຊߘͰ͸௒௿ফඅిྗϋΠύϑΥʔϚϯείϯϐϡʔ ςΟϯά(ULPHPC)ʹ޲͚ͨզʑͷݚڀάϧʔϓʹ͓͚Δ औΓ૊Έ͓Αͼͦͷ੒Ռʹ͍ͭͯड़΂Δɽ

2. ϋΠύϑΥʔϚϯείϯϐϡʔςΟϯάͱ

লిྗԽ

ܭࢉػγϛϡϨʔγϣϯ͸࣮ݧɼཧ࿦ͱฒͿݚڀ։ൃͷ ୈ3ͷՊֶ(ୈ3ͷख๏)ͱ͔ͯܽͤ͠ͳ͍ଘࡏͱͳͬͯ ͓ΓɼΑΓߴ଎ͳɼΑΓେن໛ͳɼΑΓਖ਼֬ͳγϛϡϨʔ γϣϯ͕ٻΊΒΕ͍ͯΔɽͦͷͨΊʹ͸ΑΓߴੑೳͳϋʔ υ΢ΣΞ΍ιϑτ΢ΣΞ͕ඞཁͰ͋Δɽ ϋʔυ΢ΣΞͷੑೳΛ޲্ͤ͞Δ͏͑Ͱɼిྗ͸ڧ੍͍ ݶࣄ߲ͱͳΔɽྫ͑͹εʔύʔίϯϐϡʔλͷΑ͏ʹେن ໛ͳܭࢉػγεςϜΛߏஙɾӡ༻͢Δʹ͸ɼن໛ʹݟ߹ͬ ͨిྗΛ҆ఆڙڅͰ͖Δ͜ͱ͕ඞਢͰ͋Δɽ2012೥ʹਖ਼ࣜ ՔಇΛ։࢝ͨ͠ཧԽֶݚڀॴ ܭࢉՊֶݚڀػߏͷεʔύʔ ίϯϐϡʔλʮژʯ[2]͸ɼશܥγεςϜΛՔಇͤ͞ΔͨΊ ʹ20MWఔ౓ͷిྗΛඞཁͱ͢Δɽ·ͨ2018೥ࠒͷ࣮ݱ Λ໨ࢦͯ͠ੈքதͰݚڀ͕ਐΊΒΕ͍ͯΔExaFLOPSڃͷ εʔύʔίϯϐϡʔλʢ1Exa=1000petaɼʮژʯͷੑೳ͸໿ 10petaFLOPSͰ͋Δʣʹ͍ͭͯ͸ɼ࣮ݱʹ͓͚Δ࠷େͷน ͷҰͭͱͯ͠ిྗͷ໰୊͕͋͛ΒΕ͍ͯΔɽطଘͷٕज़Λ εέʔϧͯ͠ExaFLOPSΛୡ੒͠Α͏ͱ͢Δͱඇݱ࣮తͳ Ϩϕϧͷిྗ͕ඞཁͱͳͬͯ͠·͏ͨΊɼٕज़ֵ৽͕ඞཁ ͱ͞Ε͍ͯΔɽ΋ͪΖΜύʔιφϧίϯϐϡʔλ΍ϫʔΫ εςʔγϣϯʹ͍ͭͯ΋ɼ͞ΒͳΔߴੑೳԽͷཁٻ͸େ͖ ͍ҰํͰফඅిྗΛେ্͖͛͘Δ͜ͱ͸Ͱ͖ͣɼΑΓߴ͍ ిྗޮ཰ɾলిྗੑ͕ٻΊΒΕ͍ͯΔɽ ܭࢉػγεςϜͷলిྗੑ͸ιϑτ΢ΣΞʹΑͬͯ΋ߴ ΊΔ͜ͱ͕Ͱ͖ΔɽݱࡏͰ͸ϊʔτύιίϯͳͲͷόος ϦʔͰՔಇ͢ΔػثΛத৺ʹɼෛՙ͕খ͍͞ࡍʹ͸CPU΍ ϝϞϦͳͲͷϋʔυ΢ΣΞΛ௿ফඅిྗͳ଴ػঢ়ଶʹͤ͞ ͨΓɼ෦෼తʹిྗͷڙڅΛࢭΊͨΓ͢Δٕज़͕༻͍ΒΕ ͍ͯΔɽ͜͏ٕͨ͠ज़Λ࠷େݶʹར༻͢ΔͨΊʹ͸ϓϩά ϥϜʹΑΔαϙʔτ͕ඞཁͰ͋Δɽྫ͑͹ϚϧνίΞCPU ͷ׆༻ʹ͓͍ͯ͸ɼ౥ࡌ͞Ε͍ͯΔίΞΛશͯ࢖ͬͯॲཧ Λߦͬͯ΋εέʔϥϒϧʹੑೳ޲্͕ಘΒΕͳ͍৔߹ʹ ͸ɼ࢖༻͢ΔίΞΛݮΒͯ͠ফඅిྗΛ཈͑Δ͜ͱͰ૯ফ අΤωϧΪʔΛ࡟ݮ͠ిྗޮ཰Λ޲্ͤ͞ΒΕΔ͜ͱ͕͋ Δɽܭࢉ࣌ͷσʔλͷҠಈΛͳΔ΂͘Ωϟογϡ಺ʹऩ· ΔΑ͏ʹͯ͠ϝϞϦ΍ετϨʔδ΁ͷΞΫηεΛ཈͑Δ͜ ͱ΋ফඅిྗͷ࡟ݮʹޮՌ͕͋ΔɽΑΓ୯७ͰΘ͔Γ΍͢ ͍ྫͱͯ͠͸ɼফඅిྗ͕΄΅ҰఆͰ͋ΔͱԾఆ͢ΔͳΒ ͹ɼϓϩάϥϜͷ࠷దԽΛਐΊ࣮ͯߦ࣌ؒΛ୹͘͢Δ͜ͱ ͸૯ফඅΤωϧΪʔΛ࡟ݮ͢ΔޮՌ͕͋Δɽ͜ͷΑ͏ʹɼ লిྗԽɾলΤωϧΪʔԽΛߦ͏ͨΊʹ͸ϋʔυ΢ΣΞͱ ιϑτ΢ΣΞͷڠௐ͕ॏཁͰ͋Δɽ ൚༻తͳCPUʹՃ͑ͯಛఆͷॲཧΛߴޮ཰Ͱ࣮ߦՄೳͳ ઐ༻ϋʔυ΢ΣΞΛ༻͍Δͱ͍͏खஈ΋ଟ͘༻͍ΒΕ͍ͯ ΔɽಛʹHPCͷ෼໺ʹ͓͍ͯ͸GPUͷ׆༻(GPGPU[3]ɼ GPUίϯϐϡʔςΟϯά)΁ͷ஫໨౓͕ߴ͘ɼ͢Ͱʹ༷ʑ ͳΞϓϦέʔγϣϯͷGPUԽ͕ߦΘΕ͍ͯΔɽGPU͸ CPUͱൺֱͯ͠େྔͷܭࢉίΞΛ౥ࡌ͓ͯ͠ΓɼϝϞϦੑ ೳ΋ҰൠతͳPCͱൺ΂ͯߴ͍ɽͦͷͨΊಛʹฒྻੑ͕ߴ ͍ܭࢉ໰୊ʹରͯ͠͸CPUΑΓ΋ඇৗʹߴ͍ੑೳΛಘΔ ͜ͱ͕Ͱ͖ΔɽҰํͰGPUͷඋ͑ΔܭࢉίΞ͸CPUͱൺ ΂ΔͱγϯϓϧͰ൚༻ੑʹྼΔͨΊɼे෼ͳฒྻ౓Λ࣋ͨ ͳ͍ϓϩάϥϜͳͲ͸ߴ଎ʹ࣮ߦ͢Δ͜ͱ͕Ͱ͖ͳ͍ɽ͜ ͷΑ͏ʹɼGPU͸ͲͷΑ͏ͳΞϓϦέʔγϣϯͰ΋CPU ͱൺ΂ͯߴ଎ɾߴిྗޮ཰ʹ࣮ߦͰ͖ΔΘ͚Ͱ͸ͳ͘ɼద ੾ͳ࢖͍෼͚͕ඞཁͰ͋Δɽ ݱࡏHPCͷ෼໺Ͱ࢖ΘΕ͍ͯΔGPUͱͯ͠͸ɼNVIDIA ࣾͷGPUɼओʹGeForceγϦʔζ΍TeslaγϦʔζ͕ओྲྀ Ͱ͋Δɽ͜ΕΒͷGPUʹ͓͍ͯ͸ϓϩάϥϜ࡞੒ͷͨ ΊʹC/C++Λ֦ுͨ͠CUDA[4]͕ଟ͘ར༻͞Ε͍ͯΔɽ CUDA͸ݴޠ࢓༷ͦͷ΋ͷͷಠࣗੑ͸௿͍ҰํͰ࠷దԽΛ ߦ͏ʹ͸ߴ౓ͳ஌ࣝ΋ඞཁͰ͋ΔͨΊɼϓϩάϥϜ࠷దԽ ʹඞཁͳίετ(शಘ΍ར༻ͷखؒ)΋ແࢹ͢Δ͜ͱ͸Ͱ͖ ͳ͍ɽͦͷͨΊɼࢦࣔจΛ༻͍ͯ༰қʹGPUϓϩάϥϛ ϯάΛߦ͏͜ͱ͕Ͱ͖ΔOpenACC[5]΁ͷ஫໨΋ߴ·ͬͯ ͍ΔɽGPUͷ׆༻͸զʑͷϓϩδΣΫτʹ͓͍ͯ΋ۃΊ ͯॏཁͳςʔϚͰ͋Δɽ

3. ిྗଌఆํ๏ͱ API ͷ։ൃ

ݱ࣮ͷϋʔυ΢ΣΞΛର৅ͱͯ͠লిྗԽΛߦ͏ʹ͸ɼ ফඅిྗΛଌఆ͠ϓϩάϥϜ͔Βࢀর͢Δ࢓૊Έ͕ෆՄ ܽͰ͋Δɽ͔͠͠ͳ͕Βݱࡏ࢖ΘΕ͍ͯΔPCͳͲͷҰൠ తͳܭࢉػγεςϜʹ͓͍ͯ͸όοςϦͷফඅ৘ใ΍Թ ౓৘ใͷऔಘ͸ՄೳͰ͋Δ͜ͱ͕ଟ͍ҰํɼফඅిྗΛऔ ಘ͢Δ࢓૊Έ͸උΘ͍ͬͯͳ͍ɽͦ͜ͰຊߘͷஶऀͰ͋Δ LuoɾਢాΒ͸ɼGPUΛ౥ࡌͨ͠PCΛର৅ͱͯ͠ిྗΛ ଌఆ͠ࢀর͢ΔͨΊͷ࢓૊Έʹ͍ͭͯݚڀΛߦ͖ͬͯͨɽ ৄࡉͳ಺༰͸ࢀߟจݙ[6], [7]ʹ·ͱΊΒΕ͍ͯΔͨΊɼຊ

(3)

ਤ1 ిݯଌఆ༻ϋʔυ΢ΣΞͷྫ ߘͰ͸֓ཁͷΈΛड़΂Δɽ ܭࢉػγεςϜΛؚΉҰൠతͳిؾػثͷશମతͳফඅ ిྗʹ͍ͭͯ͸ɼ͍ΘΏΔϫοτνΣοΧʔͱݺ͹ΕΔػ ثΛ༻͍Ε͹ଌఆ͢Δ͜ͱ͕Ͱ͖ΔɽPCશମͷফඅిྗ ʹ͍ͭͯ΋ಉ༷ʹଌఆՄೳͰ͋ΔҰํɼ͜ͷํ๏Ͱ͸CPU ΍GPUͱ͍ͬͨΑΓখ͞ͳ୯ҐͰͷଌఆΛߦ͏͜ͱ͸Ͱ ͖ͳ͍ɽPC಺ͷ֤ύʔπ΁ͷిྗڙڅʹ͍ͭͯ͸ɼిݯ Ϣχοτ͔Β֤ύʔπʹରͯ͠௚ྲྀిݯέʔϒϧ͕઀ଓ͞ Ε͍ͯΔͨΊɼ͜ΕΒΛ༻͍Δ͜ͱͰύʔπ୯ҐͰͷফඅ ిྗΛଌఆ͢Δ͜ͱ͕Ͱ͖Δɽ௚ྲྀిݯέʔϒϧ୯Ґͷফ අిྗΛଌఆ͢Δʹ͸Ϋϥϯϓܕͷిྗଌఆث(Ϋϥϯϓ ϝʔλʔ)͕ར༻ՄೳͰ͋ΓɼUSB઀ଓͰPC͔Β஋Λऔ ಘͰ͖Δ੡඼΋ଘࡏ͢Δɽ ͱ͜ΖͰɼGPUͷఏڙܗଶͱͯ͠ҰൠతͰ͋Δ PCI-ExpressΧʔυܕGPU΁ͷిݯڙڅʹ͍ͭͯ͸2ܥ౷Ͱߦ ΘΕ͍ͯΔɽ1ͭ͸ిݯϢχοτ͔ΒGPUΧʔυ্ͷίω Ϋλʹରͯ͠௚ྲྀిݯέʔϒϧΛ઀ଓ͢Δܦ࿏Ͱ͋Γɼ΋ ͏1ͭ͸GPUͱϚβʔϘʔυΛ઀ଓ͢ΔPCI-Expressό εʹΑΔిݯڙڅܦ࿏Ͱ͋Δɽલऀͷܦ࿏ʹؔ͢Δిྗʹ ͍ͭͯ͸ΫϥϯϓϝʔλʔʹΑΓଌఆ͢Δ͜ͱ͕ՄೳͰ ͋ΔɽҰํͰޙऀͷܦ࿏ʹ͍ͭͯ͸ΫϥϯϓϝʔλʔΛڬ Ή͜ͱ͕Ͱ͖ͳ͍ͨΊɼଞͷํ๏Ͱଌఆ͢Δඞཁ͕͋Δɽ LuoɾਢాΒ͸ϥΠβʔΧʔυʹΑΓPCI-Expressόεͷ഑ ઢΛԆ௕͠ɼԆ௕෦෼ͷ௚ྲྀిݯέʔϒϧΛՃ޻ͯ͠Ϋϥ ϯϓϝʔλʔʹڬΉ͜ͱͰফඅిྗΛଌఆͰ͖ΔΑ͏ʹ͠ ͨ(ਤ1)ɽ ͞ΒʹLuoɾਢాΒ͸ిྗଌఆʹΑΔফඅిྗͷมಈΛ ๷͙ͨΊʹଌఆର৅ܭࢉػͱଌఆΛߦ͏ܭࢉػΛผ్༻͍ ΔઃܭΛ࠾༻͠ɼফඅిྗͷଌఆ։࢝ͱऴྃ΍ଌఆ৘ใΛ σʔλϕʔεԽͯ͠ѻ͏ॲཧͳͲΛAPIͱͯ͠·ͱΊͨ (ਤ2)ɽ͜ΕΒΛ༻͍Δ͜ͱͰɼΞϓϦέʔγϣϯϓϩά ϥϚ͸೚ҙͷϓϩάϥϜʹରͯͦ͠ͷॲཧͷҰ෦ͰͲΕͩ ͚ͷిྗΛফඅ͍ͯ͠Δͷ͔Λଌఆ͠ɼଌఆ݁ՌΛར༻͢ Ϣʔβ༻APIͷྫ //marker kernel; int P_marker();

//send start monitor request; int P_startMonitor();

//send stop monitor request; int P_stopMonitor();

//wait for data int P_waitData(); //receive data

int P_receveData(int sock_fd, struct record* rec); //store data into database

int P_storeData(struct record* rec); //half-half benchmark

int P_half_test(int time); //computing intensive benchmark int P_compute_test(int time); //memory access intensive benchmark int P_memory_test(int time); //return error imformation int P_getError();

σʔλϕʔεΞΫηε༻APIͷྫ //open database connection int dbOpen(char* dbName); //read records by task ID int dbReadByID(int ID); //read records by task name int dbReadByTaskName(char* name); //qury all the user name in database int dbGetUser();

//qury all tasks information of user int dbGetTaskByUser(char* user); //delete records by task ID int dbDeleteByID(int ID); //delete records by task name

int dbDeleteByTaskName(char* taskName); //delete records by user name

int dbDeleteByUser(char* user); //close database connection int dbClose(); ਤ2 ։ൃͨ͠API Δ͜ͱ͕Մೳͱͳͬͨɽ

4. ࢦࣔจهड़ʹΑΔిྗ৘ใΛߟྀͨࣗ͠ಈ

νϡʔχϯάػߏͷ։ൃ

ैདྷɼϕΫτϧܭࢉػ͕ओྲྀͰ͋ͬͨࡍʹ͸ࣗಈฒྻԽ ʹΑΔ࠷దԽ͕ओཁͳٕज़ͱͯ͠༻͍ΒΕͨɽҰํۙ೥Ͱ ͸֊૚ੑͷ͋ΔΩϟογϡΛඋ͑ͨϚϧνίΞCPU͕ओ ྲྀͱͳ͓ͬͯΓɼ͞ΒʹGPUͷΑ͏ͳΞΫηϥϨʔλ͕ ଟ͘༻͍ΒΕΔΑ͏ʹͳ͖ͬͯͨͨΊɼࣗಈฒྻԽͰ͸ྑ ͍ੑೳ͕ಘΒΕͳ͍࣮ߦ؀ڥ(ϋʔυ΢ΣΞ)΍ΞϓϦέʔ γϣϯ͕૿Ճ͍ͯ͠Δɽ͔͠͠೚ҙͷ࣮ߦ؀ڥͱΞϓϦ έʔγϣϯʹରͯ͠࠷େੑೳΛಘΔʹ͸ߴ͍ٕज़ͱख͕ؒ ඞཁͰ͋ΓɼϓϩάϥϜͷҠ২ੑ΍Մಡੑͷ؍఺͔Β΋ɼ ੜ࢈ੑͷߴ͍։ൃख๏΍πʔϧ΁ͷधཁ͕ߴ·͍ͬͯΔɽ ͦ͜ͰɼࢦࣔจΛ༻͍ͨϓϩάϥϜ࠷దԽ΁ͷ஫໨͕ߴ

(4)

Ϧετ1 ࢦࣔจΛ༻͍ͨϓϩάϥϜهड़ͷྫ (OpenMPʹΑΔϧʔϓฒྻԽͷྫ) // ͜ ͷ ྫ Ͱ ͸ ࢦ ࣔ จ( pragma omp parallel for)௚ ޙ ͷ // f o rϧ ʔ ϓ ͕ ε Ϩ ο υ ʹ Α Γ ฒ ྻ ࣮ ߦ ͞ Ε Δ

int main (){

# pragma omp parallel for for(i=0; i<N; i ++){

C[i] = A[i] + B[i]; } return 0; } ·͍ͬͯΔɽࢦࣔจΛ༻͍ͨϓϩάϥϜ࠷దԽʹ͓͍ͯ ͸ɼϓϩάϥϜʹରͯ͠ίϝϯτͷҰछͷܗࣜͰࢦࣔจΛ ૠೖ͠ɼରԠ͢Δॲཧܥ(ίϯύΠϥ΍τϥϯεϨʔλͳ Ͳ)Λ༻͍ͯϓϩάϥϜ࠷దԽΛߦ͏(Ϧετ1)ɽҰൠతʹ ࢦࣔจ͸ϓϩάϥϜʹରͯ͠ίϝϯτͷҰछͱͯ͠ૠೖ͞ ΕΔͨΊɼࢦࣔจΛແࢹ͢Ε͹ඇରԠͷίϯύΠϥͰ΋ί ϯύΠϧՄೳͰ͋Δ͜ͱ͕ଟ͍ɽ·ͨϓϩάϥϜͷߏ଄ࣗ ମΛେ͖͘มߋ͠ͳͯ͘΋ϓϩάϥϜͷ࠷దԽ͕ߦ͑Δ͜ ͱ΍ɼطଘͷϓϩάϥϜ͔Βͷஈ֊తͳద༻͕ߦ͍΍͍͢ ͜ͱ΋ࢦࣔจΛ༻͍ͨϓϩάϥϜ࠷దԽ͕஫໨͞ΕΔཧ༝ ͱͯ͋͛͠ΒΕΔɽ ࢦࣔจΛ༻͍ͨϓϩάϥϜ࠷దԽख๏ͱͯ͠͸OpenMP ͕޿͘༻͍ΒΕ͍ͯΔɽOpenMP͸ڞ༗ϝϞϦܕฒྻܭࢉ ػΛର৅ͱ͓ͯ͠Γɼओʹϧʔϓॲཧͷฒྻߴ଎Խʹ༻͍ ΒΕ͍ͯΔɽGPUʹରԠͨ͠ࢦࣔจΛ༻͍ͨϓϩάϥϜ ࠷దԽख๏ͱͯ͠͸ɼಛʹOpenACC΁ͷ஫໨͕ߴ·ͬͯ ͍Δɽ·ͨOpenMPΛGPUʹରԠͤ͞Δݚڀʹ͍ͭͯ΋ɼ ຊߘͷஶऀΒʹΑΔ΋ͷ(OMPCUDA[8])΍OpenMPC[9] ͳͲ͕͋͛ΒΕΔɽ ຊߘͷஶऀͰ͋ΔยۅΒ͸ࢦࣔจΛ༻͍ͨࣗಈνϡʔχ ϯάهड़ʹ͍ͭͯͷݚڀΛߦ͖ͬͯͨɽยۅΒͷ։ൃͨ͠ ABCLibScript[10]͸ɼର৅ϓϩάϥϜʹࢦࣔจΛهड़͢Δ ͜ͱͰ࠷దͳΞϧΰϦζϜͷબ୒ͳͲͷࣗಈνϡʔχϯά ػೳΛ෇Ճ͢Δ͜ͱ͕Ͱ͖ΔɽຊݚڀͰ͸ยۅΒʹΑΔط ଘͷݚڀΛ΋ͱʹͯ͠ɼGPU΁ͷରԠ΍ిྗ৘ใΛߟྀ͠ ͨࣗಈνϡʔχϯάػߏͷ։ൃΛߦ͍ͬͯΔɽGPU΁ͷ ରԠʹ͍ͭͯ͸ɼਤ3ʹࣔ͢Α͏ʹCPU޲͚ͷίʔυ͔ Βిྗ৘ใΛ༻͍ͨCPU+GPU࠷దԽͷ࣮ݱΛ໨ࢦ͍ͯ͠ Δɽݱࡏ͸खಈͰهड़ͨ͠CUDAϓϩάϥϜΛ༻͍ͨ࠷ దԽ΍OMPCUDAΛ༻͍ͨGPUԽʹ͍ͭͯݕ౼͓Αͼ࣮ ૷Λߦ͍ͬͯΔ͕ɼ͞ΒʹGPUͷੑೳΛҾ͖ग़ͨ͢Ίʹ OpenACCͷ׆༻ʹ͍ͭͯ΋ݕ౼͍ͯ͠Δɽຊػߏ͸ରԠ ͢ΔࢦࣔจΛهड़ͨ͠ϓϩάϥϜʹରͯ͠3ষͰड़΂ͨి ྗؔ࿈ͷAPIΛؚΉιʔείʔυΛग़ྗ͢ΔػೳΛඋ͑ͯ ͍ΔɽͦͷͨΊΞϓϦέʔγϣϯϓϩάϥϚ͸ిྗଌఆʹ ؔ͢Δ஌ࣝͲ͜Ζ͔ɼిྗଌఆʹؔ͢Δهड़ࣗମΛߦΘͣ ਤ3 ABCLibScriptʹΑΔॲཧͷྲྀΕ ද1 ࣮ݧ؀ڥ

CPU Xeon E5-2620 x2 (SandyBridge-E) ϝΠϯϝϞϦ DDR3-1333 4GBx8 GPU TeslaK10 x4 (8GPUs) ιϑτ΢ΣΞ CentOS 6.2 x86 64, CUDA 5.0 RC ਤ4 ࣮ݧ؀ڥͷߏ੒ ʹిྗޮ཰ͷྑ͍࣮૷Λ࠾༻͢Δ͜ͱ͕ՄೳͱͳΔ͜ͱ͕ ظ଴Ͱ͖Δɽ

5. ධՁ࣮ݧ

ຊষͰ͸3ষͱ4ষʹͯड़΂֤ͨछͷٕज़Λ༻͍ͯϓ ϩάϥϜͷ෼ੳͱ࠷దԽΛߦͬͨྫΛࣔ͢ɽ࣮ݧ಺༰ͱ͠ ͯ͸ɼ൓෮ܭࢉΛߦ͏εςϯγϧܭࢉϓϩάϥϜͷओཁͳ ܭࢉ෦෼ΛCPUͱGPUͦΕͧΕͰ࣮૷͠ɼ։ൃதͷిྗ ৘ใʹରԠͨ͠ABCLibScriptͷࢦࣔจΛ༻͍ͯফඅΤω ϧΪʔ͕࠷গͱͳΔΑ͏ͳϓϩάϥϜ࣮ߦ͕Ͱ͖Δ͔Λ֬ ೝͨ͠ɽ࣮ݧʹ࢖༻ͨ͠ܭࢉػ؀ڥ͸ද1ͷ௨ΓͰ͋Δɽ ࠓճ͸ίϯηϯτͱ࣮ݧػͷؒʹΫϥϯϓϝʔλʔΛઃஔ ͠ɼPCશମͷిྗΛଌఆͯ͠ফඅΤωϧΪʔ࠷গԽΛ໨ ࢦͨ͠(ਤ4)ɽ

(5)

ਤ5 ATهड़ ࣮ݧϓϩάϥϜʹ͍ͭͯɼࢦࣔจʹΑΔATهड़Λਤ5 ʹɼࣗಈੜ੒͞Εͨίʔυͷओཁͳ෦෼ΛϦετ2ʹࣔ ͢ɽͳ͓ੜ੒͞Εͨίʔυʹ͸্ड़ͷATهड़ʹՃ͑ͯి ྗଌఆ҆ఆԽʹ༻͍͍ͯΔATMathCoreLib[11]ͷهड़΋ؚ ·Ε͍ͯΔɽ͜ͷϓϩάϥϜʹΑΓɼCPUΛ༻͍ͨ৔߹ͱ GPUΛ༻͍ͨ৔߹ͦΕͧΕͷ࣮ߦ࣌ؒٴͼফඅిྗ͕ଌ ఆ͞ΕΔɽຊ࣮ݧͰ͸൓෮ܭࢉͷ൓෮ճ਺Λมߋͨ͠৔߹ ʹCPUͱGPUͦΕͧΕͷফඅΤωϧΪʔ͸ͲͪΒ͕খ͞ ͍͔Λൺֱͨ͠ɽ ࣮ݧͷ݁ՌΛҎԼʹࣔ͢ɽ·ͣ൓෮ܭࢉʹର͢Δ࣮ߦ࣌ ؒͱফඅిྗͷؔ܎ʹ͍ͭͯ͸ɼ࣮ߦ࣌ؒʹ͍ͭͯ͸CPU ͱൺ΂ͯGPUͷํ͕࣮ߦ࣌ؒͷ૿Ճ౓߹͍͕খ͘͞ɼ൓ ෮ճ਺͕ଟ͘ͳͬͯ΋࣮ߦ͕࣌ؒԆͼʹ͍͘܏޲ʹ͋Δ͜ ͱ͕Θ͔ͬͨɽ͍ͭͮͯ൓෮ճ਺ͱফඅΤωϧΪʔͷؔ܎ ʹ͍ͭͯ͸ɼফඅిྗʹ͍ͭͯ͸CPU͸൓෮ճ਺͕૿Ճ ͢Δͱফඅిྗ͕૿Ճ͢Δ܏޲Ͱ͋Δͷʹରͯ͠ɼGPU͸ ΄΅Ұఆ(ԣ͹͍)ͱͳ͍ͬͯΔ͜ͱ͕Θ͔ͬͨɽ͜ΕΒͷ ݁Ռ͔ΒɼফඅΤωϧΪʔʹ͍ͭͯ͸൓෮ճ਺͕ଟ͍৔߹ ʹGPUͷํ͕༗རͱͳΔ(௿͘ͳΔ)͜ͱ͕ظ଴͞Εɼ࣮ ࡍʹফඅΤωϧΪʔΛ֬ೝͯ͠Έͨͱ͜Ζɼ500൓෮पล ͰফඅΤωϧΪʔ͕ٯస͓ͯ͠Γɼ൓෮ճ਺͕500ճະຬ ͷ৔߹ʹ͸CPUͷํ͕ফඅΤωϧΪʔ͕খ͘͞ɼ൓෮ճ ਺͕500ճҎ্ͷ৔߹ʹ͸GPUͷํ͕ফඅΤωϧΪʔ͕ খ͘͞ͳͬͨ(ਤ6)ɽ͜ΕΒͷ݁Ռ͔Βɼର৅໰୊ͷ൓෮ ճ਺ʹԠͯ͡CPUͱGPU͔Β࠷దͳ࣮ߦϋʔυ΢ΣΞΛ બ୒Ͱ͖Δ͜ͱ͕֬ೝͰ͖ͨɽ

6. ͓ΘΓʹ

ຊߘͰ͸ஶऀΒ͕CRESTϓϩδΣΫτʮULP-HPC:࣍ ੈ୅ςΫϊϩδͷϞσϧԽɾ࠷దԽʹΑΔ௒௿ফඅిྗϋ ΠύϑΥʔϚϯείϯϐϡʔςΟϯάʯʹ͓͍࣮ͯࢪͨ͠ ௒௿ফඅిྗϋΠύϑΥʔϚϯείϯϐϡʔςΟϯάͷ࣮ ݱʹ޲͚ͨऔΓ૊Έʹ͍ͭͯ঺հͨ͠ɽ·ͨ։ൃͨ͠ػߏ Λ༻͍ͯిྗ৘ใΛ༻͍ͨ࠷దԽ(ফඅΤωϧΪʔ࠷গԽ) ͷྫΛࣔͨ͠ɽ͜Ε͔ΒͷܭࢉػγεςϜͷੑೳ޲্ʹ͓ ͍ͯ͸ɼলిྗԽ͸ϞόΠϧ୺຤͔Βεʔύʔίϯϐϡʔ Ϧετ2 ࣗಈੜ੒͞Εͨίʔυͷྫ ઌ ಄ ʹ+ҹ ͷ ߦ : ి ྗ ؔ ࿈A P Iؔ ࿈ ͷ ه ड़ ઌ ಄ ʹ=ҹ ͷ ߦ : A T M a t h C o r e L i bؔ࿈ͷهड़

for( iloop_n = OAT_STARTTUNESIZE ;

iloop_n <= OAT_ENDTUNESIZE ; iloop_n += OAT_SAMPDIST ){ nstep = iloop_n ;

...

= exdesign_t exdes = new_exdesign (2, OAT_EPM_KAPPA ); = for( iloop_idx =0;

iloop_idx < OAT_EPM_MAXSAMP ; iloop_idx ++){ double atmeter = 0.0;

+ double avtmp = P_getTemp ();

= int cand_idx = exdes_osa_atm (exdes ,avtmp ,& atmeter ); = if ( cand_idx >= 2) break ;

// enough information obtained iusw1 = fpos [ cand_idx ];

// --- power monitoring setup + P_initial (); P_start ();

t1 = OAT_Wtime ();

for(i=0; i< OAT_MAXREPEAT ; i++) { OAT_InstallSelectPhase (

mgn ,nx ,ny ,n,dx ,dy ,delta ,nstep ,nout , calc_coef_a_w_pmobi ,init ,model , iusw1 ); }

t2 = OAT_Wtime ();

t_all2 = (t2 - t1 )/( double ) OAT_MAXREPEAT ; // --- power monitoring finalizing + P_stop ();

// --- get power data + num = P_recv (& rec ); + P_recv (& rec ); + P_close ();

if (num == 0) { // failed to get power ...

}

t_all = P_compPower (num , rec );

= update_exdes (exdes , cand_idx , t_all2 ,t_all , avtmp ); = iBestSw1 = getbest_exdes ( exdes );

assert ( iBestSw1 >= 0); iBestSw1 = fpos [ iBestSw1 ]; = del_exdesign ( exdes ); ... } } λ·ͰͲͷΑ͏ͳγεςϜʹ͓͍ͯ΋ඇৗʹॏཁͳ՝୊Ͱ ͋Δɽզʑ΋ຊϓϩδΣΫτʹ͓͚ΔऔΓ૊ΈΛ͞Βʹൃ లͤͯ͞ܭࢉػγεςϜͷলిྗԽΛਐΊ͍ͯ͘༧ఆͰ ͋Δɽ ँࣙ ຊݚڀͷҰ෦͸ʮULP-HPC:࣍ੈ୅ςΫϊϩδͷ ϞσϧԽɾ࠷దԽʹΑΔ௒௿ফඅిྗϋΠύϑΥʔϚϯε ίϯϐϡʔςΟϯάʯ(CRESTྖҬʮ৘ใγεςϜͷ௒௿ ফඅిྗԽΛ໨ࢦٕͨ͠ज़ֵ৽ͱ౷߹Խٕज़ʯ)ͷࢧԉΛ ड͚͍ͯ·͢ɽ

(6)

20 40 60 80 100 120 140 160 180 200 100 200 300 400 500 600 700 800 900 1000 Energy Number of iterations CPU GPU ਤ6 ࣮ݧ݁Ռ:൓෮ճ਺ͱফඅΤωϧΪʔ ࢀߟจݙ [1] ಠཱߦ੓๏ਓՊֶٕज़ৼڵػߏ,ʻઓུత૑଄ݚڀਪਐ ࣄۀɿCRESTʼ৘ใγεςϜͷ௒௿ফඅిྗԽΛ໨ࢦ͠ ٕͨज़ֵ৽ͱ౷߹Խٕज़, http://www.ulp.jst.go.jp/ index.html

[2] RIKEN AICS, ژ ί ϯ ϐ ϡ ʔ λ, http://www.aics. riken.jp/k/

[3] GPGPU.org, General-Purpose computation on Graphics Pro-cessing Units, http://gpgpu.org/

[4] NVIDIA, Developer Zone (CUDA ZONE), http:// developer.nvidia.com/category/zone/cuda-zone [5] OpenACC Home, http://www.openacc.org/

[6] Reiji Suda, Da Qi Ren: “Accurate Measurements and Pre-cise Modeling of Power Dissipation of CUDA Kernels toward Power Optimized High Performance CPU-GPU Computing”, Proceedings of Workshop on Ultra Performance and Depend-able Acceleration Systems (UPDAS), Hiroshima, pp.432-438 (2009).

[7] Cheng Luo, Kamil Rocki, Reiji Suda: “A precise measure-ment tool for power dissipation of CUDA kernels”, IPSJ SIG Technical Reports, Vol.2012-HPC-133 No.2,ୈ133ճHPC ݚڀձ,༗അϏϡʔϗςϧ͏ΒΒ, March 26-27 (26) (2012). [8] Satoshi OHSHIMA, Shoichi HIRASAWA, Hiroki HONDA: “OMPCUDA : OpenMP Execution Framework for CUDA Based on Omni OpenMP Compiler”, 6th International Work-shop on OpenMP, Epochal Tsukuba, June 14-16 (2010). [9] Seyong Lee, Rudolf Eigenmann: “OpenMPC: Extended

OpenMP Programming and Tuning for GPUs”, SC10: Pro-ceedings of the 2010 ACM/IEEE conference on Supercom-puting (2010).

[10] Takahiro Katagiri, Kenji Kise, Hiroki Honda, Toshitsugu Yuba: “ABCLibScript: A Directive to Support Specification of An Auto-tuning Facility for Numerical Software”, Parallel Computing, Vol.32, Issue 1, pp.92-112 (2006).

[11] ਢాྱਔ: “ࣗಈνϡʔχϯά਺ཧج൫ϥΠϒϥϦ ATMath-CoreLib”,৘ใॲཧֶձ ݚڀใࠂHPC-129-14 (2011).

参照

関連したドキュメント

into burst−mode. In burst−mode, switching operation is halted when V COMP is lower than V BURL and resumed when V COMP is higher than V BURH. By skipping un-needed switching

The Rt pin OCP components are normally designed in such a way that the OCP system shifts and regulates the operating frequency of the LLC converter during overload or secondary

A dedicated comparator monitors the bulk voltage and disables the controller if a line overvoltage fault is detected.. 3 2 Restart This pin receives a portion of the PFC output

Additional features found in the NCP1562 include line feed-- forward, frequency synchronization up to 1.0 MHz, cycle--by--cycle current limit with leading edge blanking

A dedicated comparator monitors the bulk voltage and disables the controller if a line overvoltage fault is detected.. The Fast Overvoltage (Fast−OVP) and Bulk Undervoltage

The AX8052F100 features 3 16−bit general purpose timers with SD capability, 2 output compare units for generating PWM signals, 2 input compare units to record timings of

To synchronize the receiver frequency to a carrier signal, the oscillator frequency could be tuned using the capacitor bank however, the recommended method to implement

The RF frequency generation subsystem consists of a fully integrated synthesizer, which multiplies the reference frequency from the crystal oscillator to get the desired RF