ফඅిྗߴੑೳܭࢉʹ͚ͨऔΓΈ
େౡ ૱࢙
1,a)Luo Cheng
2ฏᖒ কҰ
3ยۅ ༸
1ਢా ྱೋ
2ຊଟ ߂थ
4֓ཁɿ࣍ੈͷεʔύʔίϯϐϡʔλʹ͚ͯɼ·ͨߴੑೳͳύʔιφϧίϯϐϡʔλϫʔΫεςʔγϣ ϯͷ࣮ݱʹ͓͍ͯɼফඅిྗΛԼ͛ͭͭߴ͍ੑೳΛಘΔফඅిྗߴੑೳܭࢉʢUltra Low Power High Performance Computing : ULPHPCʣٕज़͕ॏཁͱͳ͍ͬͯΔɽ͞ΒʹࡢࠓͰࠃిྗࣄΛߟྀͨ͠ ϐʔΫిྗͷݮফඅΤωϧΪʔͷݮͳͲɼলిྗͳܭࢉػΛߏங͠ӡ༻͢Δٕज़ͷཁٻ͕ߴ·ͬ ͍ͯΔɽզʑͷάϧʔϓͰCRESTϓϩδΣΫτʮULP-HPC:࣍ੈςΫϊϩδͷϞσϧԽɾ࠷దԽʹ ΑΔফඅిྗϋΠύϑΥʔϚϯείϯϐϡʔςΟϯάʯʢݚڀදऀɿদԬ૱ ౦େڭतʣʹ͓͍ͯɼ ফඅిྗϋΠύϑΥʔϚϯείϯϐϡʔςΟϯάͷ࣮ݱʹ͚ͨిྗଌఆํ๏ͷ։ൃɼిྗଌఆAPI ͷ࡞ɼిྗใΛ༻͍ͨࣗಈνϡʔχϯάٕज़ͷ։ൃͱΞϓϦέʔγϣϯద༻ͱ͍ͬͨݚڀΛߦ͖ͬͯ ͨɽຊߘͰULPHPCʹ͚ͨզʑͷݚڀάϧʔϓʹ͓͚ΔऔΓΈ͓ΑͼͦͷՌʹ͍ͭͯड़Δɽ
Research activity for Ultra Low Performance HPC
Satoshi OHSHIMA
1,a)Luo Cheng
2Shoichi HIRASAWA
3Takahiro KATAGIRI
1Reiji SUDA
2Hiroki HONDA
4Abstract: Technologies of ultra low power high performance computing (ULPHPC) which aims obtaining high com-puting performance with reducing power consumption is getting more and more important for attainment of next generation supercomputers and high performance personal computers and workstations. Moreover, because social re-quirement of reducing the peak power consumption and consumption energy is strong, the request for technologies of implementation and operation of low power computers is required. We have researched for attainment of ULPHPC in the “ULP-HPC : Ultra Low-Power, High Performance Computing via Modeling and Optimization of Next Gener-ation HPC Technologies” project. In this project, we researched about measurement method of power consumption, development of power measurement API, and development and application of auto-tuning technologies that use power information. In this report, we describe the activities and results of our research of ULPHPC.
1. ͡Ίʹ
ܭࢉػγεςϜͷઃܭͱ։ൃʹ͓͍ͯిྗʹؔ͢Δ՝ ͕ॏཁੑΛ૿͍ͯ͠Δɽྫ͑εʔύʔίϯϐϡʔλͷߏ ஙಋೖʹ͓͍ͯɼൃిػిݯઃඋͳͲిྗʹؔ͢Δ
1 ౦ژେֶ ใج൫ηϯλʔ
Information Technology Center, The University of Tokyo
2 ౦ژେֶ େֶӃใཧֶܥݚڀՊ
Graduate School of Information Science and Technology, The Uni-versity of Tokyo
3 ౦େֶ େֶӃใՊֶݚڀՊ
Graduate School of Information Sciences, Tohoku University
4 ిؾ௨৴େֶ େֶӃใγεςϜֶݚڀՊ
Graduate School of Information Systems, The University of Electro-Communications a) [email protected] ੍ݶ͕େ͖ͳ੍ݶͷҰͭͱͳ͍ͬͯΔɽ·ͨۙͰڥ ࠃిྗࣄͷӨڹΛड͚ͯ૯ిྗফඅྔϐʔΫ ిྗͷݮ͕ٻΊΒΕ͓ͯΓɼܭࢉػγεςϜͷେখʹؔ ΘΒͣফඅిྗٕज़ফඅΤωϧΪʔٕज़ͷॏཁੑ͕ ߴ·͍ͬͯΔɽ ܭࢉػγεςϜͷলిྗԽʹ͍ͭͯࠃɾւ֎ΛΘ ༷ͣʑͳͰଟ͘ͷݚڀ։ൃ͕ߦΘΕ͍ͯΔɽ লిྗԽʹ༷ʑͳΞϓϩʔν͕͋Γɼྫ͑ϋʔυΣ Ξʹؔͯ͠ɼ͍ిѹͰՔಇ͢ΔσόΠεΛ։ൃ͢Δɼ ফඅిྗΛอͪͳ͕ΒΑΓߴʹಈ࡞Ͱ͖ΔΑ͏ʹ࣮ͯ͠ ߦ࣌ؒΛॖ͠૯ফඅΤωϧΪʔΛখ͘͢͞Δɼػిྗ ΛԼ͛Δ͜ͱͰ૯ফඅΤωϧΪʔΛখ͘͢͞Δɼಛఆͷॲ ཧʹରͯ͠ిྗޮͷྑ͍ΞΫηϥϨʔλΛ։ൃ͢Δɼͳ
Ͳͷྫ͕͋͛ΒΕΔɽ·ͨιϑτΣΞΞϧΰϦζϜʹ ؔ͢ΔΞϓϩʔνͱͯ͠ɼฒྻϝϞϦΞΫηεͷճ ɾཻΛม͑Δ͜ͱͰϋʔυΣΞͷಈ࡞Λଅਐ·ͨ ੍͢Δ͜ͱͰిྗԽফඅΤωϧΪʔԽ͕ߦ͑Δ͜ ͱ͕͋Δɽ·ͨରͷॲཧʹ࠷దͳϋʔυΣΞΛબ͢ ΔػߏΛ࡞͢Δͱ͍ͬͨྫ͋͛ΒΕΔɽ CRESTྖҬʮใγεςϜͷফඅిྗԽΛࢦ͠ ٕͨज़ֵ৽ͱ౷߹Խٕज़ʯ[1]ʹ͓͍ͯলిྗԽʹ͍ͭ ͯଟ͘ͷݚڀऀ͕༷ʑͳݚڀΛߦ͖ͬͯͨɽຊߘͷஶऀΒ ʮULP-HPC:࣍ੈςΫϊϩδͷϞσϧԽɾ࠷దԽʹΑ ΔফඅిྗϋΠύϑΥʔϚϯείϯϐϡʔςΟϯάʯ ʢݚڀදऀɿদԬ૱ ౦େڭतʣʹࢀՃ͠ɼओʹΞϧΰ ϦζϜϓϩάϥϛϯάڥ͔ΒͷΞϓϩʔνΛߦ͖ͬͯ ͨɽຊߘͰফඅిྗϋΠύϑΥʔϚϯείϯϐϡʔ ςΟϯά(ULPHPC)ʹ͚ͨզʑͷݚڀάϧʔϓʹ͓͚Δ औΓΈ͓ΑͼͦͷՌʹ͍ͭͯड़Δɽ
2. ϋΠύϑΥʔϚϯείϯϐϡʔςΟϯάͱ
লిྗԽ
ܭࢉػγϛϡϨʔγϣϯ࣮ݧɼཧͱฒͿݚڀ։ൃͷ ୈ3ͷՊֶ(ୈ3ͷख๏)ͱ͔ͯܽͤ͠ͳ͍ଘࡏͱͳͬͯ ͓ΓɼΑΓߴͳɼΑΓେنͳɼΑΓਖ਼֬ͳγϛϡϨʔ γϣϯ͕ٻΊΒΕ͍ͯΔɽͦͷͨΊʹΑΓߴੑೳͳϋʔ υΣΞιϑτΣΞ͕ඞཁͰ͋Δɽ ϋʔυΣΞͷੑೳΛ্ͤ͞Δ͏͑Ͱɼిྗڧ੍͍ ݶࣄ߲ͱͳΔɽྫ͑εʔύʔίϯϐϡʔλͷΑ͏ʹେن ͳܭࢉػγεςϜΛߏஙɾӡ༻͢Δʹɼنʹݟ߹ͬ ͨిྗΛ҆ఆڙڅͰ͖Δ͜ͱ͕ඞਢͰ͋Δɽ2012ʹਖ਼ࣜ ՔಇΛ։࢝ͨ͠ཧԽֶݚڀॴ ܭࢉՊֶݚڀػߏͷεʔύʔ ίϯϐϡʔλʮژʯ[2]ɼશܥγεςϜΛՔಇͤ͞ΔͨΊ ʹ20MWఔͷిྗΛඞཁͱ͢Δɽ·ͨ2018ࠒͷ࣮ݱ Λࢦͯ͠ੈքதͰݚڀ͕ਐΊΒΕ͍ͯΔExaFLOPSڃͷ εʔύʔίϯϐϡʔλʢ1Exa=1000petaɼʮژʯͷੑೳ 10petaFLOPSͰ͋Δʣʹ͍ͭͯɼ࣮ݱʹ͓͚Δ࠷େͷน ͷҰͭͱͯ͠ిྗͷ͕͋͛ΒΕ͍ͯΔɽطଘͷٕज़Λ εέʔϧͯ͠ExaFLOPSΛୡ͠Α͏ͱ͢Δͱඇݱ࣮తͳ Ϩϕϧͷిྗ͕ඞཁͱͳͬͯ͠·͏ͨΊɼٕज़ֵ৽͕ඞཁ ͱ͞Ε͍ͯΔɽͪΖΜύʔιφϧίϯϐϡʔλϫʔΫ εςʔγϣϯʹ͍ͭͯɼ͞ΒͳΔߴੑೳԽͷཁٻେ͖ ͍ҰํͰফඅిྗΛେ্͖͛͘Δ͜ͱͰ͖ͣɼΑΓߴ͍ ిྗޮɾলిྗੑ͕ٻΊΒΕ͍ͯΔɽ ܭࢉػγεςϜͷলిྗੑιϑτΣΞʹΑͬͯߴ ΊΔ͜ͱ͕Ͱ͖ΔɽݱࡏͰϊʔτύιίϯͳͲͷόος ϦʔͰՔಇ͢ΔػثΛத৺ʹɼෛՙ͕খ͍͞ࡍʹCPU ϝϞϦͳͲͷϋʔυΣΞΛফඅిྗͳػঢ়ଶʹͤ͞ ͨΓɼ෦తʹిྗͷڙڅΛࢭΊͨΓ͢Δٕज़͕༻͍ΒΕ ͍ͯΔɽ͜͏ٕͨ͠ज़Λ࠷େݶʹར༻͢ΔͨΊʹϓϩά ϥϜʹΑΔαϙʔτ͕ඞཁͰ͋Δɽྫ͑ϚϧνίΞCPU ͷ׆༻ʹ͓͍ͯɼࡌ͞Ε͍ͯΔίΞΛશͯͬͯॲཧ Λߦͬͯεέʔϥϒϧʹੑೳ্͕ಘΒΕͳ͍߹ʹ ɼ༻͢ΔίΞΛݮΒͯ͠ফඅిྗΛ͑Δ͜ͱͰ૯ফ අΤωϧΪʔΛݮ͠ిྗޮΛ্ͤ͞ΒΕΔ͜ͱ͕͋ Δɽܭࢉ࣌ͷσʔλͷҠಈΛͳΔ͘Ωϟογϡʹऩ· ΔΑ͏ʹͯ͠ϝϞϦετϨʔδͷΞΫηεΛ͑Δ͜ ͱফඅిྗͷݮʹޮՌ͕͋ΔɽΑΓ୯७ͰΘ͔Γ͢ ͍ྫͱͯ͠ɼফඅిྗ͕΄΅ҰఆͰ͋ΔͱԾఆ͢ΔͳΒ ɼϓϩάϥϜͷ࠷దԽΛਐΊ࣮ͯߦ࣌ؒΛ͘͢Δ͜ͱ ૯ফඅΤωϧΪʔΛݮ͢ΔޮՌ͕͋Δɽ͜ͷΑ͏ʹɼ লిྗԽɾলΤωϧΪʔԽΛߦ͏ͨΊʹϋʔυΣΞͱ ιϑτΣΞͷڠௐ͕ॏཁͰ͋Δɽ ൚༻తͳCPUʹՃ͑ͯಛఆͷॲཧΛߴޮͰ࣮ߦՄೳͳ ઐ༻ϋʔυΣΞΛ༻͍Δͱ͍͏खஈଟ͘༻͍ΒΕ͍ͯ ΔɽಛʹHPCͷʹ͓͍ͯGPUͷ׆༻(GPGPU[3]ɼ GPUίϯϐϡʔςΟϯά)ͷ͕ߴ͘ɼ͢Ͱʹ༷ʑ ͳΞϓϦέʔγϣϯͷGPUԽ͕ߦΘΕ͍ͯΔɽGPU CPUͱൺֱͯ͠େྔͷܭࢉίΞΛࡌ͓ͯ͠ΓɼϝϞϦੑ ೳҰൠతͳPCͱൺͯߴ͍ɽͦͷͨΊಛʹฒྻੑ͕ߴ ͍ܭࢉʹରͯ͠CPUΑΓඇৗʹߴ͍ੑೳΛಘΔ ͜ͱ͕Ͱ͖ΔɽҰํͰGPUͷඋ͑ΔܭࢉίΞCPUͱൺ ΔͱγϯϓϧͰ൚༻ੑʹྼΔͨΊɼेͳฒྻΛ࣋ͨ ͳ͍ϓϩάϥϜͳͲߴʹ࣮ߦ͢Δ͜ͱ͕Ͱ͖ͳ͍ɽ͜ ͷΑ͏ʹɼGPUͲͷΑ͏ͳΞϓϦέʔγϣϯͰCPU ͱൺͯߴɾߴిྗޮʹ࣮ߦͰ͖ΔΘ͚Ͱͳ͘ɼద ͳ͍͚͕ඞཁͰ͋Δɽ ݱࡏHPCͷͰΘΕ͍ͯΔGPUͱͯ͠ɼNVIDIA ࣾͷGPUɼओʹGeForceγϦʔζTeslaγϦʔζ͕ओྲྀ Ͱ͋Δɽ͜ΕΒͷGPUʹ͓͍ͯϓϩάϥϜ࡞ͷͨ ΊʹC/C++Λ֦ுͨ͠CUDA[4]͕ଟ͘ར༻͞Ε͍ͯΔɽ CUDAݴޠ༷ͦͷͷͷಠࣗੑ͍ҰํͰ࠷దԽΛ ߦ͏ʹߴͳࣝඞཁͰ͋ΔͨΊɼϓϩάϥϜ࠷దԽ ʹඞཁͳίετ(शಘར༻ͷखؒ)ແࢹ͢Δ͜ͱͰ͖ ͳ͍ɽͦͷͨΊɼࢦࣔจΛ༻͍ͯ༰қʹGPUϓϩάϥϛ ϯάΛߦ͏͜ͱ͕Ͱ͖ΔOpenACC[5]ͷߴ·ͬͯ ͍ΔɽGPUͷ׆༻զʑͷϓϩδΣΫτʹ͓͍ͯۃΊ ͯॏཁͳςʔϚͰ͋Δɽ3. ిྗଌఆํ๏ͱ API ͷ։ൃ
ݱ࣮ͷϋʔυΣΞΛରͱͯ͠লిྗԽΛߦ͏ʹɼ ফඅిྗΛଌఆ͠ϓϩάϥϜ͔Βࢀর͢ΔΈ͕ෆՄ ܽͰ͋Δɽ͔͠͠ͳ͕ΒݱࡏΘΕ͍ͯΔPCͳͲͷҰൠ తͳܭࢉػγεςϜʹ͓͍ͯόοςϦͷফඅใԹ ใͷऔಘՄೳͰ͋Δ͜ͱ͕ଟ͍ҰํɼফඅిྗΛऔ ಘ͢ΔΈඋΘ͍ͬͯͳ͍ɽͦ͜ͰຊߘͷஶऀͰ͋Δ LuoɾਢాΒɼGPUΛࡌͨ͠PCΛରͱͯ͠ిྗΛ ଌఆ͠ࢀর͢ΔͨΊͷΈʹ͍ͭͯݚڀΛߦ͖ͬͯͨɽ ৄࡉͳ༰ࢀߟจݙ[6], [7]ʹ·ͱΊΒΕ͍ͯΔͨΊɼຊਤ1 ిݯଌఆ༻ϋʔυΣΞͷྫ ߘͰ֓ཁͷΈΛड़Δɽ ܭࢉػγεςϜΛؚΉҰൠతͳిؾػثͷશମతͳফඅ ిྗʹ͍ͭͯɼ͍ΘΏΔϫοτνΣοΧʔͱݺΕΔػ ثΛ༻͍Εଌఆ͢Δ͜ͱ͕Ͱ͖ΔɽPCશମͷফඅిྗ ʹ͍ͭͯಉ༷ʹଌఆՄೳͰ͋ΔҰํɼ͜ͷํ๏ͰCPU GPUͱ͍ͬͨΑΓখ͞ͳ୯ҐͰͷଌఆΛߦ͏͜ͱͰ ͖ͳ͍ɽPCͷ֤ύʔπͷిྗڙڅʹ͍ͭͯɼిݯ Ϣχοτ͔Β֤ύʔπʹରͯ͠ྲྀిݯέʔϒϧ͕ଓ͞ Ε͍ͯΔͨΊɼ͜ΕΒΛ༻͍Δ͜ͱͰύʔπ୯ҐͰͷফඅ ిྗΛଌఆ͢Δ͜ͱ͕Ͱ͖Δɽྲྀిݯέʔϒϧ୯Ґͷফ අిྗΛଌఆ͢ΔʹΫϥϯϓܕͷిྗଌఆث(Ϋϥϯϓ ϝʔλʔ)͕ར༻ՄೳͰ͋ΓɼUSBଓͰPC͔ΒΛऔ ಘͰ͖Δଘࡏ͢Δɽ ͱ͜ΖͰɼGPUͷఏڙܗଶͱͯ͠ҰൠతͰ͋Δ PCI-ExpressΧʔυܕGPUͷిݯڙڅʹ͍ͭͯ2ܥ౷Ͱߦ ΘΕ͍ͯΔɽ1ͭిݯϢχοτ͔ΒGPUΧʔυ্ͷίω Ϋλʹରͯ͠ྲྀిݯέʔϒϧΛଓ͢Δܦ࿏Ͱ͋Γɼ ͏1ͭGPUͱϚβʔϘʔυΛଓ͢ΔPCI-Expressό εʹΑΔిݯڙڅܦ࿏Ͱ͋Δɽલऀͷܦ࿏ʹؔ͢Δిྗʹ ͍ͭͯΫϥϯϓϝʔλʔʹΑΓଌఆ͢Δ͜ͱ͕ՄೳͰ ͋ΔɽҰํͰޙऀͷܦ࿏ʹ͍ͭͯΫϥϯϓϝʔλʔΛڬ Ή͜ͱ͕Ͱ͖ͳ͍ͨΊɼଞͷํ๏Ͱଌఆ͢Δඞཁ͕͋Δɽ LuoɾਢాΒϥΠβʔΧʔυʹΑΓPCI-Expressόεͷ ઢΛԆ͠ɼԆ෦ͷྲྀిݯέʔϒϧΛՃͯ͠Ϋϥ ϯϓϝʔλʔʹڬΉ͜ͱͰফඅిྗΛଌఆͰ͖ΔΑ͏ʹ͠ ͨ(ਤ1)ɽ ͞ΒʹLuoɾਢాΒిྗଌఆʹΑΔফඅిྗͷมಈΛ ͙ͨΊʹଌఆରܭࢉػͱଌఆΛߦ͏ܭࢉػΛผ్༻͍ ΔઃܭΛ࠾༻͠ɼফඅిྗͷଌఆ։࢝ͱऴྃଌఆใΛ σʔλϕʔεԽͯ͠ѻ͏ॲཧͳͲΛAPIͱͯ͠·ͱΊͨ (ਤ2)ɽ͜ΕΒΛ༻͍Δ͜ͱͰɼΞϓϦέʔγϣϯϓϩά ϥϚҙͷϓϩάϥϜʹରͯͦ͠ͷॲཧͷҰ෦ͰͲΕͩ ͚ͷిྗΛফඅ͍ͯ͠Δͷ͔Λଌఆ͠ɼଌఆ݁ՌΛར༻͢ Ϣʔβ༻APIͷྫ //marker kernel; int P_marker();
//send start monitor request; int P_startMonitor();
//send stop monitor request; int P_stopMonitor();
//wait for data int P_waitData(); //receive data
int P_receveData(int sock_fd, struct record* rec); //store data into database
int P_storeData(struct record* rec); //half-half benchmark
int P_half_test(int time); //computing intensive benchmark int P_compute_test(int time); //memory access intensive benchmark int P_memory_test(int time); //return error imformation int P_getError();
σʔλϕʔεΞΫηε༻APIͷྫ //open database connection int dbOpen(char* dbName); //read records by task ID int dbReadByID(int ID); //read records by task name int dbReadByTaskName(char* name); //qury all the user name in database int dbGetUser();
//qury all tasks information of user int dbGetTaskByUser(char* user); //delete records by task ID int dbDeleteByID(int ID); //delete records by task name
int dbDeleteByTaskName(char* taskName); //delete records by user name
int dbDeleteByUser(char* user); //close database connection int dbClose(); ਤ2 ։ൃͨ͠API Δ͜ͱ͕Մೳͱͳͬͨɽ
4. ࢦࣔจهड़ʹΑΔిྗใΛߟྀͨࣗ͠ಈ
νϡʔχϯάػߏͷ։ൃ
ैདྷɼϕΫτϧܭࢉػ͕ओྲྀͰ͋ͬͨࡍʹࣗಈฒྻԽ ʹΑΔ࠷దԽ͕ओཁͳٕज़ͱͯ͠༻͍ΒΕͨɽҰํۙͰ ֊ੑͷ͋ΔΩϟογϡΛඋ͑ͨϚϧνίΞCPU͕ओ ྲྀͱͳ͓ͬͯΓɼ͞ΒʹGPUͷΑ͏ͳΞΫηϥϨʔλ͕ ଟ͘༻͍ΒΕΔΑ͏ʹͳ͖ͬͯͨͨΊɼࣗಈฒྻԽͰྑ ͍ੑೳ͕ಘΒΕͳ͍࣮ߦڥ(ϋʔυΣΞ)ΞϓϦέʔ γϣϯ͕૿Ճ͍ͯ͠Δɽ͔͠͠ҙͷ࣮ߦڥͱΞϓϦ έʔγϣϯʹରͯ͠࠷େੑೳΛಘΔʹߴ͍ٕज़ͱख͕ؒ ඞཁͰ͋ΓɼϓϩάϥϜͷҠ২ੑՄಡੑͷ؍͔Βɼ ੜ࢈ੑͷߴ͍։ൃख๏πʔϧͷधཁ͕ߴ·͍ͬͯΔɽ ͦ͜ͰɼࢦࣔจΛ༻͍ͨϓϩάϥϜ࠷దԽͷ͕ߴϦετ1 ࢦࣔจΛ༻͍ͨϓϩάϥϜهड़ͷྫ (OpenMPʹΑΔϧʔϓฒྻԽͷྫ) // ͜ ͷ ྫ Ͱ ࢦ ࣔ จ( pragma omp parallel for) ޙ ͷ // f o rϧ ʔ ϓ ͕ ε Ϩ ο υ ʹ Α Γ ฒ ྻ ࣮ ߦ ͞ Ε Δ
int main (){
# pragma omp parallel for for(i=0; i<N; i ++){
C[i] = A[i] + B[i]; } return 0; } ·͍ͬͯΔɽࢦࣔจΛ༻͍ͨϓϩάϥϜ࠷దԽʹ͓͍ͯ ɼϓϩάϥϜʹରͯ͠ίϝϯτͷҰछͷܗࣜͰࢦࣔจΛ ૠೖ͠ɼରԠ͢Δॲཧܥ(ίϯύΠϥτϥϯεϨʔλͳ Ͳ)Λ༻͍ͯϓϩάϥϜ࠷దԽΛߦ͏(Ϧετ1)ɽҰൠతʹ ࢦࣔจϓϩάϥϜʹରͯ͠ίϝϯτͷҰछͱͯ͠ૠೖ͞ ΕΔͨΊɼࢦࣔจΛແࢹ͢ΕඇରԠͷίϯύΠϥͰί ϯύΠϧՄೳͰ͋Δ͜ͱ͕ଟ͍ɽ·ͨϓϩάϥϜͷߏࣗ ମΛେ͖͘มߋ͠ͳͯ͘ϓϩάϥϜͷ࠷దԽ͕ߦ͑Δ͜ ͱɼطଘͷϓϩάϥϜ͔Βͷஈ֊తͳద༻͕ߦ͍͍͢ ͜ͱࢦࣔจΛ༻͍ͨϓϩάϥϜ࠷దԽ͕͞ΕΔཧ༝ ͱͯ͋͛͠ΒΕΔɽ ࢦࣔจΛ༻͍ͨϓϩάϥϜ࠷దԽख๏ͱͯ͠OpenMP ͕͘༻͍ΒΕ͍ͯΔɽOpenMPڞ༗ϝϞϦܕฒྻܭࢉ ػΛରͱ͓ͯ͠ΓɼओʹϧʔϓॲཧͷฒྻߴԽʹ༻͍ ΒΕ͍ͯΔɽGPUʹରԠͨ͠ࢦࣔจΛ༻͍ͨϓϩάϥϜ ࠷దԽख๏ͱͯ͠ɼಛʹOpenACCͷ͕ߴ·ͬͯ ͍Δɽ·ͨOpenMPΛGPUʹରԠͤ͞Δݚڀʹ͍ͭͯɼ ຊߘͷஶऀΒʹΑΔͷ(OMPCUDA[8])OpenMPC[9] ͳͲ͕͋͛ΒΕΔɽ ຊߘͷஶऀͰ͋ΔยۅΒࢦࣔจΛ༻͍ͨࣗಈνϡʔχ ϯάهड़ʹ͍ͭͯͷݚڀΛߦ͖ͬͯͨɽยۅΒͷ։ൃͨ͠ ABCLibScript[10]ɼରϓϩάϥϜʹࢦࣔจΛهड़͢Δ ͜ͱͰ࠷దͳΞϧΰϦζϜͷબͳͲͷࣗಈνϡʔχϯά ػೳΛՃ͢Δ͜ͱ͕Ͱ͖ΔɽຊݚڀͰยۅΒʹΑΔط ଘͷݚڀΛͱʹͯ͠ɼGPUͷରԠిྗใΛߟྀ͠ ͨࣗಈνϡʔχϯάػߏͷ։ൃΛߦ͍ͬͯΔɽGPUͷ ରԠʹ͍ͭͯɼਤ3ʹࣔ͢Α͏ʹCPU͚ͷίʔυ͔ ΒిྗใΛ༻͍ͨCPU+GPU࠷దԽͷ࣮ݱΛࢦ͍ͯ͠ ΔɽݱࡏखಈͰهड़ͨ͠CUDAϓϩάϥϜΛ༻͍ͨ࠷ దԽOMPCUDAΛ༻͍ͨGPUԽʹ͍ͭͯݕ౼͓Αͼ࣮ Λߦ͍ͬͯΔ͕ɼ͞ΒʹGPUͷੑೳΛҾ͖ग़ͨ͢Ίʹ OpenACCͷ׆༻ʹ͍ͭͯݕ౼͍ͯ͠ΔɽຊػߏରԠ ͢ΔࢦࣔจΛهड़ͨ͠ϓϩάϥϜʹରͯ͠3ষͰड़ͨి ྗؔ࿈ͷAPIΛؚΉιʔείʔυΛग़ྗ͢ΔػೳΛඋ͑ͯ ͍ΔɽͦͷͨΊΞϓϦέʔγϣϯϓϩάϥϚిྗଌఆʹ ؔ͢ΔࣝͲ͜Ζ͔ɼిྗଌఆʹؔ͢Δهड़ࣗମΛߦΘͣ ਤ3 ABCLibScriptʹΑΔॲཧͷྲྀΕ ද1 ࣮ݧڥ
CPU Xeon E5-2620 x2 (SandyBridge-E) ϝΠϯϝϞϦ DDR3-1333 4GBx8 GPU TeslaK10 x4 (8GPUs) ιϑτΣΞ CentOS 6.2 x86 64, CUDA 5.0 RC ਤ4 ࣮ݧڥͷߏ ʹిྗޮͷྑ͍࣮Λ࠾༻͢Δ͜ͱ͕ՄೳͱͳΔ͜ͱ͕ ظͰ͖Δɽ
5. ධՁ࣮ݧ
ຊষͰ3ষͱ4ষʹͯड़֤ͨछͷٕज़Λ༻͍ͯϓ ϩάϥϜͷੳͱ࠷దԽΛߦͬͨྫΛࣔ͢ɽ࣮ݧ༰ͱ͠ ͯɼ෮ܭࢉΛߦ͏εςϯγϧܭࢉϓϩάϥϜͷओཁͳ ܭࢉ෦ΛCPUͱGPUͦΕͧΕͰ࣮͠ɼ։ൃதͷిྗ ใʹରԠͨ͠ABCLibScriptͷࢦࣔจΛ༻͍ͯফඅΤω ϧΪʔ͕࠷গͱͳΔΑ͏ͳϓϩάϥϜ࣮ߦ͕Ͱ͖Δ͔Λ֬ ೝͨ͠ɽ࣮ݧʹ༻ͨ͠ܭࢉػڥද1ͷ௨ΓͰ͋Δɽ ࠓճίϯηϯτͱ࣮ݧػͷؒʹΫϥϯϓϝʔλʔΛઃஔ ͠ɼPCશମͷిྗΛଌఆͯ͠ফඅΤωϧΪʔ࠷গԽΛ ࢦͨ͠(ਤ4)ɽਤ5 ATهड़ ࣮ݧϓϩάϥϜʹ͍ͭͯɼࢦࣔจʹΑΔATهड़Λਤ5 ʹɼࣗಈੜ͞Εͨίʔυͷओཁͳ෦ΛϦετ2ʹࣔ ͢ɽͳ͓ੜ͞Εͨίʔυʹ্ड़ͷATهड़ʹՃ͑ͯి ྗଌఆ҆ఆԽʹ༻͍͍ͯΔATMathCoreLib[11]ͷهड़ؚ ·Ε͍ͯΔɽ͜ͷϓϩάϥϜʹΑΓɼCPUΛ༻͍ͨ߹ͱ GPUΛ༻͍ͨ߹ͦΕͧΕͷ࣮ߦ࣌ؒٴͼফඅిྗ͕ଌ ఆ͞ΕΔɽຊ࣮ݧͰ෮ܭࢉͷ෮ճΛมߋͨ͠߹ ʹCPUͱGPUͦΕͧΕͷফඅΤωϧΪʔͲͪΒ͕খ͞ ͍͔Λൺֱͨ͠ɽ ࣮ݧͷ݁ՌΛҎԼʹࣔ͢ɽ·ͣ෮ܭࢉʹର͢Δ࣮ߦ࣌ ؒͱফඅిྗͷؔʹ͍ͭͯɼ࣮ߦ࣌ؒʹ͍ͭͯCPU ͱൺͯGPUͷํ͕࣮ߦ࣌ؒͷ૿Ճ߹͍͕খ͘͞ɼ ෮ճ͕ଟ͘ͳ࣮ͬͯߦ͕࣌ؒԆͼʹ͍͘ʹ͋Δ͜ ͱ͕Θ͔ͬͨɽ͍ͭͮͯ෮ճͱফඅΤωϧΪʔͷؔ ʹ͍ͭͯɼফඅిྗʹ͍ͭͯCPU෮ճ͕૿Ճ ͢Δͱফඅిྗ͕૿Ճ͢ΔͰ͋Δͷʹରͯ͠ɼGPU ΄΅Ұఆ(ԣ͍)ͱͳ͍ͬͯΔ͜ͱ͕Θ͔ͬͨɽ͜ΕΒͷ ݁Ռ͔ΒɼফඅΤωϧΪʔʹ͍ͭͯ෮ճ͕ଟ͍߹ ʹGPUͷํ͕༗རͱͳΔ(͘ͳΔ)͜ͱ͕ظ͞Εɼ࣮ ࡍʹফඅΤωϧΪʔΛ֬ೝͯ͠Έͨͱ͜Ζɼ500෮पล ͰফඅΤωϧΪʔ͕ٯస͓ͯ͠Γɼ෮ճ͕500ճະຬ ͷ߹ʹCPUͷํ͕ফඅΤωϧΪʔ͕খ͘͞ɼ෮ճ ͕500ճҎ্ͷ߹ʹGPUͷํ͕ফඅΤωϧΪʔ͕ খ͘͞ͳͬͨ(ਤ6)ɽ͜ΕΒͷ݁Ռ͔Βɼରͷ෮ ճʹԠͯ͡CPUͱGPU͔Β࠷దͳ࣮ߦϋʔυΣΞΛ બͰ͖Δ͜ͱ͕֬ೝͰ͖ͨɽ
6. ͓ΘΓʹ
ຊߘͰஶऀΒ͕CRESTϓϩδΣΫτʮULP-HPC:࣍ ੈςΫϊϩδͷϞσϧԽɾ࠷దԽʹΑΔফඅిྗϋ ΠύϑΥʔϚϯείϯϐϡʔςΟϯάʯʹ͓͍࣮ͯࢪͨ͠ ফඅిྗϋΠύϑΥʔϚϯείϯϐϡʔςΟϯάͷ࣮ ݱʹ͚ͨऔΓΈʹ͍ͭͯհͨ͠ɽ·ͨ։ൃͨ͠ػߏ Λ༻͍ͯిྗใΛ༻͍ͨ࠷దԽ(ফඅΤωϧΪʔ࠷গԽ) ͷྫΛࣔͨ͠ɽ͜Ε͔ΒͷܭࢉػγεςϜͷੑೳ্ʹ͓ ͍ͯɼলిྗԽϞόΠϧ͔Βεʔύʔίϯϐϡʔ Ϧετ2 ࣗಈੜ͞Εͨίʔυͷྫ ઌ ಄ ʹ+ҹ ͷ ߦ : ి ྗ ؔ ࿈A P Iؔ ࿈ ͷ ه ड़ ઌ ಄ ʹ=ҹ ͷ ߦ : A T M a t h C o r e L i bؔ࿈ͷهड़for( iloop_n = OAT_STARTTUNESIZE ;
iloop_n <= OAT_ENDTUNESIZE ; iloop_n += OAT_SAMPDIST ){ nstep = iloop_n ;
...
= exdesign_t exdes = new_exdesign (2, OAT_EPM_KAPPA ); = for( iloop_idx =0;
iloop_idx < OAT_EPM_MAXSAMP ; iloop_idx ++){ double atmeter = 0.0;
+ double avtmp = P_getTemp ();
= int cand_idx = exdes_osa_atm (exdes ,avtmp ,& atmeter ); = if ( cand_idx >= 2) break ;
// enough information obtained iusw1 = fpos [ cand_idx ];
// --- power monitoring setup + P_initial (); P_start ();
t1 = OAT_Wtime ();
for(i=0; i< OAT_MAXREPEAT ; i++) { OAT_InstallSelectPhase (
mgn ,nx ,ny ,n,dx ,dy ,delta ,nstep ,nout , calc_coef_a_w_pmobi ,init ,model , iusw1 ); }
t2 = OAT_Wtime ();
t_all2 = (t2 - t1 )/( double ) OAT_MAXREPEAT ; // --- power monitoring finalizing + P_stop ();
// --- get power data + num = P_recv (& rec ); + P_recv (& rec ); + P_close ();
if (num == 0) { // failed to get power ...
}
t_all = P_compPower (num , rec );
= update_exdes (exdes , cand_idx , t_all2 ,t_all , avtmp ); = iBestSw1 = getbest_exdes ( exdes );
assert ( iBestSw1 >= 0); iBestSw1 = fpos [ iBestSw1 ]; = del_exdesign ( exdes ); ... } } λ·ͰͲͷΑ͏ͳγεςϜʹ͓͍ͯඇৗʹॏཁͳ՝Ͱ ͋ΔɽզʑຊϓϩδΣΫτʹ͓͚ΔऔΓΈΛ͞Βʹൃ లͤͯ͞ܭࢉػγεςϜͷলిྗԽΛਐΊ͍ͯ͘༧ఆͰ ͋Δɽ ँࣙ ຊݚڀͷҰ෦ʮULP-HPC:࣍ੈςΫϊϩδͷ ϞσϧԽɾ࠷దԽʹΑΔফඅిྗϋΠύϑΥʔϚϯε ίϯϐϡʔςΟϯάʯ(CRESTྖҬʮใγεςϜͷ ফඅిྗԽΛࢦٕͨ͠ज़ֵ৽ͱ౷߹Խٕज़ʯ)ͷࢧԉΛ ड͚͍ͯ·͢ɽ
20 40 60 80 100 120 140 160 180 200 100 200 300 400 500 600 700 800 900 1000 Energy Number of iterations CPU GPU ਤ6 ࣮ݧ݁Ռ:෮ճͱফඅΤωϧΪʔ ࢀߟจݙ [1] ಠཱߦ๏ਓՊֶٕज़ৼڵػߏ,ʻઓུతݚڀਪਐ ࣄۀɿCRESTʼใγεςϜͷফඅిྗԽΛࢦ͠ ٕͨज़ֵ৽ͱ౷߹Խٕज़, http://www.ulp.jst.go.jp/ index.html
[2] RIKEN AICS, ژ ί ϯ ϐ ϡ ʔ λ, http://www.aics. riken.jp/k/
[3] GPGPU.org, General-Purpose computation on Graphics Pro-cessing Units, http://gpgpu.org/
[4] NVIDIA, Developer Zone (CUDA ZONE), http:// developer.nvidia.com/category/zone/cuda-zone [5] OpenACC Home, http://www.openacc.org/
[6] Reiji Suda, Da Qi Ren: “Accurate Measurements and Pre-cise Modeling of Power Dissipation of CUDA Kernels toward Power Optimized High Performance CPU-GPU Computing”, Proceedings of Workshop on Ultra Performance and Depend-able Acceleration Systems (UPDAS), Hiroshima, pp.432-438 (2009).
[7] Cheng Luo, Kamil Rocki, Reiji Suda: “A precise measure-ment tool for power dissipation of CUDA kernels”, IPSJ SIG Technical Reports, Vol.2012-HPC-133 No.2,ୈ133ճHPC ݚڀձ,༗അϏϡʔϗςϧ͏ΒΒ, March 26-27 (26) (2012). [8] Satoshi OHSHIMA, Shoichi HIRASAWA, Hiroki HONDA: “OMPCUDA : OpenMP Execution Framework for CUDA Based on Omni OpenMP Compiler”, 6th International Work-shop on OpenMP, Epochal Tsukuba, June 14-16 (2010). [9] Seyong Lee, Rudolf Eigenmann: “OpenMPC: Extended
OpenMP Programming and Tuning for GPUs”, SC10: Pro-ceedings of the 2010 ACM/IEEE conference on Supercom-puting (2010).
[10] Takahiro Katagiri, Kenji Kise, Hiroki Honda, Toshitsugu Yuba: “ABCLibScript: A Directive to Support Specification of An Auto-tuning Facility for Numerical Software”, Parallel Computing, Vol.32, Issue 1, pp.92-112 (2006).
[11] ਢాྱਔ: “ࣗಈνϡʔχϯάཧج൫ϥΠϒϥϦ ATMath-CoreLib”,ใॲཧֶձ ݚڀใࠂHPC-129-14 (2011).