طଘͷฒྻԽख๏Λ༻͍ͨ
GPGPU
ϓϩάϥϛϯά
େ ౡ ૱ ࢙
†ฏ ᖒ ক Ұ
†ຊ ଟ ߂ थ
†GPU ͷੑೳ্ʹ͍ɼGPU ͷੑೳΛ༷ʑͳ༻్ʹ׆༻͢Δ GPGPU ͕͞Ε͍ͯΔɽGPGPU ಛʹฒྻϓϩάϥϜʹ͓͍ͯ CPU Λ͑Δߴ͍ੑೳ͕ظ͞ΕΔҰํɼGPGPU ϓϩάϥϛϯ άಛ༗ͷख๏Λ༻͍Δඞཁ͕͋ΔͨΊιϑτΣΞͷ࡞͕༰қͰͳ͍ɽຊߘͰ GPGPU ϓϩ άϥϛϯάΛ༰қʹ͢Δ̍ͭͷख๏ͱͯ͠طଘͷฒྻԽख๏Λར༻͢Δ͜ͱΛఏҊ͢Δɽ·ͨ۩ମత ͳ࣮ʹ͚ͯɼۙར༻͞Ε࢝Ίͨ GPGPU ͚ϓϩάϥϛϯάݴޠ CUDA Λར༻͠ɼGPU ্ ͷॲཧΛطଘͷฒྻԽख๏Ͱ͋Δ SIMD ໋ྩɼOpenMPɼMPI Λ༻͍ͯهड़͢Δํ๏Λݕ౼͢Δɽ
GPGPU Programming
Using Existing Parallelizing Method
Satoshi OHSHIMA,
†Shoichi HIRASAWA
†and Hiroki HONDA
† GPGPU utilizing GPU’s performance for general-purpose computation is attracting much attention. GPGPU is expected to effect higher performance than CPU. However, creating GPGPU programming is not easy because programming methods peculiar to GPGPU pro-gramming are needed. In this paper, we propose to use existing parallellizing method as one of a new method making GPGPU programming easier. Also we consider writing programs running on GPUs with SIMD instruction, OpenMP and MPI based on the new GPGPU programming language of CUDA.1. ͡ Ί ʹ
ۙɼߴͳը૾ॲཧͷཁٻʹ͍GPU(Graphics
Processing Unit) ͷੑೳ͕ஶ্͍ͯ͘͠͠Δ1)ɽ
GPU CPU(Central Processing Unit) ͱ ൺ ͯฒྻॲཧϕΫτϧॲཧʹదͨ͠ϋʔυΣΞ ߏ Ͱ ͋ Δ ͜ ͱ ͔ Β ɼGPU Λ ൚ ༻ ԋ ࢉ ʹ ར ༻ ͢ Δ GPGPU(General-Purpose computation using GPUs)2)ͷ͕ߴ·͍ͬͯΔɽ ࠓͰίϯγϡʔϚPCͷଟ͘ʹGPU͕ࡌ͞ Ε͍ͯΔͷͱൺֱ͢ΔͱɼGPGPUͷ׆༻ࣄྫݶΒ Ε͍ͯΔɽͦͷେ͖ͳཧ༝ͷ̍ͭʹGPGPUϓϩά ϥϛϯάͷ͠͞ɼ͢ͳΘͪGPUฒྻॲཧʹద͠ ͨϋʔυΣΞͰ͋Γͳ͕ΒطଘͷฒྻԽख๏Λ༰қ ʹར༻Ͱ͖ͳ͍͜ͱ͕ڍ͛ΒΕΔɽैདྷͷGPGPU ϓϩάϥϛϯάʹ͓͍ͯɼάϥϑΟοΫεAPIͱ γΣʔμݴޠΛ༻͍ͨάϥϑΟοΫεϓϩάϥϛϯά ͷٕज़͕ඞཁͱ͞Ε͖ͯͨɽ͜͏ͨ͠ϓϩάϥϜ࡞ ख๏GPGPUϓϩάϥϛϯάʹಛ༗ͷͷͰ͋Δ † ిؾ௨৴େֶ େֶ ӃใγεςϜֶ ݚڀՊ
Graduate School of Information Systems, The Univer-sity of Electro-Communications ͨΊɼଞͷͷϢʔβʹͱͬͯGPGPUͷ׆༻ ༰қͰͳ͍ɽݱࡏͰCUDA3)ͷΑ͏ʹGPGPU ϓϩάϥϛϯάͷಛघੑΛӅณ͢Δϓϩάϥϛϯάݴ ޠͳͲొ͍ͯ͠Δ͕ɼಛఆͷGPUͰͷΈར༻Մ ೳͳ͏͑ʹɼGPUΞʔΩςΫνϟͷཧղͱ৽͍͠ݴ ޠͷशಘ͕ඞཁͰ͋Δɽ ͦ ͜ Ͱ ຊ ߘ Ͱ ɼط ଘ ͷ ฒ ྻ Խ ख ๏ Λ ༻ ͍ ͨ GPGPUϓϩάϥϛϯάΛఏҊ͢ΔɽGPUฒྻ ੑͷߴ͍ϓϩάϥϜͷ࣮ߦʹద͍ͯ͠ΔͨΊɼطଘͷ ฒྻԽख๏Λ༰қʹGPUద༻͢Δखஈ͕͋Εɼ GPUΛฒྻϓϩάϥϜͷ࣮ߦڥͱͯ͠༗ޮ׆༻Մ ೳͱͳΔ͜ͱ͕ظͰ͖Δɽ ຊߘͷߏΛҎԼʹࣔ͢ɽ2ষͰຊݚڀͷഎܠʹ ͋ͨΔGPUͱGPGPUʹ͍ͭͯ؆୯ʹઆ໌͠ɼݱঢ় ͷΛ໌Β͔ʹ͢Δɽ3ষͰղܾҊͱͯ͠طଘ ͷฒྻԽख๏Λར༻ͨ͠GPGPUϓϩάϥϛϯάΛఏ Ҋ͢Δɽ4ষͰఏҊʹର͢Δ࣮ྫͱͯ͠ɼCUDA Λ༻͍ͯGPU͚ʹطଘͷฒྻԽख๏Λ࣮͢Δํ ๏Λݕ౼͢Δɽ5ষͰؔ࿈ݚڀʹ৮Εɼ6ষ·ͱ Ίͱࠓޙͷ՝ͷষͱ͢Δɽ
2. GPU ͷΞʔΩςΫνϟͱطଘͷ GPGPU
ϓϩάϥϛϯάख๏
GPUຊདྷɼߴʹը૾ඳըΛߦ͏ͨΊʹൃల͠ ͨϋʔυΣΞͰ͋ΔɽݱࡏͰಈը࠶ੜࢧԉͳͲͷ ॲཧ౷߹͞ΕΔʹ͋ΔͨΊɼੑೳͷࠩҟେ͖ ͍ͷͷɼҰൠతͳPCͷଟ͘ʹࡌ͞Ε͍ͯΔɽ ਤ1ʹ౷తͳGPUͷϋʔυΣΞߏͱը૾ ඳըͷͨΊͷओͳॲཧͷରԠΛࣔ͢ɽGPU͕ߦ͏ը ૾ඳըͷͨΊͷओͳॲཧɼฒྻܭࢉϕΫτϧܭࢉ ʹΑΔߴԽʹద͍ͯ͠ΔɽͦͷͨΊGPU্ͷॲཧ ϢχοτฒྻԽ͕ਐΜͰ͓ΓɼϕΫτϧܭࢉʹద͠ ͨ෦ߏͱͳ͍ͬͯΔɽ·ͨߴͳը૾දݱʹඳ ը༰ʹԠͨ͡ෳࡶͳܭࢉ͕ඞཁͳͨΊɼॲཧϢχο τͷϓϩάϥϚϒϧԽ͕ਐΜͰ͍ΔɽGPUΛ༻͍ͨ ϓϩάϥϜΛ࡞͢ΔʹɼDirectXOpenGLͱ ͍ͬͨάϥϑΟοΫεAPIΛ༻͍ͯGPUͷಈ࡞λΠ ϛϯά੍ޚCPU-GPUؒͷσʔλ௨৴ͳͲΛߦ͍ɼ HLSLɼGLSLɼCgͱ͍ͬͨγΣʔμݴޠΛ༻͍ͯܭ ࢉϢχοτͷߦ͏ॲཧΛهड़͢Δඞཁ͕͋Δɽ ਤ1 ౷తͳ GPU ͷϋʔυΣΞߏͱը૾ඳըॲཧͷରԠFig. 1 Hardware construction of traditional GPU and relationship with graphics processing
GPGPUɼGPUͷಛΛར༻༷ͯ͠ʑͳॲཧ(൚ ༻ܭࢉɼGeneral-Purpose computation)Λߦ͏ͷ Ͱ͋ΔɽGPUͷϋʔυΣΞੑೳΛ׆͔ͤΔ༻్ɼ͢ ͳΘͪฒྻܭࢉϕΫτϧܭࢉʹదͨ͠ରʹର ͯ͠ɼCPUΛѹ͢ΔԋࢉੑೳΛಘΔ͜ͱ͕Ͱ͖ Δɽ͔͠͠ͳ͕ΒɼGPGPUΞϓϦέʔγϣϯͷ࣮ ༰қͰͳ͍ɽྫ͑GPUΛ༻͍ͯܭࢉΛߦ ͏ͨΊʹɼܭࢉΞϧΰϦζϜΛը૾ඳըͷγε ςϜʹରԠ࣮ͤͯ͢͞Δඞཁ͕͋Δ(ਤ2)ɽͦͷͨ ΊʹܭࢉϓϩάϥϛϯάͱάϥϑΟοΫεϓϩ άϥϛϯάͷ྆ํʹਫ਼௨͍ͯ͠Δඞཁ͕͋Γɼߴ͍ੑ ೳΛಘΔͨΊʹฒྻԽख๏GPUΞʔΩςΫνϟ ʹ͍ͭͯͷࣝॏཁͰ͋ΔɽݱࡏͰCUDAͳͲ άϥϑΟοΫεϓϩάϥϛϯάͷࣝΛඞཁͱ͠ͳ͍ GPGPUϓϩάϥϛϯάͷͨΊͷݴޠϥΠϒϥϦ ͋Δ͕ɼ͜ΕΒΛ༻͍ͯΞϓϦέʔγϣϯΛ࡞͢ ΔʹGPUΞʔΩςΫνϟͷ͕ࣝඞཁͰ͋ͬͨΓɼ طଘͷϓϩάϥϛϯάख๏ͱҟͳΔख๏Λशಘ͢Δ ඞཁ͕͋ͬͨΓ͢Δɽ
ࠓͰMAC OS XͷAqua/QuartzWindows VistaͷAeroGlassΛ͡Ίͱͯ͠OSσεΫτο ϓڥͷϨϕϧͰ͋ΔఔߴੑೳͳGPUΛཁٻ͢Δ έʔε͕૿Ճ͍ͯ͠Δ4)∼7)ɽߋʹಈը࠶ੜࢧԉػೳ Λ࣋ͭGPU(ϏσΦΧʔυ)૿Ճ͍ͯ͠ΔͨΊɼί ϯγϡʔϚ͚PCͷଟ͘ʹGPU͕ࡌ͞Ε͍ͯΔɽ GPUͷ࣋ͭཧԋࢉੑೳͷߴ͞Λߟྀ͢ΔͱɼGPU ͷ࣋ͭશͯͷੑೳΛ׆༻͢Δ͜ͱͰ͖ͳͯ͘ɼ͋ Δఔͷੑೳ͕׆༻Ͱ͖Ε༷ʑͳΞϓϦέʔγϣϯ ͷߴԽ͕ظͰ͖Δɽ͔͠͠ͳ͕Βɼݱࡏը૾ॲཧ Ҏ֎ͷʹ͓͍ͯGPGPU͕׆༻͞Ε͍ͯΔͷ Ұ෦ͷܭࢉՊֶٕज़ܭࢉʹݶΒΕ͍ͯΔɽͦͷ ࠷ͨΔཧ༝ͱͯ͠ɼGPUΛ׆༻͢ΔϓϩάϥϜͷ࡞ ͕༰қͰͳ͍͜ͱ͕ڍ͛ΒΕΔɽGPGPUʹৄ͘͠ ͳ͍ଟ͘ͷΞϓϦέʔγϣϯϓϩάϥϚʹͱͬͯɼ GPGPU͕༰қʹར༻Ͱ͖Δ͜ͱ͕ॏཁͰ͋Δɽ͜Ε ͪΖΜίϯγϡʔϚPCʹݶͬͨͰͳ͘ɼಛ ʹߴੑೳͳܭࢉڥΛٻΊΔHPC༻్Ͱڞ௨ͷ՝ Ͱ͋Δɽ
3. ఏ Ҋ ༰
3.1 طଘͷฒྻԽख๏Λར༻ͨ͠GPGPUϓϩ άϥϛϯάͷఏҊ GPUCPUͱൺͯߴ͍ϋʔυΣΞϨϕϧͷฒ ྻੑΛඋ͍͑ͯΔͨΊɼGPUʹΑΔԋࢉੑೳͷ্ ͕ظ͞Ε͍ͯΔͷͬͺΒߴ͍ฒྻੑΛ࣋ͭΞϓ ϦέʔγϣϯͰ͋ΔɽҰํͰฒྻԽSIMDɼSMPɼਤ2 ඳըॲཧΛར༻ͯ͠ܭࢉॲཧΛߦ͏ϓϩάϥϜͷྫ
Fig. 2 Example of numerical culculation programming using graphics programming
ਤ3 GeForce8000 γϦʔζͷΞʔΩςΫνϟ
Fig. 3 Architecture of GeForce8000 series
PCΫϥελɼGridͳͲطʹ༷ʑͳݚڀ͕ਐΊΒΕͯ ͍ΔςʔϚͰ͋ΓɼݱࡏϚϧνίΞCPUͷීٴͳ ͲʹΑΓ͞Ε͍ͯΔɽطଘͷฒྻԽख๏Λ༻͍ͯ GPGPUϓϩάϥϛϯάΛߦ͑ΔΑ͏ʹ͢Δ͜ͱ͕Ͱ ͖ΕɼGPUΛطଘͷฒྻڥʹΑΓ͍ۙͷͱ͠ ͯѻ͑ΔΑ͏ʹͳΔͱߟ͑ΒΕΔɽGPU͕ۙͰ ͍͍͢ͷͱͳΔ͜ͱͰɼଟ͘ͷϢʔβ͕GPUͷ ࣋ͭߴ͍ܭࢉੑೳΛ༗ޮʹ׆༻Ͱ͖ΔΑ͏ʹͳΔ͜ͱ ͕ظͰ͖Δɽ ͦ͜ͰɼGPGPUϓϩάϥϛϯάʹ͓͍ͯطଘͷฒ ྻԽख๏Λར༻͢Δ͜ͱΛఏҊ͢ΔɽຊষͷΓͷ෦ Ͱɼ࠷৽GPUͷϋʔυΣΞߏΛ֬ೝͨ͠͏ ͑Ͱɼطଘͷฒྻϓϩάϥϛϯάʹ͓͍ͯฒྻԽର ͷཻʹԠ༷ͯ͡ʑͳฒྻԽख๏͕༻͍ΒΕ͍ͯΔ ͜ͱΛߟྀ͠ɼGPUͷϋʔυΣΞߏͱฒྻԽख ๏ͱͷରԠ͚Λݕ౼͢Δɽ·ͨ࣍ষͰຊఏҊʹର ͢Δ࣮ͷྫͱͯ͠ɼCUDA͚ʹطଘͷฒྻԽख ๏Λ࣮͢Δ͜ͱΛݕ౼͢Δɽ 3.2 GPUʹର͢ΔطଘͷฒྻԽख๏ͷׂΓͯ ͷݕ౼ ࠷৽GPUͷϋʔυΣΞߏΛ֓؍͠ɼطଘͷฒ ྻԽख๏ΛGPUʹద༻͢ΔʹͲ͏͢ΕΑ͍͔ɼ طଘͷฒྻԽख๏Ͱ༻͍Δ࣮ߦϞσϧͱGPUϋʔ υΣΞΛͲͷΑ͏ʹؔ࿈͚Δ͔ʹ͍ͭͯͷݕ౼Λ ߦ͏ɽ ຊߘࣥච࣌ͷ࠷৽ੈGPUͰ͋ΔGeForce8000 γϦʔζ͓ΑͼRadeonHD2000/3800γϦʔζͷ ෦ߏͷ֓ཁΛਤ3͓Αͼਤ4ʹࣔ͢ɽ͜ΕΒͷGPU ͰϓϩάϥϚϒϧॲཧϢχοτͷฒྻԽ͕ਐΜͰ͓ ΓɼGPUશମͰ࠷େ100Ҏ্ͷԋࢉΛฒྻ࣮ߦՄೳ
ਤ4 RadeonHD2000/3800 γϦʔζͷΞʔΩςΫνϟ
Fig. 4 Architecture of RadeonHD2000/3800 series
ͱͳ͍ͬͯΔɽͨͩ͠ɼGPU্ͷશͯͷԋࢉث͕ಉ ͡ػೳΛ࣋ͪۉʹஔ͞Ε͍ͯΔΘ͚Ͱͳ͍ɽ͍ ͔ͭ͘ͷԋࢉث͕ू·ͬͯॲཧϢχοτΛߏ͓ͯ͠ ΓɼߋʹॲཧϢχοτ͕ू·ͬͯGPUΛߏ͍ͯ͠ Δɽ·ͨϝϞϦʹ͍ͭͯɼGPUશମ͔Βಁաతʹ ΞΫηεՄೳͳ(άϩʔόϧͳ)ϝϞϦͷΈΛඋ͑ͯ ͍ΔΘ͚Ͱͳ͘ɼॲཧϢχοτ͝ͱʹϩʔΧϧͳϝ ϞϦඋ͍͑ͯΔɽ͢ͳΘͪݱࡏͷGPUɼϋʔυ ΣΞϨϕϧͰͷߴ͍ฒྻੑΛඋ͍͑ͯΔͷΈͳΒͣɼ ࡌ͍ͯ͠ΔԋࢉثͱϝϞϦʹ֊ੑ͕උΘ͍ͬͯ Δͱଊ͑Δ͜ͱ͕Ͱ͖Δɽ طଘͷฒྻϓϩάϥϛϯάʹ͓͍ͯɼσʔλϨϕ ϧͷฒྻੑɼ໋ྩϨϕϧͷฒྻੑɼεϨουϨϕϧͷ ฒྻੑɼϓϩηεϨϕϧͷฒྻੑͱ͍ͬͨฒྻੑ͕ར ༻͞Ε͍ͯΔɽ·ͨ۩ମతͳฒྻϓϩάϥϛϯάͷख ஈͱͯ͠SIMD໋ྩɼOpenMP͓Αͼ֤छεϨο υϥΠϒϥϦɼMPIͳͲ͕༻͍ΒΕ͍ͯΔ(ਤ5)ɽͦ ͜ͰɼGPUͷ֊ੑͱطଘͷฒྻԽख๏ͷ֊ੑͱ ΛରԠ͚ɼGPU͚ʹ͜ΕΒͷؔɾϥΠϒϥϦ Λ࣮͢Δ͜ͱΛߟ͑Δɽ 3.2.1 GPU͚SIMD໋ྩͷݕ౼ SIMD໋ྩ̍ͷԋࢉͰෳͷσʔλʹରͯ͠ॲ ཧΛߦ͏͍ΘΏΔϕΫτϧԋࢉ໋ྩͰ͋Γɼσʔλฒ ྻ͚ͷࡉཻͳฒྻԽʹར༻͞Ε͍ͯΔɽGPUϓ ϩάϥϛϯάʹ͓͍ͯɼը૾ඳըͷࡍʹߦ͏ ৭ͷԋࢉʹ4࣍ݩϕΫτϧԋࢉ͕ద͍ͯ͠Δ͜ͱ͔Βɼ ϋʔυΣΞϨϕϧͰϕΫτϧԋࢉʹద͍ͯ͠ΔͷΈ ͳΒͣɼγΣʔμݴޠʹϕΫτϧԋࢉ͚ͷΠϯλʔ ਤ5 CPU ʹ͓͚Δ֊ܕฒྻϓϩάϥϛϯά
Fig. 5 Hierarchical parallel programming on CPU
ϑΣΠε͕උΘ͍ͬͯΔɽϕΫτϧԋࢉʹΑΔߴԽ ͕ߦ͑ΔΑ͏ʹϓϩάϥϜΛΉ͜ͱGPGPUϓ ϩάϥϛϯάͷجຊͰ͋ΔɽҰํCPUϓϩάϥϛϯ άʹ͓͍ͯMMXSSEɼVMXͳͲ֤CPUͷ උ͑Δ໋ྩηοτΛར༻ͯ͠SIMD໋ྩΛར༻͢Δɽ ͦͷͨΊCPUͷରԠ͢Δ໋ྩηοτ͝ͱʹҟͳΔϓ ϩάϥϜهड़Λ͢Δඞཁ͕͋Δ্ʹɼΠϯϥΠϯΞη ϯϒϥΛهड़͢Δඞཁ͕͋Δ߹͋ΔͨΊɼGPU ͚ͷSIMD໋ྩ࡞͠ʹ͍͘ɽͦ͜ͰɼதΒ8)
ʹΑΔهड़ํࣜΛར༻͢Δ͜ͱΛߟ͑Δɽ͜ͷهड़ํ ࣜϢʔβ(ΞϓϦέʔγϣϯϓϩάϥϚ)ʹରͯ͠ SIMD໋ྩηοτ͝ͱͷهड़ͷҧ͍ΛӅณ͠ɼ࣮ߦ ڥΛΘͣ౷Ұతͳهड़ʹΑͬͯSIMDԽΛߦ͏͜ ͱͷͰ͖ΔํࣜͰ͋Δɽ࣮తʹఏҊ͢Δهड़ํࣜ Λ༻͍ͯॻ͔ΕͨϓϩάϥϜʹର֤ͯ͠CPUͷରԠ ͢ΔSIMD໋ྩͷϓϩάϥϜมΛߦ͏ͷͰ͋ ΔͨΊɼهड़ํ͔ࣜΒGPUͷରԠ͢ΔϕΫτϧԋࢉ ͷมػೳΛ࣮͢Δ͜ͱͰGPU͚SIMD໋ ྩͷ࣮͕ՄೳͱͳΔɽ 3.2.2 GPU͚OpenMPͷݕ౼ OpenMP͓Αͼ֤छεϨουϥΠϒϥϦΛ༻͍ͨ ฒྻԽɼڞ༗ϝϞϦΛඋ͑ͨڥʹదͨ͠ࡉཻͷ ฒྻԽख๏Ͱ͋ΔɽOpenMP֤छεϨουϥΠϒϥ Ϧ(ओʹpthread)ͷ্Ґʹ࣮͞Ε͓ͯΓ͔ͭهड़͕ ༰қͰ͍͍ͨ͢Ίɼ͜͜ͰGPU͚OpenMP ͷݕ౼ʹΛߜΔɽOpenMPͰஞ࣍ϓϩάϥϜͷ ιʔείʔυʹϓϥάϚΛૠೖ͠ɼίϯύΠϧ࣌ʹϓ ϥάϚΛղऍͯ͠ڞ༗ϝϞϦΛ༻͍ͨεϨουฒྻϓ ϩάϥϜʹม͢Δɽϧʔϓมͷࣗಈॻ͖͑ʹΑ ΔϧʔϓͷฒྻԽͱ͍ͬͨࡉཻͷฒྻԽʹΘΕΔ ͜ͱ͕ଟ͘ɼஞ࣍ϓϩάϥϜͷஈ֊తͳฒྻԽʹద ͍ͯ͠ΔɽOpenMPΛར༻͢ΔʹOpenMPͷϓ ϥάϚΛղऍ͢Δ͜ͱ͕ՄೳͳίϯύΠϥ͕ඞཁͰ ͋ΔɽରԠίϯύΠϥͷදతͳͷͱͯ͠Omni
OpenMP Compiler9)ɼIntel C++ Compiler,GCCͳ Ͳ͕͋Γɼଞʹ༻ίϯύΠϥͷ͍͔͕ͭ͘ରԠ͠ ͍ͯΔɽOpenMPίϯύΠϥͷجຊతͳ࣮ɼϓϥ άϚ͕ૠೖ͞Εͨ෦ʹεϨουͷ੍ޚΛߦ͏هड़Λ ૠೖ͢Δ͜ͱͰ࣮ݱͰ͖ΔɽGPUʹڞ༗ϝϞϦ͕ ࡌ͞Ε͍ͯΔͨΊɼOpenMPʹ͓͚ΔεϨουͷ ੍ޚʹ૬͢Δهड़ΛGPU্Ͱ฿Ͱ͖ΕGPU ͚OpenMPͷ࣮͕ՄೳͱͳΔɽ 3.2.3 GPU͚MPIͷݕ౼ MPIϝοηʔδ௨৴Λར༻͠ɼෳϊʔυؒʹ͓ ͚Δσʔλ੍ޚͷΓऔΓΛߦ͏͜ͱͰฒྻϓϩά ϥϜΛܗ͢ΔͨΊͷن֨Ͱ͋Δɽϊʔυ͝ͱʹऔಘ ͨ͠rankใʹج͍ͮͯॲཧΛৼΓ͚Δͱ͍ͬͨ ͍ํʹద͓ͯ͠ΓɼOpenMPͱൺͯૄཻͷฒ ྻԽʹΘΕΔ͜ͱ͕ଟ͍ɽMPIࢄϝϞϦڥ ͚ͷฒྻԽख๏Ͱ͋Δ͕ɼڞ༗ϝϞϦڥͰಛʹ ͳ͘ར༻͢Δ͜ͱ͕Ͱ͖ΔɽΉ͠Ζ௨৴ͷࡉ͔͍ ੍ޚ͕ՄೳͳͨΊɼڞ༗ϝϞϦڥͰOpenMPΑ Γߴ͍ੑೳ͕ಘΒΕΔ͜ͱ͕͋ΔɽMPIͷ࣮ί ϯύΠϥෳࡶͳιʔείʔυมػߏΛඞཁͱ͠ͳ ͍ΘΓʹɼ(ෳͷPCʹލͬͯ)ෳͷϓϩηεΛ ্ཱͪ͛ΔͨΊͷػߏΛඋ͍͑ͯΔɽMPIͷ࣮ ͱͯ͠mpich͓Αͼmpich2,LAM10),11)ͳͲ͕ڍ ͛ΒΕΔɽGPU͚MPIͷ࣮ʹ͍ͭͯɼԋࢉϢ χοτؒʹ͓͚Δڞ༗ϝϞϦΛհͨ͠σʔλ੍ޚͷ ΓऔΓʹMPIͷΠϯλʔϑΣΠεΛར༻Ͱ͖ΔΑ ͏ʹ͢Δͱ͍͏࣮͕ߟ͑ΒΕΔɽ·ͨɼCPU-GPU ؒͷ௨৴ʹMPIͷΠϯλʔϑΣΠεΛར༻Ͱ͖ΔΑ ͏ʹ͢Δ͜ͱͰɼCPU্ͷϝϞϦͱGPU্ͷϝϞ Ϧͱ͍͏ࢄϝϞϦͷཧΛطଘͷࢄϝϞϦʹ͍ۙ هड़ʹΑͬͯѻ͑ΔΑ͏ʹ͢Δ͜ͱ͕Ͱ͖Δɽ Ҏ্ͷΑ͏ʹɼطଘͷ༷ʑͳฒྻԽख๏ʹGPUΛ ରԠ͚Δ͜ͱͰɼGPU෦ͷԋࢉثϝϞϦ͕࣋ ͭ֊ੑΛ׆༻ͨ͠GPU͚ฒྻϓϩάϥϜΛ༰ қʹ࡞ՄೳͱͳΔ͜ͱ͕ظͰ͖Δɽ·ͨSMP ϚϧνίΞCPUΛࡌͨ͠ϊʔυʹΑͬͯߏ͞Ε ΔPCΫϥελͳͲʹ͓͍ͯɼϊʔυͷฒྻԽΛ OpenMPɼϊʔυؒͷฒྻԽΛMPIʹΑͬͯߦ͏ϋ ΠϒϦουͳฒྻԽͳͲʹΑͬͯޮతͳฒྻԽ͕ߦ ΘΕ͍ͯΔɽGPUΛ༻͍Δ߹ʹ͓͍ͯɼGPU ͷ֊తͳฒྻੑΛར༻ͨ͠ϋΠϒϦουͳฒྻԽɼ GPUͱCPUͷฒྻॲཧͳͲ༷ʑͳϨϕϧͰͷฒྻ ॲཧ͕༰қʹߦ͑ΔΑ͏ʹͳΔ͜ͱ͕ظͰ͖Δɽ
4. CUDA Λ༻͍ͨ GPU ͚طଘͷฒྻԽ
ख๏ͷ࣮ʹ͚ͨݕ౼
ຊষͰɼCUDAΛ༻͍ͯGPU͚ʹطଘͷฒ ྻԽख๏Λ࣮͢Δํ๏ʹ͍ͭͯΑΓ۩ମతͳݕ౼Λ ߦ͏ɽ ͜Ε·ͰGPUͷػೳΛར༻͢ΔʹάϥϑΟοΫ εAPIΛհ͢Δඞཁ͕͋ͬͨͨΊɼGPGPUʹ͓͍ͯ શͯͷॲཧΛάϥϑΟοΫεϓϩάϥϛϯάͷํࣜ ʹ͋Θ࣮ͤͯ͢Δඞཁ͕͋ͬͨɽͦͷͨΊGPUϓ ϩάϥϛϯάΛߦ͏ͨΊʹάϥϑΟοΫεϓϩάϥ ϛϯάΛशಘ͢Δඞཁ͕͋ͬͨɽ͜Εʹରͯ͠CUDA Ͱଟ͘ͷॲཧΛΑΓײతʹهड़͢Δ͜ͱ͕Ͱ͖Δ (ਤ6)ɽ·ͨ͜Ε·ͰGPUͷ෦ใʹ͍ͭͯά ϥϑΟοΫεϓϩάϥϛϯάʹඞཁͳఔͷใͷΈ ͕ެ։͞Ε͍͕ͯͨɼCUDAͱͱʹଟ͘ͷใ͕ఏ ڙ͞ΕΔΑ͏ʹͳͬͨɽ͜ΕʹΑΓGPUϓϩάϥϜ ͷσόοά࠷దԽʹ༗ӹͳใ͕ೖख͘͢͠ͳͬ ͨͱݴ͑Δɽ ਤ7ʹCUDAͷฒྻ࣮ߦϞσϧ͓ΑͼϝϞϦϞσ ϧΛࣔ͢ɽ͜ΕΛݩʹɼCUDAΛ༻͍ͯطଘͷฒྻԽख๏Ͱ͋ΔSIMD໋ྩɼOpenMP͓ΑͼMPIͷ
ਤ6 CUDA Λ༻͍ͨ GPGPU ϓϩάϥϛϯά
Fig. 6 GPGPU programming using CUDA
ਤ7 CUDA ͷฒྻ࣮ߦϞσϧͱϝϞϦϞσϧ
Fig. 7 Parallel execution model and memory model of CUDA
4.1 CUDA͚SIMD໋ྩͷ࣮
͡ΊʹCUDA͚SIMD໋ྩͷ࣮ʹ͍ͭͯߟ
͑ΔɽCUDAʹ͓͚Δ֤BlockͷThreadɼର
ͷσʔλ܈ʹରͯ͠ಉҰͷॲཧΛಉ࣌ʹ࣮ߦ͢Δ͜ ͱ͕ՄೳͰ͋Δɽ͜ΕయܕతͳSIMD໋ྩͷॲཧ ͱಉҰͰ͋ΔͨΊɼThreadΛ༻͍ͨԋࢉʹରͯ͠ط ଘͷSIMD໋ྩηοτ(ڞ௨هड़ํࣜ)ʹରԠͨ͠Π ϯλʔϑΣΠεΛ༻ҙ͢Δ͜ͱͰCUDA͚SIMD ໋ྩ͕࣮ݱͰ͖Δͱߟ͑ΒΕΔɽ 4.2 CUDA͚OpenMPͷ࣮ ଓ͍ͯ CUDA ͚OpenMP ʹ͍ͭͯߟ͑Δɽ OpenMPΛ༻͍ͨฒྻॲཧʹɼ֤ԋࢉث͕ࣗ༝ʹ ΞΫηε͢Δ͜ͱͷͰ͖Δڞ༗ϝϞϦ͕ඞཁͰ͋Δɽ CUDAʹڞ༗ϝϞϦͱͯ͠ར༻Մೳͳෳछྨͷ
ϝϞϦ͕ଘࡏ͢ΔɽGlobal Memory, Constant
Mem-ory, Texture MemoryશͯͷBlock͓Αͼશͯͷ
Thread͔Βڞ༗ϝϞϦͱͯ͠ར༻ՄೳͰ͋ΓɼShared
MemoryಉҰBlockͷશͯͷThread͔Βڞ༗ ϝϞϦͱͯ͠ར༻Մೳͱͳ͍ͬͯΔɽ͜ΕΒͷ͏ͪɼ
Constant MemoryͱTexture MemoryGPU͔Β ݟΔͱಡΈࠐΈઐ༻ͷϝϞϦͰ͋ΔͨΊɼࠓճར༻
ର͔Βআ֎͢ΔɽΔGlobal MemoryͱShared
Memoryʹ͍ͭͯɼGlobal MemoryΛར༻͢Ε
ThreadϨϕϧͱBlockϨϕϧํͰɼShared
Mem-oryΛར༻͢ΕBlockϨϕϧͰOpenMPͷฒྻॲ
ཧΛஔ͖͑ΒΕΔͱߟ͑ΒΕΔɽΑͬͯɼ͜ΕΒͷ ϝϞϦΛར༻ͯ͠OpenMPͷػೳΛ࣮ݱ͢Δ͜ͱͰɼ CUDA͚OpenMP͕࣮Ͱ͖Δͱߟ͑ΒΕΔɽ OpenMP ࡉཻͳฒྻॲཧʹར༻͢ΔͨΊɼ BlockϨϕϧͷฒྻॲཧΑΓThreadϨϕϧͷฒ ྻॲཧʹద͍ͯ͠Δͱߟ͑Δ͜ͱ͕Ͱ͖ΔɽҰํͰ GPU෦ͷϝϞϦసૹCPU-ϝΠϯϝϞϦؒ ΑΓߴͰ͋Γɼ·ͨBlockϨϕϧͷฒྻॲཧ ThreadϨϕϧͷॲཧͱൺͯॲཧͷࣗ༝͕ߴ͍ɽ
ਤ8 CUDA Λ༻͍ͨϋΠϒϦουฒྻϓϩάϥϜͷಈ࡞Πϝʔδ
Fig. 8 Executive image of hybrid parallel program using CUDA
ͦͷͨΊCUDA͚OpenMPͷ࣮ʹ͍ͭͯɼ ThreadϨϕϧͱBlockϨϕϧͷͦΕͧΕͰ࣮ͯ͠ ੑೳൺֱΛߦ͍ɼΑΓྑ͍࣮Λ୳Δ͜ͱʹ͢Δɽ 4.3 CUDA͚MPIͷ࣮ ࠷ޙʹCUDA͚MPIͷ࣮ʹ͍ͭͯߟ͑Δɽ CUDAͷฒྻ࣮ߦϞσϧͷ͏ͪɼThreadϨϕϧͷฒ ྻॲཧSIMDʹۙ͘ɼҟछͷԋࢉΛฒྻ࣮ߦ͢Δ͜ ͱ͕Ͱ͖ͳ͍ɽ͜ΕૄཻͷฒྻॲཧΛߦ͏MPI ʹ͓͍ͯக໋తͳͰ͋ΔͨΊɼBlockϨϕϧͷฒ ྻॲཧΛར༻͢Δ͜ͱʹ͢ΔɽMPIΛ༻͍ͨฒྻॲ ཧʹϊʔυؒͷ௨৴͕ෆՄܽͰ͋Γɼ·ͨಉظ௨৴ ؔΛ฿͢ΔʹBlockؒͷಉظػߏ͕ඞཁͰ͋ ΔɽCUDAʹBlockؒͰతͳ௨৴Λߦ͏ػߏ ͕ଘࡏ͠ͳ͍͕ɼGlobal MemoryΛհͯ͠σʔλΛ ΓऔΓ͢Δ͜ͱͰBlockؒͷ௨৴Λ฿͢Δɽߋʹ CUDAʹBlockؒͷಉظΛऔΔػߏ͕༻ҙ͞Εͯ ͍ͳ͍ͨΊɼGlobal MemoryΛར༻ͯ͠ಉظॲཧΛ ฿͢ΔɽҎ্ʹΑͬͯCUDA͚MPI͕࣮Ͱ ͖Δͱߟ͑ΒΕΔɽ ҰํɼCPU-GPUؒͷσʔλసૹΛߦ͏ؔΛఆٛ ͠ɼCPUͱGPUͱ͍͏ҟͳΔϋʔυΣΞؒͷࢄ ϝϞϦΛMPIͷΠϯλʔϑΣΠεͰཧՄೳʹ͢Δ ͱ͍͏ద༻ํ๏ߟ͑ΒΕΔɽ͜ͷࡍͷͱͯ͠ɼ
CUDAͷ࣮ߦϞσϧͱͯ͠Grid(CPU͔ΒGPUʹ
ରͯ͠ԋࢉࢦࣔΛߦ͏ࡍͷ୯Ґ)ͷ్தͰCPU-GPU ؒͷσʔλૹड৴͕Ͱ͖ͳ͍͜ͱɼͦͷͨΊGridͷ ׂͳͲ͕ඞཁͱͳΔՄೳੑ͕ڍ͛ΒΕΔɽ͔͠͠ͳ ͕ΒطଘͷΠϯλʔϑΣΠεͰཧͰ͖ΔϝϦοτ େ͖͍ͨΊɼ͜ͷద༻ํ๏࣮͢ΔՁ͕͋Δɽ Ҏ্ͷΑ͏ʹThreadϨϕϧͷฒྻԽͱBlockϨϕ ϧͷฒྻԽΛదʹ͍͚Δ͜ͱͰɼCUDA͚
ͷSIMD໋ྩɼOpenMPɼMPIΛ࣮͢Δ͜ͱ͕Ͱ
͖Δͱߟ͑ΒΕΔɽߋʹɼطଘͷCPU͚ฒྻϓϩ άϥϛϯάʹ͓͚Δ֊ܕͷ(ϋΠϒϦουͳ)ฒྻ ϓϩάϥϜΛ฿͢Δ͜ͱՄೳͱͳΔɽਤ8ʹզʑ ͕ݕ౼ΛਐΊ͍ͯΔCUDAΛ༻͍ͨϋΠϒϦουฒ ྻϓϩάϥϜͷಈ࡞ΠϝʔδΛࣔ͢ɽ͜ΕΒΛ༻͍Δ ͜ͱͰɼGPUͷੑೳΛΑΓ༰қʹ͔ͭޮྑ͘ൃش Ͱ͖ΔΑ͏ʹͳΔ͜ͱ͕ظͰ͖Δɽ
5. ؔ ࿈ ݚ ڀ
ݱࡏɼޮྑ͘GPGPUϓϩάϥϛϯάΛߦ͏ͨ ΊͷݴޠϥΠϒϥϦͷݚڀ͕͍͔ͭ͘ߦΘΕ͍ͯ ΔɽதʹBrookGPU12)RapidMind13)ͷΑ͏ʹɼGPUͷΈͰͳ͘Cell Broadband Engine
ϚϧνίΞCPUͳͲͷద༻ΛࢹʹೖΕͨͷ
ଘࡏ͢Δɽ͜ΕΒGPGPUϓϩάϥϛϯάΛ༰қ
ʹ͍ͨ͠ͱ͍͏ཁٻຊݚڀͱಉ༷Ͱ͋Δɽ͔͠͠ͳ ͕ΒɼC/C++ʹର͢Δ֦ுͱ͍͏ܗΛͱΓطଘͷϓ ϩάϥϛϯάݴޠʹ͍ۙهड़͕Ͱ͖ΔΑ͏ʹ͍ͯ͠Δ
ͱ͍͑ɼ৽͍͠ݴޠϥΠϒϥϦΛ࡞͍ͯ͠Δͨ ΊطଘͷΞϓϦέʔγϣϯϓϩάϥϚʹର͢Δशಘί ετͷେ͖͞൱Ίͳ͍ɽҰํɼରΞʔΩςΫνϟ ʹదͨ͠ݴޠΛ࡞͢Δ͜ͱͰطଘͷݴޠΛ༻͍ΔΑ Γߴ͍ੑೳΛಘ͍͢Մೳੑ͋ΔͨΊɼهड़शಘ ͷ͢͠͞ͱੑೳʹ͍ͭͯͷٞɾධՁΛߦ͏Ձ͕ ͋Δɽ ຊݚڀͰ࣮ʹར༻͍ͯ͠ΔCUDAɼGeForce8000 γϦʔζͷΞʔΩςΫνϟʹڧ͘ґଘͨ͠ϥΠϒϥϦ Ͱ͋ΔɽCUDAࠓޙNVIDIAͷϦϦʔε͢Δ৽͠ ͍GPUͰར༻Ͱ͖Δͱ͞Ε͍ͯΔ͕ɼࠓͷͱ͜Ζ ଞࣾͷGPUͰ༻͍Δ͜ͱͰ͖ͳ͍ͨΊɼGPUϓϩ άϥϜΛͯ͢CUDAͰهड़͢Δͱ͍͏ͷݱ࣮త Ͱͳ͍ɽ͜Εʹର͠ɼNVIDIAͱGPUͷγΣΞΛ ೋ͍ͯ͠ΔAMDClose-to-the-Metal(CTM)14) ͷఏڙΛද໌͍ͯ͠ΔɽGPUϓϩάϥϜͷهड़͠ ͢͞ੑೳͷൃش͢͠͞ͷ໘͔ΒɼࠓޙGPUϕ ϯμʔ͕GPGPU͚ʹ։ൃڥࢿྉΛఏڙ͠ଓ ͚ΔՄೳੑߴ͍ɽҰํͰɼCUDACTM͕ݴޠ ༷ͷมߋͳ͘ܧଓͯ͠ఏڙ͞ΕΔ͔ෆ໌Ͱ͋Γɼ ·ͨGPUͷछྨ͝ͱʹݸผͷϓϩάϥϜΛ࡞͢Δ ඞཁ͕͋Δ͜ͱɼGPGPUϓϩάϥϛϯάΛ༰қ ʹ͢Δͱ͍͏؍͔Βେ͖ͳোͰ͋ΔɽຊఏҊ GPUͷҧ͍Λٵऩ͠ڞ௨ʹར༻Ͱ͖ΔڥΛఏڙ͢ ΔͷͰ͋Γɼ͜͏ͨ͠ΛղܾͰ͖ΔՄೳੑ͕͋ Δͱݴ͑Δɽ
6. ͓ Θ Γ ʹ
ຊߘͰGPGPUΛ༰қʹར༻͢ΔͨΊͷ৽͍͠ ख๏ͱͯ͠ɼطଘͷฒྻԽख๏Λ༻͍ͨGPGPUϓϩ άϥϛϯάͷఏҊΛߦͬͨɽ·ͨCUDAΛ༻͍࣮ͨ ྫʹ͍ͭͯݕ౼ΛߦͬͨɽຊఏҊʹΑͬͯGPGPU ϓϩάϥϛϯάʹطଘͷฒྻԽख๏͕ར༻Ͱ͖ΔΑ ͏ʹͳΕɼ༷ʑͳΞϓϦέʔγϣϯʹରͯ͠༰қʹ GPUΛ׆༻Ͱ͖ΔΑ͏ʹͳΔɽߋʹGPUͱ͍͏ಛघ ͳϋʔυΣΞ͚ͷϓϩάϥϛϯάΛطଘͷCPU ͚ϓϩάϥϛϯάख๏Ͱߦ͏ྫͱͯ͠ߟ͑Δ͜ͱͰɼ ฒྻԽख๏ʹͱͲ·Β༷ͣʑͳCPU͚ͷϓϩάϥ ϛϯάख๏ΛGPU͚ʹ׆༻Ͱ͖ΔՄೳੑΛࣔͯ͠ ͍ΔɽຊఏҊɼࠓޙ·͢·͢ͷੑೳ্͕ظ͞Ε ΔGPUΛ༰қʹ׆༻͢Δ৽͍͠ՄೳੑΛࣔ͢ͷͰ ͋Δɽ ݱࡏఏҊ༰ͷ࣮ΛਐΊ͍ͯΔɽࠓޙ࣮Λ ਐΊɼ֤छΞϓϦέʔγϣϯʹద༻ͯ͠ੑೳධՁΛߦ ͏ɽ·ͨ͜ΕΒΛ௨ͯ͠GPGPUϓϩάϥϛϯάʹ طଘͷฒྻԽख๏Λ༻͍Δ͜ͱࣗମʹ͍ͭͯධՁΛ ߦ͏͜ͱͰɼΑΓྑ͍GPGPUϓϩάϥϛϯάख๏ ʹ͍ͭͯͷ͕ٞਂ·Δ͜ͱ͕ظͰ͖Δɽࢀ
ߟ จ
ݙ
1) gpgpu.org: SIGGRAPH 2007 GPGPU COURSE, http://www.gpgpu.org/s2007/.
2) gpgpu.org: General-Purpose computation on GPUs(GPGPU), http://gpgpu.org/.
3) NVIDIA: CUDA Programming Guide 1.0 (CUDA NVIDIA Homepage),
http://developer.nvidia.com/cuda/. 4) Apple: Apple Human Interface Guidelines,
http://developer.apple.com/documentation/ UserExperience/Conceptual/OSXHIGuidelines/. 5) Apple: Graphics & Imaging - Quartz,
http://developer.apple.com/ graphicsimaging/quartz/.
6) Microsoft: Experience Windows Vista: Win-dows Aero, http://www.microsoft.com/ windows/products/windowsvista/features/ experiences/aero.mspx.
7) RenderingProject/aiglx: RenderingProject/aiglx - Fredora Project Wiki, http://fedoraproject. org/wiki/RenderingProject/aiglx.
8) த༔,ᬑܒਖ਼,ฏᖒকҰ,ຊଟ߂थ:ίʔυͷ
ੑೳՄൖੑΛఏڙ͢ΔSIMD͚ڞ௨هड़ํࣜ,
ใॲཧֶձจࢽ ίϯϐϡʔςΟϯάγες Ϝ, Vol. 48, pp. 95–105 (2007).
9) M.Sato, S.Satoh, K.Kusano and Y.Tanaka: Design of OpenMP Compiler for an SMP Clus-ter, EWOMP ’99 , pp. 32–39 (1999).
10) Burns, G., Daoud, R. and Vaigl, J.: LAM: An Open Cluster Environment for MPI,
Proceed-ings of Supercomputing Symposium, pp. 379–
386 (1994).
11) Squyres, J. M. and Lumsdaine, A.: A Com-ponent Architecture for LAM/MPI,
Proceed-ings, 10th European PVM/MPI Users’ Group Meeting, Lecture Notes in Computer Science,
No. 2840, Venice, Italy, Springer-Verlag, pp. 379–387 (2003).
12) BrookGPU: BrookGPU, http://graphics. stanford.edu/projects/brookgpu/.
13) RapidMind Inc.: RapidMind, http://www.rapidmind.net/.
14) ATI: ATI CTM Guide, http://ati.amd.com/ companyinfo/researcher/documents.html.