パフォーマンス解析ツール長岡技術科学大学電気電子情報工学専攻出川智啓

(1)

パフォーマンス解析ツール

(2)

今回の内容



プロファイリング



_{Compute Profiler}



_(Nsight)



_{Warpと並列実行の制約}



_{Occupancy Calculator}



ベクトル和のプロファイリング

(3)

プログラムの挙動



経験的なプログラム高速化手法



スレッド数は128か256



メモリはアドレス順にアクセス

等



プログラム内部で挙動がどのように変化するか



パフォーマンスカウンタ機能による確認



プロファイリング

(4)

プロファイリング



パフォーマンスカウンタの数値を出力



プログラム実行時の挙動の記録



ボトルネックの特定



プロファイラ



コマンドラインで実行するプロファイラ



_{Compute Profiler}



_{nvprof(CUDA 5以降)}



_{GUIから操作するプロファイラ}



_{Compute Visual Profiler}

(5)

Compute Profiler



_{CUDAの実行環境に組み込まれたプロファイラ}



環境変数を設定するだけで有効化



使い方

1. 環境変数の設定

$ export CUDA_PROFILE=1

2. プログラムの実行

$./a.out

3. プロファイル結果(標準はcuda_profile_0.log)の確認

(6)

Compute Profilerの環境変数



_{CUDA_PROFILE}



_{1でプロファイル実行，0でプロファイル停止}



_{CUDA_PROFILE_CONFIG}



プロファイル項目を記述したファイルの名前



_{CUDA_PROFILE_LOG}



出力ファイル名（標準はcuda_profile_0.log）



_{CUDA_PROFILE_CSV}



_{1で出力ファイルをCSV（カンマ区切り）形式に変更}

(7)

差分法のプロファイル



_{$ export CUDA_PROFILE=1}



_{$ nvcc ‐arch=sm_20 differentiate.cu}



_$ ./a.out



_{cuda_profile_0.logが出力される}



_{$cat cuda_profile_0.log}

(8)

プロファイル結果の確認



標準で出力される値



_method

カーネルや関数（API）の名称



_gputime

_{GPU上で処理に要した時間（s単位）}



_cputime

_{CPUで処理（=カーネル起動）に要した時間}

実際の実行時間=cputime+gputime



_{occupancy 同時実行されているWarp数と実行可能な最大}

$cat cuda_profile_0.log # CUDA_PROFILE_LOG_VERSION 2.0 # CUDA_DEVICE 0 Tesla M2050 # TIMESTAMPFACTOR fffff614a81cd038 method,gputime,cputime,occupancy method=[ memcpyHtoD ] gputime=[ 3285.216 ] cputime=[ 3657.000 ] method=[ _Z13differentiatePdS_ ] gputime=[ 197.408 ] cputime=[ 18.000 ] occupancy=[ 1.000 ] method=[ memcpyDtoH ] gputime=[ 3598.752 ] cputime=[ 4505.000 ]

(9)

GPUの並列処理の実際



_{GPUのスレッドを管理}



_{Streaming Multiprocessor内のCUDA Core数}



_Tesla世代

₈



Fermi世代

32 

Kepler世代

192 

Maxwell世代

128 

_{Warp（ウォープ）という処理の単位の導入}



_{32スレッドをまとめて1 Warp}



_{1 Warpごとに処理を実行}

256スレッドの同時実行は不可能

(10)

GPUの並列処理の実際



処理はWarp単位で実行



_{32スレッドごとに処理を実行}



_{1ブロックあたりのスレッド数が256=Warp 8個}



ある命令を発行すると，Warp単位で同じ命令を実行



_{Warp内でCUDA CoreはSIMD的に動作}



複数のまとまったスレッドが同じ演算を同時に実行



SIMT(Single Instruction Multiple Threads)



厳密なSIMDではない

(11)

GPUの並列処理の実際



パイプライン実行



あるWarpがグローバルメモリへアクセス，データ待ち



その他の実行可能なWarpが計算を実行



グローバルメモリへのアクセスのレイテンシを隠蔽

計算

メモリアクセス

計算

Warp0

Warp1

Warp2

Warp3

実行開始

処理時間

計算

メモリアクセス

計算

メモリアクセス

計算

メモリアクセス

・・・

(12)

GPUの並列処理の実際



_{Warpの同時実行}



同時に実行されているWarp(Active Warp)が多いほどレイテ

ンシ隠蔽が容易



_{Active Warpを多くすることで高速化}



_{Warpの同時実行数はブロック内で使用している資源（レジスタ，}

共有メモリ）によって変化



占有率(Occupancy)



同時実行されているWarp数と実行可能な最大Warp数の比



一般的には1に近い方が高速

(13)

並列実行に対する制約



_{GPUの構成（資源の量，スケジューラ）に起因するハード}

ウェア的・ソフトウェア的制約

Tesla世代

Fermi世代

Warpサイズ

32スレッド

1ブロックあたりのスレッド数

512 1024

1SMあたりの最大スレッド数

1024

1536

1SMあたりのWarp数

32

48 1SMあたりのブロック数

8

8 1SMあたりの共有メモリサイズ

16384 byte

49152 byte

1SMあたりの32bitレジスタ数

16384本

32768本

https://www.softek.co.jp/SPG/Pgi/TIPS/public/accel/gpu‐accel2.html

(14)

並列実行に対する制約



_{1ブロックあたりのスレッド数 256}



_{1ブロックあたりのWarp数}

256thread/block / 32thread/Warp = 8 Warp/block



_{1SMが処理するブロック数（1SMあたり最大48Warpを処理）}

48 Warp/SM / 8 Warp/block = 6 block/SM（<8）



_{1SMあたり6ブロックを並列処理すればよい（最大8ブロック）}



同時実行されるWarp数は

8 Warp/block × 6 block/SM = 48 Warp/SM.



_{1SMあたり最大48Warp同時実行できるので，占有率は}

48 Warp/SM /48 Warp/SM = 1(=100%)

太字：利用するGPUによって決定

下線：ユーザの設定により決まる値

(15)

並列実行に対する制約



_{1ブロックあたりのスレッド数 64}



_{1ブロックあたりのWarp数}

64thread/block / 32thread/Warp = 2 Warp/block



_{1SMが処理するブロック数（1SMあたり最大48Warpを処理）}

48 Warp/SM / 2 Warp/block = 24 block/SM（

>8

）



_{1SMあたり8ブロックを並列処理（24ブロックの処理は不可能）}



同時実行されるWarp数は

2 Warp/block × 8 block/SM = 16 Warp/SM.



_{1SMあたり最大48Warp同時実行できるので，占有率は}

16 Warp/SM /48 Warp/SM = 0.333(=33%)

太字：利用するGPUによって決定

下線：ユーザの設定により決まる値

(16)

並列実行に対する制約



並列実行されるブロックの数



_{1ブロックあたりの共有メモリやレジスタの使用量で制限}



_{1ブロックあたりのスレッド数 Nt}



_{1ブロックが利用する共有メモリS[byte]}

49152/S [block/SM]



_{1スレッドが利用するレジスタR[本]}

32768/(Nt×R) [block/SM]



並列実行されるブロック数

min{8, 48/(Nt/32), 49152/S, 32768/(Nt×R)}

太字：利用するGPUによって決定

下線：ユーザの設定により決まる値

斜体

：処理内容やコンパイラによって決定

(17)

Occupancyの計算



_{CUDA Occupancy Calculator}



_{NVIDIAが提供*}



_{Occupancy等の値を実行前に推定}



必要な情報



_GPUの世代



共有メモリのサイズ設定



64kBのうち何kBを共有メモリとするか



_{1ブロックあたりのスレッド数}



_{1スレッドあたりのレジスタ使用数}



_{1ブロックあたりの共有メモリ使用量}

*http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls

(18)

CUDA Occupancy Calculator

(19)

共有メモリやレジスタ数使用量の確認



コンパイルオプションを付加



_{‐Xptxas –v もしくは ‐‐ptxas‐options=‐v}



実行可能バイトコードの情報を出力

$ nvcc ‐arch=sm_20 differentiate.cu ‐Xptxas ‐v

ptxas info : Compiling entry function '_Z13differentiatePdS_' for 'sm_20'

ptxas info : Function properties for _Z13differentiatePdS_

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info :

Used 19 registers

,

48 bytes cmem[0]

,

12 bytes cmem[16]

使用レジスタ数

コンスタントメモリ使用量（サイズ）

(20)

共有メモリやレジスタ数使用量の確認



共有メモリを利用したカーネル



コンスタントメモリを利用したカーネル

$ nvcc ‐arch=sm_20 differentiate_shared.cu ‐Xptxas ‐v

ptxas info : Compiling entry function '_Z13differentiatePdS_' for 'sm_20'

ptxas info : Function properties for _Z13differentiatePdS_

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info :

Used 18 registers

,

2064+0 bytes smem

, 48 bytes cmem[0],

4 bytes cmem[16]

共有メモリ使用量（サイズ）

$ nvcc ‐arch=sm_20 differentiate_constant.cu ‐Xptxas ‐v

ptxas info : Compiling entry function '_Z13differentiatePdS_' for 'sm_20'

ptxas info : Function properties for _Z13differentiatePdS_

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

(21)

CUDA Occupancy Calculator

(22)

レジスタ数の制約

*



_{Compute Capalicity2.0世代のGPUのレジスタ利用}



_{1 Warpあたり64本単位でレジスタを確保}



レジスタ数が問題となってOccupancy100%の達成が困難



_{Occupancy 100%を達成するためには}



ハードウェア的制約



_{1SMあたりのレジスタ数}

_{32768本以下}



_{1SMあたりのWarp数}

_{48(=48×32スレッド)以下}



_{1 Warpあたり64本単位でレジスタを確保}

*http://www.gdep.jp/column/view/33

(23)

レジスタ数の制約

*



_{Occupancy 100%を達成するためには}



_{1スレッドあたりの利用可能レジスタ数}

32768本/(48Warp/SM×32thread/Warp)=21.3=21本/thread



_{1スレッドあたり21本以下}



_{1 Warpが確保するレジスタ数}

21本/thread×32thread/Warp=672→

704(64の倍数)に切り上げ



_{1SMあたりの使用レジスタ数}

704本/Warp×48Warp/SM=33792本/SM (

>32768

)



_{1スレッドあたりの使用レジスタ数が20本以下でないと}

100%を達成できない

*http://www.gdep.jp/column/view/33

(24)

詳細なプロファイリング



_{gputime, cputime, occupancy以外のプロファイル}



設定ファイルに測定する項目を記述



設定ファイルを環境変数CUDA_PROFILE_CONFIGで指定



_{prof.confがよく使われる}



_{$ export CUDA_PROFILE_CONFIG=prof.conf}



設定後，CUDA_PROFILE=1にしてプログラムを実行



_{$ export CUDA_PROFILE=1}

(25)

詳細なプロファイリング



_{prof.confの内容}



_{Profiler User's Guide}

http://docs.nvidia.com/cuda/profiler‐users‐guide/index.html



4.3. Command Line Profiler Configuration中のTable 2



_{9. Metrics Reference中のTable 3}



_{nvprofを利用することも可能（CUDA 5以降）}



_{$ nvprof ‐‐query‐eventsの結果を参照}

(26)

カーネル起動に関する情報

名称

内容

gpustarttimestamp

カーネルもしくはメモリ転送開始時のタイムスタンプ(ns，64bit16進数)

gpuendtimestamp

カーネルもしくはメモリ転送完了時のタイムスタンプ(ns，64bit16進数)

timestamp

カーネル起動もしくはメモリ転送開始時のタイムスタンプ(s，単精度実数)

gridsize

カーネル起動時のブロック数(x,y)

gridsize3d

カーネル起動時のブロック数(x,y,z)

threadblocksize

カーネル起動時のスレッド数(x,y,z)

dynsmemperblock

動的に割り当てられた共有メモリのサイズ

stasmemperblock

静的に割り当てられた共有メモリのサイズ

regperthread

1スレッドあたりのレジスタ数

memtransferdir

メモリ転送の方向(0. host‐>device, 1. device‐>host)

memtransfersze

メモリ転送サイズ

memtransferhostmemtype

ホストメモリの種類（ページング可能かページロックか）

streamid

カーネルを起動したストリームのID

cacheconfigrequested

_{キャッシュ設定の要求}

共有メモリ：L1キャッシュ

(27)

カーネル実行時のイベント



全てのGPUで取得できる情報

名称

内容

local_load

_{ローカルメモリからの読込回数（1SM内のWarpあたり）}

local store

_{ローカルメモリからの書込回数（1SM内のWarpあたり）}

gld_request

_{グローバルメモリからの読込回数（1SM内のWarpあたり）}

gst_request

_{グローバルメモリへの書込回数（1SM内のWarpあたり）}

divergent_branch

_{divergent branchの発生回数}

branch

プログラム中の分岐命令の数

sm_cta_launched

_{1SM内で実行されたブロックの数}

(28)

カーネル実行時のイベント



_{Compute Capability 1.xのGPUで取得できる情報}

名称

内容

gld_incoherent

非コアレスアクセスによるグローバルメモリからの読込回数

gld_coherent

コアレスアクセスによるグローバルメモリからの読込回数

gld_32b

32バイト単位で行われたグローバルメモリからの読込回数

gld_64b

64バイト単位で行われたグローバルメモリからの読込回数

gld_128b

128バイト単位で行われたグローバルメモリからの読込回数

gst_incoherent

非コアレスアクセスによるグローバルメモリへの書込回数

gst_coherent

コアレスアクセスによるグローバルメモリへの書込回数

gst_32b

32バイト単位で行われたグローバルメモリへの書込回数

gst_64b

64バイト単位で行われたグローバルメモリへの書込回数

gst_128b

128バイト単位で行われたグローバルメモリへの書込回数

instructions

実行された命令数

warp_serialize

共有メモリ，コンスタントメモリでコンフリクトした回数

(29)

カーネル実行時のイベント



_{Compute Capability 2.xのGPUで取得できる情報}

名称

内容

instructions_issued

（バージョンによってはinst_issued）

命令発行数（命令のreplay*を含む）

instructions_issued1_0

命令群0の1命令発行に要するサイクル数

instructions_issued2_0

命令群0の2命令発行に要するサイクル数

instructions_issued1_1

命令群1の1命令発行に要するサイクル数

instructions_issued2_1

命令群1の2命令発行に要するサイクル数

instructions_executed

（バージョンによってはinst_executed）

命令発行数（replayを含まない）

warps_launched

起動されたWarpの数（1SMあたり）

threads_launched

起動されたスレッドの数（1SMあたり）

active_cycles

1SM内で最低でも1 Warpが実行されたサイクル数

active_warps

1SM内でアクティブになったWarpの数の合計

_{サイクルごとにアクティブなWarpの数（0~48の間で）を加算}

shared_load

共有メモリからの読込回数（1SM内のWarpあたり）

shared_store

共有メモリへの書込回数（1SM内のWarpあたり）

*同一Warp内のスレッド

が同じ命令を逐次的に

実行

バンクコンフリクト等が

原因で並列実行できな

い場合に発生

http://on‐demand.gputechconf.com/ gtc‐express/2011/presentations/In st_limited_kernels_Oct2011.pdf

(30)

カーネル実行時のイベント



_{Compute Capability 2.xのGPUで取得できる情報}

名称

内容

l1_global_load_hit

L1キャッシュにヒットしたグローバルメモリからの読込回数

l1_global_load_miss

L1キャッシュでキャッシュミスしたグローバルメモリからの読込回数

l1_local_load_hit

L1キャッシュにヒットしたローカルメモリからの読込回数

l1_local_load_miss

L1キャッシュでキャッシュミスしたローカルメモリからの読込回数

l1_local_store_hit

L1キャッシュにヒットしたローカルメモリへの書込回数

l1_local_store_miss

L1キャッシュでキャッシュミスしたローカルメモリへの書込回数

l1_shared_bank_conflict

共有メモリ内でバンクコンフリクトした回数

uncached_global_load_transacti

on

キャッシュされないグローバルメモリからの読込回数

32,64,128ビットアクセスに対して1,2,4で加算

global_store_transaction

グローバルメモリへの書込回数

32,64,128ビットアクセスに対して1,2,4で加算

(31)

カーネル実行時のイベント



合計で70個程度

(※動作は未確認)

•elapsed_cycles_sm •global_store_transaction •l1_global_load_hit •l1_global_load_miss •l1_local_load_hit •l1_local_load_miss •l1_local_store_hit •l1_local_store_miss •l1_shared_bank_conflict •sm_cta_launched •tex0_cache_sector_misses •tex0_cache_sector_queries •uncached_global_load_transaction •fb_subp0_read_sectors •fb_subp0_write_sectors •fb_subp1_read_sectors •fb_subp1_write_sectors •l2_subp0_read_hit_sectors •l2_subp0_read_sector_misses •l2_subp0_read_sector_queries •l2_subp0_read_sysmem_sector_queries •l2_subp0_read_tex_hit_sectors •l2_subp0_read_tex_sector_queries •l2_subp0_total_read_sector_queries •l2_subp0_total_write_sector_queries •l2_subp0_write_sector_misses •l2_subp0_write_sector_queries l2 •l2_subp0_write_sysmem_sector_queries •l2_subp1_read_hit_sectors •l2_subp1_read_sector_misses •l2_subp1_read_sector_queries •l2_subp1_read_sysmem_sector_queries •l2_subp1_read_tex_hit_sectors •l2_subp1_read_tex_sector_queries •l2_subp1_total_read_sector_queries •l2_subp1_total_write_sector_queries •l2_subp1_write_sector_misses l2 subp1 •l2_subp1_write_sector_queries l2 subp1 •l2_subp1_write_sysmem_sector_queries •gld_inst_128bit •gld_inst_16bit •gld_inst_32bit •gld_inst_64bit •gld_inst_8bit •gst_inst_128bit •gst_inst_16bit •gst_inst_32bit •gst_inst_64bit •gst_inst_8bit •active_cycles •active_warps •atom_count •branch •divergent_branch •gld_request •gred_count •gst_request •inst_executed •inst_issued •local_load •local_store •prof_trigger_00 •prof_trigger_01 •prof_trigger_02 •prof_trigger_03 •prof_trigger_04 •prof_trigger_05 •prof_trigger_06 •prof_trigger_07 •shared_load •shared_store •thread_inst_executed_0 •thread_inst_executed_1 •threads_launched •warps_launched http://www.ics.uci.edu/~roblim1/docs/cupticounters.txt

(32)

プログラム挙動の確認



スレッド数の変化によるメモリアクセスの変化



グローバルメモリからの読込(prof_gld)



_{gld_request}



_{l1_global_load_hit}



_{l1_global_load_miss}



グローバルメモリへの書込(prof_gst)



_{gst_request}



_{global_store_transaction}



命令



inst_issued

(prof_inst_issued)



_{inst_executed (prof_inst_executed)}

(33)

#define N (512000)

#define Nbytes (N*sizeof(float))

#define NT (256)

#define NB (N/NT)

//=2000

global void init(float *a,

float b, float c){

int i = blockIdx.x*blockDim.x

+ threadIdx.x;

a[i] = 1.0;

b[i] = 2.0;

c[i] = 0.0;

}

global void add(float *a,

float b, float c){

int i = blockIdx.x*blockDim.x

+ threadIdx.x;

c[i] = a[i] + b[i];

}

int main(void){

float a,b,*c;

cudaMalloc((void **)&a, Nbytes);

cudaMalloc((void **)&b, Nbytes);

cudaMalloc((void **)&c, Nbytes);

init<<< NB, NT>>>(a,b,c);

add<<< NB, NT>>>(a,b,c);

return 0;

}

ベクトル和（コアレスアクセス版）

vectoradd_coalesce.cu

(34)

#define N (512000)

#define Nbytes (N*sizeof(float))

#define NT (250)

#define NB (N/NT)

//=2048

global void init(float *a,

float b, float c){

int i = blockIdx.x*blockDim.x

+ threadIdx.x;

a[i] = 1.0;

b[i] = 2.0;

c[i] = 0.0;

}

global void add(float *a,

float b, float c){

int i = blockIdx.x*blockDim.x

+ threadIdx.x;

c[i] = a[i] + b[i];

int main(void){

float a,b,*c;

cudaMalloc((void **)&a, Nbytes);

cudaMalloc((void **)&b, Nbytes);

cudaMalloc((void **)&c, Nbytes);

init<<< NB, NT>>>(a,b,c);

add<<< NB, NT>>>(a,b,c);

return 0;

}

ベクトル和（非コアレスアクセス版）

(35)

export CUDA_PROFILE=1

nvcc ‐arch=sm_20 vectoradd_coalesce.cu #

export CUDA_PROFILE_CONFIG=prof_gld

export CUDA_PROFILE_LOG=vectoradd_c_profile_gld.log

./a.out #

export CUDA_PROFILE_CONFIG=prof_gst

export CUDA_PROFILE_LOG=vectoradd_c_profile_gst.log

./a.out #

export CUDA_PROFILE_CONFIG=prof_inst_issued export CUDA_PROFILE_LOG=vectoradd_c_profile_inst_issued.log

./a.out #

export CUDA_PROFILE_CONFIG=prof_inst_executed export CUDA_PROFILE_LOG=vectoradd_c_profile_inst_executed.log

./a.out # # # nvcc ‐arch=sm_20 vectoradd_noncoalesce.cu # export CUDA_PROFILE_CONFIG=prof_gld

export CUDA_PROFILE_LOG=vectoradd_n_profile_gld.log

./a.out #

export CUDA_PROFILE_CONFIG=prof_gst

export CUDA_PROFILE_LOG=vectoradd_n_profile_gst.log

./a.out #

export CUDA_PROFILE_CONFIG=prof_inst_issued export CUDA_PROFILE_LOG=vectoradd_n_profile_inst_issued.log

./a.out #

export CUDA_PROFILE_CONFIG=prof_inst_executed export CUDA_PROFILE_LOG=vectoradd_n_profile_inst_executed.log

./a.out #

コンパイルと実行（シェルスクリプト）

(36)

実行結果

（ベクトル和のみ抜粋

, Occupancyはどちらも1）



_{l1_global_load_missの値が増加}



_{globa_store_transactionの値が大幅に増加}

内容

コアレスアクセス

非コアレスアクセス

gputime+cputime

（4回の平均）

56.808+103

=159.808

76.616+122.5

=199.116

gld_request

2336

2368

l1_global_load_hit

0 1941

l1_global_load_miss

2336

2577

gst_request

1136

1168

global_store_transaction

1136

2252

inst_issued

24113

31399

inst_executed

19720

19992

(37)

メモリアクセスの評価



_{transactions/request}



_{1 Warpがアクセス要求に対してキャッシュラインをいくつ使うか}



理想値



32 thread/Warp × variable size[byte/thread]

/ 128 byte/line



_{4バイト変数では1，8バイト変数では2}



読込



_{(l1_global_load_hit+l1_global_load_miss)/gld_re}

quest



書込



_{global_store_transaction/gst_requet}

(38)

transactions/request



理想値



_{float型（4バイト）の読込 1}



_{float型（4バイト）の書込 1}



非コアレスアクセスではメモリ転送の効率が50%に低下

読込

書込

コアレスアクセス

_{(2336+0)/2336=1}

_1136/1136=1

非コアレスアクセス

_{(1941+2577)/2368≈2}

_{2252/1168≈2}

(39)

プログラム挙動の確認



命令の実行に要したサイクルの評価



実行した命令数



_{inst_executed (prof_inst_executed)}



_{Warpが起動したサイクル数}



_{active_cycles (prof_cycle)}

(40)

#define N (8*1024)

//64kBに収める

#define Nbytes (N*sizeof(float))

#define NT (256)

#define NB (N/NT)

global void init(float *a,

float b, float c){

int i = blockIdx.x*blockDim.x

+ threadIdx.x;

a[i] = 1.0;

b[i] = 2.0;

c[i] = 0.0;

}

global void add(float *a,

float b, float c){

int i = blockIdx.x*blockDim.x

+ threadIdx.x;

c[i] = a[i] + b[i];

int main(void){

float a,b,*c;

cudaMalloc((void **)&a, Nbytes);

cudaMalloc((void **)&b, Nbytes);

cudaMalloc((void **)&c, Nbytes);

init<<< NB, NT>>>(a,b,c);

add<<< NB, NT>>>(a,b,c);

return 0;

}

GPUプログラム（グローバルメモリ利用）

(41)

#define N (8*1024)

#define Nbytes (N*sizeof(float))

#define NT (256)

#define NB (N/NT)

constant float a[N],b[N];

global void init(float *c){

int i = blockIdx.x*blockDim.x

+ threadIdx.x;

c[i] = 0.0f;

}

global void add(float *c){

int i = blockIdx.x*blockDim.x

+ threadIdx.x;

c[i] = a[i] + b[i];

}

int main(void){

float *c;

float host_a,host_b;

int i;

host_a=(float *)malloc(Nbytes);

host_b=(float *)malloc(Nbytes);

cudaMalloc((void **)&c,Nbytes);

for(i=0;i<N;i++){

host_a[i] = 1.0f;

host_b[i] = 2.0f;

}

cudaMemcpyToSymbol

(a,host_a,Nbytes);

cudaMemcpyToSymbol

(b,host_b,Nbytes);

init<<< NB, NT>>>(c);

add<<< NB, NT>>>(c);

return 0;

}

コンスタントメモリ（単純な置き換え）

vectoradd_constant.cu

(42)

#define N (8*1024)

#define Nbytes (N*sizeof(float))

#define NT (256)

#define NB (N/NT)

constant float a, b;

global void init(float *c){

int i = blockIdx.x*blockDim.x

+ threadIdx.x;

c[i] = 0.0f;

}

global void add(float *c){

int i = blockIdx.x*blockDim.x

+ threadIdx.x;

c[i] = a + b;

}

int main(void){

float *c;

host_a=1.0f;

host_b=2.0f;

cudaMalloc((void **)&c,Nbytes);

//host_a,host_bが配列ではないので

//アドレスを取り出すために&を付ける

cudaMemcpyToSymbol

(a,&host_a,sizeof(float));

cudaMemcpyToSymbol

(b,&host_b,sizeof(float));

init<<< NB, NT>>>(c);

add<<< NB, NT>>>(c);

return 0;

}

コンスタントメモリ（同一アドレス参照）

(43)

export CUDA_PROFILE=1 #

nvcc ‐arch=sm_20 vectoradd.cu #

export CUDA_PROFILE_CONFIG=prof_inst_executed export CUDA_PROFILE_LOG=vectoradd_profile_inst_executed.log

./a.out #

export CUDA_PROFILE_CONFIG=prof_cycle

export CUDA_PROFILE_LOG=vectoradd_profile_cycle.log

./a.out # # nvcc ‐arch=sm_20 vectoradd_constant.cu # export CUDA_PROFILE_CONFIG=prof_inst_executed export CUDA_PROFILE_LOG=vectoradd_constant_profile_inst_executed.log

./a.out #

export CUDA_PROFILE_LOG=vectoradd_constant_profile_cycle.log

./a.out # # nvcc ‐arch=sm_20 vectoradd_broadcast.cu # export CUDA_PROFILE_CONFIG=prof_inst_executed export CUDA_PROFILE_LOG=vectoradd_broadcast_profile_inst_executed.log

./a.out #

export CUDA_PROFILE_LOG=vectoradd_broadcast_profile_cycle.log

./a.out #

コンパイルと実行（シェルスクリプト）

(44)

実行結果

（ベクトル和のみ抜粋

, Occupancyはどちらも1）



_{inst_issuedの値が減少}



_{active_cyclesの値が大幅に増加}



同じ処理の実行に余分なサイクルが必要



_{1サイクル（クロック）あたりの命令実行数(≠演算回数)}



グローバルメモリ 408/ 658=0.62



コンスタントメモリ 336/2984=0.11

内容

グローバルメモリ

コンスタントメモリ

ブロードキャスト

gputime+cputime

（2回の平均）

2.96+49

=51.96

7.216+53.5

=60.716

2.96+49.5

=52.46

inst_executed

408

336

288 active_cycles

658 2984

599

(45)

簡易的なプロファイリング

(CUDA 5以降)



実行と同時にプロファイル結果を出力



_{$ nvprof ./a.out}



_{$ nvprof ‐‐query‐events}



プロファイルできるイベント一覧を表示



_{$ nvprof ‐events inst_executed ./a.out}

(46)

Nsight



デバッグ・プロファイルツール



https://developer.nvidia.com/nsight‐eclipse‐edition



https://developer.nvidia.com/nvidia‐nsight‐visual‐studio‐edition



_{Visual StudioやEclipseと統合し，開発→デバッグ→}

プロファイリングが可能

(47)

Available Events



The profiler supports logging of following counters during kernel execution

only on GPUs with Compute Capability 2.0 or higher

Name

description

instructions_issued

Number of instructions issued including replays

instructions_issued1_0

Number of cycles that issue one instruction for instruction group 0

instructions_issued2_0

Number of cycles that issue two instructions for instruction group ₀

instructions_issued1_1

Number of cycles that issue one instruction for instruction group 1

instructions_issued2_1

Number of cycles that issue two instructions for instruction group ₁

instructions_executed

Number of instructions executed, do not include replays

warps_launched

Number of warps launched in a SM

threads_launched

Number of threads launched in a SM

active_cycles

Count of cycles in which at least one warp is active in a _{multiprocessor}

active_warps

Accumulated count of no. of warps which are active per cycle in a multiprocessor. Each cycle increments it by the number of warps active in that cycle (in range 0‐48)

shared_load

Number of executed shared load instructions per warp in a SM

shared_store

Number of executed shared store instructions per warp in a SM http://www.cs.cmu.edu/afs/cs/academic/class/15668‐s11/www/cuda‐doc/Compute_Profiler.txt

(48)

Available Events



The profiler supports logging of following counters during kernel execution

only on GPUs with Compute Capability 2.0 or higher

Name

description

l1_global_load_hit

Number of global load hits in L1 cache

l1_global_load_miss

Number of global load misses in L1 cache

l1_local_load_hit

Number of local load hits in L1 cache

l1_local_load_miss

Number of local load misses in L1 cache

l1_local_store_hit

Number of local store hits in L1 cache

l1_local_store_miss

Number of local store misses in L1 cache

l1_shared_bank_conflict

Count of no. of bank conflicts in shared memory

uncached_global_load_transacti

on

Number of uncached global load transactions. This increments by 1, 2 or 4 for 32, 64 and 128 bit accesses respectively

uncached_global_load_transacti

on

Number of uncached global load transactions. This increments by 1, 2, or 4 for 32, 64 and 128 bit accesses respectively

(49)

Available Events



The profiler supports logging of following counters during kernel execution

only on GPUs with Compute Capability 2.0 or higher

Name

description

l2_subp0_write_sector_misses

Accumulated write sector misses from L2 cache for slice 0 for all _{the L2 cache units}

l2_subp1_write_sector_misses

Accumulated write sectors misses from L2 cache for slice 1 for all _{the L2 cache units}

l2_subp0_read_sector_misses

Accumulated read sectors misses from L2 cache for slice 0 for all _{the L2 cache units}

l2_subp1_read_sector_misses

Accumulated read sectors misses from L2 cache for slice 1 for all _{the L2 cache units}

l2_subp0_write_sector_queries

Accumulated write sector queries from L1 to L2 cache for slice 0 of _{all the L2 cache units}

l2_subp1_write_sector_queries

Accumulated write sector queries from L1 to L2 cache for slice 1 of _{all the L2 cache units}

l2_subp0_read_sector_queries

Accumulated read sector queries from L1 to L2 cache for slice 0 of _{all the L2 cache units}

l2_subp1_read_sector_queries

Accumulated read sector queries from L1 to L2 cache for slice 1 of _{all the L2 cache units}

パフォーマンス解析ツール 長岡技術科学大学電気電子情報工学専攻出川智啓