PowerPoint プレゼンテーション

(1)

「京」における OpenFOAM の性能評価

1

2014.12.19

RIST 神戸センター

産業利用推進室（ＡＰ東京）

井上義昭

(2)

基本プロファイラ採取 / 分析 (fipp)

9

• 採取範囲の指定：初期処理などを外してのfipp_start/stopでrange指定を行う.

• コンパイルflag: -Nline （.soでの行情報追加）を追加、soのサイズが巨大化に要注意.

• 実行環境：動作メモリが増加、今回は、メモリに余裕を持たせて,通常8MPI_proc/node

を4MPI_proc/node割り当てで採取した

• 実行シェル： -L sharedに加えて、-mメモリ指定も必要.

fipp -C -Ihwm,call -P userfunc -d dir-name -m 120000 -L shared -Srange mpiexec …

• 実行時間：今回、fippなしで通常500secの実行が -Nline付では5000secに増加.

• 採取データの表示： IDE/profiler、及び、fipppx -A -Icall -d dir-name -l50 -p0,limit=1 -tcsv > file.csv で実施

（参考）採取方法TIPS

• ホット . スポット調査

• MPI 処理の割合

• 処理グループの調査

• グループ内の関数コスト分布

• コスト大関数とソースファイルの紐付け

(10)

MPI数が大きくなるにつれて、

UIPstream(MPI_recv)/UOPstream(_send),waitRequests(_wait)やallReduceなどMPI_通信系ラッパー）に上位を占められてしまう！

MPI数が320程度と少ない内は、まだ、演算系関数 (DICPre,lduMatrix

,PCG)がコスト上

位だが….

ホット . スポット調査 (fipp/GUI 表示）

10

320mpi

 fipp Cost情報: 基本プロファイラfippで採取したデータを、プログラミング

支援ツール（GUI)のProfilerアイコンをクリックして表示.

ホット.スポット関数一覧（cost順,内MPIcost）

補）Barチャート (赤：MAX,緑：AVG, 青：MIN)は、

MPI_rankでの

imbaranceを表現.

1280mpi

(11)

MPI 処理の割合

320mpi

1280mpi

11

演算処理コスト

MPI.cost

演算処理コスト rank

rank cost

cost

MPI数が320から1280になった結果、

MPI関連の関数のコスト (MPI.cost)が、支配的になってしまっている

 fipp Cost情報(GUI表示でのCostStack Chart)

all.cost

(12)

level application(proc.0) 320mpi

0 main 4,272

1 Foam::fvMatrix<double>::solve( 2,693 1 Foam::gMax<double>(const Fo 832

1 Foam::fvc::surfaceSum<double 197

1 Foam::MULES::explicitSolve(Fo 141

1 Foam::incompressible::RASMod 96

12

処理グループの調査

 fipp call.graph機能

Process 0 ---+ (例：proc.0のcall.graph)

| 0% <0> main [0 / 4272]

| 0% <1> Foam::gMax(….) [1 / 832]

| 0% <2> Foam::reduce ….. [0 / 831]

| 0% <3> Foam::Pstream::gather ... [0 / 831]

###| 19% <4> Foam::UIPstream::read(…) [831]

| 0% <1> Foam:: … [a/b]

 mainからの呼び出し関数一覧（ネストレベル１の関数のb値をcost順にソート. これをグループ単位とする)

A B

2つの関数処理グループA, B、MPI数増加で２つの傾向. CostはProc.0での値としているため、B の様に減少(scal)するのが正常だが、A の大幅増加(12倍）は問題！

<n> : 呼び出しネストレベルを表し、上

下の関数は、その呼び出し関係を示す例：main>gMax>reduce>gather>read

[a/b]: aは当該関数自身のcostであり、

bは配下の関数（含a)のcostの総和.

[情報の見方]

leve application(proc.0) 1280mpi 0 main 3,253

1 Foam::MULES::explicitSolve 1,745

1 Foam::fvMatrix<double>::sol 854

1 Foam::gMax<double>(const 203

1 Foam::fvc::surfaceSum<dou 182

1 Foam::DimensionedField<do 134

1 Foam::incompressible::RASM 35 関数の呼び出し.ネスト関係（親/子/孫）と関数配下のコストを調査.

(13)

1.MULES::explicitSolve >2. ::explicitSolve::limit > 3. ::limiter(35->7)

>4.syncTools::syncFaceList

>5::syncBoundaryFaceList >6. streamBuffers::finishedSends >7.Pstream::exchange(->2)

> 8.combineReduce

>9.combineGather

> 10.IPstream > 11.UIPstream >12.UIPstream (44->415)

> 10.operator > 11.operator >12.UIPstream::read(1->9) > 9.combineScatter

> 10.Opstream > 11.UOPstream

> 12.UOPstream > 13.::write(1->27) > 10.operator (1) >11.operator

> 12.UOPstream:write(12->233)

> 13.List::setSize(2->80) >8.::waitRequests(25->904)

>2. ::explicitSolve > 3.GeometricField > 4.Upstream::waitRequests(3->57) …….

A. MULES 関数 Group （コスト増加）

13

explicitSolve配下の関数のコスト分布のMPI数増加での

変化を調査. （括弧内:当該関数コストの変化）

①

② ①，②、③配下のMPI通信処理でのコストがMPI数増加で急増！

 fipp callgraph 情報をもとに、 MULES.Group 内の関数構造及びコストを調査

これらの関数でのMPI通信の発生や性能の妥当性の調査が必要！詳細プロファイラfapp採取

MPI処理関数制御関数

※MPI数によるコストの変化

{320mpi(141)->1280mpi(1745)/proc0}

③

(14)

B. fvMatrix 関数 Group （コスト Scal ）

1.fvMatrix::solve

> 2.::solveSegated

> 3. PCG::solve(398>88) > 4.lduMatrix::solver

> 5.gAverage > 6.sumReduce > 7.allReduce (1) > 5.gSum > 6.reduce > 7.allReduce (1) > 4.lduMatrix::Amul (494>162)

> 5.::updateMatrixInterfaces(60>13)

> 6.waitRequests(103>10)

> 5. ::initMatrixInterfaces(25>8) > 6. Thunk for > 7.::read(1>3) > 7. ::write(4>5) > 4. reduce > 5.allReduce(496>281)

> 4. DICPreconditioner(833>199) > 4. sumMag (170>43)

> 3.sumProd (104>38) ……

｛320mpi(2693)>1280mpi(854)/proc.0｝

次の詳細プロファイラ（fapp)では、これらの各関数の詳細特性を採取する 14

MPI系関数の割合は、それほど大きくなく、

演算系の関数で、ホットスポットとして

リストアップされていた関数が、このグループ配下でコストを占めている.

MPI処理関数演算関数

 fipp callgraph 情報をもとに、 fvMatrix 配下の関数の構造及びコストを調査

Preconditioning_CG法

行列ベクトル積

対角ベース不完全コレスキー分解

(15)

コスト大関数とソースファイルの紐付け

15

 fippのsourceView機能：ソースファイル別のコストや行コストが調査できる

ソース行レベルでのCOSTを調査する場合、ソールファイルのパスを指定

次の詳細プロファイラfappの区間設定時などにも,この情報は便利！

ソースファイル.Cとコストの関係が簡単を調査！

UIPread.Cは、

45%のコストを持ち MPI_Probe処理が重い

DICPreconditioner .Cは、6%のコストで左の２つのLINEにコストが集中

などが簡単に調査

(16)

（参考）採取方法 TIPS

16

• 採取区間の設定: fapp_start(“id”,n,level);_stop();で区間を挟む. “id”,nでユニークであり、

区間の重複は可能. levelは概要を1として、詳細の段階毎に値を大きくしておく.

• 採取条件：採取のレベルに合わせて実行時、レベル値を設定する.

• 採取実行：fapp -C -d dir-name -Ihwm,mpi –Lレベル値 -Hevent=Statistics mpiexec ……

• 採取時間：1回のHWモニター採取で数usec程度のoverheadとなるため、呼び出し回数

が多い(>10^6)関数の場合、区間の設定には注意が必要.

• データ表示: 次の様なfapppxコマンド、あるいは、IDL/profilerで表示 fapppx -A -Ihwm -d F -p0,limit=1 -tcsv > c1.csv

詳細プロファイラ採取 / 分析 (fapp)

• main プログラムでの採取区間の設定 (top.down)

• main プログラムでの区間特性の分析

• コスト大関数の区間設定と検証 (bottom.up)

(17)

in t main (in t argc , c h ar * argv[]) { fapp_start(" in te rFoam" ,1 ,1 ) ; #include "setRootCase.H"

#include "createTime.H"

#include …

fapp_stop(" in te rFoam" ,1 ,1 ) ;

Info<< "\nStarting time loop\n" << endl;

fapp_start(" in te rFoam" ,2 ,1 );

wh ile (ru n Time .ru n ()) {

#include "readTimeControls.H"

#include "CourantNo.H"

#include … ru n Time + + ;

Info<< "Time = " << runTime.timeName() << nl << endl;

twoPhaseProperties.correct();

fapp_start(" in te rFoam" ,3 ,1 );

# in c lu de " alph aEqn Su bCyc le .H"

fapp_stop(" in te rFoam" ,3 ,1 ) ; interface.correct();

// --- Pressure-velocity PIMPLE corrector loop while (pimple.loop())

{

#include "UEqn.H"

// --- Pressure corrector loop while (pimple.correct())

{ fapp_start(" in te rFoam" ,4 ,1 );

# in c lu de " pEqn .H"

fapp_stop(" in te rFoam" ,4 ,1 ) ; } if (pimple.turbCorr())

{ turbulence->correct(); } }

runTime.write();

Info<< "ExecutionTime = " << runTime.elapsedCpuTime() <<

<< " ClockTime = " << runTime.elapsedClockTime() << "

}

fapp_stop(" in te rFoam" ,2 ,1 ) ; Info<< "End\n" << endl;

main プログラムでの採取区間設定 (top.Down ）

17

interFoam_1

（初期処理）

 interFOAM_mainソース（含fapp区間設定行）

interFoam_4

(pEqn >fvMatrix）

interFoam_3(alphaEqnSubCycle > MULES:explicitSolve ) interFoam_2(イタレーションloop処理）

シュミレーションTimeの更新 fapp区間設定(fapp_start,同_stop)

は、”ID”+番号と、採取条件で指定

 UEqn:速度の計算

 pEqn:圧力の計算

 流体率(α）の計算

 実行時間(CPU,Elapse)のlog表示

runTime==指定EndTime でプログラム終了 (今回15イタレーションloopと短処理）

runTime++

 処理フロー(２層流解析）

全ての処理を網羅するよう区間指定

(18)

main プログラムでの区間特性の分析

18

fapp区間毎のElapse時間を、mpi数を変えて計測してみた.Loop処理(interFoam_2)の性能を

確認、pEqn.Hはscalしていくが、aphaEqnsubCycleが並列数に比例して増加、全体性能のボ

トルネックになってしまっている。次は、区間内の関数の特定が必要！

interFoam_2 (Loop処理）

(初期処理）

interFoam_1 alphaEqn

SubCycle

>MULES Time.sec

mpi.proc

pEqn.H

>fvMatrix

Proc平均

(19)

 区間設定

fippで調査した処理グループ（MULES,fvMatrix）内で、コスト大の関数に対して、fipp

SourceView情報を元に、fapp区間（関数の開始～終了）を設定し、実行時間/HWモニター情

報/MPI特性を採取. MPI通信の場合、上位の管理関数に設定したり、必要に応じて関数内の処理ブロック/コスト集中のloop処理にも設定(区間の重複には解析時注意が必要).

 測定（MPI数を変化）

コスト大関数の区間設定（ bottom.UP)

19

mpi並列数が640を超えた辺りから、

実行時間が急増!!

演算系処理

(pEqn >fvMatrix)

Time(s)

MPI並列数

データ交換処理 (alphaEqnCycle >MULES）

fvMatrix内の関数が綺麗にscalしているのと対照的に、

MULES内exchange 関数はmpi数増加により急増

設定した関数レベルでの区間特性で性能問題が再現

Proc平均

(20)

コスト大関数の区間検証 (top.down/bottom.up)

20 Time(s)

MPI並列数(640)

283

mainで設定(top.down) した区間と、コスト大関数 (bottom.up)での区間の実行時間を比較検証

mainでの両処理（pEqn, alphaEqn)とも7割程度を、

コスト大関数のコストでカバーしている事が検証できた！

Proc平均

Mainプログラムで設定した区間の実行時間

コスト大関数の区間での実行時間

pEqn.H

alphaEqn

MULES ｆｖMatrix

今後は、コスト大関数の詳細な特性調査にフォーカスする

(21)

 それにしても、どうやれば通信時間が、こんな

「ノコギリ型の形状」になるのか？（謎解きの始まり）

fapp_ID/MPI_Comm Kind Elapsed(s) Wait(s) Byte Call

in te rFoam 2 615,573 100,757 ---- 797,308,842

MPI::Comm::Allre du c e AVG 8 1 7 9 16 52,797

MAX 1 3 9 1 3 7 16 52,797

MIN 4 5 4 3 16 52,797

MPI::Comm::Isend AVG 1 0 8,505 273,295

MAX 3 0 18,260 659,820

MIN 0 0 4,856 94,260

MPI::Comm::Irecv AVG 0 0 8,505 273,295

MAX 1 0 18,260 659,820

MIN 0 0 4,856 94,260

MPI::Re qu e st::Waitall AVG 9 2 0 0 18,852

MAX 3 3 3 0 0 18,852

MIN 3 0 0 18,852

MPI::Comm::Se n d AVG 8 0 8 3 3 ,6 6 1 1,329

MAX 1 2 8 0 1 ,8 3 1 ,7 7 6 6,622

MIN 0 0 3 ,5 5 3 728

MPI::Comm::Recv AVG 1 0 8 3 3 ,6 6 1 1,329

MAX 2 0 1 ,8 3 1 ,7 7 6 8,008

MIN 1 0 1 4 0 ,1 0 7 602

MPI::Comm::Probe AVG 2 9 7 0 0 336

MAX 4 2 6 0 0 1,848

MIN 8 1 0 0 168

通信処理系の分析・改善

(interfoam2 >alphaEqnSubCycle/MULES)

21

Allreduce処理：16Byte,Elapsed大だが、waitも大から、殆どは他mpi_proc待ち時間！

非同期send/revcと、その完了待ちwaitall、

Elapse大は、他mpi_proc待ち?

同期sendと同期recvであるが、recvの待ちを probeが吸収してElapse大、Msg長が大.

proc

 区間interFoam2(イタレーションloop)のfapp_MPI情報(1280mpi)

Probe_Elapse時間(proc分布）

t

A

B

C

(22)

void Pstre am::e xc h an ge forAll(sendBufs, procI)

{ nsTransPs[procI] = sendBufs[procI].size( } fapp_start("exchage",3,1) ;

// Send sizes across. Note: blocks.

c ombin e Re du c e (size s, UPstre am::listEq(), tag);

fapp_stop("exchage",3,1) ;

label startOfRequests = Pstream::nRequests();

// Set up receives

recvBufs.setSize(sendBufs.size());

forAll(sizes, procI)

{ label nRecv = sizes[procI][UPstream::myProcNo()];

if (procI != Pstream::myProcNo() && nRecv > 0) { recvBufs[procI].setSize(nRecv);

UIPstre am::re ad (Upstre am::n on Bloc kin g,proc I,…..);

} }

// Set up sends forAll(sendBufs, procI)

{ if (procI != Pstream::myProcNo() && sendBufs[procI].size() > 0) { if(!UOPstre am:write (Upstre am::n on Bloc kin g,proc I,….)) }

}

fapp_start("exchage",6,1) ; // Wait for all to finish if (block)

{ Pstre am::waitRe qu e sts(startOfRe qu e sts); } fapp_stop("exchage",6,1) ;

}

fapp_MPI情報(1 2 8 0 mpi) Elapsed(s) Byte Call

e xc h age 3 397,008 ---- 2,187,090

MPI::Comm::Send AVG 8 3 ,2 9 9 ,8 5 9 342

MAX 129 6 ,5 6 3 ,8 4 9 1,881

MIN 0 1 5 ,3 6 9 171

MPI::Comm::Recv AVG 0 3 ,2 9 9 ,8 5 9 342

MAX 1 6 ,5 6 3 ,8 4 9 1,881

MIN 0 6 0 5 ,5 6 5 171

MPI::Comm::Probe AVG 3 0 2 0 342

MAX 4 3 4 0 1,881

MIN 8 2 0 171

exchange 処理

22

[MPI_Irecv]

[MPI_Isend]

[MPI_waitall ]

上流でのimbalanceの吸収のために待ち時間が大に

(※barrire挿入での激減を確認済み）なっているだけ

異常なProbe*時間とMsg長大のS/R特性

>> combineReduceコードを調査

非同期 Send/Recv の完了待ち

(*Recvの完了確認）

e xc h age 6 93,280 ---- 218,880

MPI::Request::Waitall AVG 7 3 0 171

MAX 3 3 4 0 171

MIN 0 0 171

 exchangeプログラムソース（含fapp区間設定行）

（※barrire挿入確認箇所）

C

B

(23)

template<class T, class CombineOp>

void Pstre am::c ombin e Gath e r fapp_start("combine",1,2);

// Get my communication order

const commsStruct& myComm = comms[UPstream::myProcNo()];

// Receive from my downstairs neighbours forAll(myComm.be low(), be lowI) {

label belowID = myComm.below()[be lowI];

T value;

UIPstre am::re ad (Upstream:scheduled,belowID,….);

cop(Value, value);

}

fapp_stop("combine",1,2);

fapp_start("combine",2,2);

// Send up Value

if (myComm.above() != -1) {

UOPstre am::write (Upstream:scheduled,myComm.above(),…);

}

template<class T>

void Pstre am::c ombin e Sc atte r if (UPstream::parRun())

{ fapp_start("combine",3,2);

// Get my communication order

const UPstream::commsStruct& myComm = comms[UPstream::myProcNo(

// Reveive from up if (myComm.above() != -1) {

UIPstream::read(Upstream:scheduled,myComm.above(),…) }

fapp_start("combine",4,2);

// Send to my downstairs neighbours forAll(myComm.be low(), be lowI) {

label belowID = myComm.below()[belowI];

UOPstream::write(Upsream::scheduled,belowID,…);

}

combine_reduce>gather/Scatter 処理

23

①[MPI_Prebe&recv]

②[MPI_send]

①配下(below)の子rank達からのRecv

待ち※. 全部、集めたら..

②上位(above)の親rankにsendする.

③[MPI_Probe&recv]

③Recvを発行し、親rank（above)からの受信待ち. 受け取ったら..

④配下(below)の子rank達へsendする※

④[MPI_send]

※①、④で、どの rank 順序で処理するかが、性能面で重要 !

Gather及びScater処理を実行. OpenFOAMではこれらをバイナリ.ツリー形式の同期send/revcで実装！

 ソースコード（含fapp採取行）

 処理フロー

データを集めるデータばら撒き

(24)

配下の子rankの送信では、より上位proc

（より分配先を持つ）から下位procへと処理する（proc0では、S8>s4>s2>s1の順番で）

sample scatter 　

proc 1 2 3 4 0 s8 s4 s2 s1

1 r0

2 r0 s3

3 r2

4 r0 s6 s5

5 r4

6 r4 s7

7 r6

8 r0 s12 s10 s9

9 r8

10 r8 s11

11 r10

12 s8 s14 s13

13 r12

14 r12 s15

15 r14

sample gather

proc 1 2 3 4

0 r1 r2 r4 r8

1 s0

2 r3 s0

3 s2

4 r5 r6 s0

5 s4

6 r7 s4

7 s6

8 r9 r10 r12 s0

9 s8

10 r11 s8 11 s10

12 r13 r14 s8

13 s12

14 r15 s12 15 s14

（参考）一般的な gather/scatter 処理

24

①Gather：配下の子rank達からの受信 ④Scatter：配下の子rank達に送信

※効率よく多重で処理が実行される通信の順序は以下のもの（16procでの例）

ｔｔ

sn:send n ｒn:recv n

n:procNo ⁽^上位⁾

（下位) (上位) （下位)

配下の子rankの受信では、より下位proc

（より集配先を持たない）から上位procへと

処理する. （OpenFOAMも同順序と確認）

Dataの流れ

並列収集並列分配

(25)

sample scatter

proc 0 1 2 3 4 5 6 7 8 9 10

0 s1 s2 s4 s8

1 r0 r0

2 r0 r0 s3

3 r2 r2

4 r0 r0 s5 s6

5 r4 r4

6 r4 r4 s7

7 r6 r6

8 r0 r0 s9 s10 s12

9 r8 r8

10 r8 r8 s11

11 r10 r10

12 ｒ8 r8 s13 s14

13 r12 r12

14 r12 r12 s15

15 r14 r14

ところが、現状の combine_Scatter では

t

Recv発行

各procのSend実行

Recv完了

（MPI_Probe時間)

Gatherと同じ下位procから,Sendを開始しているため、2分岐での並列通信が効率よく実施されていない.

例えばproc0のs8 やproc4のs6等の Tree構造の分岐点の出遅れ発生！

ｆapp MPI情報(MPI_Probe.Elapse) 1280MPI/IDE表示

proc t

Probe時間

25

※ノコギリ形状

（下位) (上位)

sn:send n ｒn:recv n n:procNo

BASE

（例16proc)

Dataの流れ

(26)

sample scatter reverse

proc 0 1 2 3 4

0 s8 s4 s2 s1

1 r0 r0

2 r0 r0 s3

3 r2 r2

4 r0 r0 s6 s5

5 r4 r4

6 r4 r4 s7

7 r6 r6

8 r0 r0 s12 s10 s9

9 r8 r8

10 r8 r8 s11

11 r10 r10

12 ｒ8 r12 s14 s13

13 r12 r12

14 r12 r12 s15

15 r14 r14

// Send to my downstairs neighbours forAll(myComm.be low(), be lowI)

forAllRe ve rse (myComm.be low(), be lowI) {

label belowID = myComm.below()[belowI];

UOPstream::write (

UPstream::scheduled, belowID,

reinterpret_cast<const char*>(&Value), sizeof(T),

tag );

}

改善 combine_Scatter

（MPI_Probe時間)

fapp MPI情報(MPI_Probe.Elapse) 1280MPI/IDE表示

t

Probe時間

proc 26 myComm（proc.8の場合)

below

above ^proc0 12 proc9

10

①

②

Sendの順序をGather時とは逆順(下位からではなく上位 procから開始）に変更.

コードとしては、forAllマクロ(①前からLoop）を、forAllReverse(後ろから②）マクロに修正

• Combine_Scatter修正code

(上位) （下位)

(上位) (下位)

1行修正

10

s15 r14

BASE

t

（例16proc)

sn:send n ｒn:recv n n:procNo

Dataの流れ

(27)

性能向上

27

exchange 時間

が大幅短縮

しかし、 exchange 時間

の増加は止められず、

抜本的な対策が不可欠

BASE: オリジナル

REV 版 :combine_Scatter ( ｆ orAllRevese に修正 )

Time(s)

mpi mpi

Proc平均

※320mpi以下では、exchangeの占める時間自体が小さく、性能向上の効果は微小

(28)

演算処理系の分析・改善 (pEqn/fvMatrix)

28

fapp ID 640mpi Elapsed(s) MFLOPS MFLOPS /PEAK(%)

_chip(GB/

S) /PEAK(%) SIMD(%)

all.0(program) 560 231 1.4 19 29 1

interFoam.1 (init) 42 156 1.0 12 19 1

interFoam.2 (loop) 506 243 1 .5 20 31 1

interFoam.3 > alphaEqn 44 109 0.7 9 15 0

interFoam.4 > pEqn 405 276 1 .7 22 3 4 1

PCG 1 10 532 3 .3 47 73 13

PCG 2 21 488 3 .0 44 69 17

lduMat 66 506 3 .2 31 48 1

DICP 110 437 2 .7 32 50 0

sumMag 56 95 0 .6 4 7 0

sumProd 16 643 4 .0 34 53 0

 fapp_HWモニタ情報

(640mpi:REV版）

イタレーションloop処理特性

(pEqn/fvMatrixグループ特性)

 演算系関数の実行性能は、

一部を除き3-4%と健全.

 高メモリバンド幅使用率と低SIMD化率が特長

 fapp HWモニター情報（Hevent=Statistics）

内

訳 /fvMatrix

 コンパイラーの最適化メッセージを表示

（-Koptmsg=2)、SIMD化やSWP（ソフトウエアパイプライン）の状況を調査

MFLOPS実行効率%

 MPIスケラビリティ

SuperScal

低性能  BASE版は、全関数を共通のコンパイラ

flag(-O3)でビルド（-Kfastの副作用によ

る結果不正を回避するため）

Proc平均

(29)

Foam::solve rPe rforman c e Foam::PCG::solve // --- Solver iteration

do {

fapp_start("PCG",3,2) ;

// --- Store previous wArA wArAold = wArA;

// --- Precondition residual

pre c on Ptr- > pre c on dition (wA, rA, c mpt);

// --- Update search directions:

wArA = gSu mProd(wA, rA);

fapp_stop("PCG",3,2) ; fapp_start("PCG",1,1) ;

if (solverPerf.nIterations() == 0)

{ for (register label cell=0; cell<nCells; cell++) { pAPtr[cell] = wAPtr[cell]; }

} else

{ scalar beta = wArA/wArAold;

for (register label cell=0; cell<n Ce lls; cell++)

{ pAPtr[c e ll] = wAPtr[c e ll] + beta*pAPtr[c e ll]; } }

fapp_stop("PCG",1,1) ;

fapp_start("PCG",4,2) ;

// --- Update preconditioned residual

matrix_.Amu l(wA, pA, in te rfac e Bou Coe ffs_, in te rfac e sc alar wApA = gSu mProd(wA, pA);

fapp_stop("PCG",4,2) ; // --- Test for singularity

if (solverPerf.checkSingularity(mag(wApA)/normFactor)) break fapp_start("PCG",2,1) ;

// --- Update solution and residual:

scalar alpha = wArA/wApA;

for (register label cell=0; cell<n Ce lls; cell++) {

psiPtr[cell] += alpha*pAPtr[c e ll];

rAPtr[cell] -= alpha*wAPtr[c e ll];

}

fapp_stop("PCG",2,1) ; fapp_start("PCG",5,2) ;

solve rPe rf.fin alRe sidu al() = gSu mMag(rA)/ n ormFac to fapp_stop("PCG",5,2) ;

} while

( solverPerf.nIterations()++ < maxIter_

&& !(solverPerf.checkConvergence(tolerance_, relTol_))

PCG

29

PCG1/2とも単純な1重loopでのメモリ連続accessの演算でLoop長も十分

概ね良好の性能を発揮.

> DICPreconditioner呼び出し > lduMatrix_Amul呼び出し

mpi分割数 nCells(K) nFaces(K)

320 290 856

640 145 428

1280 73 214

 コンパイルMSGから、SIMDされているが未SWP化状態と判明

jwd6001s-i "../test_wmake/PCG.C", line 151: ループ制御変数'cell'のループをSIMD化しました。

jwd8662o-i "../test_wmake/PCG.C", line 151: スケジューリング結果を得られなかったため、ソフトウェアパイプライニングを適用できません。

jwd8202o-i "../test_wmake/PCG.C", line 151: このループを展開数4回でループアンローリングしました。

(30)

void Foam::ldu Matrix::Amu l {

// Initialise the update of interfaced interfaces fapp_start("lduMat",4,2) ;

initMatrixInterfaces fapp_stop("lduMat",4,2) ; fapp_start("lduMat",1,1) ;

register const label nCells = diag().size();

fapp_start("lduMat",3,2) ;

for (register label cell=0; cell<nCells; cell++) {

ApsiPtr[cell] = diagPtr[cell]*psiPtr[cell];

}

fapp_stop("lduMat",3,2) ; fapp_start("lduMat",2,2) ;

for (register label face=0; face<n Fac e s; face++) {

ApsiPtr[uPtr[face]] += lowerPtr[face]*psiPtr[lPtr[face]];

ApsiPtr[lPtr[face]] += upperPtr[face]*psiPtr[uPtr[face]];

}

fapp_stop("lduMat",2,2) ; fapp_stop("lduMat",1,1) ; fapp_start("lduMat",5,2) ; // Update interface interfaces updateMatrixInterfaces

tpsi.clear();

fapp_stop("lduMat",5,2) ;

void Foam::DICPre c on dition e r::pre c on dition { fapp_start("DICP",1,1) ;

fapp_start("DICP",2,2);

for (register label cell=0; cell<nCells; cell++) {

wAPtr[cell] = rDPtr[cell]*rAPtr[cell];

}

fapp_stop("DICP",2,2) ; fapp_start("DICP",3,2);

for (register label face=0; face<n Fac e s; face++) {

wAPtr[u Ptr[fac e ]] -= rDPtr[u Ptr[fac e ]]*upperPtr[face]*wAPtr[lPtr[fac e ]];

}

fapp_stop("DICP",3,2) ; fapp_start("DICP",4,2);

for (register label face=n Fac e sM1 ; face>=0; face--) {

wAPtr[lPtr[fac e ]] -= rDPtr[lPtr[fac e ]]*upperPtr[face]*wAPtr[u Ptr[fac e ]];

}

fapp_stop("DICP",4,2) ; fapp_stop("DICP",1,1) ;

lduMat/DICP

loop長情報

30

mpi分割数 nCells(K) nFaces(K)

320 290 856

640 145 428

1280 73 214

lduMat.DICPともコスト大のloopはindirectアクセスの演算であり、高メモリバンド幅の特性は、これに起因している. また、順アクセスのloopも有り. コンパイルMSGから、SWP及び、SIMD化が出来ていない事が判明(indirectのため不可）

jwd8662o-i "DICPreconditioner.C", line 129: スケジューリング結果を得られなかったため、ソフトウェアパイプライニングを適用できません。

jwd8202o-i "DICPreconditioner.C", line 129: このループを展開数6回でループアンローリングしました。

jwd6202s-i "DICPreconditioner.C", line 131: ループ中で変数'wAPtr'を定義する順序が逐次実行と異なるため、SIMD化できません。

(31)

421 template<class Type>

422 scalar su mMag(const UList<Type>& f) 423 {

424 if (f.size())

425 { fapp_start("sumMag",1,1) ; 426 scalar SumMag = 0.0;

4 2 7 TFOR_ALL_S_OP_FUNC_F(sc alar, Su mMag, + = , mag, Type , f) 428 fapp_stop("sumMag",1,1) ;

429 return SumMag;

sumMag/sumProd

31

#define TFOR_ALL_S_OP_FUNC_F(typeS, s, OP, FUNC, typeF, f) ¥ /* set access to f at end of field */ ¥

List_CONST_ACCESS(typeF, f, fP); ¥ /* loop through field performing s OP f */ ¥ List_FOR_ALL(f, i) ¥ (s) OP FUNC(List_ELEM(f, fP, i)); ¥ List_END_FOR_ALL

register const Type* __restrict__ fp = (f).begin()； register label i = (f).size();

while (i--) {(SumMag) += mag((*fp++)); }

マクロ展開

91 template<>

92 scalar su mProd(const UList<scalar>& f1, const UList<scalar>& f2) 93 {

94 if (f1.size() && (f1.size() == f2.size())) 95 { fapp_start("sumProd",1,1) ;

96 scalar SumProd = 0.0;

9 7 TFOR_ALL_S_OP_F_OP_F(sc alar, Su mProd, + = , sc alar 98 fapp_stop("sumProd",1,1) ;

99 return SumProd;

#define TFOR_ALL_S_OP_F_OP_F(typeS, s, OP1, typeF1, f1, OP2, typeF2, f2) ¥ /* set access to f1 and f2 at end of each field */ ¥

List_CONST_ACCESS(typeF1, f1, f1P); ¥ List_CONST_ACCESS(typeF2, f2, f2P); ¥ /* loop through field performing s OP f */ ¥ List_FOR_ALL(f1, i) ¥ (s) OP1 List_ELEM(f1, f1P, i) OP2 List_ELEM(f2, f2P, i); ¥ List_END_FOR_ALL

register const Type* __restrict__ f1P = (f1). begin();

register label i = (f1). size();

while (i--) { (SumProd) += (*f1P++) && (*f2P++); } マクロ展開

 コンパイルMSGを調査、while文のloopでのlist処理のため、SWP/SIMD化されていない

jwd6101s-i "scalarField.C", line 97: ループ内にSIMD化の制約となる文が存在します。

jwd8663o-i "scalarField.C", line 97: ソフトウェアパイプライニングの効果がないループと判断したため、ソフトウェアパイプライニングを適用しません。

jwd8202o-i ".scalarField.C", line 97: このループを展開数8回でループアンローリングしました。

 TFOR_ALL_S_OP_FUNC_Fマクロ定義

 sumMagソースコード（含fapp採取行）

 展開ソース

更にList_Xマクロを参照

 sumProdソースコード（同左）

 TFOR_ALL_S_OP_F_OP_Fマクロ定義

 展開ソース更にList_Xマクロを参照

(32)

register label i = (f1). size();

while (i--) { (SumProd) += (*f1P++) && (*f2P++); } register const Type* __restrict__ fp = (f).begin()； register label i = (f).size();

while (i--) {(SumMag) += mag((*fp++)); } const Type* const __restrict__ fP = (f). begin();

register const label _ni = (f). size();

for (register label i=0; i<_ni; i++) { (SumMag) += mag((fP[i])); }

./ Ope n FOAM/ c on tain e rs/ Lists/ List/ ListLoopM.H # i f d e f v e c t o r M a c h i n e

/ / E l e m e n t a c c e s s l o o p i n g u s i n g [ ] f o r v e c t o r m a #define L i s t _ F O R _ A L L ( f , i ) \ register const label _n##i = (f).size(); \ for (register label i=0; i<_n##i; i++) \ {

#define L i s t _ E N D _ F O R _ A L L }

#define L i s t _ E L E M ( f , f p , i ) (fp[i]) #define L i s t _ A C C E S S ( t y p e , f , f p ) \ type* const __restrict__ fp = (f).begin() #define L i s t _ C O N S T _ A C C E S S ( t y p e , f , f p ) \ const type* const __restrict__ fp = (f).begin()

# e l s e

/ / P o i n t e r l o o p i n g f o r s c a l a r m a c h i n e s

#define L i s t _ F O R _ A L L (f, i) \ register label i = (f).size(); \ while (i--) \ { \ #define L i s t _ E N D _ F O R _ A L L }

#define L i s t _ E L E M (f, fp, i) (*fp++) #define L i s t _ A C C E S S (type, f, fp) \

register type* __restrict__ fp = (f).begin()

#define L i s t _ C O N S T _ A C C E S S (type, f, fp) \

register const type* __restrict__ fp = (f).begin()

CPP マクロ展開の変更 (-DvectorMachine ）

32

②スカラマシン用

①vectorMachine用  CPPフラグに-DvectorMachineを指定

（sumMag)

（sumProd)

const scalar* const __restrict__ f1P = (f1). begin();

const scalar* const __restrict__ f2P = (f2). begin();

register const label _ni = (f1). size();

for (register label i=0; i<_ni; i++) { (SumProd) += (f1P[i]) * (f2P[i]); }

 コンパイルMSG(*1）からSWP/SIMD化を確認！

whileループでのスカラ向けのポインタ参照②から、

forループ.ベクトル向け[添え字]参照①に変更

 List_ x マクロ定義

jwd6004s-i scalarField.C", line 97: リダクション演算を含むループ制御変数'i'のループをSIMD

jwd8208o-i scalarField.C", line 97: ループ内の総和または乗積演算の計算方法を変更

jwd8204o-i scalarField.C", line 97: ループにソフトウェアパイプライニングを適用

jwd8205o-i scalarField.C", line 97: ループの回転数が81回以上の時、ソフトウェアパイプライニングを適用したループが実行時に選択

(*1)同時に-Kfastも指定

 展開ソース

①

②

①

②

(33)

C++ 最適化 Flag

Source code 修正前最適化flags 備考

DICPreconditioner.C

-O3

-Kfast

lduMatrixATmul.C -Kfast

PCG.C -Kfast

-DvectorMachine*1 (*1):sumProd,sumMagのマクロ

展開をvecter型に変更

-Kfast, nofp_relaxed*2

-DvectorMachine -Kfastだけでは、最適化の副作

用で異常終了 scalarField.C

PBiCG.C -Kfast,

-DvectorMachine sumProd,sumMag使用で

/fields/Fields/Field/FieldFunctio ns.Cをincludeしている

smoothSolver.C -Kfast,

-DvectorMachine

33

効果が期待できるコスト大の演算関数だけに限定して、最適化flagの-Kfastを指定する.

一部の.Cファイル内のコードを、特定optionでビルドする手順は、以下のものである.

①Allwmakeのlogから該当.Cのビルド部分を切り出し、特定Optionを指定しコンパイルする.

②.Oの作成を確認したら、Allwmakeを実行し、関連する.SOが一括作成される. その際、ビルド時間を短縮したい場合、wmakeで、影響範囲の.SOだけを作成する方法がある.

（参考）TIPS: ^(*2):^{逆数近似の最適化停止}

(34)

最適化効果

640mpi

Time(s)

sumMag,sumProdは、マクロ展開変更

でのソース生成で、最適化オブジェクト

(SIMD化＋SWP化）で実行され、大幅

性能向上！

DICP,lduMatも、-Kfastの最適化により若干の性能UPの成果.（indirect演算の

ため、SIMD/SWP化は不可であるが、

演算数が大幅減少）

実行効率：関数個別では3～7%、 pEqn処理全体では、2.3%を発揮.

34

実行効率%

12.3倍 1.7倍

1.2倍

1.1倍

PCGは、SIMD化率が大幅UPしたが、性能には効果がなかった.

いずれも、更なる性能向上には、loop毎の解析、TUNINGが不可欠.

Proc平均

(35)

性能比較

35

BASE

最適化版（REV_Kfast_Dvect）

Time.sec _Mflops実行効率

MPI通信系及び、演算系の最適化により、

両者とも大幅性能向上！

しかし、これ以上の並列数には、更なる通信系の高度化が不可欠となる

Proc平均

(36)

実行時間 log

³⁶

Time(s) scal

(1.3倍)

(1.4倍) (1.7倍) (2.1倍)

MPI並列数

(BASE時間/[REV+Kfast]比)

(1.3倍)

logファイルでのExecution Timeの比較では、最適版REV+KfastはBASEに対して、

1.3-2.1倍の性能向上だが、960mpi以上でscalは頭打ちの状況. 更なる改善が必要！！

（一対一通信での集団通信の実装を、京のMPI集団通信に変更を検討中）

*1:Logファイルの最終Execution Time(sec) 更なる改善

(37)

まとめ

• 大規模 C++ 、 OSS アプリケーション OpenFOAM の性能評価を、「京」での高度化支援を通じて、大規模データ (9700 万格子）を用いて実施 .

• 「京」での性能分析ツール (fipp,fapp) を駆使して、高並列 (320mpi-1280mpi) 実行での性能分析を行った .

• 結果、 MPI 通信処理のボトルネックの検出 / 改善と、演算処理関数への最適コンパイラ flag の指定等の簡易な ASIS レベルでの最適化により、大幅な性能向上

（ 1.3~2.1 倍）を得る事ができ、実行効率も 1.4% から 2.1% （ 640mpi 時）に向上 .

 又、更なる高並列化には、 OpenFOAM の集団通信の実装方式（バイナリーツリー方式よる一対一通信での実装）の抜本的な見直しが必要となると予想

 最後に、今回のボトルネックの検出手順の実例が、他の大規模 C++ アプリの分析作業に微力ながら役立つことを期待したい .

37

(38)

END

38

PowerPoint プレゼンテーション

「京」における OpenFOAM の性能評価

2014.12.19

RIST 神戸センター

産業利用推進室（ＡＰ東京）

井上 義昭

contents

• OpenFOAM とは

• 「京」での OpenFOAM 状況

• 評価環境と性能課題

• 基本プロファイラ採取 / 分析 (fipp)

• 詳細プロファイラ採取 / 分析 (fapp)

• 通信処理系の分析・改善

• 演算処理系の分析・改善

• 性能向上効果

• まとめ

 Open Source Field Operation and Manipulation の略、

英国の Nabla 社（現 OpenCFD 社）が開発販売してい

た商用コードを、 2004 年 12 月にオープンソース . ソフ トウェアとして公開（ライセンス費なし）

 オブジェクト指向言語 C++ で開発された汎用のプロ グラミングツール . ボックス（標準ソルバ、クラスライブ ラリ、ユーティリティなど完備）

 CFD （ Computational Fluid Dynamics ）など数値流体 力学の分野を中心に使用

OpenFOAM とは

コード規模

C++ の巨大ソースでコードの内容は、完全ブラック BOX ！！

さて、どうやって分析するか？ が最大の課題 .

「京」の OpenFOAM 状況

 2013 年 6 月 : RIST にて 2.1.0 の移植を支援、「京」版パッチを提供 .

 2013 年 10 月 : 第一回 OpenFOAM ワークショップ（ー OpenFOAM を

「京」で使おうー ) を開催 (RIST 主催） https://www.hpci- office.jp/pages/ws_openfoam_130927

 2014 年 1 月： 「京」にて OpenFOAM2.2.1 版ビルドに成功、「京」利用 者へ提供、「高度化支援」依頼に対応開始 .

 2014 年 4 月～： RIST にて実データをもとにした本格的に性能分析を 実施、ボトルネック改善での性能向上に成功 .

 2014 年 7 月 : 「高度化支援」申請課題に対して、 RIST 最適版を活用し て、大幅な性能向上を達成 .

 2014 年 10 月 : 第２回 OpenFOAM ワークショップ (RIST 主催 ) にて、高度 化実例や課題支援状況を報告 . https://www.hpci-

office.jp/pages/ws_openfoam2_141017

OpenFOAM 支援状況

評価環境

• OpenFOAM バージョン : 2.2.1

• OpenFOAM 使用 solver: interFoam –parallel (2 層流解析）

• 「京」言語環境 : Env_base_1.2.0-16-3

• 並列化： FlatMPI(8proc/node)

• mpi 数 : 320/640/960/1280 プロセス

• 評価方式： Strong scal

• データ規模： 9700 万 cell (450x450x600)

• 分割方式： scotch 分割

• 実行 Time_step 数： 0~ 0.00003 （ 15

)

実行時間 log （ BASE)

性能課題 : ６４０ mpi 以上

スケールしない

基本プロファイラ採取 / 分析 (fipp)

• ホット . スポット調査

• MPI 処理の割合

• 処理グループの調査

• グループ内の関数コスト分布

• コスト大関数とソースファイルの紐付け

ホット . スポット調査 (fipp/GUI 表示）

MPI 処理の割合

処理グループの調査

A. MULES 関数 Group （コスト増加）

 fipp callgraph 情報をもとに、 MULES.Group 内の関数構造及びコストを調査

B. fvMatrix 関数 Group （コスト Scal ）

1.fvMatrix::solve

> 2.::solveSegated

> 3. PCG::solve(398>88) > 4.lduMatrix::solver

> 5.gAverage > 6.sumReduce > 7.allReduce (1) > 5.gSum > 6.reduce > 7.allReduce (1) > 4.lduMatrix::Amul (494>162)

> 5.::updateMatrixInterfaces(60>13)

> 6.waitRequests(103>10)

> 5. ::initMatrixInterfaces(25>8) > 6. Thunk for > 7.::read(1>3) > 7. ::write(4>5) > 4. reduce > 5.allReduce(496>281)

> 4. DICPreconditioner(833>199) > 4. sumMag (170>43)

> 3.sumProd (104>38) ……

 fipp callgraph 情報をもとに、 fvMatrix 配下の関数の構造及びコストを調査

コスト大関数とソースファイルの紐付け

（参考）採取方法 TIPS

詳細プロファイラ採取 / 分析 (fapp)

• main プログラムでの採取区間の設定 (top.down)

• main プログラムでの区間特性の分析

• コスト大関数の区間設定と検証 (bottom.up)

main プログラムでの採取区間設定 (top.Down ）

main プログラムでの区間特性の分析

コスト大関数の区間設定（ bottom.UP)

井上義昭

た商用コードを、 2004 年 12 月にオープンソース . ソフトウェアとして公開（ライセンス費なし）

 オブジェクト指向言語 C++ で開発された汎用のプログラミングツール . ボックス（標準ソルバ、クラスライブラリ、ユーティリティなど完備）

 CFD （ Computational Fluid Dynamics ）など数値流体力学の分野を中心に使用

さて、どうやって分析するか？が最大の課題 .

 2014 年 1 月：「京」にて OpenFOAM2.2.1 版ビルドに成功、「京」利用者へ提供、「高度化支援」依頼に対応開始 .

 2014 年 4 月～： RIST にて実データをもとにした本格的に性能分析を実施、ボトルネック改善での性能向上に成功 .

 2014 年 7 月 : 「高度化支援」申請課題に対して、 RIST 最適版を活用して、大幅な性能向上を達成 .

 2014 年 10 月 : 第２回 OpenFOAM ワークショップ (RIST 主催 ) にて、高度化実例や課題支援状況を報告 . https://www.hpci-

• 大規模 C++ 、 OSS アプリケーション OpenFOAM の性能評価を、「京」での高度化支援を通じて、大規模データ (9700 万格子）を用いて実施 .

• 「京」での性能分析ツール (fipp,fapp) を駆使して、高並列 (320mpi-1280mpi) 実行での性能分析を行った .

• 結果、 MPI 通信処理のボトルネックの検出 / 改善と、演算処理関数への最適コンパイラ flag の指定等の簡易な ASIS レベルでの最適化により、大幅な性能向上

 又、更なる高並列化には、 OpenFOAM の集団通信の実装方式（バイナリーツリー方式よる一対一通信での実装）の抜本的な見直しが必要となると予想

 最後に、今回のボトルネックの検出手順の実例が、他の大規模 C++ アプリの分析作業に微力ながら役立つことを期待したい .