Parallel Programming in MPI
part 2
Today's Topic
• ノンブロッキング通信
Non-Blocking Communication
• 通信の完了を待つ間に他の処理を行う
Execute other instructions while waiting for the completion of a communication.
• 集団通信関数の実装
Implementation of collective communications • MPIプログラムの時間計測
Measuring execution time of MPI programs • デッドロック Deadlock
Today's Topic
• ノンブロッキング通信
Non-Blocking Communication
• 通信の完了を待つ間に他の処理を行う
Execute other instructions while waiting for the completion of a communication.
• 集団通信関数の実装
Implementation of collective communications • MPIプログラムの時間計測
Measuring execution time of MPI programs • デッドロック Deadlock
ノンブロッキング通信関数
Non-blocking communication functions
• ノンブロッキング
= ある命令の完了を待たずに次の命令に移る
Non-blocking = Do not wait for the completion of an instruction and
proceed to the next instruction
•
Example) MPI_Irecv & MPI_Wait
MPI_Recv
Wait for the arrival of data
MPI_Irecv
Proceed to the next instruction without waiting for the data data Blocking next instructions next instructions MPI_Wait data Non-Blocking
MPI_Irecv
• Non-Blocking Receive
• Parameters:
start address for storing received data, number of elements, data type,
rank of the source, tag (= 0, in most cases),
communicator (= MPI_COMM_WORLD, in most cases), request
• request: 通信要求 Communication Request
• この通信の完了を待つ際に用いる
Used for Waiting completion of this communication
• Example)
MPI_Request req; ...
MPI_Irecv(a, 100, MPI_INT, 0, 0, MPI_COMM_WORLD, &req); ...
MPI_Wait(&req, &status);
Usage:
int MPI_Irecv(void *b, int c, MPI_Datatype d, int src, int t, MPI_Comm comm, MPI_Request *r);
MPI_Isend
• Non-Blocking Send
• Parameters:
start address for sending data, number of elements, data type,
rank of the destination, tag (= 0, in most cases),
communicator (= MPI_COMM_WORLD, in most cases), request
• Example)
MPI_Request req; ...
MPI_Isend(a, 100, MPI_INT, 1, 0, MPI_COMM_WORLD, &req); ...
MPI_Wait(&req, &status);
Usage:
int MPI_Isend(void *b, int c, MPI_Datatype d, int dest, int t, MPI_Comm comm,
Non-Blocking Send?
• Blocking send (MPI_Send):
送信データが別の場所にコピーされるのを待つ
Wait for the data to be copied to somewhere else.
• ネットワークにデータを送出し終わるか、一時的にデータのコピーを作成するまで。 Until completion of the data to be transferred to the network
or, until completion of the data to be copied to a temporal memory.
• Non-Blocking send (MPI_Isend):
待たない
Notice: ノンブロッキング通信中はデータが不定
Data is not sure in non-blocking communications
• MPI_Irecv:
• 受信データの格納場所と指定した変数の値は MPI_Waitまで不定
Value of the variable specified for receiving data is not fixed before MPI_Wait MPI_Irecv to A ... ~ = A ... MPI_Wait 10 A 50 A arrived data 50
Value of
A
at here
can be 10 or 50
~ = AValue of
A
is 50
Notice: ノンブロッキング通信中はデータが不定
Data is not sure in non-blocking communications
• MPI_Isend:
• 送信データを格納した変数を MPI_Waitより前に書き換えると、実際に送信さ れる値は不定
If the variable that stored the data to be sent is modified before MPI_Wait, the value to be actually sent is unpredictable.
MPI_Isend A ... A= 50 ... MPI_Wait 10 A 50 A data sent 10 or 50 A= 100
Modifying value of
A
here
causes incorrect
communication
You can modify value of
A
at
MPI_Wait
• ノンブロッキング通信(
MPI_Isend、 MPI_Irecv)の完了を待つ。
Wait for the completion of MPI_Isend or MPI_Irecv
• 送信データの書き換えや受信データの参照が行える Make sure that sending data can be modified,
or receiving data can be referred. • Parameters:
request, status
•
status:
MPI_Irecv 完了時に受信データの statusを格納
The status of the received data is stored at the completion of
MPI_Irecv
Usage:
MPI_Waitall
• 指定した数のノンブロッキング通信の完了を待つ
Wait for the completion of specified number of non-blocking
communications
• Parameters:
count, requests, statuses
•
count:
ノンブロッキング通信の数
The number of non-blocking communications
•
requests, statuses:
少なくとも
count個の要素を持つ MPI_Request と MPI_Statusの配列
Arrays of MPI_Request or MPI_Status that consists at least 'count'
number of elements.
Usage:
int MPI_Waitall(int c,
Today's Topic
• ノンブロッキング通信
Non-Blocking Communication
• 通信の完了を待つ間に他の処理を行う
Execute other instructions while waiting for the completion of a communication.
• 集団通信関数の実装
Implementation of collective communications • MPIプログラムの時間計測
Measuring execution time of MPI programs • デッドロック Deadlock
集団通信関数の中身
Inside of the functions of collective communications
• 通常,集団通信関数は,
MPI_Send, MPI_Recv, MPI_Isend, MPI_Irecv
等の一対一通信で実装される
Usually, functions of collective communications are
implemented by using message passing functions.
Inside of MPI_Bcast
• One of the most simple implementations
int MPI_Bcast(char *a, int c, MPI_Datatype d, int root, MPI_Comm comm)
{
int i, myid, procs; MPI_Status st; MPI_Comm_rank(comm, &myid); MPI_Comm_size(comm, &procs); if (myid == root){ for (i = 0; i < procs) if (i != root) MPI_Send(a, c, d, i, 0, comm); } else{
MPI_Recv(a, c, d, root, 0, comm, &st); }
Another implementation: With MPI_Isend
int MPI_Bcast(char *a, int c, MPI_Datatype d, int root, MPI_Comm comm)
{
int i, myid, procs, cntr; MPI_Status st, *stats; MPI_Request *reqs;
MPI_Comm_rank(comm, &myid); MPI_Comm_rank(comm, &procs); if (myid == root){
stats = (MPI_Status *)malloc(sizeof(MPI_Status)*procs); reqs = (MPI_Request *)malloc(sizeof(MPI_Request)*procs); cntr = 0;
for (i = 0; i < procs) if (i != root)
MPI_Isend(a, c, d, i, 0, comm, &(reqs[cntr++])); MPI_Waitall(procs-1, reqs, stats);
free(stats); free(reqs); } else{
MPI_Recv(a, c, d, root, 0, comm, &st); }
Flow of the Simple Implementation
Rank 0 Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 6 Rank 7
Isend to 1 Irecv from 0 Isend to 2 Irecv from 0 Irecv from 0 Irecv from 0 Irecv from 0 Irecv from 0 Irecv from 0 wait wait wait wait wait wait Isend to 3 Isend to 4 Isend to 5 Isend to 6 Isend to 7 wait waitall
Time for Simple Implementation
•
1 link can transfer 1 message at a time
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Total Time = T * (P-1)
Another implementation: Binomial Tree
int MPI_Bcast(char *a, int c, MPI_Datatype d, int root, MPI_Comm comm)
{
int i, myid, procs; MPI_Status st;
int mask, relative_rank, src, dst; int tag = 1, success = 0;
MPI_Comm_rank(comm, &myid); MPI_Comm_rank(comm, &procs); relative_rank = myid - root; if (relative_rank < 0)
relative_rank += procs; mask = 1;
while (mask < num_procs){ if (relative_rank & mask){
src = myid - mask;
if (src < 0) src += procs;
MPI_Recv(a, c, d, src, 0, comm, &st); break;
}
mask >>= 1;
while (mask > 0){
if (relative_rank + mask < procs){ dst = myid + mask;
if (dst >= procs) dst -= procs; MPI_Send (a, c, d, dst, 0, comm); }
mask >>= 1; }
return 0; }
Flow of Binomial Tree
• Use 'mask' to determine when and how to Send/Recv
Rank 0 Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 6 Rank 7 mask = 1 mask = 2 mask = 4 mask = 4 mask = 2 mask = 1 Send to 4 Send to 2 Send to 1 mask = 1 Recv from 0 mask = 1 mask = 2 mask = 1 mask = 2 mask = 4 mask = 1 mask = 2 mask = 1 Recv from 2 mask = 1 Recv from 4 mask = 1 Recv from 6 Recv from 0 Recv from 0 Recv from 4 mask = 1 Send to 3 mask = 2 Send to 6 mask = 1 Send to 7 mask = 1 Send to 5
Time for Binomial Tree
•
Use multiple links at a time
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
Total Time = T * log2P
T: Time for transferring 1 message P: Number of processes
Today's Topic
• ノンブロッキング通信
Non-Blocking Communication
• 通信の完了を待つ間に他の処理を行う
Execute other instructions while waiting for the completion of a communication.
• 集団通信関数の実装
Implementation of collective communications • MPIプログラムの時間計測
Measuring execution time of MPI programs • デッドロック Deadlock
MPIプログラムの時間計測
Measure the time of MPI programs
• MPI_Wtime
• 現在時間(秒)を実数で返す関数 Returns the current time in seconds. • Example) Measure time here
...
double t1, t2;
...
t1 = MPI_Wtime();
処理
t2 = MPI_Wtime();
並列プログラムにおける時間計測の問題
Problem on measuring time in parallel programs
• プロセス毎に違う時間を測定:
どの時間が本当の所要時間か?
Each process measures different time.
Which time is the time we want?
Read Read Send Read Send Rank 0 Receive Receive Rank 1 Rank 2 t1 = MPI_Wtime(); t1 = MPI_Wtime(); t1 = MPI_Wtime(); t1 = MPI_Wtime(); t1 = MPI_Wtime(); t1 = MPI_Wtime(); Measure time here
集団通信
MPI_Barrierを使った解決策
Use MPI_Barrier
• 時間計測前にMPI_Barrierで同期
Synchronize processes before each measurement
• For measuring total execution time.
Read Read Send Read Send Rank 0 Receive Receive Rank 1 Rank 2 t1 = MPI_Wtime(); MPI_Barrier MPI_Barrier MPI_Barrier MPI_Barrier MPI_Barrier Measure time here
より細かい解析
Detailed analysis
•
Average
• MPI_Reduce can be used to achieve the average:
•
MAX and MIN
• Use MPI_Gather to gather all of the results to Rank 0. • Let Rank 0 to find MAX and MIN
double t1, t2, t, total;
t1 = MPI_Wtime();
...
t2 = MPI_Wtime();
t = t2 – t1;
MPI_Reduce(&t, &total, 1, MPI_DOUBLE, MPI_SUM,
0, MPI_COMM_WORLD);
if (myrank == 0)
最大
(Max)、平均(Ave)、最小(Min)の関係
Relationships among Max, Ave and Min
• プロセス毎の負荷(仕事量)のばらつき検証に利用
Can be used for checking the load-balance.
Max – Ave
is large
Max – Ave
is small
Ave – Min
is large
NG
Mostly OK
Ave – Min
通信時間の計測
Measuring time for communications
double t1, t2, t3, t4 comm=0;
t3 = MPI_Wtime();
for (i = 0; i < N; i++){
computation
t1 = MPI_Wtime();
communication
t2 = MPI_Wtime(); comm += t2 – t1;
computation
t1 = MPI_Wtime();
communication
t2 = MPI_Wtime(); comm += t2 – t1;
}
t4 = MPI_Wtime();
Analyze computation time
• Computation time = Total time - Communication time
• Or, just measure the computation time
• 計算時間のばらつき = 負荷の不均衡の度合い
Balance of computation time shows
balance of the amount of computation
• 注意: 通信時間には、負荷の不均衡によって生じた待ち時間が含まれ
るので、単純な評価は難しい
Communication time is difficult to analyze since
it consists waiting time caused by load-imbalance.
Today's Topic
• ノンブロッキング通信
Non-Blocking Communication
• 通信の完了を待つ間に他の処理を行う
Execute other instructions while waiting for the completion of a communication.
• 集団通信関数の実装
Implementation of collective communications • MPIプログラムの時間計測
Measuring execution time of MPI programs • デッドロック Deadlock
Deadlock
• 何らかの理由で、プログラムを進行させることができなくなった状態
A status of a program in which it cannot proceed by some reasons.•
MPIプログラムでデッドロックが発生しやすい場所:
Places you need to be careful for deadlocks:
1. MPI_Recv, MPI_Wait, MPI_Waitall
2. Collective communications
if (myid == 0){
MPI_Recv from rank 1 MPI_Send to rank 1 }
if (myid == 1){
MPI_Recv from rank 0 MPI_Send to rank 0 }
if (myid == 0){
MPI_Irecv from rank 1 MPI_Send to rank 1 MPI_Wait
}
if (myid == 1){
MPI_Irecv from rank 0 MPI_Send to rank 0 MPI_Wait
}
Wrong case: One solution:
Summary
• ノンブロッキング通信の効果
Effect of non-blocking communication
• 通信開始と通信完了待ちを分離
Split the start and the completion of a communication • 通信と計算のオーバラップを可能にする
Enable overlapping of communication and computation .
• 集団通信の実装
Implementation of collective communication.
• 内部で送信と受信を組み合わせて実装
Construct algorithms with sends and receives. • 所要時間はアルゴリズムに依存
Time depends on the algorithm.
• MPIプログラムの時間計測
Measuring execution time of MPI programs
• 並列プログラムではデッドロックに注意
Report) Make Reduce function by yourself
• 次のページのプログラムの my_reduce関数の中身を追加してプロ
グラムを完成させる
Fill the inside of 'my_reduce' function in the program
shown in the next slide
• my_reduce: MPI_Reduceの簡略版 Simplified version of MPI_Reduce
• 整数の総和のみ. ルートランクは 0限定. コミュニケータは MPI_COMM_WORLD
Calculates total sum of integer numbers. The root rank is always 0. The communicator is always MPI_COMM_WORLD.
• アルゴリズムは好きなものを考えてよい
Any algorithm is OK.
#include <stdio.h> #include <stdlib.h> #include "mpi.h" #define N 20
int my_reduce(int *a, int *b, int c) {
return 0; }
int main(int argc, char *argv[]) {
int i, myid, procs; int a[N], b[N]; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myid); MPI_Comm_size(MPI_COMM_WORLD, &procs); for (i = 0; i < N; i++){ a[i] = i; b[i] = 0; } my_reduce(a, b, N); if (myid == 0) for (i = 0; i < N; i++)
printf("b[%d] = %d , correct answer = %d¥n", i, b[i], i*procs); MPI_Finalize();
return 0;