Microsoft PowerPoint ppt [互換モード]

(1)

Graduate School of Information Science, Nagoya Univ. Embedded and Real-Time Systems Lab.

ERTL

計算機アーキテクチャ特論

• 前半（並列アーキテクチャの基本、枝廣） – 10/1, 10/15, 10/22, 10/29, 11/5, 11/12（⽇程は予定） – 内容（変更の可能性あり） • 序論（マルチコア＝並列アーキテクチャ概論） • キャッシュ・コヒーレンシ、メモリ・コンシステンシ • 並列アーキテクチャモデル、ＯＳモデル • スケーラビリティに関する法則 • 並列プログラミングモデル、⾔語 – 講義のWWWサイト http://www.pdsl.jp/class/ から計算機アーキテクチャ特論のページに⼊る資料配布をしないので、事前にダウンロードして必要ならば印刷してくるように。資料は前⽇にはアップロードする予定 Page 1

２０１２年１０月２２日枝廣

(2)

⽬次

• 並列アーキテクチャモデルとOSモデル、

プログラムモデル

(3)

ERTL

3

マルチコアプロセッサの分類

• ヘテロジニアス vs. ホモジニアス

（ハードウェア・アーキテクチャのAMP vs. SMP） – ヘテロジニアス：異種コアによるマルチプロセッサ – ホモジニアス：同種コアによるマルチプロセッサ（注：同種コアでも性能が異なればヘテロジニアスとよばれる場合がある）

• AMP vs. SMP

_{（システムのAMP vs. SMP）}

– AMP (Asymmetric Multi-Processor、⾮対称型)

• 各コアが別々のソフトを実⾏（機能分散） – SMP (Symmetric Multi-Processor、対称型) • OSが複数ソフトウェアを複数コアに負荷分散しながら実⾏（SMPはホモジニアス型のみ） • ただし、最近の組込みシステム向けSoCでは様々な専⽤エンジンを搭載しつつ複数CPUを持つため、上記が混在している場合がある。

DSP

CPU

HW

CPU

ヘテロジニアスホモジニアス

(4)

SMP型システムの定義

SMP OS

CPU1 CPU2 CPU3 CPU4 SW1 SW2 SW3 SW4

ポイント１：一つの

OS

ポイント２：

対称的に実行可能＝

別の

CPUに移すことが可能

(CPU1で動作させ中断していたものをCPU2 で再開することが可能。問題はキャッシュに一時保存しているデータの扱い）様々な定義がある。ここでは以下のように考える。一つのOSで管理され，全ての処理（タスク，プロセス，スレッドなど）が全ての CPUにおいて対称的に実行可能であるようなマルチプロセッサによる並列処理方式同じバイナリが全てのCPUで動作する必要がある（少なくとも命令セットは）同じCPUである 必要がある＝ホモジニアスホモジニアス・AMP型とSMP型の違い システムモデル（次頁）キャッシュ一貫性に対するハードウェアサポート（後述） SMP型 AMP型 SMP型以外のもの

(5)

ERTL

5

マルチコアプロセッサの分類

ヘテロジニアス・AMP

ホモジニアス・SMP

SMP OS

CPU CPU CPU CPU SW1 SW2 SW3 SW4

ホモジニアス・AMP

HW&SW一体でサブシステム最適化

・サブシステム内変更が他に影響しにくい → リアルタイム性確保や、テストで有利・ヘテロジニアスでは電力・性能・コスト面で最適なHWを選択・ホモジニアスではHWを同一化。SW環境を同じにしつつサブシステム分離性確保・SW-HWの割り当ては固定的 → HW能力に分割損が発生しがち

OSがSWモジュール（スレッド）を

動的にHWへマッピング

•SWモジュール変更が全体性能に影響 → リアルタイム性確保やテストで不利 •SW機能のマッピング自由度が大 → HW能力の分割損は発生しにくい CPU SW1 CPU SW2 CPU SW3 CPU SW4 OS OS OS OS CPU SW1 DSP SW2 HW1 SW3 HW2 SW4 OS OS

(6)

AMPとSMPの違い（まとめ）

• SMP型＝「⼀つのOSで管理され，すべての処理（タスク，

プロセス，スレッドなど）がすべてのCPUにおいて対称的に

実⾏可能であるようなマルチプロセッサによる並列処理⽅

式」

– SMP型にはホモジニアスしかなく、⼀つのOSがすべてのソフトウェアを動的に負荷分散しながら実⾏する – AMP型は、各プロセッサにOS＊を持ち、各プロセッサが実⾏するソフトウェアが静的に決められている * プロセッサ管理のみの簡易的な基本ソフトウェアも含む – AMP型はそれぞれのプロセッサがサブシステムとなる • リアルタイム性保証やテストなどにメリットがあり、現状組込みシステムでは AMP型の⽅が多いと⾔われている – AMP型とSMP型ではキャッシュの⼀貫性に関するハードウェア機構に違いがある • 混合型もあり、組込みプロセッサには対応したハードウェア機構を持つものもある

(7)

ERTL

7

コヒーレント・キャッシュとシステムモデル

• ノンコヒーレント・キャッシュ：CPU間の分離性が良くなる →AMP型システムに向く • コヒーレント・キャッシュ：CPU間でのデータ共有オーバーヘッド⼩ →SMP型システムに向く • 組込みシステムでは両⽅の性質を使いたい要求（例えば⼀部のタスクのみリアルタイム性を確保したい）があり、組込みプロセッサではスヌープ機構を部分的に切れるようになっているものもある CPU1 CPU2 コヒーレント・キャッシュ CPU1 CPU2 キャッシュ OS1 キャッシュスケジューラ OS2 スケジューラ SW1 SW2 SW3 SW4 AMP型システム SMP OS スケジューラ SW1 SW2 SW3 SW4 SMP型システム CPU1 CPU2 コヒーレント・キャッシュ SMP OS スケジューラ SW1 SW2 SW3 CPU3 キャッシュ OS2 スケジューラ SW4 AMP/SMP混合型システム

(8)

AMP型マルチコアシステムのリアルタイム性

a) 1CPU b) 3CPU

Delay from Scheduled Time (= Points above 40ms) →Discontinuity of Audio & Video

Many Delays

Time Execute Time for Periodical Processes

NO DELAY

(9)

ERTL

Page 9 © NEC Corporation 2010

ＳＭＰ型マルチコアシステムのスケーラビリティ

• ⼿ぶれ補正処理 – シャッタースピードを遅くすると⼿ぶれ発⽣。速くすると暗くなる (I, II) – シャッタースピードを速くしつつ、複数枚撮影し、画像補正 (III) – 画質を⾼くすればするほど⾼いCPU性能が必要 0 0.5 1 1.5 2 2.5 3 3.5 1 2 3 4 Number of Processors Speedup QVGA VGA ▐ 並列性能向上率 (1CPU対4CPU) --- VGA: 2.94倍, QVGA: 3.15倍

(II) fast shutter: dark but not blurry (I) slow shutter:

bright but blurry

(III) bright and not blurry Image

Stabilizer

(10)

タスク７タスク３タスク５タスク２タスク１タスク４タスク６タスク７タスク３タスク５タスク２

AMP型とSMP型のプログラムモデル

• AMP型はプロセッサごとの（別々のOS上の）プログラムとなり、プログラム間の同期・通信を記載する。CPUへのタスク（スレッド）割り当てはプログラム時に静的に⾏われる • SMP型はSMP OS上の⼀つのプログラムとなり、同期・通信も含め、並列化⽀援⾔語・APIとして記載する。SMP OSが負荷分散を考慮しながら動的にタスク（スレッド）をプロセッサに割り当てる OS CPU１ OS CPU２ OS CPU３ SMP OS

CPU１ CPU２ CPU３

CPU１向けプログラム CPU２向けプログラム CPU３向けプログラムタスク１タスク４タスク６並列化プログラム

AMP型

SMP型

(11)

ERTL

11

プログラムが並列・並⾏実⾏可能に記述

• AMP型のプログラム

–

同期・通信以外は通常のソフトウェア

• SMP型のプログラム

–

スレッド・プログラミング

(12)

SMP型マルチコア向けスレッド化プログラミング

• OSが提供するスレッドライブラリ

– pthread

• IEEEのPOSIX Section 1003.1c規格。Linuxなどで標準的にサポート

POSIX: Portable Operating System Interface

– Windows API • Windowsでサポート

• ⾔語仕様内、⾔語拡張のスレッドライブラリ

– Java Thread • Java⾔語の中に標準で定義 – OpenMP • C/C++/FORTRANを並列プログラム可能にするために⽶国コンパイラベンダグループによって作られた指⽰⽂。パソコン向けの開発環境などで標準的にサポート – TBB • Intel社が開発した⾔語。C/C++で使える。動的な負荷分散などをランタイムで⾏う – TPL • Microsoft社の⾔語。.NETに含まれておりC#, VBで使える – Cilk • MITで開発された⾔語。ANSI Cで使える。Intelなどがサポートしはじめている

(13)

ERTL

13

OS スレッドライブラリ

• pthread

–

IEEE POSIX Section 1003.1c

POSIX: Portable Operating System Interface

–

Nichols, Buttlar, and Farrell: Pthreads

Programming, OʼREILLY, 1998.

–

Linuxなどで標準

–

pthread_create, pthread_join

• Windows Thread API

(14)

#include <stdio.h> #include <math.h> #define DATA_NUM 100 int main() { BOOL primes[DATA_NUM]; int i;

Example2: Calculate Primes

/* Check */

for (i = 0; i < DATA_NUM; i++) { primes[i] = TRUE; limit = (int)sqrt((double)i); for (j = 2; j <= limit; j++) if (primes[j] && i % j == 0) { primes[i] = FALSE; break; } } /* Output */

for (i = 2; i < DATA_NUM; i++) { if (primes[i] == 1) printf("%d ", i); }

printf("¥n"); return 0; }

If primes[i] is TRUE (j is a prime), and (i % j == 0) ( i is multiple number of j), i is an prime.

If j is not a prime, we don’t have to check if I is multiple number of j. Why?

(15)

ERTL

15 #include <stdio.h> #include <stdlib.h> #include <stdbool.h> #include <math.h> #include <pthread.h> #define THREAD_NUM 3 #define DATA_NUM 100

typedef struct _thread_arg { int id; bool *primes; } thread_arg_t; マルチコアCPUのための 並列プログラミング（秀和システムズ）より

void thread_func(void *arg) {

thread_arg_t* targ = (thread_arg_t *)arg; int c_start, c_end, range, limit;

int i, j;

/* Determine Range of Values to be Checked */ range = (DATA_NUM - 2) / THREAD_NUM + 1; c_start = 2 + targ->id *range;

c_end = 2 + (targ->id+1) *range;

if (c_end > DATA_NUM) c_end = DATA_NUM; /* Check */

for (i = c_start; i < c_end; i++) { limit = (int)sqrt((double) i); for (j = 2; j <= limit; j++) if (targ->primes[j] && i % j == 0) { targ->primes[i] = false; break; } } return; }

Pthread (1/2)

Calc Primes

(16)

/* Wait for All Threads */

for (i = 0; i < THREAD_NUM; i++) pthread_join(handle[i], NULL);

/* Output */

for (i = 2; i < DATA_NUM; i++) if (primes[i]) printf("%d ", i); printf("¥n"); return 0; int main() { pthread_t handle[THREAD_NUM]; thread_arg_t targ[THREAD_NUM]; bool primes[DATA_NUM]; int i; /* Initialize */

for (i = 0; i < DATA_NUM; i++) primes[i] = true; /* Start */ for (i = 0; i < THREAD_NUM; i++) { targ[i].id = i; targ[i].primes = primes; pthread_create(&handle[i], NULL, (void*)thread_func, (void*)&targ[i]); }

Pthread (2/2)

(17)

ERTL

17 #include <stdio.h> #include <windows.h> #include <math.h> #define THREAD_NUM 3 #define DATA_NUM 100

typedef struct _thread_arg { int id; BOOL *primes; } thread_arg_t; マルチコアCPUのための 並列プログラミング（秀和システムズ）より

void thread_func(void *arg) {

int i, j;

/* Determine Range of Values to be Checked */ range = (DATA_NUM - 2) / THREAD_NUM + 1; c_start = 2 + targ->id * range;

c_end = 2 + (targ->id + 1) * range;

if (c_end > DATA_NUM) c_end = DATA_NUM; /* Check */

for (i = c_start; i < c_end; i++) { limit = (int)sqrt((double)i); for (j = 2; j <= limit; j++) if (targ->primes[j] && i % j == 0) { targ->primes[i] = FALSE; break; } } return; }

Windows thread (1/2)

Calc Primes

(18)

WaitForMultipleObjects(THREAD_NUM, handle, TRUE, INFINITE);

/* Output */

for (i = 2; i < DATA_NUM; i++) { if (primes[i] == 1) printf("%d ", i); } printf("¥n"); return 0; } int main() { HANDLE handle[THREAD_NUM]; thread_arg_t targ[THREAD_NUM]; BOOL primes[DATA_NUM]; int i;

for (i = 0; i < DATA_NUM; i++) { primes[i] = TRUE;

}

for (i = 0; i < THREAD_NUM; i++) { targ[i].id = i;

targ[i].primes = primes;

handle[i] = CreateThread(NULL, 0, (LPTHREAD_START_ROUTINE)thread _func, (void *)&targ[i], 0, NULL);

}

(19)

ERTL

19

OpenMP

• OS スレッドライブラリは低レベル

–

プログラマはアーキテクチャを考慮し、粒度や負荷

分散を考えながら⾃分でプログラムを切って記載す

る必要がある。

• OpenMP

–

C/C++/FORTRANの指⽰⾏として並列を記載

–

USのコンパイラベンダが集まって開発

–

PC向けの開発環境などでサポートされている

–

http://www.openmp.org/

–

Fork-Join Model

–

粒度はランタイムによって決められる

(20)

#include <stdio.h> #include <math.h> #include <omp.h> #define DATA_NUM 100 int main() { BOOL primes[DATA_NUM]; int i; /* Initialize */

#pragma omp parallel for

for (i = 0; i < DATA_NUM; i++) primes[i] = TRUE;

OpenMP

Calc Primes

/* Check */

#pragma omp parallel for

for (i = 0; i < DATA_NUM; i++) { primes[i] = TRUE; limit = (int)sqrt((double)i); for (j = 2; j <= limit; j++) if (primes[j] && i % j == 0) { primes[i] = FALSE; break; } } /* Output */

(21)

ERTL

21

Software for SMP （OpenMP)

• Example of OpenMP (Banking)

–

Execute ‘section’s in Parallel

within ‘sections’ block

#pragma omp parallel sections

{

#pragma omp section

main();

#pragma omp section

withdraw();

#pragma omp section

deposit();

#pragma omp section

balance();

}

–‘sections’

ブロックのʻ}’で同期

(すべてのsectionはʼ}ʼで同期)

Banking

main()

Main thread

withdraw()

thread

deposit()

thread

balance()

thread

Customer Requests

(22)

Software for SMP (OpenMP)

• Example of OpenMP (Video Decode)

–for-loop with ‘for’ Directive

is executed in Parallel

#pragma omp parallel for for(i=1; i<=N; i++)

Decode#i; –その他の指⽰⽂ • 総和 • バリア • アトミック • …

Video Decode

Decod e#1 Decod

e#2 Decod_e#3

Decod e#5 Decod e#8 Decod e#7 Decod e#4 Decod e#5

(23)

ERTL

23

排他制御に関する⽤語

• クリティカルセクション

–

⼀度に⼀つのプロセスまたはスレッドのみが実⾏可

能なプログラムの部分

–

例：グローバル変数の書換

(素数の数のカウント)

• 共有リソース

–

メモリ、周辺デバイスなど

(24)

排他制御

時間

その他の

処理

クリティカル

セクション

その他の

処理

一度に一つのプロセス（スレッド）

のみが実行可能

例：グローバル変数の書換

共有リソースの利用

(25)

ERTL

25

Lock - Unlock

時間

その他の

処理

クリティカル

セクション

その他の

処理

「ロック変数」 v を宣言

Thread A

Lock v

Thread A

は実行可能

Unlock v

Thread B

vがUnlock

されるまで

_Wait

STOP

(26)

排他制御の例

• Mutex (= Mutual Exclusion)

–

ある変数のLock/Unlock

• セマフォ

–

リソースが複数ある場合に利⽤

–

利⽤可能なリソース数を保持し、リソースが残って

いる限りプログラムはクリティカルセクションに⼊

れる

–

Mutexはリソース数が⼀つの特殊ケースと考えられ

る

(27)

ERTL

27

pthread, POSIX セマフォ

• pthread mutex

–

pthread_mutex_init

• ロック変数の初期化

–

pthread_mutex_lock, pthread_mutex_unlock

–

pthread_destroy

• POSIX セマフォ

–

sem_init

–

sem_wait, sem_post

–

sem_destroy

(28)

Windows Thread API

• クリティカルセクション

–

InitializeCriticalSection

–

EnterCriticalSection, LeaveCriticalSection

–

DeleteCriticalSection

• セマフォ

–

CreateSemaphore

–

WaitForSingleObject, ReleaseSemaphore

–

CloseHandle

(29)

ERTL

29 #include <stdio.h> #include <math.h> #define DATA_NUM 100 int main() { BOOL primes[DATA_NUM]; int I, count;

Example2’: Calculate Primes

and count # of Primes

/* Check */

for (i = 0; i < DATA_NUM; i++) { primes[i] = TRUE; limit = (int)sqrt((double)i); for (j = 2; j <= limit; j++) if (primes[j] && i % j == 0) { primes[i] = FALSE; break; } if (j > limit) count++; } /* Output */

(30)

#include <stdio.h> #include <stdlib.h> #include <stdbool.h> #include <math.h> #include <pthread.h> #define THREAD_NUM 3 #define DATA_NUM 100 typedef struct _thread_arg {

int id;

bool *primes;

pthread_mutex_t *mutex; } thread_arg_t;

int count;

void thread_func(void *arg) {

int i, j;

/* Determine Range of Values to be Checked */ range = (DATA_NUM - 2) / THREAD_NUM + 1; c_start = 2 + targ->id *range;

c_end = 2 + (targ->id+1) *range;

if (c_end > DATA_NUM) c_end = DATA_NUM; /* Check */

for (i = c_start; i < c_end; i++) { limit = (int)sqrt((double) i); for (j = 2; j <= limit; j++) if (targ->primes[j] && i % j == 0) { targ->primes[i] = false; break; } if(j > limit) { pthread_mutex_lock(targ->mutex); count++; pthread_mutex_unlock(targ->mutex); } } return; }

Pthread (1/2)

Calc Primes

(31)

ERTL

31

/* Wait for All Threads */

for (i = 0; i < THREAD_NUM; i++) pthread_join(handle[i], NULL);

/* Destroy Mutex Variable */

pthread_mutex_destroy(&mutex);

/* Output */

for (i = 2; i < DATA_NUM; i++) if (primes[i]) printf("%d ", i); printf("¥n"); return 0; int main() { pthread_t handle[THREAD_NUM]; thread_arg_t targ[THREAD_NUM]; bool primes[DATA_NUM]; int i; pthread_mutex_t mutex; /* Initialize */

for (i = 0; i < DATA_NUM; i++) primes[i] = true;

/* Initialize mutex variable */ pthread_mutex_init(&mutex, NULL);

/* Start */

for (i = 0; i < THREAD_NUM; i++) { targ[i].id = i; targ[i].primes = primes; targ[i].mutex = &mutex; pthread_create(&handle[i], NULL, (void*)thread_func, (void*)&targ[i]); }

Pthread (2/2)

(32)

#include <stdio.h> #include <windows.h> #include <math.h>

#define THREAD_NUM 3 #define DATA_NUM 100 typedef struct _thread_arg {

int id;

BOOL *primes;

CRITICAL_SECTION *cs; } thread_arg_t;

int count;

void thread_func(void *arg) {

int i, j;

/* Determine Range of Values to be Checked */ range = (DATA_NUM - 2) / THREAD_NUM + 1; c_start = 2 + targ->id * range;

c_end = 2 + (targ->id + 1) * range;

if (c_end > DATA_NUM) c_end = DATA_NUM; /* Check */

for (i = c_start; i < c_end; i++) { limit = (int)sqrt((double)i); for (j = 2; j <= limit; j++) if (targ->primes[j] && i % j == 0) { targ->primes[i] = FALSE; break; } if(j > limit) { EnterCriticalSection(targ->cs); count++; LeaveCriticalSection(targ->cs); } } return; }

Windows thread (1/2)

Calc Primes

(33)

ERTL

33

WaitForMultipleObjects(THREAD_NUM, handle, TRUE, INFINITE);

/* Destroy critical section Variable */ DeleteCriticalSection(&cs);

/* Output */

for (i = 2; i < DATA_NUM; i++) { if (primes[i] == 1) printf("%d ", i); } printf("¥n"); return 0; } int main() { HANDLE handle[THREAD_NUM]; thread_arg_t targ[THREAD_NUM]; BOOL primes[DATA_NUM]; int i; CRITICAL_SECTION cs;

for (i = 0; i < DATA_NUM; i++) { primes[i] = TRUE;

}

/* Initialize critical section variable */ InitializeCriticalSection(&cs);

for (i = 0; i < THREAD_NUM; i++) { targ[i].id = i;

targ[i].primes = primes; targ[i].mutex = &cs;

handle[i] = CreateThread(NULL, 0, (LPTHREAD_START_ROUTINE)thread_ func, (void *)&targ[i], 0, NULL);

}

(34)

OpenMP

• Clause

–

付加情報

–

private, shared (変数)

–

reduction (演算)

• #pragma omp critical

• #pragma omp atomic

(35)

ERTL

35

Reduction

Thread 1

counting

Thread 2

counting

Thread 3

counting

Thread 4

counting

Count

Final Result

(36)

#include <stdio.h> #include <math.h> #include <omp.h> #define DATA_NUM 100 int main() { BOOL primes[DATA_NUM]; int I, count; /* Initialize */

#pragma omp parallel for for (i = 0; i < DATA_NUM; i++) primes[i] = TRUE;

OpenMP

Calc Primes

/* Check */

#pragma omp parallel for reduction(+;count) private(limit, j)

for (i = 0; i < DATA_NUM; i++) { limit = (int)sqrt((double)i); for (j = 2; j <= limit; j++) if (primes[j] && i % j == 0) { primes[i] = FALSE; break; } if (j > limit) count++; } /* Output */

Microsoft PowerPoint ppt [互換モード]

ERTL

ERTL

計算機アーキテクチャ特論

２０１２年１０月２２日 枝廣

⽬次

•

並列アーキテクチャモデルとOSモデル、

プログラムモデル

ERTL

ERTL

マルチコアプロセッサの分類

•

ヘテロジニアス vs. ホモジニアス

•

AMP vs. SMP

DSP

CPU

HW

CPU

CPU

CPU

CPU

CPU

SMP型システムの定義

ポイント１：一つの

OS

ポイント２：

対称的に実行可能＝

別の

CPUに移すことが可能

ERTL

ERTL

マルチコアプロセッサの分類

ヘテロジニアス・AMP

ホモジニアス・SMP

ホモジニアス・AMP

HW&SW一体でサブシステム最適化

OSがSWモジュール（スレッド）を

動的にHWへマッピング

AMPとSMPの違い（まとめ）

•

SMP型＝「⼀つのOSで管理され，すべての処理（タスク，

プロセス，スレッドなど）がすべてのCPUにおいて対称的に

実⾏可能であるようなマルチプロセッサによる並列処理⽅

式」

ERTL

ERTL

コヒーレント・キャッシュとシステムモデル

AMP型マルチコアシステムのリアルタイム性

ERTL

ERTL

ＳＭＰ型マルチコアシステムのスケーラビリティ

AMP型とSMP型のプログラムモデル

AMP型

SMP型

ERTL

ERTL

プログラムが並列・並⾏実⾏可能に記述

•

AMP型のプログラム

–

同期・通信以外は通常のソフトウェア

•

SMP型のプログラム

–

スレッド・プログラミング

SMP型マルチコア向けスレッド化プログラミング

•

OSが提供するスレッドライブラリ

•

⾔語仕様内、⾔語拡張のスレッドライブラリ

ERTL

ERTL

OS スレッドライブラリ

•

pthread

–

IEEE POSIX Section 1003.1c

POSIX: Portable Operating System Interface

２０１２年１０月２２日枝廣