• 検索結果がありません。

2018 10 16 10:25-12:10

N/A
N/A
Protected

Academic year: 2021

シェア "2018 10 16 10:25-12:10"

Copied!
62
0
0

読み込み中.... (全文を見る)

全文

(1)

2018 10 16 10:25-12:10

(2)

1. 9 25

2. 10 2

l

3. 10 9

l

4. 10 16

l

5. 10 23

l 2

6. 10 30

l -

l

8. 11 20

l -

9. 11 27

l

10. 12 4

l l

11. 12 11

l

12. 12 18 ??

l

13. 1 8

l RB-H

(3)

• MPI

(4)

1.

2.

3.

4.

5.

6.

(5)
(6)

1 1

1

1

• →

100

(7)

l

(8)

8 9

10 11 12 13 14

0 1 2 3 4 6 7

( )

10

6

0 2

14

10

6

0 2

14

(9)

32 /

112 2

1M /2

(10)

32 /

112 2

1M /2

(11)

(L3: Cache )

64 65 66 67

2 2 2 2

(12)

Intel OmniPath Architecture

12.

(L3: Cache )

64 65 66 67

2 2 2 2

(L3: Cache )

64 65 66 67

2 2 2 2

(L3: Cache )

64 65 66 67

2 2 2 2

(L3: Cache )

64 65 66 67

2 2 2 2

(L3: Cache )

64 65 66 67

2 2 2 2

(L3: Cache )

64 65 66 67

2 2 2 2

(13)

• Knights Landing Overview 1 1

Chip: 36 Tiles interconnected by 2D Mesh Tile: 2 Cores + 2 VPU/core + 1 MB L2

Memory: MCDRAM: 16 GB on-package; High BW DDR4: 6 channels @ 2400 up to 384GB IO: 36 lanes PCIe Gen3. 4 lanes of DMI for chipset Node: 1-Socket only

Fabric: Omni-Path on-package (not shown)

Vector Peak Perf: 3+TF DP and 6+TF SP Flops Scalar Perf: ~3x over Knights Corner

Streams Triad (GB/s): MCDRAM : 400+; DDR: 90+

4

Core L2 Core

Package

Source Intel: All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. KNL data are preliminary based on current expectations and are subject to change without notice. 1Binary Compatible with Intel Xeon processors using Haswell Instruction Set (except TSX). 2Bandwidth numbers are based on STREAM-like memory access pattern when MCDRAM used as flat memory. Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.

Omni-path not shown

EDC EDC

PCIe Gen 3

EDC EDC

Tile

DDR MC DDR MC

EDC EDC misc EDC EDC

36 Tiles connected by

2D Mesh Interconnect

MCDRAM MCDRAM MCDRAM MCDRAM

3 D D R 4 C H A N N E L S

3 D D R 4 C H A N N E L S

MCDRAM MCDRAM MCDRAM MCDRAM

D M

I 2 x16

1 x4

X4 DMI

HotChips27 KNL

Potential future options subject to change without notice.

All timeframes, features, products and dates are preliminary forecasts and subject to change without further notification.

Three products

KNL Self-Boot KNL Self-Boot w/ Fabric KNL Card (Baseline) (Fabric Integrated) (PCIe-Card)

2 VPU 2 VPU

Core 1MB Core L2

MCDRAM: 490GB/

DDR4: 115.2 GB/

=(8Byte 2400MHz 6 channel)

( ) MCDRAM:

16GB

+ DDR4

(14)

+ 8 A +56 + = 5 A ? 6A = =

. D

2 18 A

2 38 A )

+56

4 3 ( 3+

3+

l (

(15)
(16)

• 0.4

(17)

• 5

0.4 0.4 0.4 0.4 0.4

(18)

• 2 1

• 2.4 2

• 2.8 3

• 3.2 4

• 3.4 5

• 3.8 6

• 0.63

0.4

(19)

1.

• 1.

2.

2.

• 1.

2.

(20)

• -

for (j=0; j<n; j++) for (i=0; i<n; i++) {

y[j] += A[j][i] * x[i] ; }

A[0][0] x[0] A[0][0]*

x[0] y[0]

A[0][1] x[1] A[0][0]*

x[1] y[0]

A[0][2] x[2]

(21)

A[0][0] x[0] A[0][0]*

x[0] y[0]

A[0][1] x[1] A[0][0]*

x[1] y[0]

A[0][2] x[2] A[0][2]*

x[2] y[0]

A[0][3] x[3] A[0][3]*

x[3] y[0]

A[0][4] x[4] A[0][2]*

x[4] y[0]

… l

N

→N

(22)

1.

→CPU CPU→

i, j

2.

3.

(23)

1.

2.

• Intel Pentium4

• CPU CPU

Xeon Phi: KNC 7 ,

• Broadwell 14-19 (?)

• KNL 14

(24)

• 32

• AVX512 8

16

• AVX512

• 16 2 AVX512 32

• 1.4

1.4 GHz* 32 = 44.8 GFLOPS /

• 68

44.8 * 68 = 3046.6 GFLOPS /

= 3.04 TFLOPS /

(25)
(26)

1.

2.

(27)

(28)

l k- 2 (n 2 )

for (i=0; i<n; i++) for (j=0; j<n; j++)

for (k=0; k<n; k+=2)

C[i][j] += A[i][k] *B[k][j] + A[i][k+ ]*B[k+ ][j];

Ø k- /2

(29)

l j- 2 (n 2 )

for (i=0; i<n; i++) for (j=0; j<n; j+=2)

for (k=0; k<n; k++) {

C[i][j ] += A[i][k] *B[k][j ];

C[i][j+ ] += A[i][k] *B[k][j+ ];

}

Ø A[i][k]

(30)

l i- 2 (n 2 )

for (i=0; i<n; i+=2) for (j=0; j<n; j++)

for (k=0; k<n; k++) {

C[i ][j] += A[i ][k] *B[k][j];

C[i+ ][j] += A[i+ ][k] *B[k][j];

}

Ø B[i][j]

(31)

l i- j- 2

(n )

for (i=0; i<n; i+=2) for (j=0; j<n; j+=2)

for (k=0; k<n; k++) {

C[i ][j ] += A[i ][k] *B[k][j ];

C[i ][j+ ] += A[i ][k] *B[k][j+ ];

C[i+ ][j ] += A[i+ ][k] *B[k][j ];

C[i+ ][j+ ] += A[i+ ][k] *B[k][j + ];

}

Ø A[i][j], A[i+ ][k],B[k][j],B[k][j+ ]

(32)

l

for (i=0; i<n; i+=2) for (j=0; j<n; j+=2) {

dc00 = C[i ][j ]; dc01 = C[i ][j+ ];

dc10 = C[i+ ][j ]; dc11 = C[i+ ][j+ ] ; for (k=0; k<n; k++) {

da0= A[i ][k] ; da1= A[i+ ][k] ; db0= B[k][j ]; db1= B[k][j+ ];

dc00 += da0 *db0; dc01 += da0 *db1;

dc10 += da1 *db0; dc11 += da1 *db1;

} C[i ][j ] = dc00; C[i ][j+ ] = dc01;

C[i+ ][j ] = dc10; C[i+ ][j+ ] = dc11;

}

(33)

l k- 2 (n 2 ) do i= , n

do j= , n

do k= , n, 2

C(i, j) = C(i, j) +A(i, k) *B(k, j) &

+ A(i, k+ )*B(k+ , j) enddo

enddo enddo

Ø k- /2

(34)

l j- 2 (n 2 )

do i= , n

do j= , n, 2 do k= , n

C(i, j ) = C(i, j ) +A(i, k) * B(k, j ) C(i, j+ ) = C(i, j+ ) +A(i, k) * B(k, j+ ) enddo

enddo enddo

Ø A(i, k)

(35)

l i- 2 (n 2 )

do i= , n, 2 do j= , n

do k= , n

C(i , j) = C(i , j) +A(i , k) * B(k , j) C(i+ , j) = C(i+ , j) +A(i+ , k) * B(k , j) enddo

enddo enddo

Ø B(i, j)

(36)

l i- j- 2

(n )

do i= , n, 2 do j= , n, 2

do k= , n

C(i , j ) = C(i , j ) +A(i , k) *B(k, j ) C(i , j+ ) = C(i , j+ ) +A(i , k) *B(k, j+ ) C(i+ , j ) = C(i+ , j ) +A(i+ , k) *B(k, j ) C(i+ , j+ )=C(i+ , j+ ) +A(i+ , k) *B(k, j + ) enddo; enddo; enddo;

Ø A(i,j), A(i+ ,k),B(k,j),B(k,j+ )

(37)

l

do i= , n, 2 do j= , n, 2

dc00 = C(i ,j ); dc01 = C(i ,j+ ) dc10 = C(i+ ,j ); dc11 = C(i+ ,j+ )

do k= , n

da0= A(i ,k); da1= A(i+ , k) db0= B(k ,j ); db1= B(k, j+ )

dc00 = dc00+da0 *db0; dc01 = dc01+da0 *db1;

dc10 = dc10+da1 *db0; dc11 = dc11+da1 *db1;

enddo

C(i , j ) = dc00; C(i , j+ ) = dc01

C(i+ , j ) = dc10; C(i+ , j+ ) = dc11

enddo; enddo

(38)
(39)

i j

}

i, j

1 5 9 13

2 6 10 14

3 7 11 15

4 8 12 16

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

i

j

i

j

(40)

… …

(41)

i

(42)

2.

(43)

• →

(44)

(45)

2

(46)

1.

• (&A)++

• 2.

3.

(47)

• Sparc64 Iv L1 2Way

• OpenMP

• KNL cache L3 Direct Mapping(=1 Way)

• OpenMP Static

Dynamic(Cyclic) L2

L1

• !$OMP DO SCHEDULE(static,1)

SACSIS2012

http://sacsis.hpcc.jp/2012/files/SACSIS2012-tutorial1-pub.pdf

(48)

-

(49)

• emacs emacs

• ^x ^s control

• ^x ^c

^z

• ^g :

• ^k :

• ^y : ^k

• ^s :

• ^M x goto-line :

(50)

• rm

• rm *~ : test.c~ ~

*~

• ls :

• cd

• cd .. :

• cd ~

• cat

• make : Makefile

• make clean :

clean Makefile

(51)

• less (cat

• : 1

• / :

• q

(52)

• C Fortran

Mat-Mat-noopt-ofp.tar.gz

• mat-mat-noopt.bash

lecture-flat lecture7-flat

gt00 gt17 pjsub

• lecture-flat :

• lecture7-flat:

(53)

• $ cd /work/gt17/t170xx

$ cp /work/gt17/z30105/Mat-Mat-noopt-ofp.tar.gz ./

$ tar xvfz Mat-Mat-noopt-ofp.tar.gz

$ cd Mat-Mat-noopt

$ cd C : C

$ cd F :

• $ make

• $ pjsub mat-mat-noopt.bash

• $ cat mat-mat-noopt.bash.oXXXX (XXXX )

(54)

• (C )

N = 512

Mat-Mat time = 12.511196 [sec.]

21.455619 [MFLOPS]

OK!

N = 512

Mat-Mat time = 13.501827 [sec.]

19.881417 [MFLOPS]

OK!

DDR4

MCDRAM

(55)

• (Fortran )

N = 512

Mat-Mat time[sec.] = 24.4274609088898 MFLOPS = 10.9890854527813

OK!

N = 512

Mat-Mat time[sec.] = 27.0449259281158 MFLOPS = 9.92553856630092

OK!

DDR4

MCDRAM

(56)

• (DDR4 )

• mpiexec.hydra -n ${PJM_MPI_PROC} ./mat-mat-noopt

• MCDRAM

• mpiexec.hydra -n ${PJM_MPI_PROC} numactl -m 1

./mat-mat-noopt

(57)

• #define N 512

• #define DEBUG -

• MyMatMat

• N N

(58)

integer, parameter :: NN=512

(59)

• MyMatMat

(60)

1. [L ] -

2. [L ] - i, j, k

•L00:

•L10

•L20

•L30

•L40

•L50

L

(61)

1.

,

2. Kevin Dowd

C

,

(62)

参照

関連したドキュメント

It is well known that the inverse problems for the parabolic equations are ill- posed apart from this the inverse problems considered here are not easy to handle due to the

The idea of applying (implicit) Runge-Kutta methods to a reformulated form instead of DAEs of standard form was first proposed in [11, 12], and it is shown that the

Left: time to solution for an increasing load for NL-BDDC and NK-BDDC for an inhomogeneous Neo-Hooke hyperelasticity problem in three dimensions and 4 096 subdomains; Right:

Based on sequential numerical results [28], Klawonn and Pavarino showed that the number of GMRES [39] iterations for the two-level additive Schwarz methods for symmetric

In this article, using the sub-supersolution method and Rabinowitz- type global bifurcation theory, we prove some results on existence, uniqueness and multiplicity of positive

2 Principal bundles with connection are transport functors 1324 3 Transport 2-functors and gauge invariant surface holonomy 1352 4 The path-curvature 2-functor associated to a

Using a clear and straightforward approach, we have obtained and proved inter- esting new binary digit extraction BBP-type formulas for polylogarithm constants.. Some known results

Touchdown Total may be applied as a spot spray in peppermint and spearmint. Apply spray-to-wet with hand-held equipment, such as backpack and knapsack sprayers, pump-up