Implementation Overview - Proposed Parallelizer

4.3 Proposed Parallelizer

4.3.4 Implementation Overview

We implemented our parallelizer on top of COINS¹. Program analysis and transformation were imple-mented at the HIR level, where the AST of a given C program almost remains as the HIR. Figure 4.6 shows the compilation pipeline of our parallelizer. In the rest of this subsection, we describe each com-pilation step briefly along the pipeline described in Figure 4.6 by using the program described in Figure 4.1 as a running example.

Pragma Processing

After the C frontend construct HIRs, we first find and interpret#pragma.

If any #pragma commSemiring is not given, our parallelizer uses only built-in semirings. If a user-defined function is used in#pragma commSemiring, we generate a unique operator for the function and then replace its function calls with binary expressions internally in HIRs. As a result, we can handle the expressions over all commutative semirings uniformly on HIRs. Note that binary expressions with

1http://coins-compiler.sourceforge.jp/

4.3. PROPOSED PARALLELIZER 39

# p r a g m a c o m m S e m i r i n g max " + " N I N F 0

# p r a g m a e x c l u d e max N I N F e x t e r n int N I N F ;

e x t e r n int max ( int a , int b );

p a i r _ i n t m i s _ f b t ( B N i n t * t ) { p a i r _ i n t ret ;

if ( t - > l == N U L L && t - > r == N U L L ) { ret . _1 = N I N F ;

ret . _2 = 0;

} e l s e {

p a i r _ i n t l = m i s _ f b t ( t - > l ) , r = m i s _ f b t ( t - > r );

ret . _1 = t - > v + l . _2 + r . _2 ;

ret . _2 = max ( l . _1 , l . _2 ) + max ( r . _1 , r . _2 );

}

r e t u r n ret ; }

Figure 4.5: mis_fbt, which calculates the maximum independent sum of a given full binary tree.

generated operators are transformed back to function calls just before code generation. For example, after our parallelizer processes #pragma commSemiring in Figure 4.1, the max function is interpreted as an operator. Then, our parallelizer assumes that the maxfunction and the + operator consist of a commutative semiring.

The symbols declared#pragma pure and #pragma excludeare registered in this step. The former information is used in matrix extraction and the latter information is used in code generation.

Case Extraction

Our parallelizer tries to parallelize all given function definitions. It first constructs from each target definition, program slices in four cases that correspond to the four primary operators for reduce over general binary trees (see Section 4.4). Specifically, the slice s₀₀ in the case of no child (i.e., Leaf), the slices₁₀ in the case of only the left child (i.e.,Left), the slices₀₁in the case of only the right child (i.e., Right), and the slices₁₁ in the case of both children (i.e.,Branch) are constructed.

In the rest of compilation process, we focus on themisfunction definition. For example, the following is the slices10 constructed frommis.

p a i r _ i n t mis ( B N i n t * t ) {

p a i r _ i n t ret , l = { NINF ,0} , r = { NINF , 0 } ; l = mis ( t - > l );

ret . _1 = t - > v + l . _2 + r . _2 ;

ret . _2 = max ( l . _1 , l . _2 ) + max ( r . _1 , r . _2 );

r e t u r n ret ; }

In this step, we require consideration of short-circuit logical operators &&and||in COINS. The C frontend translates both into if statements containing goto/label statements. Because the goto/label statements complicate control flow, we eliminate them by replicating the then/else part of if statements containing them. As a result, the implementation of case extraction becomes simple.

Figure 4.6: Compilation pipeline of our parallelizer.

Symbolic Execution

Program slices represent computations in the form of statements. Computations in the form of pure expressions are appropriate to operators for skeletons. We therefore transform the program slices s₀₀, s₀₁,s₁₀, and s₁₁respectively into expressionse₀₀,e₀₁,e₁₀, ande₁₁ through symbolic execution.

Specifically, symbolic execution is to update a symbolic environment (i.e., a mapping from variables to expressions) in a step-by-step abstract interpretation where assignments cause symbolic substitution and if statements construct ternary conditional expressions. The resultant expression is a return expression to which we apply symbolic substitution based on a final symbolic environment.

After we obtain a single expression corresponding to a slice, we abstract the payload value of a current node, the result of the recursive call for a left child, and that for a right child, by using special variablesv, l, andr. These special variables correspond to the formal parameters of operators for skeletons. Finally, constructed expressionse00,e01,e10, ande11become equivalent to the body expressions of the primary operatorsk00,k01,k10, andk11forreduce over general binary trees (see Section 4.4).

For example, the s10ofmisis transformed into the followinge10:

_ p a i r _ i n t ( v + l . _2 + 0 , max ( l . _1 , l . _2 ) + max ( NINF , 0)),

where the function _pair_int is a helper function that works as the constructor of pair_int. Our parallelizer generates such helper functions for convenience.

After e₀₀, e₀₁, e₁₀, and e₁₁ are obtained, we validate them. If we find local variables in expressions, these values are unspecified. We regard such expressions as undefined cases and do not use the expressions themselves in later steps. We, however, register undefined cases for generating halt operators.

Matrix Extraction

To obtain auxiliary operators, we have to formalize a target function as a multilinear computation. This corresponds to transforming the primary operators used for internal nodes, i.e.,e01,e10, ande11into the

4.3. PROPOSED PARALLELIZER 41 matrix-vector product like the equation (4.1). That is, we have to extract coeﬃcient matrices.

By assuming the linearity of expressions, we can implement the extraction of coeﬃcient matrices simply as partial diﬀerentiation over a commutative semiring. If we obtain nonlinear a partial derivative, we judge that matrix extraction fails. That is, partial diﬀerentiation also works as linearity checking.

Specifically, we extract four coeﬃcient matricesA_l,A_r,A_s_l, andA

rsthat are respectively used in the auxiliary operatorsψ_l,ψ_r,ψ_s_l, and ψ

rsforreduce over general binary trees (see Section 4.4). A_l consists of the partial derivatives ofe11with respect tol;Arconsists of the partial derivatives ofe11with respect tor;A

srconsists of partial derivative ofe01with respect tor;A_s_lconsists of the partial derivatives ofe10

with respect tol.

More precisely, augmented coeﬃcient matrices in the equation (4.2) are extracted. The extraction of the part of bi is immediate. It is suﬃcient to substitute l or rwith a null vector. For example, from thee10ofmis, we can extract the following augmented matrix:

NINF v NINF

0 0 NINF

NINF NINF 0

‚.

The p1,2q-component here is the partial derivative of the first memberv + l._2 + 0 of e10, with re-spective to the second memberl._2 ofl.

Our parallelizer tries matrix extraction successively with respect to each commutative semiring until all coeﬃcient matrices are successfully obtained. In this step, our parallelizer also performs constant folding over a target commutative semiring. Although this constant folding would improve the runtime performance of generated operators, its main aim is to simplify the construction of abstract matrices in matrix materialization.

Matrix Materialization

Extracted matrices often consist much of 0 and 1 over a semiring. In particular, augmented coeﬃcient matrices, which our parallelizer actually extracts, contain 0 and 1 by definition. Some components of such matrices remain 0 or 1 in matrix-matrix multiplication. Then, we can eliminate such invariant components from dynamic (i.e., runtime) data and instead embed constants into the code of auxiliary operators. In the step of matrix materialization, we find such invariant components and define the concrete data types of matrices.

To determine invariant components of 0 or 1, we have only to distinguish 0, 1, and dynamic values, i.e, the values of variables. Letting V be the abstract value of variables, we consider a three-valued commutative semiring pt0,1, Vu,`,ˆ,0,1q, where V `a “ V for any a P t0,1, Vu and V ˆV “ V. We construct abstract matrices overpt0,1, Vu,`,ˆ,0,1qfrom extracted matrices. Then, the supremum of all possible matrix products derived from the abstract matrices suﬃces for determining necessary components as dynamic data. To calculate supremums, we define the join operator\over t0,1, Vu as V \c “V andc\c¹ “V for any cP t0,1u. In summary, we can calculate the supremum of abstract matrix products as the least fixed point of matrix multiplication over pt0,1, Vu,`,ˆ,0,1qand matrix updating overpt0,1, Vu,\q with all abstract matrices.

For example, the supremum of abstract matrix products derived from augmented coeﬃcient matrices ofmisis the following:

V V 0

0 0 1

‚.

Then, the four components of V suﬃces for the dynamic data of a matrix. Our parallelizer therefore defines the data type of matrices as the followingstruct.

s t r u c t _ m i s _ m a t {

int _0_0 , _0_1 , _1_0 , _ 1 _ 1 ; };

The member names here correspond to the zero-based numbering of component indices.

Operator Generation and Code Generation

Since e00, e01, e10, and e11 are equivalent to their bodies, it is immediate to generate the primary operators. After augmented coeﬃcient matrices and matrix data types are obtained, we can generate auxiliary operators simply by using symbolic matrix multiplication, following the general forms described in Section 4.2. Note that the special variables v, l, and r are captured as the formal parameters of generated operators.

The final code generation is straightforward. Because of high compatibility between C89 and C++03, we have been able to reuse the C code generator of COINS for code generation. The code generation of helper functions and matrix data types is completely performed by the C code generator. We implemented an ad hoc C++ code generator for entry functions and operator definitions as a simple extension of the C code generator because they necessitate C++ functionality.

From the augmented coeﬃcient matrix above of mis, generated is the following function (more pre-cisely, a member function of C++ function objects) that corresponds toψ

rs.

h i r _ t _ v o i d o p e r a t o r ()( s t r u c t _ m i s _ m a t & m , c o n s t h i r _ t _ i n t v ) c o n s t {

h i r _ s _ p r i v a t e h i r _ v _ a u t o h i r _ t _ i n t _ v a r 4 3 ; h i r _ s _ p r i v a t e h i r _ v _ a u t o h i r _ t _ i n t _ v a r 4 5 ; h i r _ s _ p r i v a t e h i r _ v _ a u t o h i r _ t _ i n t _ v a r 4 7 ; h i r _ s _ p r i v a t e h i r _ v _ a u t o h i r _ t _ i n t _ v a r 4 9 ; _ v a r 4 3 = ( h i r _ _ _ A D D ( v , m . _ 1 _ 0 ));

_ v a r 4 5 = ( h i r _ _ _ A D D ( v , m . _ 1 _ 1 ));

_ v a r 4 7 = max ( m . _0_0 , m . _ 1 _ 0 );

_ v a r 4 9 = max ( m . _0_1 , m . _ 1 _ 1 );

m . _0 _ 0 = _ v a r 4 3 ; m . _0 _ 1 = _ v a r 4 5 ; m . _1 _ 0 = _ v a r 4 7 ; m . _1 _ 1 = _ v a r 4 9 ; }

The matrix argumentmis overwritten with a return value because passed data is unnecessary inreduce.

Right before code generation, our parallelizer excludes symbols inconvenient to C++ programs. For example,wchar_t is a typedefed type in C but a keyword in C++. Ifwchar_t were generated as C, resultant programs would have a syntactic error in C++. We therefore have to exclude wchar_t from code generation. Similarly, we exclude symbols specified in#pragma excludefrom code generation.

ドキュメント内 Structured Parallel Programming with Trees （木を用いた構造化並列プログラミング） (ページ 50-54)