Conclusion - Structured Parallel Programming with Trees （木を用いた構造化並列プログラミング）

In this chapter, we have presented a syntax-directed method for constructing a value graph from a given AST. Our method performs in a simple yet precise single-pass manner to the while language with break/continue (or equivalent goto/label) statements. Our method utilizes reduced ASTs for tolerating malignant goto/label statements with respect to precision or cost. The technical novelty of our method is very limited because it is based only upon existing notions and techniques. Our work, however, gives a concrete guideline for implementation. Our approach will be a basis for implementation of high-level compiler optimizations.

We have developed the prototype implementation of C and CA on top of COINS⁷. We also have confirmed that these are actually feasible in a realistic cost. We plan to implement value numbering by using our method and evaluate it practically. Incorporating our approach with high-level compiler optimizations is left for future work.

7http://coins-compiler.sourceforge.jp/

Chapter 9

Lessons from Syntax-Directed Program Analysis

We have developed syntax-directed methods for program analysis in Chapters 7 and 8. In this chapter, we examine this practice from the perspective of syntax-directed programming.

Data-flow analysis (DFA) in Chapter 7 and value-graph construction (VGC) in Chapter 8 are similar.

Both formalize program fragments as functions and calculate results by composing them. This function composition is an interpretation of statement sequencing.

The production rule of statement sequencing is identical to that ofJoin inJListαand both interpre-tations are similar. The interpretation of Join is the concatenation of two lists and that of statement sequencing is the sequential conjunction of two computations. It is therefore natural to interpret state-ment sequencing as function composition in syntax-directed program analysis. The formalization of programs with statement sequencing brings associativity useful for load balancing. This justifies our observation that formalization with trees should be first.

Our syntax-directed method for DFA can be parallelized on the basis of the independence of siblings and the associativity of function composition. We do not use segmented trees for parallelizing it because the balancing of statement sequencing suﬃces for the balancing of a given whole AST. In this case, a extremely deep nesting of if/while statements causes poor load balancing. Because such a nesting in computer programs is highly improbable, segmented trees are not worth using. This justifies our observation that it is important to consider the underlying data of trees.

We thus have justified our observations mentioned in Chapter 5 in this practice. Furthermore, the diﬀerence between DFA and VGC mentioned in Chapter 8 has brought an unsurprising yet important observation: true concerns in problems would not be the structures of given trees.

Recall that the diﬀerence of methods for taming goto/label statements. Its root cause is the diﬀerence of the concerns with a given program. DFA inherently requires the regular paths of a given program and does not use the control flow per se. Meanwhile, VGC inherently requires the control flow of a given program and its dominance relation for ϕ-placement. We cannot determine the dominance relation of a program fragment in the presence of unknown control flow, while we can determine the regular paths around unknown control flow. In the presence of goto/label statements, we therefore can construct more compact intermediate results of DFA than ones of VGC. This diﬀerence between DFA and VGC is irrelevant to given trees and relevant to their problem specifications.

Our syntax-directed methods merely utilize the structure of a given tree for discovering easy parts of the whole computation. Tree computations per se are not essential. Use of trees is merely an issue on algorithmic engineering. This observation is not limited to our methods. For example, the purpose of use of search trees in various applications is acceleration, which is an issue on algorithmic engineering. From this practice, we conjecture that most of tree-based algorithms use trees for algorithmic engineering and not for problem specification.

Part III

Programming With Neighborhood

Chapter 10

Neighborhood Computations

The observations described in Chapter 5 suggest that it is unreasonable to consider from patterns of tree computations first. We should consider application domains first. In Part II, we have selected program analysis for an application domain because its computational structure is similar to list/tree skeletons.

In this part, we attach importance to practical applicability and select spatial computation based on neighborhood, i.e., neighborhood computation, for an application domain.

A typical neighborhood computation is stencil computation [KBB^`07, RMCKB97, BHMS91], which is extensively used in scientific computing. In the standard stencil computation, the space of a domain is divided into a uniform grid, and each cell of the grid is updated by using its neighbor cells. By mapping the grid into multidimensional arrays, the stencil computation is implemented as a regular array-based computation. This stencil computation is embarrassingly parallel because the updating of each cell. A naive parallelization is, however, insuﬃcient in practice because this stencil computation often repeats over time steps in scientific computing. If we perform the stencil computation of each time step sequentially, it incurs a lot of cache misses because of poor locality, and results in poor performance. The existing work [KBB^`07, DMV^`08, TCK^`11] on stencil computation therefore combined parallelization and locality enhancement.

Another typical neighborhood computation is the querying of space-partitioning trees, where a query prunes far subspaces and finds neighbors. It contains range queries to databases and nearest-neighbor queries to statistical datasets. Practical algorithms with reasonable approximation such as the Barnes-Hut algorithm forN-body problems and the photon mapping for global illumination utilize this querying.

Because a suﬃcient number of independent queries are given in practice, this computation is also em-barrassingly parallel. However, a naive parallel querying also has poor locality and incurs a lot of cache misses in iterative traversal of space-partitioning trees. There was also the work [JK11, JK12, JGK13]

to combine parallelization and locality enhancement.

Programming for neighborhood computations thus raises the issue of locality enhancement, which is diﬀerent form load balancing. Nevertheless, programming in a divide-and-conquer manner is highly appropriate because the divide-and-conquer paradigm is known to be highly advantageous both to load balancing and locality enhancement [FLPR99, FS06, BGS10]. We therefore deal with cache-eﬃcient divide-and-conquer approaches to neighborhood computations in this part.

We first deal with stencil computation (Chapter 11). Although it is not an irregular algorithm based on trees, we investigate locality enhancement that promotes a simple divide-and-conquer computation.

We secondly deal with iterative traversal of space-partitioning trees (Chapter 12). We investigate locality enhancement for both general space-partitioning trees and a skeleton that abstracts an iterative querying.

Chapter 11

Time Contraction for Optimizing Stencil Computation

This chapter is self-contained and is a revised and extended version of our unpublished paper [SI11b].

11.1 Introduction

Stencil computation appears extensively in practical applications. It appears particularly in solvers for partial diﬀerential equations (PDEs) discretized through a finite diﬀerence method (FDM). In such cases, stencil computation means to update each spatial grid point by using its temporal and spatial neighborhood. For example, the following is a typical one-dimensional three-point stencil: x^pt`1q_i “ c1x^ptq_i´1`c2x^ptq_i `c3x^ptq_i`1, where xi denotes thei-th cell of discretized space,x^ptq_i denotes a value ofxi at thet-th step of discretized time, and every cj is a constant that is defined by physical conditions.

Owing to the practical importance of stencil computation, its optimization is a long-term research topic. Locality optimization [Won00, MW99, KBB^`07, OG09, FS05, FS07, FS06], vectorization [HSP^`11], and auto-tuning [DMV^`08] for it have been investigated. Even domain-specific compilers for it have been developed [BHMS91, RMCKB97, TCK^`11]. Because memory bandwidth very often becomes a major bottleneck in stencil computation, locality optimization is the most important. Time skewing [Won00, MW99], which is a specialized combination of tiling [WL91] and skewing, is the standard ap-proach to optimizing the locality of stencil loops. Intuitively, time skewing is a reordering of computations such that a small part of the spatial grid progresses by several time-steps.

In this paper, we present a novel approach to optimizing stencil loops. First of all, notice that a stencil loop can often be represented by a recurrence equation with a linear transformation: x^pt`1q “Ax^ptq, wherex^p^t^q denotes a vector of spatial grid points at thet-th time-step, and Ais a sparse matrix whose non-zero elements regularly occur. For example, the following represents the above three-point stencil:

˝ x1

x₂ ... x_N_´1

x_N

‹

‚

pt`1q

“

˝ c2 c3

c₁ c₂ c₃ . .. . .. . ..

c₁ c₂ c₃ c₁ c₂

‹

‚

˝ x1

x₂ ... x_N_´1

x_N

‹

‚

ptq

where the boundary condition is assumed to be zero-constant; i.e., x0 “ xN`1 “ 0. Then, from this recurrence equation, we obtainx^pt`kq“A^kx^ptq. Our key observation is that by computingA^k, we can obtain a wider stencil that computesktime-steps at once. That is,ktime-step iterations are contracted into one. However, if we compute A^k and store this result naively, this contracted loop is too costly.

To solve it, on the basis of the regularity of A, we have developed a cheap way to compute A^k and a compact representation of this result. As a result, this contracted loop becomes eﬃcient, and then it

causes fewer cache misses than the original loop since fewer time-step iterations are performed. This contraction of time-steps thus functions to eliminate redundant memory write and arithmetic, as well as to improve locality of stencil loops.

This paper makes the following contributions:

• We describeloop contraction, a technique to optimize a loop that recurrently applies a linear trans-formation, by contracting multiple iterations into one (Section 11.2). It is easy and mathematically trivial; similar formulations are found in [IKF05, LAAK06, LTA03]. However, we have not found exactly the same one in the context of loop transformations. We therefore believe that, even though not novel, it is a distinct perspective in loop optimizations.

• We have developedtime contraction, a novel approach to optimizing stencil loops. It is a special-ization of loop contraction. We describe its concept, techniques to implement it, and its eﬀects on complexity, as well as its pros and cons (Section 11.3). Time contraction is quite diﬀerent from time skewing [Won00, MW99] and is eﬀective for reducing the cache complexity of stencil loops.

This is the most significant contribution of our work.

• We have experimentally demonstrated the eﬀectiveness of time contraction by using several imple-mentations of a one-dimensional three-point stencil derived from the heat equation (Section 11.5).

In all cases, the time contraction outperformed time skewing (i.e., the standard approach) in per-formance improvement: in the best settings, the time-contracted version achieved 72.4 % better performance then the time-skewed one.

• We describe tuning techniques for stencil loops on the basis of time contraction (Section 11.4).

Time contraction facilities tuning of stencil loops. We have been implementing an experimental FDM library that generates tuned stencil loops by means of these techniques. In a preliminary experiment, a generated loop achieved up to 60 % better performance than a manually vectorized and time-contracted one (Section 11.5).

ドキュメント内 Structured Parallel Programming with Trees （木を用いた構造化並列プログラミング） (ページ 98-106)