• 検索結果がありません。

東北大学機関リポジトリTOUR

N/A
N/A
Protected

Academic year: 2021

シェア "東北大学機関リポジトリTOUR"

Copied!
6
0
0

読み込み中.... (全文を見る)

全文

(1)

Masanori Hariyama and Michitaka Kameyama Graduate School of Information Sciences

Tohoku University

Aoba 6-6-05, Aramaki, Aoba, Sendai, 980-8579, Japan {hariyama,kameyama}@ecei.tohoku.ac.jp

Yasuhiro Kobayashi

Oyama National College of Technology Nakakuki 771, Oyama, Tochigi, 323-0806, Japan

[email protected]

Abstract

One major issue in designing image processors is to de-sign a memory system that supports parallel access with a simple interconnection network. This paper presents a de-sign methodology for a logic-in-memory architecture where each of memory modules is connected to its dedicated pro-cessing element(PE). An efficient memory allocation to min-imize the number of memory modules and PEs under a time constraint is proposed based on regularity.

1

Introduction

Highly-parallel image processors requires a complex in-terconnection network between memory modules and pro-cessing elements(PEs) for parallel memory access. The complex interconnection network causes significant over-head in delay and power in deep-submicron and more ad-vanced technologies since the delay and the power of in-terconnection units are more dominant than those of logic units.

This paper presents a design methodology for a logic-in-memory architecture where each of memory modules is connected to its dedicated PE. The advantage of the logic-in-memory architecture is its simple interconnection net-work between memory modules and PEs. Its disadvantage is that it may require a large number of PEs when a large number of memory modules are required for parallel access. To solve this problem, this paper also presents a memory allocation technique to enable parallel access with the min-imum number of memory modules. From a practical point, a memory allocation should have a simple address function that calculates which memory module a pixel is stored in. Hence, we consider a periodical memory allocation where a memory allocation for a whole image is given by repeat-ing a memory allocation for a partial image. A periodical memory allocation has a simple address function because of its regularity. The memory allocation problem is usually mapped onto the clique partitioning problem that is known as an NP-complete problem[1]. An efficient search method

is presented based on the regularity of a periodical memory allocation. As a result, an optimal memory allocation for a practical image size is found in much less than 1 second on Pentium4@2GHz.

The proposed method is used for a road-extraction VLSI for highly-safe vehicles and a stereo-matching VLSI[2]. In the former and latter cases, the number of memory modules is reduced to 91.7% and 19.4%, respectively.

2

Problem formulation

2.1

Target algorithms

We consider window-type algorithms. In these algo-rithms, the output/intermediate output depends on a small neighborhood of an input image, where the neighborhood size is fixed and given as a window as shown in Fig. 1. Algorithms of this type frequently appear in practical situ-ations: spatial filter, morphology, and image matching, and so on. Moreover, they usually have high degree of paral-lelism, and are suited for VLSI implementations.

We use window-serial-and-pixel-parallel scheduling as shown in Fig. 2. In this scheduling, operations are per-formed in parallel with pixels in a window, whereas opera-tions are performed in a serial manner with windows. Fig-ure 2(a) shows the location of the window at each step. The thick line denotes a window. Figure 2(b) shows the sched-uled data-flow graph(SDFG) corresponding to Fig. 2(a). A node in the SDFG denotes a operation. The labels A and

B on operations denote the operation types of the nodes.

There is an edge from node V1 to nodeV2 when the

out-put ofV1 is used as an input ofV2. At Step 1 in Fig. 2, pixels:(0, 0), (0, 1), (0, 2) and (1, 1) are used as inputs of operations of type A. The pixels must be accessed in par-allel since the A-type operations are performed in parpar-allel. Their results are used as inputs of a B-type operation. As the location of the window changes, the input pixels change.

(2)

Figure 1. Window.

(0,0) (0,1) (0,2) (1,1)

(1,0) (1,1) (1,2) (2,1)

(0,1) (0,2) (0,3) (1,2)

(1,1) (1,2) (1,3) (2,2)

Figure 2. Data-flow graph of a pixel-parallel-and-window-serial schedule.

2.2

Target architecture

Figure 3 shows the basic structure of the logic-in-memory architecture. A single dedicated PE is connected to each memory modules. Pixels of input images are dis-tributed among the memory module. The PEs performs operations of type A according to the window-serial-and-pixel-parallel scheduling shown in Fig. 2(b). Their outputs are used as inputs of the unit for type-B operations. In order to extend the architecture model, you can add inputs to PEs as required.

2.3

Periodical memory allocation

From a practical point, a memory allocation should have a simple address function to map the coordinates of a pixel onto the memory module number. If the address function is complex, its overhead in area and delay become serious. For example, a look-up table is required for address mapping of all pixels in the worst case. Therefore, we consider a periodical memory allocation where a memory allocation

Figure 3. Logic-in-memory architecture.

for a whole image is given by repeating a memory allocation for a partial image. The periodical memory allocation has a simple address function because of its regularity. LetK be the number of memory modules. LetNi be the number of pixels that are allocated to the memory moduleMi for 1 ≤ i ≤ K. Let (xji, yji) be the coordinates of a pixel that is allocated to the memory moduleMifor1 ≤ j ≤ Ni. Then, we exactly define a periodical memory allocation as a memory allocation where the coordinates(xji, yji) of pixels allocated toMiare expressed as

 xji yji  = sji  Ux Uy  + tji  Vx Vy  +  x0 i y0 i  (1)

where U= [UxUy]T and V= [VxVy]T are vectors to rep-resent periods(called period vectors); the variablessji and tji are integers; the coordinatesx0i, yi0are those of the ref-erence pixel allocated toMi. Note that you can select an arbitrary pixel as the reference one from the ones allocated to the same memory module.

Fig. 4(a) shows an example of a periodical memory al-location for the SDFG shown in Fig. 2. The label on a pixel denotes the module number where the pixel is allocated. For example, the pixels: (0, 0),(0, 1),(1, 0) and (1, 1) are allo-cated toM1,M2,M3 andM4, respectively. From Fig. 4 (b), the coordinates of the pixels allocated toM1are given by  xj1 y1j  = sj1  1 2  + tj1  2 0  +  0 0  (2) where the coordinates of the reference pixel and the period vectors are[0 0]T, U = [1 2]T and V=[2 0]T, respectively. Each of the pixels with label 1 in Fig. 4 (b) are given by (sj1, tj1)=(0, 0), (1, 0), (2, 0), (0, 1), (1, 1) or (2, 1). Figures 4-(c), 4-(d) and 4-(e) show the pixels allocated toM2,M3

andM4, respectively. These figures show that the same pe-riod vectors asM1are used forM2,M3andM4. They also show that the coordinates of the reference pixels forM2, M3andM4are[0 1]T,[1 0]T and[1 1]T, respectively. The

memory allocation shown in Fig. 4 satisfies Eq.( (1)), that is, the definition of a periodical memory allocation.

(3)

y

x

y

x

4 4 4

x

y

x

y

4 4

x

4 4 4

Figure 4. Example of a periodical memory al-location.

As mentioned at the beginning of this section, the advan-tage of the periodical memory allocation is its simple ad-dress function. For the periodical memory allocation shown in Fig. 4(a), the address function that maps the coordinates (x, y) of a pixel to the module number is given by

f(x, y) = (2x + y) mod 4 + 1 (3)

2.4

Optimal memory allocation

Let us minimize the hardware amount when the SDFG is specified. In the logic-in-memory architecture, the hard-ware amount is determined by a memory allocation task since the number of PEs is determined by the number of memory modules. The SDFG imposes the degree of paral-lelism in memory access. For example, in the SDFG shown in Fig. 2, the degree of parallelism in memory access is 4. From these observations, minimizing the hardware amount is formulated as finding a memory allocation that minimizes the number of memory modules under a specified degree of parallelism in memory access. Given a window, the optimal memory allocation is defined as the memory allocation that satisfies the following conditions:

C1: For an arbitrary location of the window, pixels in the

window can be retrieved in parallel. In other words, the

modules.

C2: Each pixel is allocated to a single memory module.

This condition ensures that the total memory capacity is minimized.

C3: The number of memory module is minimized. This

condition ensures that the hardware amount is mini-mized in the logic-in-memory architecture.

C4: The memory allocation is a periodical one. This

condi-tion ensures that the hardware for the address funccondi-tion is small.

For example, Fig. 4(a) is the optimal memory allocation for the window shown in Fig. 1 from the following reasons. The condition C1 is satisfied since pixels in the window are allocated to different pixels for an arbitrary location of the window. The condition C2 is satisfied since each pixel has a single label. The condition C3 is satisfied since the number of memory module is 4 and is equal to the minimum num-ber of memory modules required for parallel access, i.e. the number of pixels in the window. The condition C4 is satis-fied since the memory allocation is periodical as mentioned in section 2.3.

2.5

Overview of search method

In the periodical memory allocation, the period vectors

U and V determine which pixels are allocated to the same

memory module as shown in Fig. 4. The period vectors make a parallelogram. Given the period vectors, the num-ber of memory modules is estimated by the area of the par-allelogram. The areaS of the parallelogram is given by

S = |Ux· Vy− Uy· Vx|. (4)

For example shown in Fig. 4,

S = |1 · 0 − 2 · 2| = 4,

andS is exactly same as the number of memory modules. This is because the memory allocation for a whole image is given by repeating the one for the parallelogram, and the parallelogram must be filled with pixels allocated to differ-ent memory modules. From these observations, finding the optimal memory allocation is reduced to finding period vec-tors that make the minimum parallelogram still satisfying the parallel access condition.

Basically, the search for the optimal period vectors for

Uminand Vminis performed by repeating following steps for U and V within possible coordinates of pixels.

Step 1 Check the parallel access condition, that is, whether

the window includes the pixels allocated to the same memory module.

Step 2 If the parallel access condition is satisfied, calculate

(4)

Figure 5. Road extraction.

Figure 6. Window for road extraction . Step 3 If S is smaller than current minimum area Smin,

thenSmin← S, Umin← U, Vmin← V.

To reduce the computational amount, we can limit the reference pixel to[0 0]T as shown in Fig. 4(b). Moreover, we can limit the search space using the branch-and-bound method such that

S = |Ux· Vy− Uy· Vx| < Smin,

where Smin is the current minimum area of the parallelo-gram made by the current period vectors.

3

Design examples

3.1

Road extraction VLSI

Let us design a road extraction VLSI for highly-safe hicles. The “road” is defined as a set of regions that a ve-hicle can climb over. We assume that the 3-D information of the land surface is obtained from pre-processing such as stereo vision. Given the 3-D coordinates of the land surface and the size of the wheel of a vehicle, let us examine a 3-D pixel to see if it is a 3-D pixel on the “road”. LetR be the radius of the wheel. As shown in Fig. 5, letA be the contact between the wheel and the land. LetB be an arbitrary 3-D pixel within the radius fromA. The 3-D pixel A is a part of the “road” if the vertical interval|Hb− Ha| is less than acceptable rangeHmax. This condition is given by

−Hmax≤ Hb− Ha≤ Hmax (5) whereHaandHbare thez-coordinates of A and B, respec-tively. Considering the traveling direction of the vehicle,

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 32 33 34 35 36 37 38 39 31 40 41 23 24 25 26 27 28 29 30 42 43 44 52 53 54 55 56 57 58 59 51 60 61 45 46 47 48 49 50 62 6364 72 73 74 75 76 77 78 79 10 71 80 81 85 86 65 66 67 68 69 70 828384 92 93 94 95 96 97 98 99 91 F0 F1 F5 F6 F7 F8 87 88 89 90 F2F3F4 F9 G0 15 16 17 18 19 20 21 22 37 38 39 40 41 42 43 44 59 60 61 626364 81 85 86 65 66 828384 F5 F6 F7 F8 87 88 F4 F3 F9 G0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 32 33 34 35 36 31 23 24 25 26 27 28 29 30 52 53 54 55 56 57 58 51 45 46 47 48 49 50 72 73 74 75 76 77 78 79 71 80 67 68 69 70 92 93 94 95 96 97 98 99 91 F0 F1 89 90 F2 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 32 33 34 35 36 37 38 39 31 40 41 29 30 42 43 44 52 53 54 55 56 57 58 59 51 60 61 626364 73 74 75 76 77 78 79 80 81 85 86 65 66 828384 95 96 97 98 99 F0 F1 F5 F6 F7 F8 87 88 F2F3F4 F9 G0 1 2 3 4 5 6 23 24 25 26 27 28 45 46 47 48 49 50 72 71 67 68 69 70 92 93 94 91 89 90 7 8 9 10 11 12 13 14 15 16 17 18 19 20 32 33 34 35 36 37 38 39 31 40 41 29 30 42 52 53 54 55 56 57 58 59 51 60 61 6263 64 73 74 75 76 77 78 79 80 8182838485 86 95 96 97 98 99 F0 F1F2F3F4F5 F6 F7 F8 21 22 43 44 65 66 87 88 F9 G0 1 2 3 4 5 6 23 24 25 26 27 28 45 46 47 48 49 50 72 71 67 68 69 70 92 93 94 91 89 90 Fk:100+k, Gk: 110+k

Figure 7. Optimal allocation for Fig. 6.

Figure 8. Logic-in-memory architecture for road extraction.

Eq. (5) is checked forB within a front half circle centered onA as shown in Fig. 6.The black pixel and gray pixels denoteA and B, respectively.

Let us find the optimal memory allocation for the win-dow shown in Fig. 6. Suppose that the area for road ex-traction is 10[m] × 50[m] ahead and that the area is di-vided into a set of squares of 5[cm] × 5[cm]. Then, the 3-D information of the land can be considered as an image of 200 × 1000 pixels. Figure 7 shows the optimal mem-ory allocation. It shows the result for a partial image since the space is limited. The labels Fk andGk mean mem-ory modulesM100+kandM110+k, respectively. The period vectors are U = [14 5]T and V= [6 10]T. The number of memory modules is 110. The search time for the optimal al-location is 40ms on a PC(Pentium4@2GHz, 2G-byte main memory, OS:Window XP). Figures 8 and 9 show the logic-in-memory architecture for road extraction and the block di-agram of the PE, respectively. Note that the interconnection between memory modules and PEs is extended for road ex-traction. The road extraction VLSI is designed in a 0.5μm CMOS technology. Its performance is 1200 times higher than that of Pentium4@2GHz.

Let us compare the proposed method with the conven-tional approach from the point of the number of memory

(5)

Figure 9. Block diagram of the PE.

Figure 10. Rectangular memory allocation.

modules. For the window-type image processing, one ef-ficient memory allocation method is a rectangular memory allocation[3] as shown in Fig. 10. A minimum bounding rectangle approximates a given window(Fig. 10(a)). The optimal memory allocation for the rectangle window is ob-tained by the rectangular memory allocation(Fig. 10(b)). The number of memory modules isK × L for a rectangle window of sizeK × L. The window shown in Fig. 6 is ap-proximated by a rectangle window of size8 × 15. The rect-angular memory allocation requires8×15(= 120) memory modules. As a result, the number of memory modules is re-duced to 91.7%(110/120×100).

3.2

Stereo matching VLSI processor using

multi-resolution images

A coarse-to-fine approach using multi-resolution images is one of important techniques in image processing such as pattern matching. This section describes the optimal mem-ory allocation for stereo matching using multi-resolution images. The method proposed in Section 2 is extended to the case with several different windows in the following.

Stereo matching is one efficient method to obtain 3-D in-formation of real scene. It uses two images taken from two different cameras at the same time. After correspondence between the two images is established, the 3-D information of the scene is computed based on the triangular method. The correspondence search is time-consuming. Figures 11 and 12 show the correspondence search and the DFG, re-spectively. Figure 14 shows the logic-in-memory architec-ture for stereo matching. As a similarity measure between a reference window and a candidate window, a sum of abso-lute differences(SAD) is used. Given a reference window,

VR

VL Left image Right image

Object Candidate windows Reference windows Epipolar line Q L R

Figure 11. Search for a corresponding point.

AD AD ADAD ADAD AD AD ADAD ADAD AD AD ADAD ADAD MVD MVD MVD MVD Q Q ADs Q Q ADs Q Q ADs S1 S2 S3 Sn Sn+1 Time MVD MVD AD AD : Absolute Difference

: Minimum Value Detection

Figure 12. Data flow graph of stereo matching using multi-resolution images.

SADs are computed for all the possible candidate windows. The candidate window with the minimum SAD is deter-mined to be the corresponding pixel.

Multi-resolution images are used to reduce the compu-tational amount. Given sampling periods SR = SRmax, SRmax−1,...,2, 1, reduced images are made by sampling

the original images everySR pixels. Beginning with the lowest-resolution image (SR = SRmax), the resolution is iteratively increased untilSR = 1. The possible location of candidate windows at a higher resolution are limited by using the matching result at a lower resolution. Hence, the computational amount is reduced[2].

Figure 13 shows the windows at multi-resolution images for window sizeQ = 3. As shown in Fig. 13(a), the win-dow of3 ×3 at SR = 1 corresponds to the window of 3 ×3 on the original image. This is because the image atSR = 1 is exactly same as the original image. As shown in Fig. 13(b), the window of3 × 3 at SR = 2 corresponds to the window of5 × 5 on the original image. Note that only 3 × 3 gray pixels must be accessed in parallel. In order to find the optimal memory allocation for such different windows, we make a single window by the union of the windows of Q × Q atSR = SRmax, SRmax−1, ..., 2, 1 on the original

image.

Figure 15 shows the resulting optimal memory alloca-tion forQ = 3 and SRmax = 3. The period vectors are

U= [5 0]T and V= [0 5]T. The number of memory mod-ules is 25. Figures 15 (a), (b) and (c) corresponds to SR=1,

(6)

Figure 13. Window.

ADD

ADD

ADD

M

0

M

1

M

K REG

AD

REG REG

AD

REG REG

AD

REG REG REG

SAD

Mi (i=0,1,..,K):

Memory modules for a candidate image

Figure 14. Logic-in-memory architecture for stereo matching.

2 and 3, respectively. The label(x, y) and k(= 1, 2, · · · 25) on a pixel denote the coordinates of the pixel on the origi-nal image and the memory module number where the pixel is allocated, respectively. You can see that all the pixels in a3 × 3 window of an arbitrary location are distributed be-tween different memory modules at each sampling period. In other words, parallel memory access for a3 × 3 window is achieved at each sampling period. Table 1 shows com-parison between the proposed and the rectangular memory allocation forQ = 4 and SRmax= 8, i.e. a more practical

parameters. The number of memory modules (AD units) are reduced to 19.4%. The search time is 30ms on the PC.

4

Conclusion

This paper presents an interconnection-aware design methodology for image processors. A key to success is a optimal memory allocation for a logic-in-memory architec-ture. The method is also useful for FPGA implementations where interconnection overhead in delay is significant large.

(a) Original image(SR=1)

(b) Reduced image (SR=2) (c) Reduced image (SR=3) (0,0) (0,0) (0,0) (0,1) (1,0) (1,1) (2,0) (2,0) (2,1) (3,0) (3,0) (3,1) (4,0) (4,0) (4,1) (5,0) (5,1) (6,0) (6,0) (6,0) (6,1) (7,0) (7,1) (8,0) (8,0) (8,1) (9,0) (9,0) (9,1) 6 7 8 9 10 11 12 12 13 14 14 15 16 17 17 18 19 20 20 1 2 3 4 5 6 6 6 7 8 8 9 9 10 10 11 12 13 14 15 16 16 17 18 18 19 20 20 21 21 22 23 24 24 25 1 2 3 4 5 6 7 7 7 8 9 9 10 10 11 12 13 14 15 16 17 17 18 19 19 20 21 22 22 23 24 25 25 1 1 1 2 3 3 4 4 5 5 6 7 8 9 10 11 11 12 13 13 14 15 15 16 16 17 18 19 19 20 21 21 22 23 23 24 25 25 21 22 22 23 24 24 25 1 2 2 2 3 4 4 5 5 (0,2) (0,2) (1,2) (2,2) (2,2) (3,2) (4,2) (4,2) (5,2) (6,2) (6,2) (7,2) (8,2) (8,2) (9,2) (0,9) (0,9) (1,9) (2,9) (3,9) (3,9) (4,9) (5,9) (6,9) (6,9) (7,9) (8,9) (9,9) (9,9) (0,3) (0,3) (1,3) (2,3) (3,3) (3,3) (4,3) (5,3) (6,3) (6,3) (7,3) (8,3) (9,3) (9,3) (0,4) (0,4) (1,4) (2,4) (2,4) (3,4) (4,4) (4,4) (5,4) (6,4) (6,4) (7,4) (8,4) (8,4) (9,4) (0,5) (1,5) (2,5) (3,5) (4,5) (5,5) (6,5) (7,5) (8,5) (9,5) (0,6) (0,6) (0,6) (1,6) (2,6) (2,6) (3,6) (3,6) (4,6) (4,6) (5,6) (6,6) (6,6) (6,6) (7,6) (8,6) (8,6) (9,6) (9,6) (0,7) (1,7) (2,7) (3,7) (4,7) (5,7) (6,7) (7,7) (8,7) (9,7) (0,8) (0,8) (1,8) (2,8) (2,8) (3,8) (4,8) (4,8) (5,8) (6,8) (6,8) (7,8) (8,8) (8,8) (9,8)

Figure 15. Optimal memory allocation for multi-resolution images.

Table 1. Comparison between the proposed method and the rectangular allocation.

Proposed Rectangular Allocation Number of modules 121 625 Number of AD units 121 625 Number of adders 123 624

Acknowledgment

This work was supported in part by industrial technology research grant program from New Energy and Industrial Technology Development Organization (NEDO) of Japan.

References

[1] D. Gajski, N. Dutt, A. Wu, and S. Lin, “HIGH-LEVEL SYNTHESIS -Introduction to Chip and System De-sign,” Kluwer Academic Publishers, pp.277-278(1992). [2] M. Hariyama, H. Sasaki, and M. Kameyama, ”Archi-tecture of a Stereo Matching VLSI Processor Based on Hierarchically Parallel Memory Access”, IEICE Trans. Info. and Syst., Vol. E88-D, No.7,pp.1486-1491(2005). [3] M. Hariyama, S. Lee, and M. Kameyama, “Highly-Parallel Stereo Vision VLSI Processor Based on an Op-timal Parallel Memory Access Scheme”, IEICE Trans. Electron., Vol.E84-C, No.5, pp.382-389(2003).

Figure 1. Window.
Figure 4. Example of a periodical memory al- al-location.
Figure 6. Window for road extraction .
Figure 11. Search for a corresponding point.
+2

参照

関連したドキュメント

In section 2 we present the model in its original form and establish an equivalent formulation using boundary integrals. This is then used to devise a semi-implicit algorithm

Theorem 4.8 shows that the addition of the nonlocal term to local diffusion pro- duces similar early pattern results when compared to the pure local case considered in [33].. Lemma

Kilbas; Conditions of the existence of a classical solution of a Cauchy type problem for the diffusion equation with the Riemann-Liouville partial derivative, Differential Equations,

Since the boundary integral equation is Fredholm, the solvability theorem follows from the uniqueness theorem, which is ensured for the Neumann problem in the case of the

The study of the eigenvalue problem when the nonlinear term is placed in the equation, that is when one considers a quasilinear problem of the form −∆ p u = λ|u| p−2 u with

Classical Sturm oscillation theory states that the number of oscillations of the fundamental solutions of a regular Sturm-Liouville equation at energy E and over a (possibly

It leads to simple purely geometric criteria of boundary maximality which bear hyperbolic nature and allow us to identify the Poisson boundary with natural topological boundaries

In all cited papers, the existence of (strong) steady-state solutions to the quan- tum hydrodynamic equations is shown for sufficiently small current densities j 0 &gt;.. In fact,