東北大学機関リポジトリTOUR

(1)

Masanori Hariyama and Michitaka Kameyama Graduate School of Information Sciences

Tohoku University

Aoba 6-6-05, Aramaki, Aoba, Sendai, 980-8579, Japan {hariyama,kameyama}@ecei.tohoku.ac.jp

Yasuhiro Kobayashi

Oyama National College of Technology Nakakuki 771, Oyama, Tochigi, 323-0806, Japan

Abstract

One major issue in designing image processors is to de-sign a memory system that supports parallel access with a simple interconnection network. This paper presents a de-sign methodology for a logic-in-memory architecture where each of memory modules is connected to its dedicated pro-cessing element(PE). An efﬁcient memory allocation to min-imize the number of memory modules and PEs under a time constraint is proposed based on regularity.

1 Introduction

Highly-parallel image processors requires a complex in-terconnection network between memory modules and pro-cessing elements(PEs) for parallel memory access. The complex interconnection network causes signiﬁcant over-head in delay and power in deep-submicron and more ad-vanced technologies since the delay and the power of in-terconnection units are more dominant than those of logic units.

This paper presents a design methodology for a logic-in-memory architecture where each of memory modules is connected to its dedicated PE. The advantage of the logic-in-memory architecture is its simple interconnection net-work between memory modules and PEs. Its disadvantage is that it may require a large number of PEs when a large number of memory modules are required for parallel access. To solve this problem, this paper also presents a memory allocation technique to enable parallel access with the min-imum number of memory modules. From a practical point, a memory allocation should have a simple address function that calculates which memory module a pixel is stored in. Hence, we consider a periodical memory allocation where a memory allocation for a whole image is given by repeat-ing a memory allocation for a partial image. A periodical memory allocation has a simple address function because of its regularity. The memory allocation problem is usually mapped onto the clique partitioning problem that is known as an NP-complete problem[1]. An efﬁcient search method

is presented based on the regularity of a periodical memory allocation. As a result, an optimal memory allocation for a practical image size is found in much less than 1 second on Pentium4@2GHz.

The proposed method is used for a road-extraction VLSI for highly-safe vehicles and a stereo-matching VLSI[2]. In the former and latter cases, the number of memory modules is reduced to 91.7% and 19.4%, respectively.

2 Problem formulation

2.1 Target algorithms

We consider window-type algorithms. In these algo-rithms, the output/intermediate output depends on a small neighborhood of an input image, where the neighborhood size is ﬁxed and given as a window as shown in Fig. 1. Algorithms of this type frequently appear in practical situ-ations: spatial ﬁlter, morphology, and image matching, and so on. Moreover, they usually have high degree of paral-lelism, and are suited for VLSI implementations.

We use window-serial-and-pixel-parallel scheduling as shown in Fig. 2. In this scheduling, operations are per-formed in parallel with pixels in a window, whereas opera-tions are performed in a serial manner with windows. Fig-ure 2(a) shows the location of the window at each step. The thick line denotes a window. Figure 2(b) shows the sched-uled data-ﬂow graph(SDFG) corresponding to Fig. 2(a). A node in the SDFG denotes a operation. The labels A and

B on operations denote the operation types of the nodes.

There is an edge from node V1 to nodeV2 when the

out-put ofV₁ is used as an input ofV₂. At Step 1 in Fig. 2, pixels:(0, 0), (0, 1), (0, 2) and (1, 1) are used as inputs of operations of type A. The pixels must be accessed in par-allel since the A-type operations are performed in parpar-allel. Their results are used as inputs of a B-type operation. As the location of the window changes, the input pixels change.

(2)

Figure 1. Window.

(0,0) (0,1) (0,2) (1,1)

(1,0) (1,1) (1,2) (2,1)

(0,1) (0,2) (0,3) (1,2)

(1,1) (1,2) (1,3) (2,2)

Figure 2. Data-ﬂow graph of a pixel-parallel-and-window-serial schedule.

2.2 Target architecture

Figure 3 shows the basic structure of the logic-in-memory architecture. A single dedicated PE is connected to each memory modules. Pixels of input images are dis-tributed among the memory module. The PEs performs operations of type A according to the window-serial-and-pixel-parallel scheduling shown in Fig. 2(b). Their outputs are used as inputs of the unit for type-B operations. In order to extend the architecture model, you can add inputs to PEs as required.

2.3 Periodical memory allocation

From a practical point, a memory allocation should have a simple address function to map the coordinates of a pixel onto the memory module number. If the address function is complex, its overhead in area and delay become serious. For example, a look-up table is required for address mapping of all pixels in the worst case. Therefore, we consider a periodical memory allocation where a memory allocation

Figure 3. Logic-in-memory architecture.

for a whole image is given by repeating a memory allocation for a partial image. The periodical memory allocation has a simple address function because of its regularity. LetK be the number of memory modules. LetN_i be the number of pixels that are allocated to the memory moduleM_i for 1 ≤ i ≤ K. Let (xj_i, yj_i) be the coordinates of a pixel that is allocated to the memory moduleM_ifor1 ≤ j ≤ N_i. Then, we exactly deﬁne a periodical memory allocation as a memory allocation where the coordinates(xj_i, yj_i) of pixels allocated toM_iare expressed as

xj_i yj_i = sj_i Ux Uy + tj_i Vx Vy + x0 i y0 i (1)

where U= [UxUy]T and V= [VxVy]T are vectors to rep-resent periods(called period vectors); the variablessj_i and tj_i are integers; the coordinatesx0_i, y_i0are those of the ref-erence pixel allocated toM_i. Note that you can select an arbitrary pixel as the reference one from the ones allocated to the same memory module.

Fig. 4(a) shows an example of a periodical memory al-location for the SDFG shown in Fig. 2. The label on a pixel denotes the module number where the pixel is allocated. For example, the pixels: (0, 0),(0, 1),(1, 0) and (1, 1) are allo-cated toM₁,M₂,M₃ andM₄, respectively. From Fig. 4 (b), the coordinates of the pixels allocated toM₁are given by xj₁ y₁j = sj₁ 1 2 + tj₁ 2 0 + 0 0 (2) where the coordinates of the reference pixel and the period vectors are[0 0]T, U = [1 2]T and V=[2 0]T, respectively. Each of the pixels with label 1 in Fig. 4 (b) are given by (sj₁, tj₁)=(0, 0), (1, 0), (2, 0), (0, 1), (1, 1) or (2, 1). Figures 4-(c), 4-(d) and 4-(e) show the pixels allocated toM2,M3

andM4, respectively. These ﬁgures show that the same pe-riod vectors asM1are used forM2,M3andM4. They also show that the coordinates of the reference pixels forM₂, M3andM4are[0 1]T,[1 0]T and[1 1]T, respectively. The

memory allocation shown in Fig. 4 satisﬁes Eq.( (1)), that is, the deﬁnition of a periodical memory allocation.

(3)

y

x

y

x

4 4 4

x

y

x

y

4 4

x

4 4 4

Figure 4. Example of a periodical memory al-location.

As mentioned at the beginning of this section, the advan-tage of the periodical memory allocation is its simple ad-dress function. For the periodical memory allocation shown in Fig. 4(a), the address function that maps the coordinates (x, y) of a pixel to the module number is given by

f(x, y) = (2x + y) mod 4 + 1 (3)

2.4 Optimal memory allocation

Let us minimize the hardware amount when the SDFG is specified. In the logic-in-memory architecture, the hard-ware amount is determined by a memory allocation task since the number of PEs is determined by the number of memory modules. The SDFG imposes the degree of paral-lelism in memory access. For example, in the SDFG shown in Fig. 2, the degree of parallelism in memory access is 4. From these observations, minimizing the hardware amount is formulated as finding a memory allocation that minimizes the number of memory modules under a specified degree of parallelism in memory access. Given a window, the optimal memory allocation is defined as the memory allocation that satisfies the following conditions:

C1: For an arbitrary location of the window, pixels in the

window can be retrieved in parallel. In other words, the

modules.

C2: Each pixel is allocated to a single memory module.

This condition ensures that the total memory capacity is minimized.

C3: The number of memory module is minimized. This

condition ensures that the hardware amount is mini-mized in the logic-in-memory architecture.

C4: The memory allocation is a periodical one. This

condi-tion ensures that the hardware for the address funccondi-tion is small.

For example, Fig. 4(a) is the optimal memory allocation for the window shown in Fig. 1 from the following reasons. The condition C1 is satisfied since pixels in the window are allocated to different pixels for an arbitrary location of the window. The condition C2 is satisfied since each pixel has a single label. The condition C3 is satisfied since the number of memory module is 4 and is equal to the minimum num-ber of memory modules required for parallel access, i.e. the number of pixels in the window. The condition C4 is satis-fied since the memory allocation is periodical as mentioned in section 2.3.

2.5 Overview of search method

In the periodical memory allocation, the period vectors

U and V determine which pixels are allocated to the same

memory module as shown in Fig. 4. The period vectors make a parallelogram. Given the period vectors, the num-ber of memory modules is estimated by the area of the par-allelogram. The areaS of the parallelogram is given by

S = |Ux· Vy− Uy· Vx|. (4)

For example shown in Fig. 4,

S = |1 · 0 − 2 · 2| = 4,

andS is exactly same as the number of memory modules. This is because the memory allocation for a whole image is given by repeating the one for the parallelogram, and the parallelogram must be filled with pixels allocated to differ-ent memory modules. From these observations, finding the optimal memory allocation is reduced to finding period vec-tors that make the minimum parallelogram still satisfying the parallel access condition.

Basically, the search for the optimal period vectors for

U_minand Vminis performed by repeating following steps for U and V within possible coordinates of pixels.

Step 1 Check the parallel access condition, that is, whether

the window includes the pixels allocated to the same memory module.

Step 2 If the parallel access condition is satisﬁed, calculate

(4)

Figure 5. Road extraction.

Figure 6. Window for road extraction . Step 3 If S is smaller than current minimum area S_min,

thenS_min← S, U_min← U, V_min← V.

To reduce the computational amount, we can limit the reference pixel to[0 0]T as shown in Fig. 4(b). Moreover, we can limit the search space using the branch-and-bound method such that

S = |Ux· Vy− Uy· Vx| < Smin,

where S_min is the current minimum area of the parallelo-gram made by the current period vectors.

3 Design examples

3.1 Road extraction VLSI

Let us design a road extraction VLSI for highly-safe hicles. The “road” is deﬁned as a set of regions that a ve-hicle can climb over. We assume that the 3-D information of the land surface is obtained from pre-processing such as stereo vision. Given the 3-D coordinates of the land surface and the size of the wheel of a vehicle, let us examine a 3-D pixel to see if it is a 3-D pixel on the “road”. LetR be the radius of the wheel. As shown in Fig. 5, letA be the contact between the wheel and the land. LetB be an arbitrary 3-D pixel within the radius fromA. The 3-D pixel A is a part of the “road” if the vertical interval|H_b− H_a| is less than acceptable rangeHmax. This condition is given by

−Hmax≤ Hb− Ha≤ Hmax (5) whereH_aandH_bare thez-coordinates of A and B, respec-tively. Considering the traveling direction of the vehicle,

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 32 33 34 35 36 37 38 39 31 40 41 23 24 25 26 27 28 29 30 42 43 44 52 53 54 55 56 57 58 59 51 60 61 45 46 47 48 49 50 62 6364 72 73 74 75 76 77 78 79 10 71 80 81 85 86 65 66 67 68 69 70 828384 92 93 94 95 96 97 98 99 91 F0 F1 F5 F6 F7 F8 87 88 89 90 F2F3F4 F9 G0 15 16 17 18 19 20 21 22 37 38 39 40 41 42 43 44 59 60 61 626364 81 85 86 65 66 828384 F5 F6 F7 F8 87 88 F4 F3 F9 G0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 32 33 34 35 36 31 23 24 25 26 27 28 29 30 52 53 54 55 56 57 58 51 45 46 47 48 49 50 72 73 74 75 76 77 78 79 71 80 67 68 69 70 92 93 94 95 96 97 98 99 91 F0 F1 89 90 F2 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 32 33 34 35 36 37 38 39 31 40 41 29 30 42 43 44 52 53 54 55 56 57 58 59 51 60 61 626364 73 74 75 76 77 78 79 80 81 85 86 65 66 828384 95 96 97 98 99 F0 F1 F5 F6 F7 F8 87 88 F2F3F4 F9 G0 1 2 3 4 5 6 23 24 25 26 27 28 45 46 47 48 49 50 72 71 67 68 69 70 92 93 94 91 89 90 7 8 9 10 11 12 13 14 15 16 17 18 19 20 32 33 34 35 36 37 38 39 31 40 41 29 30 42 52 53 54 55 56 57 58 59 51 60 61 6263 64 73 74 75 76 77 78 79 80 8182838485 86 95 96 97 98 99 F0 F1F2F3F4F5 F6 F7 F8 21 22 43 44 65 66 87 88 F9 G0 1 2 3 4 5 6 23 24 25 26 27 28 45 46 47 48 49 50 72 71 67 68 69 70 92 93 94 91 89 90 Fk:100+k, Gk: 110+k

Figure 7. Optimal allocation for Fig. 6.

Figure 8. Logic-in-memory architecture for road extraction.

Eq. (5) is checked forB within a front half circle centered onA as shown in Fig. 6.The black pixel and gray pixels denoteA and B, respectively.

Let us ﬁnd the optimal memory allocation for the win-dow shown in Fig. 6. Suppose that the area for road ex-traction is 10[m] × 50[m] ahead and that the area is di-vided into a set of squares of 5[cm] × 5[cm]. Then, the 3-D information of the land can be considered as an image of 200 × 1000 pixels. Figure 7 shows the optimal mem-ory allocation. It shows the result for a partial image since the space is limited. The labels Fk andGk mean mem-ory modulesM100+kandM110+k, respectively. The period vectors are U = [14 5]T and V= [6 10]T. The number of memory modules is 110. The search time for the optimal al-location is 40ms on a PC(Pentium4@2GHz, 2G-byte main memory, OS:Window XP). Figures 8 and 9 show the logic-in-memory architecture for road extraction and the block di-agram of the PE, respectively. Note that the interconnection between memory modules and PEs is extended for road ex-traction. The road extraction VLSI is designed in a 0.5μm CMOS technology. Its performance is 1200 times higher than that of Pentium4@2GHz.

Let us compare the proposed method with the conven-tional approach from the point of the number of memory

(5)

Figure 9. Block diagram of the PE.

Figure 10. Rectangular memory allocation.

modules. For the window-type image processing, one ef-ﬁcient memory allocation method is a rectangular memory allocation[3] as shown in Fig. 10. A minimum bounding rectangle approximates a given window(Fig. 10(a)). The optimal memory allocation for the rectangle window is ob-tained by the rectangular memory allocation(Fig. 10(b)). The number of memory modules isK × L for a rectangle window of sizeK × L. The window shown in Fig. 6 is ap-proximated by a rectangle window of size8 × 15. The rect-angular memory allocation requires8×15(= 120) memory modules. As a result, the number of memory modules is re-duced to 91.7%(110/120×100).

3.2 Stereo matching VLSI processor using

multi-resolution images

A coarse-to-ﬁne approach using multi-resolution images is one of important techniques in image processing such as pattern matching. This section describes the optimal mem-ory allocation for stereo matching using multi-resolution images. The method proposed in Section 2 is extended to the case with several different windows in the following.

Stereo matching is one efﬁcient method to obtain 3-D in-formation of real scene. It uses two images taken from two different cameras at the same time. After correspondence between the two images is established, the 3-D information of the scene is computed based on the triangular method. The correspondence search is time-consuming. Figures 11 and 12 show the correspondence search and the DFG, re-spectively. Figure 14 shows the logic-in-memory architec-ture for stereo matching. As a similarity measure between a reference window and a candidate window, a sum of abso-lute differences(SAD) is used. Given a reference window,

VR

VL _{Left image} _{Right image}

Object Candidate windows Reference windows Epipolar line Q L R

Figure 11. Search for a corresponding point.

AD AD ADAD ADAD AD AD ADAD ADAD AD AD ADAD ADAD MVD MVD MVD MVD Q Q ADs Q Q ADs Q Q ADs S1 S2 S3 Sn Sn+1 Time MVD MVD AD AD : Absolute Difference

: Minimum Value Detection

Figure 12. Data ﬂow graph of stereo matching using multi-resolution images.

SADs are computed for all the possible candidate windows. The candidate window with the minimum SAD is deter-mined to be the corresponding pixel.

Multi-resolution images are used to reduce the compu-tational amount. Given sampling periods SR = SRmax, SRmax−1,...,2, 1, reduced images are made by sampling

the original images everySR pixels. Beginning with the lowest-resolution image (SR = SR_max), the resolution is iteratively increased untilSR = 1. The possible location of candidate windows at a higher resolution are limited by using the matching result at a lower resolution. Hence, the computational amount is reduced[2].

Figure 13 shows the windows at multi-resolution images for window sizeQ = 3. As shown in Fig. 13(a), the win-dow of3 ×3 at SR = 1 corresponds to the window of 3 ×3 on the original image. This is because the image atSR = 1 is exactly same as the original image. As shown in Fig. 13(b), the window of3 × 3 at SR = 2 corresponds to the window of5 × 5 on the original image. Note that only 3 × 3 gray pixels must be accessed in parallel. In order to ﬁnd the optimal memory allocation for such different windows, we make a single window by the union of the windows of Q × Q atSR = SRmax, SRmax−1, ..., 2, 1 on the original

image.

Figure 15 shows the resulting optimal memory alloca-tion forQ = 3 and SR_max = 3. The period vectors are

U= [5 0]T and V= [0 5]T. The number of memory mod-ules is 25. Figures 15 (a), (b) and (c) corresponds to SR=1,

(6)

Figure 13. Window.

ADD

M

0

M

1

M

K REG

AD

REG REG

AD

REG REG

AD

REG REG REG

SAD

Mi (i=0,1,..,K):

Memory modules for a candidate image

Figure 14. Logic-in-memory architecture for stereo matching.

2 and 3, respectively. The label(x, y) and k(= 1, 2, · · · 25) on a pixel denote the coordinates of the pixel on the origi-nal image and the memory module number where the pixel is allocated, respectively. You can see that all the pixels in a3 × 3 window of an arbitrary location are distributed be-tween different memory modules at each sampling period. In other words, parallel memory access for a3 × 3 window is achieved at each sampling period. Table 1 shows com-parison between the proposed and the rectangular memory allocation forQ = 4 and SRmax= 8, i.e. a more practical

parameters. The number of memory modules (AD units) are reduced to 19.4%. The search time is 30ms on the PC.

4 Conclusion

This paper presents an interconnection-aware design methodology for image processors. A key to success is a optimal memory allocation for a logic-in-memory architec-ture. The method is also useful for FPGA implementations where interconnection overhead in delay is signiﬁcant large.

(a) Original image(SR=1)

(b) Reduced image (SR=2) (c) Reduced image (SR=3) (0,0) (0,0) (0,0) (0,1) (1,0) (1,1) (2,0) (2,0) (2,1) (3,0) (3,0) (3,1) (4,0) (4,0) (4,1) (5,0) (5,1) (6,0) (6,0) (6,0) (6,1) (7,0) (7,1) (8,0) (8,0) (8,1) (9,0) (9,0) (9,1) 6 7 8 9 10 11 12 12 13 14 14 15 16 17 17 18 19 20 20 1 2 3 4 5 6 6 6 7 8 8 9 9 10 10 11 12 13 14 15 16 16 17 18 18 19 20 20 21 21 22 23 24 24 25 1 2 3 4 5 6 7 7 7 8 9 9 10 10 11 12 13 14 15 16 17 17 18 19 19 20 21 22 22 23 24 25 25 1 1 1 2 3 3 4 4 5 5 6 7 8 9 10 11 11 12 13 13 14 15 15 16 16 17 18 19 19 20 21 21 22 23 23 24 25 25 21 22 22 23 24 24 25 1 2 2 2 3 4 4 5 5 (0,2) (0,2) (1,2) (2,2) (2,2) (3,2) (4,2) (4,2) (5,2) (6,2) (6,2) (7,2) (8,2) (8,2) (9,2) (0,9) (0,9) (1,9) (2,9) (3,9) (3,9) (4,9) (5,9) (6,9) (6,9) (7,9) (8,9) (9,9) (9,9) (0,3) (0,3) (1,3) (2,3) (3,3) (3,3) (4,3) (5,3) (6,3) (6,3) (7,3) (8,3) (9,3) (9,3) (0,4) (0,4) (1,4) (2,4) (2,4) (3,4) (4,4) (4,4) (5,4) (6,4) (6,4) (7,4) (8,4) (8,4) (9,4) (0,5) (1,5) (2,5) (3,5) (4,5) (5,5) (6,5) (7,5) (8,5) (9,5) (0,6) (0,6) (0,6) (1,6) (2,6) (2,6) (3,6) (3,6) (4,6) (4,6) (5,6) (6,6) (6,6) (6,6) (7,6) (8,6) (8,6) (9,6) (9,6) (0,7) (1,7) (2,7) (3,7) (4,7) (5,7) (6,7) (7,7) (8,7) (9,7) (0,8) (0,8) (1,8) (2,8) (2,8) (3,8) (4,8) (4,8) (5,8) (6,8) (6,8) (7,8) (8,8) (8,8) (9,8)

Figure 15. Optimal memory allocation for multi-resolution images.

Table 1. Comparison between the proposed method and the rectangular allocation.

Proposed Rectangular Allocation Number of modules 121 625 Number of AD units 121 625 Number of adders 123 624

Acknowledgment

This work was supported in part by industrial technology research grant program from New Energy and Industrial Technology Development Organization (NEDO) of Japan.

References

[1] D. Gajski, N. Dutt, A. Wu, and S. Lin, “HIGH-LEVEL SYNTHESIS -Introduction to Chip and System De-sign,” Kluwer Academic Publishers, pp.277-278(1992). [2] M. Hariyama, H. Sasaki, and M. Kameyama, ”Archi-tecture of a Stereo Matching VLSI Processor Based on Hierarchically Parallel Memory Access”, IEICE Trans. Info. and Syst., Vol. E88-D, No.7,pp.1486-1491(2005). [3] M. Hariyama, S. Lee, and M. Kameyama, “Highly-Parallel Stereo Vision VLSI Processor Based on an Op-timal Parallel Memory Access Scheme”, IEICE Trans. Electron., Vol.E84-C, No.5, pp.382-389(2003).