Speculative Selection in Adaptive Routing on Interconnection Networks
全文
(2) 148. IPSJ Transactions on Advanced Computing Systems. mediately to reduce the latency. The other status that should be considered is the next node utilization (or the network condition on forward direction of a channel). This status would be useful in case of there are multiple free channels available to forward the message on. The number of messages, which currently requests this channel, also importantly affects the channel profitability. The higher is this number, the lower is channel’s profitability. We propose the speculative selection function that uses latest network information, which is communicated between adjacent nodes, to speculate the highest profitability output channel. Thus, the message can dynamically select the best travel path that contains least busy nodes leading to improve network performance. There are two issues with the speculation. The first is the usefulness of communicated information. Because this status communication between nodes takes time, the current node does not know exactly present status of the next nodes. In addition, the current node cannot preserve next node status until the message reached it along the selected channel. Thus, there is a risk that the channel with calculated highest profitability may not be the best channel any more in current network status. The second issue is the scope of speculation. The calculated channel profitability might not be accurate because we cannot afford to take into account the condition of the whole network. When the miss-speculation occurs, the network takes loss in message latency. The algorithm should be constructed to reduce the message latency in overall, despite of losses in some miss-speculation. To understand those problems by experiments, we constructed a simulator to mimic network and router operation in detail. Then, we compared the network performance with various speculation parameters to design the algorithm. There are many ways to implement speculative selection, but we can classify by: • How deep is the level of speculation, we have the one-hop and multi-hop speculation. The one-hop is simple to implement while the multi-hop allows speculation that is more accurate. • How does the information be updated between adjacent nodes, we have periodically and event updated exchanging. The event updated exchanging provides more accu-. Fig. 1. Aug. 2003. Level of speculation.. rate information, but the cost of exchanging time increases corresponding with the number of incoming message, that will significantly reduce network performance in high load region. In the experiment, we chose to implement the two-hop configuration in periodical update mechanism. 2.2 Level of Speculation The level of speculation defines how far a node is looking forward to do selection as illustrated in Fig. 1. In the one-hop configuration, the current node A is only collecting information from its neighboring nodes B, C, D and E. This configuration is very simple and allows easy implementation. However, its inherent disadvantage is that this configuration has too small range about the around conditions, so the probability of miss-speculation is high, leading to small improvement that do not compensate with the cost of calculation and additional hardware for speculation. An interesting configuration is the two-hop case. Logically, we can see that, the two-hop far away nodes of node A consist of 12 nodes noted from B to M. The difficult problems appear at the nodes in the corners F, G, H, I that each of those nodes can send its information to the node A by two alternative routes. For example, the information of G has to go through B or C to reach A. The hardware implementation in A to decide where to receive G’s information from (B or C) should be quite hard. Luckily, considering a message from A wants to go North-East, the node A considers the channel AB and AC for profitability. Because the roles G’s informa-.
(3) Vol. 44. No. SIG 11(ACS 3). Speculative Selection in Adaptive Routing on Interconnection Networks. 149. tion on channel AB and AC are symmetric, we can eliminate G’s information in the consideration. Similarly, we can also safely eliminate the node F, H, I from this configuration. This configuration then consists of current node A receiving information from 8 neighboring nodes B, C, D, E and J, K, L, M and becomes much easier to implement. Theoretically, the twohop configuration supports much more accurate speculation with its extended looking-forward distance and this was confirmed by the experiments. 3. Network Condition Exchange 3.1 The Local Utilization Counter For a simple implementation, we define the Local Utilization Counter (LUC) in each node which represents the network information that nodes use to do the selection. How and when LUC is calculated in each node could much affect the performance of the algorithm. LUC indicates how much a node is utilized, that shows how many messages are passing through this node at the moment. LUC is a counter that counts up when a message reaches the node and counts down when a message leaves the node. 3.2 Conditions Exchange Mechanism Figure 2 (a) illustrates the structure of network condition exchange in a node with speculative routing in one-hop scheme for mesh network. Each router has a LUC and four Network Condition Buffers (NCBs) to store adjacent node’s LUCs. The way that a router transmits LUC is similar to the transmission of message flit. In Fig. 2 (b), supposed that the physical link previously has two virtual channels (VC1 and VC2), the exchange hardware uses a third virtual channel to send LUC from a node to the adjacent. The operation of this third virtual channel is different from other normal virtual channels that this virtual channel is periodically or evently used only when router needs to transmit LUC to other node’s NCBs. In the configuration, this third virtual channel holds the physical channel until all values (local LUC and other’s node LUC) are completely transmitted in sequence. Additionally, we do not need a header in this transmission because of the length and order of data is fixed, combining with the fact that those information only send from a node to the adjacent nodes. By using this mechanism, the exchange process does not. Fig. 2. Network condition exchange.. disrupt the normal message flit transmission in the case of wormhole routing. Routers in interconnection network update these buffer values periodically or evently depending on the update scheme. In the periodical update scheme, the router temporary hangs all message-flit transmissions in all physical channels for several cycles to broadcast its own LUC to all adjacent routers. Simultaneously, when a router receives update-request signal from an adjacent node, it stores that value to its own appropriate NCB. This process is repeated every 100 clock cycles in simulation version of the algorithm. In event update scheme, the router sends a update request signal to all adjacent nodes just when it do the selection function for an incoming message. The adjacent nodes, then hang the message transmission to the current node (if exist) to transmit their own LUC to the current node. The event update scheme seems to save some network cycles in very low load (small number of messsage). It also has the newest network conditions to do the selection. However, when the load increases, this scheme consumes network cycle correspondingly that would substantially degrade network performance in the saturated region. In addition, the event update.
(4) 150. IPSJ Transactions on Advanced Computing Systems. scheme is very difficult to implement in multihop configuration. On the contrary, the periodic update scheme consumes a predefined amount of network cycle. The periodic update scheme also supports fast selection computation because the selection function does not have to wait for information to be updated. In addition, periodic update scheme is easier to implement in multihop configuration. With those apparent advantages, the periodic update scheme was chosen to implement. Simulation results and many of other studies show that the network is saturated highest at about 70% of link utilization. This means in every 100 cycles, the physical links are idle (do not physically transmit message flits or messages are blocked waiting for free buffers) in average 30 cycles. Using those redundant cycles to exchange network condition in a special virtual channel, we could implement the periodic update scheme that costs less than what we gain in balanced network traffic with speculation. 4. One-hop and Multi-hop Exchange The level of speculation in this algorithm is defined by the number of hops from a node to the farthest node from which it receives network information. With one-hop configuration, the node1 passes 1-hop information to node2 and in return, node2 passes 1-hop information to node1 as shown in Fig. 3. In the two-hop example, in each router, the NCBs consist of two groups: NCB 1-hop and NCB 2-hop to store LUCs from the corresponding 1-hop and 2-hop far-ahead routers. In the Tn transaction, LUC’s value of node1 is transmitted to node2’s NCB in 1-hop group. In the Tn+1 transaction, LUC’s value of node1 at the Tn time slice, is transmitted from NCBs of node2 to NCB 2-hop of node3. Simultaneously, the NCB 1-hop in the node2 is updated with new the value of node1’s value at the Tn+1 time slice. The process is iterated to keep the information up-to-date. The one-hop speculation allows us to construct a quite simple hardware with minimum cost in computing cycles but provides little information about the next nodes. Meanwhile, the multi-hop speculation will provide more information but requires complex hardware for the selection function. Extensive hardware re-. Fig. 3. Aug. 2003. Two-hop information exchanges.. source would be required if we increase the level of speculation to three-hop and above. 5. The Speculative Selection Algorithm The speculative selection function is divided into selection function for the one-hop and twohop functions. In order to calculate the final output, we have to define the set of input status: • The link’s (virtual channel) availability F= free, busy. • The link’s dimension D: D = 1 if the message came from the same link’s dimension, otherwise D = 0. • The link’s current number of requests R. • The next node LUCs belong to the link values N1 and N2, which are corresponding to one-hop-far and two-hop-far values. The selection function operation tries to select an output in the following steps: ( 1 ) Calculate each channel’s evaluation value E = D + R + N1 + kN2 where k = 0 in one-hop function and k = 0.5 in two-hop function. ( 2 ) Choose a free channel with minimum E value. ( 3 ) If there is no free channel, choose the channel with minimum E value. 6. Implementation 6.1 Router Structure Figure 4 shows a simple router structure implementing Speculative Selection Routing (SSR) for the mesh network. The input and.
(5) Vol. 44. No. SIG 11(ACS 3). Speculative Selection in Adaptive Routing on Interconnection Networks Table 1. 151. Simulation parameters.. Simulation Time Network Network Size Number of Virtual Channels Packet Length Routing Method Flit Transfer Time. 100,000 clock cycles Mesh network 16 x 16 (256 nodes) 2 20 flits (fixed) Wormhole 3 clock cycles. have virtual channel. The messages that need to go further West can not use the virtual channel set I of the North-South. The Duato’s protocol is simpler by using subadaptive network concept. The mesh network is divided into two virtual channels, named a and b. While the a virtual network is adaptive, the b virtual network is restricted to dimension order. When restricted to minimal path, the function is also deadlock free with wormhole routing. Because it uses only sub-adaptive network, performance would be inferior compared with fully adaptive algorithms. 7. Simulation Fig. 4. Router structure.. output circuits control virtual channels to adjacent routers. These virtual channels and the network condition buffers share the same physical channel to receive its own information in time-sharing manner. The routing function circuit decodes messages header and gives the output to selection function circuit. Based on local information and network conditions, the selection function actually commands the Crossbar to connect an input to appropriate output or queues that commands until the requiring output is available. LUC repeatedly broadcasts its value to adjacent routers. In those cycles, additional exchange hardware controls all physical output channels and normal message passing is temporary hanged. After all, exchange hardware frees physical channel to normal operation. 6.2 Routing Functions To evaluate SSR, we implemented SSR with two routing functions in mesh network: The Optimal Fully Adaptive Minimal Wormhole Routing for Mesh (Opt-y) 3) and Duato’s Protocol for Mesh 6) . Opt-y offers fully adaptive, deadlock free and requires only the minimum number of virtual channels. Opty-y algorithm divides the NorthSouth channels into two virtual channel sets named I and II, while the West-East does not. 7.1 Simulation Conditions A simulator has been written in C and be able to simulate SSR in various conditions. In this experiment, 3 traffic patterns have been used: • Uniform random traffic selects random destinations that make the traffic uniformly distributed over the network. • Hotspot traffic is similar to uniform random except that it creates 10 random hotspot destinations, which receive four times as much the number of messages as other nodes. • Matrix transpose traffic sends messages from a source node at (x, y) to a destination at (y, x). Table 1 defines simulation parameters. The period of first 10,000 cycles of the simulation is ignored because of unstable conditions in network initialization. Then we measure the network throughput and latency versus network load which is defined as follows: • Load presents how many messages are injected to a node in a period of time. • Latency shows the average required time for an injected message to traverse from a source node to a destination in the network. This study does not include the message’s queuing time waiting to be injected at the source node. • Throughput shows how much the network link is utilized (or actively transmit data).
(6) 152. IPSJ Transactions on Advanced Computing Systems. Fig. 5. Throughput of SSR and oblivious selections with Opt-y in uniform random traffic.. Fig. 6. Latency of SSR and oblivious selections with Opt-y in uniform random traffic.. at loads. 7.2 Simulation Results Figures 5–8 evaluate the performance of SSR compared with oblivious selection using Opt-y routing function in uniform random and matrix transpose traffics. The results show better performance for SSR in both traffics. In Figs. 5 and 6, SSR archives 5% higher throughput and average 7% lower latency with the uniform random. In Figs. 7 and 8, SSR gains about 10% of throughput and decreases latency by 11% in the matrix transpose. In both traffics, SSR has the same throughput with oblivious selection in the load from 0.1 to 0.3 (Figs. 5 and 7). As the network enters saturation from load 0.4, SSR becomes more effective. Meanwhile, the SSR latency gets smaller from load 0.3 in both traffics (Figs. 6 and 8) compared to oblivious selection. In Figs. 9 and 10, we compare the effectiveness of SSR with two different routing functions. Aug. 2003. Fig. 7. Throughput of SSR and oblivious selections with Opt-y in matrix transpose traffic.. Fig. 8. Latency of SSR and oblivious selections with Opt-y in matrix transpose traffic.. of Opt-y and Duato’s Protocol in hotspot traffic. Figure 9 shows that, SSR achieves overall better performance with both routing functions. SSR achieves around 10% better performance for Opt-y and 5% for Duato’s Protocol from 0.4 load and continues into the saturation. Similarly, Fig. 10 shows smaller latency in favor for SSR. The latencies of oblivious selection and SSR are the same in 0.2 load. However, from 0.3 to 0.7 load, SSR retains around 12% lower latency for Opt-y and 4% for Duato’s Protocol. With the Opt-y routing function, latency differences become smaller as the network enters saturation at 0.5 load, and continue to reduce as load increases. Meanwhile, with the Duato’s Protocol, SSR retains lower latency in high load. Also in Figs. 9 and 10, the effect of speculation is higher in Opt-y than in Duato’s Protocol. Figures 11 and 12 compare throughput and latency of SSR with Least Recently Used.
(7) Vol. 44. Fig. 9. No. SIG 11(ACS 3). Speculative Selection in Adaptive Routing on Interconnection Networks. Fig. 11 Throughput of SSR with Opt-y and Duato’s Protocol routing functions in hotspot traffic.. Fig. 12 Fig. 10. Latency of SSR with Opt-y and Duato’s Protocol routing functions in hotspot traffic.. (LRU) selection algorithm 8) in matrix transpose traffic. In the simulation, LRU behaves much the same with SSR that it gains throughput from load 0.4 and reduces latency from load 0.3 compared to oblivious selection. SSR has archived about 5% higher throughput than LRU as shown in Fig. 11. In the latency aspect, SSR has smaller latency than LRU in the saturation area from load 0.5 to 0.7. In the highly saturated region of 0.7 load, the LRU’s latency reaches the level of oblivious selection, while SSR still remains lower latency. Figures 13 and 14 illustrate the visualization of node utilization for Opt-y without and with speculative selection in hotspot traffic. With the oblivious straight selection function in Fig. 13 (b), Opt-y behavior shows strong hotspot peaks (3D view), while the speculative selection has reduced those peaks as shown in Fig. 14 (b). Figure 14 (b) also shows that while the hotspot peak levels are reduced, the. 153. Throughput of SSR and LRU with Opt-y in matrix transpose traffic.. Latency of SSR and LRU with Opt-y in matrix transpose traffic.. nodes around the hotspots show higher utilization and smoother transition. From the topview of those figures, SSR in Fig. 14 (a) shows a larger area of average utilization compared with Fig. 13 (a). This means the speculation spreads traffic to outer areas of the network causing more balanced traffic than the oblivious straight one. 8. Related Work and Discussion 8.1 Selection Flexibility and Performance Selection flexibility is defined by how flexible is the selection function in choosing an output channel. Dally, et al. showed that minimum congestion and straight-line selection give good performance while the maximum-flexibility has much higher latency and saturates at lower traffic level 1) . In addition, Upadhyay, et al. discussed that the turn biased and multiplex-turn biased selection policies perform better than the random selection policy 5) . These researches.
(8) 154. Fig. 13. IPSJ Transactions on Advanced Computing Systems. Node utilization for Opt-y with oblivious selection in hotspot traffic.. show that, with maximum flexibility and random selection, too much of flexibility in selection function degrades the performance. With the speculation, we are trying to achieve a general selection function that fit to any routing function by speculating network conditions. The speculation gives messages high flexibility in selection with the compromise of some more delay in information exchange process, and results in more complex hardware. The speculative decision adapts with the network conditions dynamically. With current experimental results, SSR shows overall better performance compared with the oblivious selection function of Opt-y and Duato’s Protocol in various traffic patterns. The throughput remains equal in the region of before saturation but stays higher in the saturation region. This is due to the fact that network is more balanced in spite of concentrating in the central region of mesh and messages can avoid hotspots in travel (Fig. 14). This means the speculation makes the network balance both in local and global scope. This network balancing also presents in. Fig. 14. Aug. 2003. Node utilization for Opt-y with SSR selection in hotspot traffic.. the smaller latencies of SSR. Thus, in overall, the network’s links are more efficiently used. The selection flexibility is smartly utilized to boost the performance. 8.2 Universality Schwiebert evaluated various routing functions with two selection functions: zigzag and no-turn 7) . They concluded that selection functions can have a substantial impact on the average message latency and saturation behavior of an adaptive routing algorithm, and the best selection function depends on the routing algorithm. Koibuchi, et al. proposed a selection function called Minimal Multiplex and Least Recently Used 9) that demonstrated better performance for torus network with *channel algorithm. These researches proved that improving selection functions leads to increasing overall performance, but those selection functions have to stick to specific routing functions to get the desired performance. Although SSR improves performance in both routing functions, SSR with Opt-y achieves almost 10% gain while as little as 5% when using.
(9) Vol. 44. No. SIG 11(ACS 3). Speculative Selection in Adaptive Routing on Interconnection Networks. with Duato’s Protocol. This should be due to the adaptivity of the routing functions. Opty provides maximum adaptivity compared with sub-network adaptivity in Duato’s Protocol. So that, SSR uses the channels more effectively (with more channels to compare and select) in Opt-y. In addition, SSR is effective with various traffic patterns. With uniform random traffic, SSR achieves a small 5% gain in throughput due to the non-hotspot nature of this traffic compared to oblivious selection. The gain of SSR in this traffic mainly stays in traffic balancing in global scope. With other traffics of hotspot and matrix transpose, which create highly concentrated traffic regions, SSR gets nearly 10% increase in performance by balancing traffic in both global and local scope. Thus, we can consider SSR is a universal selection function, which does not depend on routing functions and traffics to achieve better performance. 8.3 Trade-off In this experiment, SSR has been compared with LRU selection function to illustrate two methods of constructing selection function. SSR tries to utilize network information while LRU smartly uses local information. Both methods show gain in performance compared to oblivious selection. SSR shows about 5% better performance than LRU in our simulation. The trade-off between SSR and LRU represents in a more complex hardware in selection function and information exchange logic of SSR compared to very simple hardware design in LRU. However, there is a possibility that we could combine SSR with other selection functions (such as LRU) to create a selection function that smartly utilize both local and global network information. The SSR’s selection function hardware, if need to be performed in 1 cycle, consists of hardwired adders, shifters, and special buffer construction that allows multiple simultaneous data accesses. Moreover, this complexity increases as the level of speculation increases. When we develop the level of speculation to 3, the number of neighboring nodes increases to 20 nodes that, obviously, this configuration would be extremely difficult to implement in both hardware and algorithm. With recent swift advance in semicocnductor technology, the complexity level of router chip would not be problematic in operating fre-. 155. quency and total cost. The problem stays in the routing algorithms that do not fully utilize the ability of physical network. SSR uses the physical network more efficiently despite of high complexity compared to other selection functions. 9. Conclusion This paper presented a new approach to construct universal selection functions with speculation for multicomputer direct interconnection networks. The results showed overall better performance compared with oblivious selection function when using with the Opt-y and Duato’s Protocol routing functions. SSR improves the throughput in the saturated region as much as 10% with Opt-y and about 5% with Duato’s Protocol. In addition, SSR reduces the network latency around 12% and 4% with Opt-y and Duato’s Protocol, respectively. With SSR, the traffic is more balanced all over the mesh network. In local scope, the hotspot is eliminated, while in global scope, traffic is spread out to the edge of the network. The effect of SSR does not depend on routing function. However, SSR achieves better results with the flexible Opt-y than with Duato’s Protocol. Experiments with other network topologies such as torus and k-ary n-cube will be done in the future. SSR’s mechanism of information exchange will be improved to allow the network condition to be more accurate, such as using special communication links. Moreover, the prospect of combining SSR with other selection such as LRU seems to be feasible to create a better selection function that fully uses both the local and global information of the network. References 1) Dally, W.J. and Aoki, H.: Deadlock-Free Adaptive Routing in Multicomputer Networks Using Virtual Channels, IEEE Trans. Parallel and Distributed Systems, Vol.4, No.4, pp.466– 475 (1993). 2) Duato, J.: A New Theory of Deadlock-Free Adaptive Routing in Wormhole Networks, IEEE Trans. Parallel and Distributed Systems, Vol.4, No.12, pp.1320–1331 (1993). 3) Schwiebert, L. and Jayasimha, D.N.: Optimal Fully Adaptive Minimal Wormhole Routing for Meshes, Journal of Parallel and Distributed Computing, Vol.27, No.1, pp.56–70 (1995). 4) Dally, W.J. and Seitz, C.L.: Deadlock Free Message Routing in Multiprocessor Interconnection networks, IEEE Trans. Computer,.
(10) 156. IPSJ Transactions on Advanced Computing Systems. Vol.36, No.5, pp.547–553 (1997). 5) Upadhyay, J., Varavithya, V. and Mohapatra, P.: A Traffic-Balanced Adaptive Wormhole Routing Scheme for Two-Dimensional Meshes, IEEE Trans. Computer, Vol.46, No.2, pp.190– 197 (1997). 6) Duato, J., Yalamanchili, S. and Ni, L.: Interconnection Network: an Engineering Approach, IEEE Press, ISBN 0-8186-7800-3 (1997). 7) Schwiebert, L.: A Performance Evaluation of Fully Adaptive Wormhole Routing including Selection Function Choice, IEEE International Performance, Computing, and Communications Conference, pp.117–123 (2000). 8) Koibuchi, M., Funahashi, A., Jouraku, A. and Amano, H.: The Impact of Output Selection Function on Adaptive Routing, IPSJ Journal, Vol.42, No.4, pp.704–713 (2001). 9) Koibuchi, M., Jouraku, A. and Funahashi, A.: An Output Selection Function on Adaptive Routing, JSPP2001, IPSJ Symposium Series, Vol.2001, No.6, pp.255–262 (2001). 10) So, T.C., Oyanagi, S. and Yamazaki, K.: Speculative Selection in Adaptive Routing on Interconnection Networks, Symposium on Advanced Computing Systems and Infrastructures (SACSIS2003 ), IPSJ Symposium Series, Vol.2003, No.8, pp.29–36 (2003).. (Received January 30, 2003) (Accepted May 14, 2003). Aug. 2003. Tran Cong So is a research student at the Department of Computer Science, Faculty of Science and Engineering, Ritsumeikan University in Japan. His research interests include computer architecture, adaptive routing, and hardware/software codesign. He received an MS in computer science from Ritsumeikan University in 1999. Shigeru Oyanagi is a professor at the Department of Computer Science, Faculty of Science and Engineering, Ritsumeikan University in Japan. His interests include parallel processing, computer architecutre, database, and data mining. He graduated Kyoto University in 1972, and received a doctor of engineering from Kyoto University in 1979. He is a member of IEICE, IPSJ, ACM and IEEE. Katsuhiro Yamazaki is a professor at the Department of Computer Science, Faculty of Science and Engineering, Ritsumeikan University in Japan. His research interests include parallel computing, hardware/ software codesign, and multimedia streaming systems. He received a Ph.D. in computer science from Kyoto University in 1986. He is a member of IEICE, IPSJ, ACM and IEEE..
(11)
図
関連したドキュメント
Because of the knowledge, experience, and background of each expert are different and vague, different types of 2-tuple linguistic variable are suitable used to express experts’
A Tabu search procedure is then used to select a subset of financial ratio variables which best predict bankruptcy from among a larger initial set of 20 variables, and use that
Evolution by natural selection results in changes in the density of phenotypes in the adaptive space.. An adaptive trait is a set of values (say height, weight) that a
13 proposed a hybrid multiobjective evolutionary algorithm HMOEA that incorporates various heuristics for local exploitation in the evolutionary search and the concept of
Yin, “Markowitz’s mean-variance portfolio selection with regime switching: a continuous-time model,” SIAM Journal on Control and Optimization, vol... Li,
VRP is an NP-hard problem [7]; heuristics and evolu- tionary algorithms are used to solve VRP. In this paper, mutation ant colony algorithm is used to solve MRVRP with
In this paper we develop the semifilter approach to the classical Menger and Hurewicz properties and show that the small cardinal g is a lower bound of the additivity number of
We present sufficient conditions for the existence of solutions to Neu- mann and periodic boundary-value problems for some class of quasilinear ordinary differential equations.. We