A Split-Central-Buffered Load-Balancing Clos-Network Switch with   In-Order Forwarding

Oladele Theophilus Sule; Roberto Rojas-Cessa; Ziqian Dong; Chuan-Bi; Lin

arXiv:1812.11650·cs.NI·August 29, 2019

A Split-Central-Buffered Load-Balancing Clos-Network Switch with In-Order Forwarding

Oladele Theophilus Sule, Roberto Rojas-Cessa, Ziqian Dong, Chuan-Bi, Lin

PDF

TL;DR

This paper introduces a load-balancing Clos-network switch with split central buffers that ensures high throughput, in-sequence forwarding, and low complexity without requiring memory speedup or central-stage expansion.

Contribution

It proposes a novel configuration scheme for a load-balancing Clos-network switch with split central modules, achieving 100% throughput and in-sequence forwarding with low complexity.

Findings

01

Achieves 100% throughput under various traffic conditions.

02

Ensures in-sequence forwarding without memory speedup.

03

Demonstrates high performance through simulation studies.

Abstract

We propose a configuration scheme for a load-balancing Clos-network packet switch that has split central modules and buffers in between the split modules. Our split-central-buffered Load-Balancing Clos-network (LBC) switch is cell based. The switch has four stages, namely input, central-input, central-output, and output stages. The proposed configuration scheme uses a pre-determined and periodic interconnection pattern in the input and split central modules to load-balance and route traffic. The LBC switch has low configuration complexity. The operation of the switch includes a mechanism applied at input and split-central modules to forward cells in sequence. The switch achieves 100\% throughput under uniform and nonuniform admissible traffic with independent and identical distributions (i.i.d.). These high switching performance and low complexity are achieved while performing…

Tables6

Table 1. TABLE I: Notations used in the description of the LBC switch

Term	Description
$N$	Number of input/output ports.
$n$	Number of input/output ports for each IM and OM.
$m$	Number of CIMs and COMs.
$k$	Number of IMs and OMs, where $k = \frac{N}{n}$ .
$I P (i, s)$	Input port $s$ of $I M (i)$ , where $0 \leq i \leq k - 1, 0 \leq s \leq n - 1$ .
$I M (i)$	Input module $i$ .
$O M (j)$	Output module $j$ , where $0 \leq j \leq k - 1$ .
$C I M (r)$	Central Input Module $r$ , where $0 \leq r \leq m - 1$ .
$C O M (r)$	Central Output Module $r$ .
$V O Q (i, s, j, d)$	VOQ at $I P (i, s)$ that stores cells destined to $O P (j, d)$ , where $0 \leq d \leq n - 1$ .
$L_{I M} (i, r)$	Output link of $I M (i)$ connected to $C I M (r)$ .
$L_{C I M} (r, p)$	Output port $p$ of $C I M (r)$ , where $0 \leq p \leq k - 1$ .
$I_{C} (r, p)$	Input port $p$ of $C O M (r)$ .
$L_{C O M} (r, j)$	Output link of $C O M (r)$ connected to $O M (j)$ .
$V O M Q (r, p, j)$	VOMQ at output of CIMs that stores cells destined to $O M (j)$ .
$V O P Q (r, p, j, d)$	VOPQ at output of CIMs that stores cells destined to $O P (j, d)$ .
$C B (r, j, d)$	Crosspoint buffer at $O M (j)$ that stores cells going through $C O M (r)$ and destined to $O P (j, d)$ .
$O P (j, d)$	Output port $d$ at $O M (j)$ .

Table 2. TABLE II: Example of configuration of modules in a 9 × \times 9 LBC switch.

Configuration
Time slot	$I M (0)$	$C I M (0)$	$C O M (0)$
$t = 0$	$I P (0, 0) \to L_{I M} (0, 0)$	$L_{I M} (0, 0) \to L_{C I M} (0, 0)$	$I_{c} (0, 0) \to L_{C O M} (0, 0)$
	$I P (0, 1) \to L_{I M} (0, 1)$	$L_{I M} (1, 0) \to L_{C I M} (0, 1)$	$I_{c} (0, 1) \to L_{C O M} (0, 1)$
	$I P (0, 2) \to L_{I M} (0, 2)$	$L_{I M} (2, 0) \to L_{C I M} (0, 2)$	$I_{c} (0, 0) \to L_{C O M} (0, 2)$
$t = 1$	$I P (0, 0) \to L_{I M} (0, 1)$	$L_{I M} (0, 0) \to L_{C I M} (0, 1)$	$I_{c} (0, 0) \to L_{C O M} (0, 2)$
	$I P (0, 1) \to L_{I M} (0, 2)$	$L_{I M} (1, 0) \to L_{C I M} (0, 2)$	$I_{c} (0, 1) \to L_{C O M} (0, 0)$
	$I P (0, 2) \to L_{I M} (0, 0)$	$L_{I M} (2, 0) \to L_{C I M} (0, 0)$	$I_{c} (0, 2) \to L_{C O M} (0, 1)$
$t = 2$	$I P (0, 0) \to L_{I M} (0, 2)$	$L_{I M} (0, 0) \to L_{C I M} (0, 2)$	$I_{c} (0, 0) \to L_{C O M} (0, 1)$
	$I P (0, 1) \to L_{I M} (0, 0)$	$L_{I M} (1, 0) \to L_{C I M} (0, 0)$	$I_{c} (0, 1) \to L_{C O M} (0, 2)$
	$I P (0, 2) \to L_{I M} (0, 1)$	$L_{I M} (2, 0) \to L_{C I M} (0, 1)$	$I_{c} (0, 2) \to L_{C O M} (0, 0)$

Table 3. TABLE III: Notations for in-sequence analysis.

$c_{y, τ}$	The $τ$ th cell of flow $y$ from $I P (i, s)$ to $O P (j, d)$ .
$t_{a_{y, τ}}$	Arrival time of $c_{y, τ}$ in $V O Q (i, s, j, d)$ at $I P (i, s)$ .
$q_{1_{y, τ}}$	Queuing delay of $c_{y, τ}$ at $V O Q (i, s, j, d)$ .
$d_{1_{y, τ}}$	Departure time of $c_{y, τ}$ from $V O Q (i, s, j, d)$ at $I P (i, s)$ .
$q_{2_{y, τ}}$	Queuing delay of $c_{y, τ}$ at $V O M Q (r, p, j)$ .
$d_{2_{y, τ}}$	Departure time of $c_{y, τ}$ from $V O M Q (r, p, j)$ at $L_{C O M} (r, j)$ .
$q_{3_{y, τ}}$	Queuing delay of $c_{y, τ}$ at $C B (r, j, d)$ of $O P (j, d)$ .
$d_{3_{y, τ}}$	Departure time of $c_{y, τ}$ from $C B (r, j, d)$ .

Table 4. TABLE IV: Example of back-to-back arrivals of one burst of k 𝑘 k flows.

Cell arrival time
$t_{x}$	$t_{x + 1}$	$t_{x + 2}$	$t_{x + 3}$	$t_{x + 4}$
$c_{1, 1}$	$c_{1, 2}$	$c_{1, 3}$
	$c_{2, 1}$	$c_{2, 2}$	$c_{2, 3}$
		$c_{3, 1}$	$c_{3, 2}$	$c_{3, 3}$

Table 5. TABLE V: Time slots in which cells arrive to VOMQs of a single k 𝑘 k -cell burst.

Time slots cells arrive at the VOMQs
$t_{x}$	$t_{x + 1}$	$t_{x + 2}$	$t_{x + 3}$	$t_{x + 4}$	$t_{x + 5}$	$t_{x + 6}$	$t_{x + 7}$	$t_{x + 8}$	$t_{x + 9}$	$t_{x + 10}$	$t_{x + 11}$
	$c_{1, 1}$	$c_{1, 2}$	$c_{1, 3}$
		$c_{2, 1}$				$c_{2, 2}$	$c_{2, 3}$
			$c_{3, 1}$							$c_{3, 2}$	$c_{3, 3}$

Table 6. TABLE VI: Time slots when cells depart VOMQs in example of the in-sequence forwarding mechanism.

Cell departure time slots from VOMQs

t_{x}

t_{x + 1}

t_{x + 2}

t_{x + 3}

t_{x + 4}

t_{x + 5}

t_{x + 6}

t_{x + 7}

t_{x + 8}

t_{x + 9}

t_{x + 10}

t_{x + 11}

t_{x + 12}

c_{1, 1}

c_{1, 2}

c_{1, 3}

c_{2, 1}

c_{22}

c_{2, 3}

c_{3, 1}

c_{3, 2}

c_{3, 3}

Equations308

r = (s + t) mod m

r = (s + t) mod m

p = (i + t) mod k

p = (i + t) mod k

j = (p - t) mod k

j = (p - t) mod k

R_{L C I M} = \frac{1}{m} i = 0 \sum k λ_{i, s, j, d}

R_{L C I M} = \frac{1}{m} i = 0 \sum k λ_{i, s, j, d}

R_{C B} = \frac{1}{mk} \sum k i = 0 \sum k λ_{i, s, j, d}

R_{C B} = \frac{1}{mk} \sum k i = 0 \sum k λ_{i, s, j, d}

λ_{i, s, j, d} = \frac{1}{N}

λ_{i, s, j, d} = \frac{1}{N}

R_{C B} = \frac{1}{k ^{2}} \sum k i = 0 \sum k \frac{1}{N} = \frac{1}{N} = \frac{1}{k ^{2}}

R_{C B} = \frac{1}{k ^{2}} \sum k i = 0 \sum k \frac{1}{N} = \frac{1}{N} = \frac{1}{k ^{2}}

\frac{1}{k} \leq S_{C B} \leq 1

\frac{1}{k} \leq S_{C B} \leq 1

S_{C B} = \frac{1}{k}

S_{C B} = \frac{1}{k}

λ_{i, s, j, d} = \frac{1}{k}

λ_{i, s, j, d} = \frac{1}{k}

R_{C B} = \frac{1}{m} \frac{1}{k} \sum k i = 0 \sum k \frac{1}{k} = \frac{1}{k}

R_{C B} = \frac{1}{m} \frac{1}{k} \sum k i = 0 \sum k \frac{1}{k} = \frac{1}{k}

λ_{i, s, j, d} = 1

λ_{i, s, j, d} = 1

R_{L C I M} = \frac{1}{m} λ_{i, s, j, d} = \frac{1}{m}

R_{L C I M} = \frac{1}{m} λ_{i, s, j, d} = \frac{1}{m}

R_{C B} = \frac{1}{m} \frac{1}{k} \sum k = \frac{1}{m} = \frac{1}{k}

R_{C B} = \frac{1}{m} \frac{1}{k} \sum k = \frac{1}{m} = \frac{1}{k}

R_{1} = [λ_{u, v}]

R_{1} = [λ_{u, v}]

u = ik + s

u = ik + s

v = j m + d

v = j m + d

u = 0 \sum N - 1 λ_{u, v} \leq 1, v = 0 \sum N - 1 λ_{u, v} \leq 1

u = 0 \sum N - 1 λ_{u, v} \leq 1, v = 0 \sum N - 1 λ_{u, v} \leq 1

\displaystyle\pi_{u,\upsilon}=\begin{dcases*}1&for any $u$, $\upsilon=rk+p$\\ 0&elsewhere.\end{dcases*}

\displaystyle\pi_{u,\upsilon}=\begin{dcases*}1&for any $u$, $\upsilon=rk+p$\\ 0&elsewhere.\end{dcases*}

P_{1} = \sum k Π (t)

P_{1} = \sum k Π (t)

R_{2} = \frac{1}{k} ((R_{1} * 1) \circ P_{1})

R_{2} = \frac{1}{k} ((R_{1} * 1) \circ P_{1})

R_{2} = j = 0 \sum j = k - 1 R_{2} (j)

R_{2} = j = 0 \sum j = k - 1 R_{2} (j)

\displaystyle\phi_{u,v}=\begin{dcases*}1&for any $u$,~{}$v=jk+r$\\ 0&elsewhere.\end{dcases*}

\displaystyle\phi_{u,v}=\begin{dcases*}1&for any $u$,~{}$v=jk+r$\\ 0&elsewhere.\end{dcases*}

P_{2} = \sum k Φ (t)

P_{2} = \sum k Φ (t)

R_{3} (j) = R_{2} (j) \circ P_{2}

R_{3} (j) = R_{2} (j) \circ P_{2}

R_{3} (j) = d = 0 \sum d = k - 1 R_{3} (j, d)

R_{3} (j) = d = 0 \sum d = k - 1 R_{3} (j, d)

D_{s} = [1, \dots, 1]

D_{s} = [1, \dots, 1]

A = [1 \dots 0]

A = [1 \dots 0]

A_{s} = [A_{s_{1}}, \dots, A_{s_{k}}]^{T}

A_{s} = [A_{s_{1}}, \dots, A_{s_{k}}]^{T}

A_{s} = [A, \dots, A]^{T}

A_{s} = [A, \dots, A]^{T}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A Split-Central-Buffered Load-Balancing Clos-Network Switch with In-Order Forwarding

Oladele Theophilus Sule, Roberto Rojas-Cessa, Ziqian Dong, Chuan-Bi Lin This paper is an extended version of that published in IEEE trans. on Networking. *(Corresponding author: Oladele Theophilus Sule)*O.T. Sule and R. Rojas-Cessa are with the Department of Electrical and Computer Engineering, New Jersey Institute of Technology, Newark, NJ 07102. Email: {ots5, rojas}@njit.edu.Z. Dong is with the Department of Electrical and Computer Engineering, New York Institute of Technology, New York, NY 10023.C. Lin is with the Department of Information and Communication Engineering, Chaoyang University of Technology, Wufeng District, Taichung, 41349, Taiwan.This work was partially supported by National Science Foundation (NSF) under Grant No. (CNS) 1641033.

(January-21-2018)

Abstract

We propose a configuration scheme for a load-balancing Clos-network packet switch that has split central modules and buffers in between the split modules. Our split-central-buffered Load-Balancing Clos-network (LBC) switch is cell based. The switch has four stages, namely input, central-input, central-output, and output stages. The proposed configuration scheme uses a pre-determined and periodic interconnection pattern in the input and split central modules to load-balance and route traffic. The LBC switch has low configuration complexity. The operation of the switch includes a mechanism applied at input and split-central modules to forward cells in sequence. The switch achieves 100% throughput under uniform and nonuniform admissible traffic with independent and identical distributions (i.i.d.). These high switching performance and low complexity are achieved while performing in-sequence forwarding and without resorting to memory speedup or central-stage expansion. Our discussion includes throughput analysis, where we describe the operations that the configuration mechanism performs on the traffic traversing the switch, and proof of in-sequence forwarding. A simulation study is presented as a practical demonstration of the switch performance on uniform and nonuniform i.i.d. traffic.

Index Terms:

Clos-network switch, load-balancing switch, in-order forwarding, high performance switching, packet scheduling, packet switching.

I Introduction

Clos-network switches are attractive for building large-size switches [1]. These switches mostly employ three stages, where each stage uses switch modules as building blocks. Each module is a small- or medium-size switch. Modules of the first, second, and third stages are often called input, central, and output modules, and they are denoted as IM, CM, and OM, respectively. Overall, Clos-network switches require fewer crosspoint elements, each of which is the atomic switching unit of a packet switch, than a single-stage switch of equivalent size, and thus they may require less building hardware. This trait of a Clos network often comes at the cost of an increased configuration complexity. The term configuration here means the local interconnection between inputs and outputs of a module. In general, a Clos-network switch requires the configuration of the modules in every stage before packets are forwarded through. Moreover, owing to the multi-stage architecture of such switch, the time for switch reconfiguration increases as the number of stages holding dependences increases. In a multi-stage switch, there is a dependence when the configuration of a module is affected by the configuration of another. The required configuration time dictates the internal data transmission time, which in turn defines the minimum size of the internal data unit. For example, switches that require long configuration time may need to use a long internal segment and time to transmit data while switches with fast configuration times may use a smaller segment size. Therefore, the configuration time of a switch must be kept to the shortest possible for a fast and efficient reconfiguration [2].

In the remainder of this paper, we consider the proposed packet switch to be cell based; that is, upon arrival at an input port of a switch, packets of variable size are segmented into fixed-size cells. Cells are forwarded through the switch to their destination outputs. Packets are re-assembled at the outputs of the switch. The selection of the cell length is left for the implementation of the LBC switch. However, as in any other switch, the cell length is decided by the time required to reconfigure IMs and CIMs and memory speed (of central queues or CBs). Cell length may be selected such that cell transmission time is equal to or greater than the largest of the switch configuration or memory response times. Additionally, the cell length can be increased if the average Internet packet is longer than the configuration time to reduce segmentation/reassembly processing [2].

Based on the design of its switching modules, each stage of a Clos-network switch can be categorized as either space-based (S) or memory-based, where space switching modules are bufferless while memory switching modules are buffered. Space switching refers to the use of a level of parallelism where multiple cells can be switched at the same time slot by using multiple connections. Memory switching refers to the use of memory to store cells when they cannot be forwarded to the outputs (or next stage). Some of these categorizes are SSS (or S3) [3, 4], MSM [5, 6, 7, 8], MMM [9, 10, 11, 12], SMM [13], and SSM [14, 15], among the most popular ones. Out of those, S3 switches require small amounts of hardware but their configuration has been proven challenging as input-to-output path setup must be resolved before cells are transmitted. On the other hand, inclusion of memory in modules may relax the configuration complexity. However, configuration complexity has remained high despite using memory in every switch module because of internal blocking and the multiplicity of input-output paths associated with diverse queuing delays [9, 16]. Specifically, switches with buffered central or output stages are prone to forwarding packets out of sequence, making re-sequencing or in-sequence transmission mechanisms an added feature. Moreover, the number and size of queues in a module are restricted to the available on-chip real estate. This restriction plus the adopted in-sequence measures may exacerbate internal blocking that, in turn, may lead to performance degradation [11].

Minimizing the complexity of the central module of a Clos-network switch has been of research interest in recent years. Hassen et. al proposed a Clos-network switch that combines different switching stages [17]. In this work, central modules are replaced with multi-directional networks-on-chip (MDN) modules. The switch uses a static dispatching scheme from the input/output modules, for which every input constantly delivers packets to the same MDN module, and adopts inter-central-module routing to enable forwarding of the cells to the final destination. However, this switch may forward cells to the output ports out of sequence if cells from the same flow are routed through different paths on the central modules.

Load balancing traffic prior to routing it towards the destination output is a technique that not only improves switching performance but also reduces the configuration complexity of a packet switch when the load-balancing and routing follow a deterministic schedule [18]. Such a schedule may be obtained as an application of matrix decomposition [19, 20]. This technique enables high performance not only on switches but also on a large number of network applications [21].

A switch that load-balances traffic may need at least two stages to operate; one for load balancing and the other for routing cells to their destination outputs [18]. A switch with such a deterministic and periodic schedule may require the use of queues between the load-balancing and routing stages. However, placing such queues and enabling multiple interconnection paths between an input and an output make load-balancing switches susceptible to forwarding cells out of sequence [18]. This issue has been addressed by introducing either re-sequencing buffers at the output ports [22] or mechanisms that prevent out-of-sequence forwarding [23, 24]. However, these approaches are either complex or degrade switching performance.

Load balancing has been applied to Clos-network switches [9, 25]. For example, Zhang et al. [25] proposed an SMM switch which adopts the two-stage load-balanced Birkhoff-von Neumann switch in each central module but has no input port buffers. Here, a central module consists of two $k\times k$ bufferless crossbar switches and $k$ buffers in between the crossbars. The switch performs load balancing at the input module and the first stage of the load-balanced Birkhoff-Von Neumann switch. Each of these queues accommodates up to one cell to guarantee the transmission of cells in sequence. However, the distance between modules in a large switch requires larger queue sizes for which this switch would suffer from out-of-sequence forwarding.

The switches discussed above suffer from either limited switching performance, high complexity, or out-of-sequence forwarding. These drawbacks then raise the question, can a load-balancing Clos-network switch achieve high switching performance, low configuration complexity, and in-sequence cell forwarding without resorting to memory speedup?

In this paper, we aim at answering this question by proposing a split-central-buffered Load-Balancing Clos-network (LBC) switch. The switch has a split central module and queues in between. The switch employs predetermined and periodic interconnection patterns to interconnect the inputs and outputs of the switch modules. The switch load balances the incoming traffic and switches the cells towards the destination outputs, both with minimum configuration complexity. The result is a switch that attains high throughput under admissible traffic with independent and identical distribution (i.i.d.) and uses a configuration scheme with $O(1)$ complexity. The switch also adopts an in-sequence forwarding mechanism at the input queues to keep cells in sequence despite the presence of buffers between the split CMs.

Different from existing switching architectures, as discussed above, the LBC switch achieves high performance, configuration simplicity, and in-sequence service, all attained without memory speedup nor central module expansion.

We analyze the performance of the proposed switch by modeling the effect of each stage on the traffic passing through the switch. In addition, we study the performance of the switch through traffic analysis and computer simulation. We show that the throughput of the switch approaches 100% under several admissible traffic models, including traffic with nonuniform distributions, and demonstrate that the switch forwards cells to the output ports in sequence. The high performance and the in-sequence forwarding of packets of the switch are both achieved without resorting to speedup throughout the switch.

In summary, the contributions of this paper are as follows: 1) the proposal of a configuration scheme for a split-central-buffered load-balancing switch such that the attained throughput is 100% under admissible traffic while having $O(1)$ scheduling complexity, 2) the proposal of an in-sequence mechanism for forwarding of cells in sequence throughout the switch, 3) the presentation of throughput analysis of the LBC switch for each of the stages that shows that the switch achieves 100% throughput under i.i.d. admissible traffic, and 4) proof of the in-sequence capability of the proposed in-sequence forwarding mechanism.

The remainder of this paper is organized as follows: Section II introduces the LBC switch. Section III analyzes the throughput performance of the proposed switch. Section IV analyzes the in-sequence forwarding property of the LBC switch. Section V presents a simulation study on the performance of the proposed switch. Section VI presents our conclusions.

II Switch Architecture

The LBC switch has $N$ inputs and $N$ outputs, each denoted as $IP(i,s)$ and $OP(j,d)$ , respectively, where $0\leq i,~{}j\leq k-1$ , $0\leq s,~{}d\leq n-1$ , and $N=nk$ . Figure 1 shows the architecture of the LBC switch. This switch has $k$ $n\times m$ IMs and $k$ $m\times n$ OMs. Each central module is split into two modules called central-input and -output modules, denoted as CIMs and COMs, respectively. The switch has $m$ CIMs and the same number of COMs. Each CIM and COM is a $k\times k$ switch. In the remainder of this paper, we set $n=k=m$ for symmetry and cost-effectiveness. The IMs, CIMs, and COMs are bufferless crossbars while the OMs are buffered ones.

The use of a split central module on this switch enables preserving staggered symmetry and in-order delivery [26] by using a pre-determined configuration in the IMs, CIMs and COMs with a mirror sequence between CIMs and COMs. The staggered symmetry and in-order delivery refers to the fact that at time slot $t$ , $IP(i,s)$ connects to $COM(r)$ which connects to $OM(j)$ . Then at the next time slot $(t+1)$ , $IP(i,s)$ connects to $COM((r+1)\mod m)$ , which also connects to $OM(j)$ . This property enables the configuration of IMs/CIMs and COMs to be easily represented with a pre-determined compound permutation that repeats every $k$ time slots. This property also ensures that cells experience the same amount of delay for uniform traffic and the incorporation of a simple in-sequence mechanism. A switch with queues between IMs and CMs but without a split central module may require more complex load balancing and routing configurations to achieve the same objective.

Each input port has $N$ virtual output queues (VOQs), denoted as $VOQ(i,s,j,d)$ , to store cells destined to output port $d$ at $OM(j)$ . The combination of IMs and CIMs form a compound stage, called the IM-CIM stage. The COMs and OMs operate as single stages. There are queues placed between CIMs and COMs to store cells coming from an IM and destined to OMs. These central queues may be implemented as virtual output port queues (VOPQs), as shown in Figure 2(a). Each VOPQ, denoted as $VOPQ(r,p,j,d)$ , stores cells coming for $OP(j,d)$ through $L_{CIM}(r,p)$ . As an alternative, to reduce the number of VOPQs for a large switch, we consider the use of virtual output module queues (VOMQs) instead, as shown in Figure 2(b). A VOMQ, denoted as $VOMQ(r,p,j)$ , stores cells for all OPs at $OM(j)$ . Each of these queues stores cells coming from $L_{CIM}(r,p)$ and destined to $OM(j)$ . Compared to VOPQs, VOMQs introduce the possibility of head-of-line (HoL) blocking. However, as we show in Section II-F, such HoL effect is not a concern when the switch is loaded with admissible traffic. The remainder of this paper considers VOMQs, as this option stresses the load-balancing feature of LBC.

Every CIM has $k$ $L_{CIM}$ ports. Every $L_{CIM}(r,p)$ of a CIM is connected to one input $I_{C}(r,p)$ of the corresponding COM. The LCIM includes a set of $k$ VOMQs, one per OM. Each OP has $m$ crosspoint buffers, each denoted as $CB(r,j,d)$ . A flow control mechanism operates between VOMQs and VOQs, and between CBs and VOMQs to avoid buffer overflow and this is described in Section II-E. The VOMQs are off-chip. The switch has $N$ LCIMs, and therefore $N$ sets of $k$ VOMQs each. Table I lists the notations used in the description of the LBC switch.

The following is a walk-through description of how the switch operates: After arriving at the IP, a cell is placed at the VOQ corresponding to its destination OP. The IP arbiter selects a VOQ to be served in a round-robin manner. When a VOQ is selected, the HoL cell is forwarded to a VOMQ at the LCIM identified by the current configuration of the IM and CIM. The VOMQ is the one associated with the OM that includes the destination OP of the cell. When the configuration of the COM permits forwarding to the destination OM, the cell is forwarded to the OM and stored at the crosspoint buffer (CB) allocated for cells from the source COM. The OP arbiter selects CBs based on a round-robin manner. Upon selection of a CB, the HOL cell is forwarded from the CB to the OP.

II-A Module Configuration

The IMs and CIMs in the IM-CIM stage are configured based on a pre-determined sequence of disjoint permutations, applying one permutation every time slot. We call a permutation disjoint from the set of permutations if an input-output pair is interconnected in one and only one of the permutations. This pre-determined sequence of permutations repeats every $k$ time slots. Cells at the inputs of IMs are forwarded to the outputs of the CIMs determined by the configuration of that time slot. A cell is then stored in the VOMQ corresponding to its destination OM.

The COMs follow a configuration similar to that of the CIMs, but in a mirror (i.e., reverse order) sequence. The HoL cell at the VOMQ destined to $OM(j)$ is forwarded to its destination when the input of the COM is connected to the input of the destination $OM(j)$ . Else, the HoL cell waits until the required configuration takes place. The forwarded cell is queued at the CB of its destination OP once it arrives in the OM. At the OP, a CB (i.e., HoL cell of that queue) is selected from all non-empty CBs by an output arbitration scheme.

The specific configurations of the bufferless modules, IM, CIM, COM, and OM are as follows.

At time slot $t$ , $IM(i)$ is configured to interconnect input $IP(i,s)$ to $L_{IM}(i,r)$ , with:

[TABLE]

Similarly, CIM input $L_{IM}(i,r)$ is interconnected to CIM output $L_{CIM}(r,p)$ at time slot $t$ with:

[TABLE]

The configuration of COMs is similar to that of IMs, but in a reverse sequence. At time slot $t$ , COM input $I_{C}(r,p)$ is interconnected to output $L_{COM}(r,j)$ with:

[TABLE]

Round-robin could also be used to select VOMQs and configure COMs. OM buffers allow forwarding a cell from a VOMQ to the destination output without requiring port matching [14].

Figure 3 shows an example of the configuration of a $9\times 9$ LBC switch. As $k=3$ , the example shows the configuration of three consecutive time slots, after which the configuration pattern repeats. Because similar connections are set for all the IMs and CIMs and a different connection pattern is set for all COMs at each time slot, Table II describes the configuration on the figure for $IM(0)$ , $CIM(0)$ , and $COM(0)$ at each time slot. In this example, we use $\rightarrow$ to denote an interconnection.

II-B Arbitration at Output Ports

An output port arbiter selects a HoL cell from the crosspoint buffers in a round-robin fashion. Because there is one cell from each flow at these buffers, out-of-sequence forwarding is not a concern at this stage. We discuss this case in Section IV. Here, a flow is the set of cells from $IP(i,s)$ destined to $OP(j,d)$ . The round-robin schedule ensures fair service for different flows.

II-C In-sequence Cell Forwarding Mechanism

The proposed in-sequence forwarding mechanism for the LBC switch is based on holding cells of a flow at the VOQs so that no younger cell is forwarded from VOMQs to OPs before any given cell of the same flow. The policy used for holding cells at an IP is as follows: No cell of flow $y$ at the IP is forwarded to a VOMQ for $\delta k$ time slots after cell $\tau$ of the same flow has been forwarded to a VOMQ, whose occupancy is $\delta$ cells at the time of arrival in the VOMQ. For a cell that arrives at an empty VOMQ, $\delta=0$ . The flow control mechanism keeps IPs informed about VOMQ occupancy as discussed in Section II-E.

Figure 4 shows an example of this forwarding mechanism for flow $A$ . Cells from flow $A$ are denoted as $A_{t}$ , where $t$ is the cell arrival time. In this example, cells arrive at time slots 1, 2, 4, and 5, and they are denoted as $A_{1}$ , $A_{2}$ , $A_{4}$ , and $A_{5}$ , respectively. VOMQ $(k)$ denotes the $k$ th VOMQ to where cells are forwarded. Here, the “X” mark indicates that the buffer at VOMQ $(k)$ is occupied by cells from other flows. Assuming $k=3$ and no other cell arrival or departure during this time period, $A_{1}$ is the first cell of the flow with arrival time $t=1$ and is sent to VOMQ $(1)$ at time slot $t=2$ . Because VOMQ $(1)$ has no backlogged cells before $A_{1}$ , there is no waiting time for $A_{2}$ . Therefore, $A_{2}$ is sent to VOMQ $(2)$ at $t=3$ . $A_{2}$ finds three cells already queued, so no cell from this flow is forwarded in $3*3=9$ time slots, or from time slots $t=4$ to $t=12$ . After that, $A_{4}$ is sent to VOMQ $(3)$ at $t=13$ . This cell finds no other cell, so $A_{5}$ is sent to VOMQ $(1)$ at $t=14$ .

II-D Implementation of In-sequence Mechanism

Each IP has an input port counter (IPC) for each VOMQ to which it connects. IPCs keep track of the number of cells at these VOMQs. Each IP also has a hold-down timer for each VOQ. The timer is used by the in-sequence forwarding mechanism. The timer is triggered by the IPC count of the VOMQ where the last cell was forwarded. When a cell is forwarded from a VOQ to VOMQ, and the IPC is updated to $\sigma$ , this update sets the hold-down timer for that VOQ for $(\sigma-1)k$ time slots, where $\delta=\sigma-1$ .

II-E Flow Control

There is a flow control mechanism between VOMQs and IPs and another between CBs and VOMQs that extends to IPs. There are fixed connections between each VOMQ and its $k$ corresponding IPs and between each CB and its corresponding $k$ $I_{C}$ s. Each IP has $mk=N$ occupancy counters, IPCs, one per VOMQ. Each VOMQ updates the corresponding $k$ IPCs about its occupancy. A VOMQ uses two thresholds for flow control; pause ( $T_{pv}$ ) and resume ( $T_{rv}$ ), where $T_{pv}>T_{rv}$ , in number of cells. When the occupancy of VOMQ, $|VOMQ|$ , is larger than $T_{pv}$ , the VOMQ signals the corresponding IPs to pause sending cells to it. When the $|VOMQ|<T_{rv}$ , the VOMQ signals the corresponding IPs to resume sending cells to it. Here, $T_{pv}$ is such that $C_{VOMQ}-T_{pv}\geq D_{v}$ , where $C_{VOMQ}$ is the size of the VOMQ and $D_{v}$ is the flow-control information delay.

Similar to VOMQs, CBs use two thresholds; pause ( $T_{pc}$ ) and resume ( $T_{rc}$ ), where $T_{pc}>T_{rc}$ , and $T_{pc}$ is such that $C_{CB}-T_{pc}\geq D_{c}$ , for a CB size, $C_{CB}$ , and flow-control information delay between a CB and corresponding IPs, $D_{c}$ . These CB thresholds work in a similar way as for VOMQs. Different from IPs, VOMQs have a binary flag to pause/resume forwarding of cells to CBs. When the occupancy of a CB, $|CB|$ , becomes larger than $T_{pc}$ , the CB informs the corresponding VOMQs, and in turn VOMQs inform corresponding IPs to pause forwarding cells to the VOMQ for the congested OP. With IPs paused for traffic to a CB, traffic already at VOMQs can still be forwarded to CBs as long as $|CB|$ is such that $T_{pc}<|CB|<C_{CB}$ . When $|CB|<T_{rc}$ , the CB signals the corresponding VOMQs to resume forwarding, and in turn, VOMQs signal source IPs to resume forwarding cells for that destination OP.

II-F Avoiding HoL Blocking in LBC with VOMQs

Concerns of HoL blocking, owning to the aggregation of traffic going to different OPs at the same OM at a VOMQ, may arise. However, one must note that this HoL blocking may occur if and only if a CB gets congested. Here, we argue that the efficient load-balancing mechanism and the use of one CB for each COM at an OP avoids congestion of CBs even in the presence of heavy (but admissible) traffic. We also show that CB occupancy does not build up. Let us consider the input traffic matrix, $\mathbf{R_{1}}$ , with input load, $\lambda_{i,s,j,d}$ , which gets load-balanced to CIMs at rate of $\frac{1}{m}$ . The aggregate traffic arrival rate at an $L_{CIM}$ from all IMs, $R_{LCIM}$ , is:

[TABLE]

Therefore, the traffic arrival rate to a CB from COMs, $R_{CB}$ , is:

[TABLE]

To test the growth of CBs, we consider three stressing traffic scenarios: a) All IPs in the switch have traffic only for OPs in an OM; b) all IPs in an IM forward traffic to all OPs in an OM; and c) a single flow, with a large rate, going from an IP to a single OP.

Then, for a) the largest arrival rate at IPs while being admissible is:

[TABLE]

Substituting (6) into (5) and $m=n=k$ yields:

[TABLE]

Because round-robin is used as selection policy at an OP, the service rate, ${S_{CB}}$ , of a CB would be:

[TABLE]

Yet, while considering the worst case scenario, or:

[TABLE]

Therefore, CB occupancy does not grow because ${S_{CB}>R_{CB}}$ .

For b), the arrival rate at IPs for admissibility is:

[TABLE]

Substituting (9) into (5) yields:

[TABLE]

The service rate would be the same as in (8). Therefore, the CB would not become congested as ${R_{CB}}={S_{CB}}$ .

For c), the arrival rate at the IP:

[TABLE]

The traffic arrival rate to an $L_{CIM}$ is:

[TABLE]

The traffic arrival rate to a CB from COMs is:

[TABLE]

Therefore, the CB would not become congested since ${R_{CB}}\leq{S_{CB}}$ for admissible traffic.

III Throughput Analysis

In this section, we analyze the performance of the proposed LBC switch. Let us denote the traffic coming to the IM-CIM stage, the COM stage, the OMs, OPs, and the traffic leaving LBC as $\mathbf{R_{1}}$ , $\mathbf{R_{2}}$ , $\mathbf{R_{3}}$ , $\mathbf{R_{4}}$ , and ${R_{5}}$ , respectively. Figure 1 shows these analysis points. Here, $\mathbf{R_{1}}$ , $\mathbf{R_{2}}$ , and $\mathbf{R_{3}}$ are $N\times N$ matrices, $\mathbf{R_{4}}$ comprises $N$ $m\times 1$ column vectors, and ${R_{5}}$ comprises $N$ scalars.

The traffic from input ports to the IM-CIM stage, $\mathbf{R_{1}}$ , is defined as:

[TABLE]

Here, $\lambda_{u,v}$ is the arrival rate of traffic from input $u$ to output $v$ , where

[TABLE]

and $0\leq u,v\leq N-1$ .

In the following analysis, we consider admissible traffic, which is defined as:

[TABLE]

under i.i.d. traffic conditions.

The IM-CIM stage of the LBC switch balances the traffic load coming from the input ports to the VOMQs. Specifically, the permutations used to configure the IMs and CIMs interconnect the traffic from an input to $k$ different CIMs, and then to the VOMQs connected to these CIMs.

$\mathbf{R_{2}}$ is the traffic directed towards COMs and it is derived from $\mathbf{R_{1}}$ and the permutations of IMs and CIMs. The configuration of the combined IM-CIM stage at time slot $t$ that connects $IP(i,s)$ to $L_{CIM}(r,p)$ are represented as an $N\times N$ permutation matrix, $\mathbf{\Pi}(t)=[\pi_{u,\upsilon}]$ , where $r$ and $p$ are determined from (1) and (2) and the matrix element:

[TABLE]

The configuration of the compound IM-CIM stage can be represented as a compound permutation matrix, $\mathbf{P_{1}}$ , which is the sum of the IM-CIM permutations over $k$ time slots as follows,

[TABLE]

Because the configuration is repeated every $k$ time slots, the traffic load from the same input going to each VOMQ is $\frac{1}{k}$ of the traffic load of $\mathbf{R_{1}}$ . Therefore, a row of $\mathbf{R_{2}}$ is the sum of the row elements of $\mathbf{R_{1}}$ at the non zero positions of $\mathbf{P_{1}}$ , normalized by $k$ . This is:

[TABLE]

where $\mathbb{1}$ denotes an $N\times N$ unit matrix and $\circ$ denotes element/position wise multiplication.

There are $k$ non-zero elements in each row of $\mathbf{R_{2}}$ . Here, $\mathbf{R_{2}}$ is the aggregate traffic in all the VOMQs destined to all OPs. This matrix can be further decomposed into $k$ $N\times N$ submatrices, $\mathbf{R_{2}}(j)$ , each of which is the aggregate traffic at VOMQs designated for $OM(j)$ .

[TABLE]

where $j$ is obtained from (16) $\forall\;d$ . The configuration of the COM stage at time slot $t$ that connects $I_{c}(r,p)$ to $L_{COM(r,j)}$ can be represented as an $N\times N$ permutation matrix, $\mathbf{\Phi}(t)=[\phi_{u,v}]$ , and the matrix element;

[TABLE]

Similarly, the switching at the COM stage is represented by a compound permutation matrix $\mathbf{P_{2}}$ , which is the sum of $k$ permutations of the COM stage over $k$ time slots. Here

[TABLE]

The output traffic of COMs going to different OMs is described by matrix $\mathbf{R_{3}}(j)$ , which is defined as

[TABLE]

where $j$ is obtained from (16) $\forall\;d$ . The traffic destined to $OP(j,d)$ at $OM(j)$ , $\mathbf{R_{3}}(j,d)$ , is obtained by extracting the traffic elements from $\mathbf{R_{3}}(j)$ , or:

[TABLE]

where $d$ is obtained from (16) for the different $j$ .

$\mathbf{D_{s}}$ is an $m\times N$ matrix, built by concatenating $N$ $k\times 1$ vector of all ones, $\vec{1}$ , as:

[TABLE]

$\vec{A}$ is a $1\times k$ row vector, built by setting the first element to $1$ and every other element to [math], or:

[TABLE]

$\vec{A_{s}}$ is an $N\times 1$ column vector, built by concatenating $k$ $\vec{A}$ and taking the transpose, or:

[TABLE]

where $\vec{A_{s_{1}}}=\vec{A_{s_{k}}}=\vec{A}$ , such that

[TABLE]

The traffic queued at the CB of an OP, $\mathbf{R_{4}}(v)$ , is the multiplication of $\mathbf{D_{s}}$ , $\mathbf{R_{3}}(j,d)$ , and $\vec{A_{s}}$ , or:

[TABLE]

The traffic leaving an OP, ${R_{5}}(v)$ , is:

[TABLE]

Therefore, $\mathbf{R_{5}}(v)$ is the sum of the traffic leaving $OP(v)$ .

Equations (19), (29), and (30) show that the admissibility conditions in (17) are satisfied by the traffic at the VOMQ, CBs, and OP. Since $\mathbf{R_{2}}$ , $\mathbf{R_{4}}(v)$ , and ${R_{5}}(v)$ meet the admissibility conditions in (17), this implies that the sum of the traffic load at each $VOMQ$ , $CB$ , and $OP$ does not exceed their respective capacities. From (29), we can deduce that $\mathbf{R_{4}}$ is equal to the input traffic $\mathbf{R_{1}}$ , or:

[TABLE]

From the admissibility of $\mathbf{R_{2}}$ , $\mathbf{R_{4}}(v)$ , ${R_{5}}(v)$ and (31), we can infer that the input traffic is successfully forwarded to the output ports.

As discussed in Section II-B, the output arbiter selects a flow in a round-robin fashion and if no cell of a flow is selected, the OP arbiter moves to the next flow. This implies the queues are work conserving which ensures fairness and that cells forwarded to OPs are successfully forwarded out of OPs. Hence, from (30), we can infer that ${R_{5}}(v)$ is equal to $\mathbf{R_{4}}(v)$ , or:

[TABLE]

From (31) and (32), we can conclude that LBC successfully forwards all traffic at IPs out of OPs.

The following example shows the different traffic matrices for a 4 $\times$ 4 ( $k=2$ ) LBC switch. Let the input traffic matrix be

[TABLE]

First, $\mathbf{R_{1}}$ is decomposed into $\mathbf{R_{2}}$ at the IM-CIM stage. From (18), the compound permutation matrix for the IM-CIM stage for this switch is:

[TABLE]

Using (19), we get:

[TABLE]

From (20), the traffic matrix at VOMQs destined for the different OMs are:

[TABLE]

The rows of $\mathbf{R_{2}}(v)$ represent the traffic from IPs, and the columns represent $VOMQ(r,p,j)$ at $I_{C}(r,p)$ . From (22), the compound permutation matrix for the COM stage for this switch is:

[TABLE]

From (23) and (24), the traffic forwarded to an OP is:

[TABLE]

The rows of $\mathbf{R_{3}}(j,d)$ represent the traffic from $VOMQ(r,p,j)$ at $I_{C}(r,p)$ and the columns represent $L_{COM}(r,j)$ . $\mathbf{D_{S}}$ and $\vec{A_{s}}$ are obtained from (25) and (28), respectively, as:

[TABLE]

The traffic forwarded from $CB$ s to the corresponding $OP$ is obtained from (29):

[TABLE]

The rows of $\mathbf{R_{4}}(v)$ represent the traffic from $COM(r)$ . Using (30), we obtain the sum of the traffic leaving the OP, or:

[TABLE]

We use the traffic analysis of this section to demonstrate that the LBC switch achieves 100% throughput under admissible traffic. This demonstration is provided in Appendix B.

IV Analysis of In-Sequence Service

In this section, we demonstrate that the LBC switch forwards cells in sequence through the proposed in-sequence forwarding mechanism.

Table III lists the definition of terms used in the discussion of the properties of the proposed LBC switch. Here, $c_{y,\tau}(i,s,j,d)$ denotes the $\tau$ th cell of traffic flow $y$ , which comprises cells going from $IP(i,s)$ to $OP(j,d)$ with arrival time $t_{x}$ . In addition, $t_{a_{y,\tau}}$ denotes the arrival time of $c_{y,\tau}$ , and $q_{1_{y,\tau}}$ , $q_{2_{y,\tau}}$ , and $q_{3_{y,\tau}}$ denote the queuing delays experienced by $c_{y,\tau}$ at $VOQ(i,s,j,d)$ , $VOMQ(r,p,j)$ , and $CB(r,j,d)$ , respectively. The departure times of $c_{y,\tau}$ from these queues are denoted as $d_{1_{y,\tau}}$ , $d_{2_{y,\tau}}$ , and $d_{3_{y,\tau}}$ , respectively. In this paper, we consider admissible traffic as defined in (17).

Here, we claim that the LBC switch forward cells in sequence to the output ports, through the following theorem.

Theorem 1.

For any two cells $c_{y,\tau}(i,s,j,d)$ and $c_{y,\tau^{\prime}}(i,s,j,d)$ , where $\tau<\tau^{\prime}$ , $c_{y,\tau}(i,s,j,d)$ departs the destination output port before $c_{y,\tau^{\prime}}(i,s,j,d)$ .

This theorem is sectioned into the following lemmas.

Lemma 1.

For a single flow traversing the LBC switch, any cell of the flow experiences the same delay. This is, let $t_{d}$ be the delay experienced by a cell. Then, $t_{d_{y,\tau}}=\gamma~{}~{}\forall~{}\tau$ , where $\gamma$ is a positive constant.

A constant delay for each cell implies that cells depart the switch in the same order they arrived under the conditions of this lemma.

Lemma 2.

For any number of flows traversing the LBC switch, cells from the same flow arrive at the OM in sequence.

Lemma 3.

For any number of flows traversing the LBC switch, the cells of each flow arrive and are cleared at the output port (OP) in the same order the cells arrived at the input port (IP).

Appendix A presents the proofs of these lemmas.

V Performance Analysis

We evaluated the performance of the LBC switch through computer simulation under both uniform and nonuniform traffic models. We also compared the performance of the proposed switch with that of an output-queued (OQ) switch, a high-performing Memory-Memory-Memory Clos-network (MMM) switch, and an MMM switch with extended memory (MMeM). The MMM switch uses forwarding arbitration schemes to select cells from the buffers in the previous stage modules and is agnostic to cell sequence, therefore delivering high switching performance. We considered switches with sizes $N=\{64,256\}$ .

V-A Uniform Traffic

We evaluated the LBC, OQ, MMM, and MMeM switches under uniform traffic with Bernoulli and bursty arrivals. Figures 5(a) and 5(b) show the average delay under uniform Bernoulli traffic arrivals for $N=64$ and $N=256$ , respectively. The results in the figures show that the LBC switch achieves 100% throughput under uniform traffic with Bernoulli arrivals, indicated by the finite and moderate average queuing delay. The high throughput performance by the proposed switch is the result of using an efficient load-balancing process in the IM-CIM stage. However, this high performance is expected under this traffic pattern as the distribution of the incoming traffic is already uniformly distributed.

Figure 5(a) shows that the LBC switch experiences a similar delay as the MMeM switch at high input load. Figure 5(b) shows that the LBC switch experiences a slightly higher average delay than the OQ switch. This additional delay in the LBC switch is caused by having cells wait in the VOMQs until a configuration that allows forwarding the cells to their destination output modules takes place. Because MMeM requires an excessive amount of memory to implement the extended set of queues, the measurement of average cell delay cannot be measured for $N$ =256 by our simulators. This figure also shows that the LBC switch achieves a lower average delay than the MMM switch with an input load of $0.95$ and larger.

Uniform bursty traffic is modeled as an ON-OFF Markov modulated process, with the average duration of the ON period set as the average burst length, $l$ , with $l=\{10,30\}$ cells. Figures 5(c) and 5(d) show the average delay under uniform traffic with bursty arrivals for average burst length of 10 and 30 cells, respectively, for switches with $N$ =256. The results show that the LBC switch achieves 100% throughput under bursty uniform traffic. In contrast, the MMM switch has a throughput of 0.8 and 0.75 for an average burst length of 10 and 30 cells, respectively. Therefore, the LBC switch achieves a performance closer to that of the OQ switch. There is a very small difference in the delay of the LBC. From this graph, we also observe that the LBC switch achieves 100% throughput under bursty uniform traffic. The uniform distributed nature of the traffic and the load-balancing stages help to achieve this high throughput and low queueing delay. The slightly larger average queueing delay of the LBC switch for very small input loads is caused by the predetermined and cyclic configuration of the bufferless modules as some cells wait for a few time slots to be forwarded and this is irrespective of the switch size. Nevertheless, these two figures show that the queueing delay difference between the LBC and the OQ switch is not significant for large input loads.

V-B Nonuniform traffic

We also compared the performance of the proposed LBC switch with the MMM, MMeM, and OQ switches under unbalanced [27, 28] and hot-spot patterns as nonuniform traffic. The unbalanced traffic can be modeled using an unbalanced probability $\omega$ to indicate the load variances for different flows. Consider input port $IP(i,s)$ and output port $OP_{(}j,d)$ of the LBC switch, the traffic load is determined by

[TABLE]

where $\rho$ is the traffic load for input $IP(i,s)$ and $\omega$ is the unbalanced probability. When $\omega$ =0, the input traffic is uniformly distributed and when $\omega$ =1, the input traffic is completely directional; traffic from $IP(i,s)$ is destined for $OP(j,d)$ .

The simulation results show that the throughput of the LBC switch is 100% under this traffic pattern for all values of $\omega$ , matching those of MMM and MMeM switches, which are also known to achieve high throughput but neglect in-sequence forwarding. It has been shown that many switches do not achieve high throughput when $w$ is around 0.6 [28]. Therefore, we measured the average delay of the LBC switch under this traffic pattern for $\omega$ =0.6, as shown in Figure 5(e), and compared with the OQ switch as this switch is well-known to achieve 100% throughput. As the figure shows, the average delay of the LBC switch is comparable to that of an OQ switch. The load-balancing stage of the LBC switch distributes the traffic uniformly throughout the switch.

We compared the performance of the proposed LBC switch with the MMM, MMeM, and OQ switches under hot-spot traffic [24]. Hot-spot traffic occurs when all IPs send most or all traffic to one OP. Consider input port $IP(i,s)$ and output port $OP(j,d)$ of the LBC switch, the traffic load is determined by

[TABLE]

where $h$ is the hot-spot OP and $1\leq h\leq N$ .

Our simulation shows that the LBC switch as well as the MMM and MMeM switches achieve 100% throughput under admissible hot-spot traffic.

Figure 5(f) shows the measured average delay of the LBC switch under this traffic pattern and that of an OQ switch. The figure shows that the average delay of the LBC switch is comparable to that of an OQ switch. This is as a result of effective load-balancing at the IMs, CIMs, and COMs of the multiple flows coming from different inputs.

In addition to the analysis presented in Section II-F, we also simulated the LBC switch under two new traffic patterns, which we believe may stress the occupancy of CBs and therefore increase the likelihood of occurrence of HoL blocking conditions. The traffic patterns are: a) $k$ flows from IPs at different IMs, each arriving at a rate of $\frac{1}{k}$ for admissibility, are forwarded to all OPs at one OM. The source IPs of the flows are selected such that they share VOMQs; $i=s$ or $IP(0,0),IP(1,1),\cdots,IP(k-1,n-1)$ . b) Each IP at an IM forwards cells at rate $\frac{1}{k}$ to each OP at an OM (e.g., $i=j$ ). Each OP in the destination OM receives traffic from all IPs of one IM. VOMQs are also shared by different flows. Figures 6(a) and 6(b) show the average delay under the first and second traffic patterns presented above, respectively. The results in the figures show that LBC experiences a finite and moderate average queuing delay, which implies that LBC achieves 100% throughput under both traffic patterns. We also measured the average CB length and this length does not grow more than one cell, indicating that no CB gets congested. This result is obtained because the load-balancing mechanism spreads a flow to different VOMQs.

VI Conclusions

We have introduced a configuration scheme for a split-central-buffered load-balancing Clos-network switch and a mechanism that forwards cells in sequence for this switch. To effectively perform load balancing, the switch has virtual output module queues between these two central stages. With the split central module, the switch comprises four stages, named IM, CIM, COM, and OM. The IM, CIM, and COM stages are bufferless crossbars, while the OMs is a buffered one. All the bufferless modules follow a pre-deterministic configuration while the OM follows a round-robin sequence to forward cells from the CB to the output ports. Therefore, the switch does not have to perform matching in any stage despite having bufferless modules, and the configuration complexity of the switch is minimum, making it comparable to that of MMM switches. We introduce an in-sequence mechanism that operates at the inputs of the LBC switch to avoid out-of-sequence forwarding caused by the central buffers. We modeled and analyzed the operations that each of the stages effects on the incoming traffic to obtain the loads seen by the output ports. We showed that for admissible independent and identically distributed traffic, the switch achieves 100% throughput. Unlike the existing switching architectures discussed in Section I, LBC achieves high performance, configuration simplicity, and in-sequence service without memory speedup and central module expansion. In addition, we analyzed the operation of the forwarding mechanism and demonstrated that cells are forwarded in sequence. We showed, through computer simulation, that for all tested traffic, the switch achieved 100% throughput for uniform and nonuniform traffic distributions.

Appendix A Analysis of In-Sequence Service

In this section, we demonstrate the lemmas that support the theorem where we claim that the LBC switch forwards cells in sequence through the proposed in-sequence forwarding mechanism.

**Lemma 1. **For a single flow traversing the LBC switch, any cell of the flow experiences the same delay. This is, let $t_{d}$ be the delay experienced by a cell. Then, for any cell traversing the LBC switch, $t_{d_{y,\tau}}=\gamma$ , where $\gamma$ is a positive constant.

We analyze first the scenario of a single flow, i.e., $y$ , traversing the switch, whose cells arrive back to back, one each time slot. For simplicity but without losing generality, let us also consider empty queues as an initial condition.

Proof:

For any $c_{y,\tau}$ , the total delay time is defined as:

[TABLE]

in number of time slots. Here we consider fixed arbitration time at each queue and this delay is included in the queuing delay. We are then interested in finding $q_{1_{y,\tau}}$ , $q_{2_{y,\tau}}$ , and $q_{3_{y,\tau}}$ .

For $q_{1_{y,\tau}}$ , under a single-flow scenario, let us consider any two cells of $c_{y,\tau}$ with arrival times $k$ time slots apart, $c_{y,\tau-2k}$ and $c_{y,\tau-k}$ , they are forwarded to the same VOMQ. Then, $c_{y,\tau}$ is held at the VOQ (owing to the mechanism to keep cells in sequence at the VOQ) if $c_{y,\tau-k}$ finds one or more cells in the VOMQ, $q_{1_{y,\tau}}$ increases. In this case, the empty queue initial condition makes the waiting factor $\delta=0$ .

On the other hand, an OM is connected to a VOMQ every $k$ time slots as per the configuration scheme of COM. Therefore,

[TABLE]

This queuing delay is smaller than the arrival gap between these two cells as:

[TABLE]

Therefore, $c_{y,\tau}$ is not backlogged further in VOMQ and there is no impact on the time the cell is held in a VOQ, such that:

[TABLE]

For $q_{2_{y,\tau}}$ , let us now assume that $c_{y,\tau-k}$ arrives at a time that it has to wait $\gamma$ time slots, where $1\leq\gamma\leq k$ , to be forwarded to the destination OM, or

[TABLE]

Then when $c_{y,\tau}$ arrives, $k$ time slots later, it finds exactly the same configuration in the COM as found by $c_{y,\tau-k}$ . Because cells arrive consecutively,

[TABLE]

For $q_{3_{y,\tau}}$ , because there is a single flow traversing the switch and the configuration scheme followed by COM, one cell arrives in the CB each time slot and one cell departs OP at the same time slot. Therefore, no cell is backlogged in this case and

[TABLE]

From (35):

[TABLE]

for empty queues as initial condition.

It is then easy to see that for any queued cells, $q_{1_{y,\tau}}$ would be increased by $\delta k$ time slots, and $q_{2_{y,\tau}}$ as well as $q_{3_{y,\tau}}$ remain unchanged.

Therefore, all cells of the flow experience the same delay and are forwarded in sequence.

$\blacksquare$

**Lemma 2. ** For any number of flows traversing the LBC switch, cells from the same flow arrive at the OM in sequence.

Proof: Here, we consider the following traffic scenario: There are $k$ flows coming from different IPs, each from a different IM. In each of the flows, cells arrive back to back and are destined to the same OP. Furthermore, the flows have one time slot difference in their arrival times such that the cells with the same sequence number of each different flow are stored in the same VOMQs. Here, each flow consists of $k$ cells. Table IV shows an example of the arrival pattern of this traffic scenario for three flows. The table shows the arrival of $k$ cells from $k$ flows at different IPs and IMs that arrive at one time slot apart to enable these flows to be forwarded to the same VOMQ, otherwise the flows would be forwarded to different VOMQs.

Table V shows that cells $c_{1,1}$ , $c_{1,2}$ , $c_{1,3}$ , $c_{2,1}$ , and $c_{3,1}$ were successfully forwarded to the VOMQ without any blocking. While the in-sequence mechanism holds back the cells $c_{2,2}$ , $c_{2,3}$ , $c_{3,2}$ and $c_{3,3}$ to prevent out-of-sequence, because cells $c_{2,1}$ and $c_{3,1}$ were forwarded to a non-empty VOMQ.

The configuration pattern used in the IMs and CIMs, and the in-sequence mechanism determine the order in which cells arrive to the VOMQs. Table V shows this order in our example.

In such arrival pattern, the departures from VOMQs follow the deterministic configuration of the COMs. Table VI shows the corresponding departures of the cells from VOMQs of these three flows.

TableVI shows that all the cells were forwarded out the VOMQ in the same pattern they arrived and one cell each $k$ time slots because the COM connects to the OM once each $k$ time slots.

Also, let us assume that the first cell of a flow at the $L_{CIM}$ arrives at least one or more time slots before the configuration of the COM allows forwarding the cell to its destination OM. Thus, cells may depart in the following or a few time slot after its arrival. A cell then may wait up to $k-1$ time slots for the designated interconnection to take place before being forwarded to the OM.

Given $k$ flows, with their $\tau$ th cells being $c_{1,\tau}$ to $c_{k,\tau}$ , the arrival time of the first arriving cell $c_{1,\tau}$ is:

[TABLE]

The number of cells at the VOQ, $N_{1}(c_{y,\tau})$ , upon the arrival of $c_{1,\tau}$ is:

[TABLE]

This condition holds because there is no cell at the VOQ when $c_{1,\tau}$ arrives. Because of (38), the queuing delay at the VOQ of $c_{1,\tau}$ is:

[TABLE]

The departure time of a cell $c_{y,\tau}$ from the VOQ is:

[TABLE]

Using (37) to (40), the departure time of $c_{1,\tau}$ from the VOQ is:

[TABLE]

Upon arriving at the VOMQ, $c_{1,\tau}$ finds no cell ahead of it. Thus, the number of cells at the VOMQ, $N_{2}(c_{1,\tau})$ , upon the arrival of $c_{1,\tau}$ is:

[TABLE]

Based on the considered traffic pattern, $c_{1,\tau}$ is stored in the VOMQ for additional $k-1$ time slots. Therefore,

[TABLE]

The departure time of a cell $c_{y,\tau}$ from the VOMQ is:

[TABLE]

Using (41), (43), and (44), the departure time of $c_{1,\tau}$ from the VOMQ is:

[TABLE]

Let us consider now another cell from the same flow, $c_{1,\tau+\theta}$ , where $0<\theta<k$ , with

[TABLE]

Upon the arrival of $c_{1,\tau+\theta}$ , there is no cell at the VOQ, or:

[TABLE]

Because of (42) and (47), the queuing delay at the $VOQ$ for $c_{1,\tau+\theta}$ is:

[TABLE]

Using (40), (46), and (48), the departure time of $c_{1,\tau+\theta}$ from the VOQ is:

[TABLE]

Upon arriving at the VOMQ, $c_{1,\tau+\theta}$ finds no cell ahead of it, or:

[TABLE]

Because of the considered traffic, $c_{1,\tau+\theta}$ is queued extra $k-1$ time slots at the VOMQ, hence:

[TABLE]

Using (44), and (49) to (51),

[TABLE]

Using (45), therefore,

[TABLE]

In general, for $c_{z,\tau}$ , where $1<z\leq k$ , the arrival time is

[TABLE]

and upon the arrival of $c_{z,\tau}$ in the VOQ, there is no cell:

[TABLE]

With (55),

[TABLE]

Using (40), (54) , and (56),

[TABLE]

However, upon arriving in the VOMQ, $c_{z,\tau}$ finds $\delta$ cells ahead of it, or:

[TABLE]

where $0<\delta<k$

[TABLE]

$q_{H_{z,\tau}}$ is the delay from the HoL cell in the VOMQ on $c_{z,\tau}$ . $(\delta-1)k$ is the delay generated from the other $(\delta-1)$ cells ahead of $c_{z,\tau}$ in the VOMQ. The extra $k$ time slots is the delay $c_{z,\tau}$ experiences as it waits for the configuration pattern to repeat after the last cell ahead of it is forwarded to the OM. where

[TABLE]

Using (44), (60), and (61), the departure time of $c_{z,\tau}$ from the VOMQ is:

[TABLE]

Using (45) and (59), then:

[TABLE]

Let us now consider any other cell from flow $z$ , $c_{z,\tau+\theta}$ , where $0<\theta<k$ . The time of arrival of the cell $c_{z,\tau+\theta}$ is:

[TABLE]

Upon the arrival of $c_{z,\tau+\theta}$ , there could be zero or more at the VOQ, hence:

[TABLE]

where $\gamma$ is the number of cells at the VOQ upon the arrival of $c_{z,\tau+\theta}$ and $0\leq\gamma<k$ . Using (58) and (65), then:

[TABLE]

where

[TABLE]

is the delay generated from the $\gamma$ cells ahead of $c_{z,\tau+\theta}$ at the VOQ. Let

[TABLE]

Using (40), (64), (66), and (67), then:

[TABLE]

The queuing delay of $c_{z,\tau+\theta}$ at the VOMQ is equal to (60). Therefore, using (44), (60), and (68), the departure time of $c_{z,\tau+\theta}$ from the VOMQ is:

[TABLE]

Using (53) and (59), then:

[TABLE]

Using (45), then:

[TABLE]

From (53),

[TABLE]

Using (63), gives:

[TABLE]

The difference between the departure times of any two cells of a flow from VOMQ is a function of $\theta$ , which is the arrival time difference of the two cells. Therefore, cells of a flow are forwarded to the OM in the same order they arrived.

$\blacksquare$

**Lemma 3. ** For any number of flows traversing the LBC switch, the cells of each flow arrive and are cleared at the output port (OP) in the same order the cells arrived at the input port (IP).

In our discussion of this lemma, let us consider the following traffic scenario: The switch has cells from only two flows, each arriving in a different IM (and therefore IP) and both of them are destined to the same OP. In each flow, cells arrive back-to-back, one at each time slot, and the first cell of both flows arrive at a time slot such that the configuration pattern of IM-CIM stage would not enable forwarding them to the COM immediately. With this condition, we analyze how these two flows are kept from affecting each other, and therefore, the sequence in which cells may depart the OP. This traffic scenario may present the greatest opportunity of experiencing out-of-sequence forwarding by any two cells of a flow as cells from these two flows interact at the CBs of the destination OP. Let us also consider empty queues as an initial condition.

Given flows $y$ and $z$ , where the first cells of $y$ and $z$ , $c_{y,\tau}$ and $c_{z,\tau}$ , respectively, arrive at their respective VOQs at time slot $t_{x}$ and the $\theta$ th cells, $c_{y,\tau+\theta}$ and $c_{z,\tau+\theta}$ $\forall$ $\theta$ $\geq$ 1, arrive at time slot $t_{x}+\theta$ . Therefore, according to this lemma $c_{y,\tau}$ and $c_{z,\tau}$ must be forwarded and cleared from the output port $OP(j,d)$ before $c_{y,\tau+\theta}$ and $c_{z,\tau+\theta}$ , respectively.

Proof:

We analyze the departure time of the cells $c_{y,\tau}$ and $c_{z,\tau}$ from the CBs. The arrival times for cells $c_{y,\tau}$ and $c_{z,\tau}$ is:

[TABLE]

Upon arriving in the VOQ, $c_{y,\tau}$ and $c_{z,\tau}$ are placed as HoL cells. Because there are no backlogged cells, hence:

[TABLE]

and

[TABLE]

Using (75) and (76), the queuing delays of $c_{y,\tau}$ and $c_{z,\tau}$ at the VOQ are:

[TABLE]

and

[TABLE]

Using (40), (74), and (77) the departure time for $c_{y,\tau}$ from the VOQ is:

[TABLE]

Using (40), (74), and (78) the departure time for $c_{z,\tau}$ from the VOQ is:

[TABLE]

Thus, $c_{y,\tau}$ and $c_{z,\tau}$ are forwarded to the same CIM (so that these two cells would share the same CB) and stored in their respective VOMQ. Because the VOMQs are empty at the time the two cells arrive, hence:

[TABLE]

and

[TABLE]

Based on the adopted traffic scenario, $c_{y,\tau}$ and $c_{z,\tau}$ are held at the VOMQ for $\beta_{1}$ and $\beta_{2}$ time slots, respectively, before the configuration pattern enables forwarding them to their destination OM. Here, $1\leq\beta_{1}<k$ and $1\leq\beta_{2}<k$ . Hence, the queuing delay of $c_{y,\tau}$ at the VOMQ is:

[TABLE]

The queuing delay of $c_{z,\tau}$ at the VOMQ is:

[TABLE]

Assuming $\beta_{1}<\beta_{2}$ , hence $c_{y,\tau}$ would be forwarded to the destination OM before $c_{z,\tau}$ . From (44), (79), and (83), the departure time of $c_{y,\tau}$ from the VOMQs is:

[TABLE]

From (44), (80), and (84), the departure time of $c_{z,\tau}$ from the VOMQs is:

[TABLE]

When $c_{y,\tau}$ and $c_{z,\tau}$ arrive at the OM, they are stored at CBs before being forwarded to the output port.

Let us now consider $c_{y,\tau+1}$ and $c_{z,\tau+1}$ , which arrive at time slot $t_{x}$ + 1, hence:

[TABLE]

Because there are no cells at the VOQ upon the arrival of $c_{y,\tau+1}$ and $c_{z,\tau+1}$ , then:

[TABLE]

and

[TABLE]

With (81) and (88), the queuing delay of $c_{y,\tau+1}$ at the VOQ is:

[TABLE]

With (82) and (89), the queuing delay of $c_{z,\tau+1}$ at the VOQ is:

[TABLE]

Using (40), (87), and (90), the departure time of $c_{y,\tau+1}$ from the VOQ is:

[TABLE]

Using (40), (87), and (91), the departure time of $c_{z,\tau+1}$ from the VOQ is:

[TABLE]

$c_{y,\tau+1}$ and $c_{z,\tau+1}$ are forwarded to the same CIM and stored in their respective VOMQs. Based on the traffic scenario $c_{y,\tau+1}$ and $c_{z,\tau+1}$ are also stored for $\beta_{1}$ and $\beta_{2}$ time slots, respectively, at the VOMQs before the configuration pattern of the COM enables forwarding them to the destination OM. Hence, the queuing delay of $c_{y,\tau+1}$ and $c_{z,\tau+1}$ at the VOMQ are equal to (83) and (84), respectively. From (44), (83), and (92), the departure time of $c_{y,\tau+1}$ from the VOMQ is:

[TABLE]

From (44), (84), and (93), the departure time of $c_{z,\tau+1}$ from the VOMQ is:

[TABLE]

Next, we analyze the departure time of the cells from the output port. Because $d_{2{y,\tau+1}}>d_{2{y,\tau}}$ and $d_{2{z,\tau+1}}>d_{2{z,\tau}}$ , this means that $c_{y,\tau}$ and $c_{z,\tau}$ arrive at the output module before $c_{y,\tau+1}$ and $c_{y,\tau+1}$ , respectively. With the CB initially empty based on the initial condition, then:

[TABLE]

With $d_{2{z,\tau}}>d_{2{y,\tau}}$ , hence:

[TABLE]

With (96) and (97), the queuing delays of $c_{y,\tau}$ and $c_{z,\tau}$ at the CB are:

[TABLE]

and

[TABLE]

The queuing delay of $c_{y,\tau+1}$ and $c_{z,\tau+1}$ at the CB are equal to (98) and (99). The departure time of a cell $c_{c,\tau}$ from the CB is:

[TABLE]

Therefore, using (85), (98), and (100), the departure time of $c_{y,\tau}$ from the output port is:

[TABLE]

Using (94), (98), and (100), the departure time of $c_{y,\tau+1}$ from the output port is:

[TABLE]

Using (86), (99), and (100), the departure time of $c_{z,\tau}$ from the output port is:

[TABLE]

Using (95), (99), and (100), the departure time of $c_{z,\tau+1}$ from the output port is:

[TABLE]

Therefore, with $d_{3{y,\tau+1}}>d_{3{y,\tau}}$ and $d_{3{z,\tau+1}}>d_{3{z,\tau}}$ , $c_{y,\tau}$ and $c_{z,\tau}$ would depart the output port before $c_{y,\tau+1}$ and $c_{z,\tau+1}$ , respectively. Note that for $N_{1}(c_{y,\tau})>0$ , $\delta>0$ , such that the cells from the same flow are forwarded with larger time separation from each other, and there are fewer chances that they will be at the CBs at the same time slot. Therefore, this property, as described by this lemma, applies to any two cells of a flow.

$\blacksquare$

This completes the proof of Theorem 1.

$\blacksquare$

Appendix B 100% Throughput

In this section we prove that LBC achieves 100% throughput by using the analysis presented on Section III. A and the concept of queue stability. A switch is defined as stable for a traffic pattern if the queue length is bounded and a switch achieves 100% throughput if it is stable for admissible i.i.d. traffic [29]. With this, we set the following theorem:

Theorem 2.

LBC achieves 100% throughput under admissible i.i.d traffic.

Proof: Here, we consider the queue to be weakly stable if the drift of the queue occupancy from the initial state is a finite integer $\epsilon$ $\forall~{}t$ as $\lim_{t\to\infty}$ . Using the definition above, we show that the queue length of VOQs, VOMQs, and CBs are weakly stable under i.i.d. traffic, and hence, achieves 100% throughput under that traffic pattern.

Let us represent the queue occupancy of VOQs at time slot $t$ , $\mathbf{N_{1}}(t)$ as:

[TABLE]

where $\mathbf{A_{1}}(t)$ is the packet arrival matrix at time slot $t$ to VOQs and $\mathbf{D_{1}}(t)$ is the service rate matrix of VOQs at time slot $t$ . Solving (101) with an initial condition $\mathbf{N_{1}}(0)$ , recursively yields:

[TABLE]

Let us consider $s_{1_{u,v}}(t)$ as the service rate received by the VOQ at $IP(u)$ for $OP(v)$ at time slot $t$ or:

[TABLE]

Another way to express $D_{1}(t)$ is:

[TABLE]

and recalling $\mathbf{R_{1}}$ as the aggregate traffic arrival to VOQs or:

[TABLE]

Let us assume the worse case scenario in (103). Substituting (103) into (104), and (104) and (105) into (102), yields:

[TABLE]

From (106), we obtain:

[TABLE]

From the admissibility condition of $\mathbf{R_{1}}$ , it is easy to see that for any value of $t$ , (107) is finite. Hence, from the admissibility of $\mathbf{R_{1}}$ , (106) and (107), we conclude that occupancy of VOQ is weakly stable.

$\blacksquare$

Now we prove VOMQs stability. As before, the queue occupancy matrix of VOMQs at time slot $t$ can be represented as:

[TABLE]

where $\mathbf{A_{2}}(t)$ is the arrival matrix at time slot $t$ to VOMQs and $\mathbf{D_{2}}(t)$ is the service rate matrix of VOMQs at time slot $t$ . Solving (108) recursively with consideration of an initial condition for $\mathbf{N_{2}}(t)$ , yields:

[TABLE]

Because a VOMQ is serviced at least once every $k$ time slots, the service rate of the VOMQ at $I_{C}(r,p)$ for $OP(v)$ at time slot $t$ , $d_{2_{\mu,v}}(t)$ is:

[TABLE]

Then, the service matrix of VOMQs is:

[TABLE]

and representing $\mathbf{R_{2}}$ as the aggregate traffic arrival to VOMQs or:

[TABLE]

Substituting (110) and (111) into (109) gives:

[TABLE]

Recalling that $\mathbf{R_{2}}$ is admissible, per the discussion in Section III.A, and by substituting $\mathbf{P_{1}}$ and $\mathbf{R_{2}}$ into (113), it is easy to see that $\epsilon$ is finite. Hence, from (112) and (113), we conclude that the occupancy of VOMQ is weakly stable.

$\blacksquare$

Now we prove the stability of CBs. The queue occupancy matrix of CBs at time slot $t$ can be represented as:

[TABLE]

where $\mathbf{A_{3}}(t)$ is the packet arrival matrix at time slot $t$ CBs, and $\mathbf{D_{3}}(t)$ is the service rate matrix of CBs at time slot $t$ . Solving (114) recursively as before yields:

[TABLE]

Because a CB is serviced at least once every $k$ time slots. Hence, the service rate of the CB at $OP(v)$ at time slot $t$ , $d_{3_{v}}(t)$ is:

[TABLE]

and service matrix of CBs is:

[TABLE]

Similarly, the aggregate traffic arrival to the CB or:

[TABLE]

Let us assume $d_{3_{v}}(t)=\frac{1}{k}~{}\forall~{}v$ in (116), which is the worst case scenario at which a CB gets served once every $k$ time slots. Substituting (116) and (117) into (115) gives:

[TABLE]

where

[TABLE]

With R4 being admissible, as discussed in Section III.A, and by substituting $\mathbf{R_{4}}$ into (119), it is easy to see that $\epsilon$ is finite. Hence, from (118) and (119), we conclude that the occupancy of CB is also weakly stable.

$\blacksquare$

This completes the proof of Theorem 2.

$\blacksquare$

Bibliography29

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] C. Clos, “A study of non-blocking switching networks,” Bell System Technical Journal , vol. 32, no. 2, pp. 406–424, 1953.
2[2] N. A. Al-Saber, S. Oberoi, T. Pedasanaganti, R. Rojas-Cessa, and S. G. Ziavras, “Concatenating packets in variable-length input-queued packet switches with cell-based and packet-based scheduling,” in Sarnoff Symposium, 2008 IEEE . IEEE, 2008, pp. 1–5.
3[3] T. T. Lee and C. H. Lam, “Path switching-a quasi-static routing scheme for large-scale ATM packet switches,” IEEE Journal on Selected Areas in Communications , vol. 15, no. 5, pp. 914–924, 1997.
4[4] H. J. Chao, Z. Jing, and S. Y. Liew, “Matching algorithms for three-stage bufferless Clos network switches,” Communications Magazine, IEEE , vol. 41, no. 10, pp. 46–54, 2003.
5[5] F. M. Chiussi, J. G. Kneuer, and V. P. Kumar, “Low-cost scalable switching solutions for broadband networking: the ATLANTA architecture and chipset,” IEEE Communications Magazine , vol. 35, no. 12, pp. 44–53, 1997.
6[6] J. Kleban and U. Suszynska, “Static dispatching with internal backpressure scheme for SMM Clos-network switches,” in Computers and Communications (ISCC), 2013 IEEE Symposium on . IEEE, 2013, pp. 000 654–000 658.
7[7] J. Kleban, M. Sobieraj, and S. Weclewski, “The modified MSM Clos switching fabric with efficient packet dispatching scheme,” in High Performance Switching and Routing, 2007. HPSR’07. Workshop on . IEEE, 2007, pp. 1–6.
8[8] R. Rojas-Cessa, E. Oki, and H. J. Chao, “Maximum weight matching dispatching scheme in buffered Clos-network packet switches,” in Communications, 2004 IEEE International Conference on , vol. 2. IEEE, 2004, pp. 1075–1079.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

A Split-Central-Buffered Load-Balancing Clos-Network Switch with In-Order Forwarding

Abstract

Index Terms:

I Introduction

II Switch Architecture

II-A Module Configuration

II-B Arbitration at Output Ports

II-C In-sequence Cell Forwarding Mechanism

II-D Implementation of In-sequence Mechanism

II-E Flow Control

II-F Avoiding HoL Blocking in LBC with VOMQs

III Throughput Analysis

IV Analysis of In-Sequence Service

Theorem 1**.**

Lemma 1**.**

Lemma 2**.**

Lemma 3**.**

V Performance Analysis

V-A Uniform Traffic

V-B Nonuniform traffic

VI Conclusions

Appendix A Analysis of In-Sequence Service

Appendix B 100% Throughput

Theorem 2**.**

Theorem 1.

Lemma 1.

Lemma 2.

Lemma 3.

Theorem 2.