by Talari Blogger-in-Chief Andy Gottlieb...
As I note whenever I’m asked
what it is that Talari is bringing to Enterprise networks, Adaptive Private
Networking is doing for WANs what RAID did for storage.
In a previous
post, we looked at the analogy to RAID at the business benefits level. Here, we examine the parallels on the
technical level.
Wrapping hardware and intelligent software around core enabling technology
Seagate
hard disk technology was originally designed for and targeted at the nascent but fast growing market for
personal computers. Each hard disk was not nearly as reliable, having nowhere
near the MTBF or seek times, as the existing
mainframe/minicomputer disks. But RAID - Redundant Array of Inexpensive
Disks - took advantage of the massively better
price/bit of the Seagate hard disk to revolutionize the storage market. By combining a layer of hardware and intelligent
software with multiple of these inexpensive disks, RAID delivered a system with
higher capacity, lower cost, competitive – and ultimately superior – access
times, and greater reliability than the older generation storage solutions.
For
APN, the analogous enabling technology is the public Internet.
By wrapping a
two-ended system of appliance-based hardware running intelligent software around multiple
WAN connections – most or all of which are Internet connections – APN creates
an enterprise WAN which is lower cost, massively better cost/bit, higher
capacity, and – critically – more reliable than the best single vendor MPLS
WAN.
Redundancy to deliver application continuity
The
basic idea behind RAID – and behind APN as well – is that while two devices operating in series which each have 99% reliability will deliver a system with
only .99 *.99 = 98% reliability, a properly designed system
with the same two devices operating in parallel
will deliver 1 – (1- .99) * (1 - .99) = 99.99%
reliability.
The key
phrase, of course, is proper design.
The
first premise of RAID is that the loss of any single disk ensures not only that no data is lost, but that the application – data reads and writes – continues to function normally, without meaningful performance degradation. This is essentially what RAID Level 1 delivered.
Existing traditional routed networks with appropriate link and device redundancy at each location, of
course, provide network availability in the face
of any hard single link failure or router failure. But this "no loss of connectivity" is merely the equivalent of "no data loss" in the storage world. Even a no-single-point-of-failure routed network does not provide for application
continuity in all cases, as a routed network can take upwards of 30 seconds at
times for router convergence in the face of a given failure. More importantly, routing does not handle the
case where packet loss or excessive latency causes significant problems with
application performance. Yet these “soft
failures”, due to congestion on shared IP networks, especially Wide Area Networks, occur with far greater
frequency than hard link or device failures.
In fact, it is precisely because of the congestion-based packet loss and jitter which occurs at Internet peering points between ISPs that the public Internet has earned its "works pretty well most of the time" reputation. Of course, "pretty well most of the time" just ain't good enough for your Enterprise WAN.
Handling "soft failures" - and doing it quickly!
APN, by leveraging multiple paths across
the network between locations, ensures that no single hard or soft failure of a network link, device or peering
point in the middle of the Internet will cause a loss of connectivity or
application predictability.
APN does its equivalent of RAID Level 1 by
continuous measurement of network path performance (loss, jitter, latency,
bandwidth) and sub-second response to problems with any network path. In
typically less than 3 round-trip times (RTTs), APN will move traffic off of a
path experiencing high loss or excessive jitter.
For TCP applications, APN also delivers this
predictability by buffering packets from flows and retransmitting them in the
face of loss.
For real-time application flows, APN can replicate the traffic on multiple paths, suppressing the duplicate packets at the receiving appliance, providing almost an exact equivalent mechanism to what RAID 1 does. Application
predictability, by avoiding loss and minimizing jitter, is much more important
than efficient capacity utilization for real-time apps like VoIP. By using Internet connections, the cost of
that extra bandwidth is minimal. For
VoIP, it’s actually trivial. Even for
videoconferencing, it’s fairly small where the capacity exists. [Note that APN also allows “conditional”
replication of such flows, based on availability of bandwidth at the time.]
Striping = greater throughput
RAID Levels 2 – 5
(and up) allow improved data access performance by doing bit, byte and/or block
level striping across multiple disks, allowing simultaneous disk seeks. With APN, an individual
TCP flow can be striped across multiple paths/links to deliver greater
aggregate throughput. The buffering and
retransmission of packets, in conjunction with buffering and packet reordering at the receiving APN
appliance and packet forwarding logic armed with the relative latencies
of each of the network paths involved, enables high throughput even in the face of packet loss.
Automating the solution to MTBF and MTTR issues
Network availability and predictability are quite naturally the number one concern of any enterprise WAN manager. Just as RAID changed how storage systems were designed, an APN-based WAN turns the historic importance of, and so emphasis on, certain metrics on their
head.
Before RAID, the MTBF of a disk
subsystem was a critical factor in designing a highly reliable and available IT
system. With RAID, this concern pretty
much vanished. Before RAID, MTTR
of the storage system was a big deal as well. RAID solutions allowed much
faster system MTTR (just swap out the hard disk and insert a relatively cheap
replacement disk). The defective disk itself is almost never actually
“repaired” any longer. And with proper
redundant system-wide design and automatic synchronization and backup processes in place, if in fact the MTTR of even making that disk swap ends up being
24 or 48 hours, no one is particularly bothered.
Similarly, prior to
APN, the union of the network availability and the predictability of packet delivery of the
private WAN, usually somewhere in the 99.95% - 99.99% range, has quite
correctly been of huge importance to the enterprise network manager. And the private WAN provider's SLA for a 4 hour MTTR, say, when a WAN
problem does occur has also been extremely important.
With APN, however, the need for each WAN
“subsystem” to be 99.9%+ predictable and reliable is greatly reduced. Given multiple diverse WAN connections combined with sub-second switchover in the case of packet delivery problems with any of
them, the need for a 4 hour MTTR guarantee goes away as well. Who cares if the MTTR is 24 or even 48 hours
for a broadband connection at a branch office, say, if the network
continues to work, and users notice no loss in connectivity and minimal loss in
application performance and predictability?
And if delivering application predictability even during those rare times when a
given link is down completely is important for your application or location, by having 3
diverse connections at that location, rather than 2, you will have superior
application performance as well as network and application availability versus a
private WAN actually delivering 99.99% uptime with an SLA committing to a 1 hour MTTR!
RAID revolutionized
storage economics, improving storage reliability and capacity while radically
reducing costs. In a very similar way,
APN is revolutionizing enterprise WAN economics, improving WAN reliability and
enabling substantial capacity increases while simultaneously enabling network
managers for the first time in more than a decade to not just have control over
their WAN costs and their WAN providers, but to radically reduce that huge budget
item as well.