by Talari Blogger-in-Chief Andy Gottlieb...
As I note whenever I’m asked
what it is that Talari is bringing to Enterprise networks, Adaptive Private
Networking is doing for WANs what RAID did for storage.
In a previous
post, we looked at the analogy to RAID at the business benefits level. Here, we examine the parallels on the
technical level.
Wrapping hardware and intelligent software around core enabling technology
Seagate
hard disk technology was originally designed for and targeted at the nascent but fast growing market for
personal computers. Each hard disk was not nearly as reliable, having nowhere
near the MTBF or seek times, as the existing
mainframe/minicomputer disks. But RAID - Redundant Array of Inexpensive
Disks - took advantage of the massively better
price/bit of the Seagate hard disk to revolutionize the storage market. By combining a layer of hardware and intelligent
software with multiple of these inexpensive disks, RAID delivered a system with
higher capacity, lower cost, competitive – and ultimately superior – access
times, and greater reliability than the older generation storage solutions.
For
APN, the analogous enabling technology is the public Internet.
By wrapping a
two-ended system of appliance-based hardware running intelligent software around multiple
WAN connections – most or all of which are Internet connections – APN creates
an enterprise WAN which is lower cost, massively better cost/bit, higher
capacity, and – critically – more reliable than the best single vendor MPLS
WAN.
Redundancy to deliver application continuity
The basic idea behind RAID – and behind APN as well – is that while two devices operating in series which each have 99% reliability will deliver a system with only .99 *.99 = 98% reliability, a properly designed system with the same two devices operating in parallel will deliver 1 – (1- .99) * (1 - .99) = 99.99% reliability.
The key
phrase, of course, is proper design.
The first premise of RAID is that the loss of any single disk ensures not only that no data is lost, but that the application – data reads and writes – continues to function normally, without meaningful performance degradation. This is essentially what RAID Level 1 delivered.
Existing traditional routed networks with appropriate link and device redundancy at each location, of course, provide network availability in the face of any hard single link failure or router failure. But this "no loss of connectivity" is merely the equivalent of "no data loss" in the storage world. Even a no-single-point-of-failure routed network does not provide for application continuity in all cases, as a routed network can take upwards of 30 seconds at times for router convergence in the face of a given failure. More importantly, routing does not handle the case where packet loss or excessive latency causes significant problems with application performance. Yet these “soft failures”, due to congestion on shared IP networks, especially Wide Area Networks, occur with far greater frequency than hard link or device failures.
In fact, it is precisely because of the congestion-based packet loss and jitter which occurs at Internet peering points between ISPs that the public Internet has earned its "works pretty well most of the time" reputation. Of course, "pretty well most of the time" just ain't good enough for your Enterprise WAN.
Handling "soft failures" - and doing it quickly!
APN, by leveraging multiple paths across
the network between locations, ensures that no single hard or soft failure of a network link, device or peering
point in the middle of the Internet will cause a loss of connectivity or
application predictability.
APN does its equivalent of RAID Level 1 by
continuous measurement of network path performance (loss, jitter, latency,
bandwidth) and sub-second response to problems with any network path. In
typically less than 3 round-trip times (RTTs), APN will move traffic off of a
path experiencing high loss or excessive jitter.
For TCP applications, APN also delivers this
predictability by buffering packets from flows and retransmitting them in the
face of loss.
For real-time application flows, APN can replicate the traffic on multiple paths, suppressing the duplicate packets at the receiving appliance, providing almost an exact equivalent mechanism to what RAID 1 does. Application
predictability, by avoiding loss and minimizing jitter, is much more important
than efficient capacity utilization for real-time apps like VoIP. By using Internet connections, the cost of
that extra bandwidth is minimal. For
VoIP, it’s actually trivial. Even for
videoconferencing, it’s fairly small where the capacity exists. [Note that APN also allows “conditional”
replication of such flows, based on availability of bandwidth at the time.]
Striping = greater throughput
RAID Levels 2 – 5 (and up) allow improved data access performance by doing bit, byte and/or block level striping across multiple disks, allowing simultaneous disk seeks. With APN, an individual TCP flow can be striped across multiple paths/links to deliver greater aggregate throughput. The buffering and retransmission of packets, in conjunction with buffering and packet reordering at the receiving APN appliance and packet forwarding logic armed with the relative latencies of each of the network paths involved, enables high throughput even in the face of packet loss.
Automating the solution to MTBF and MTTR issues
Network availability and predictability are quite naturally the number one concern of any enterprise WAN manager. Just as RAID changed how storage systems were designed, an APN-based WAN turns the historic importance of, and so emphasis on, certain metrics on their head.
Before RAID, the MTBF of a disk
subsystem was a critical factor in designing a highly reliable and available IT
system. With RAID, this concern pretty
much vanished. Before RAID, MTTR
of the storage system was a big deal as well. RAID solutions allowed much
faster system MTTR (just swap out the hard disk and insert a relatively cheap
replacement disk). The defective disk itself is almost never actually
“repaired” any longer. And with proper
redundant system-wide design and automatic synchronization and backup processes in place, if in fact the MTTR of even making that disk swap ends up being
24 or 48 hours, no one is particularly bothered.
Similarly, prior to APN, the union of the network availability and the predictability of packet delivery of the private WAN, usually somewhere in the 99.95% - 99.99% range, has quite correctly been of huge importance to the enterprise network manager. And the private WAN provider's SLA for a 4 hour MTTR, say, when a WAN problem does occur has also been extremely important.
With APN, however, the need for each WAN “subsystem” to be 99.9%+ predictable and reliable is greatly reduced. Given multiple diverse WAN connections combined with sub-second switchover in the case of packet delivery problems with any of them, the need for a 4 hour MTTR guarantee goes away as well. Who cares if the MTTR is 24 or even 48 hours for a broadband connection at a branch office, say, if the network continues to work, and users notice no loss in connectivity and minimal loss in application performance and predictability? And if delivering application predictability even during those rare times when a given link is down completely is important for your application or location, by having 3 diverse connections at that location, rather than 2, you will have superior application performance as well as network and application availability versus a private WAN actually delivering 99.99% uptime with an SLA committing to a 1 hour MTTR!
RAID revolutionized storage economics, improving storage reliability and capacity while radically reducing costs. In a very similar way, APN is revolutionizing enterprise WAN economics, improving WAN reliability and enabling substantial capacity increases while simultaneously enabling network managers for the first time in more than a decade to not just have control over their WAN costs and their WAN providers, but to radically reduce that huge budget item as well.
Comments