My ultimate SD-WAN solution
I wrote this post end of July 2020 as a brain dump of my thoughts about the SD-WAN market and never reviewed it nor published it until now... Being on a break between two jobs, I had plenty of time to let my brain wandering and reflecting upon different topics. One of them being indeed SD-WAN. There are plenty of articles, documents and videos about this technology and it is still hot on the market. If you are new to SD-WAN, in a nutshell, it means Software-Defined Wide Area Network or how to leverage automation to built a more resilient and application-aware network with multiple uplink, the whole infrastructure being managed by a central controller.
So why talking about SD-WAN?
I worked for a vendor — Riverbed Technology — in this space for several years in different capacities (pre-sales/sales engineer, technical evangelist…) and during that time, I studied the market and the different vendors.
More importantly, I have had the opportunity to engage with many Enterprise customers of different sizes, from different regions, discussing their challenges and the barrier to adoption.
This article is a summary of my thoughts and hopefully would inspire people to build better products. There is a lot of blue-sky thinking but that’s the exercice after all, think outside the box. I’d love to get feedbacks and other points of views as well.
Among the +60 players on the market, there are really good solutions, each vendor having its own definition/philosophy and strengths, each having their flaws too, some more than others as always… I have not seen the perfect solution that answers all the challenges that Enterprise customers are facing so this is an exercice to design that ultimate solution.
Guiding principles
If I were to develop an SD-WAN solution from scratch, I would follow those specific tenets:
- Simplicity is king. The real opportunity of Software-Defined is to focus on business outcomes first: drive and operate a network that supports the business by accelerating processes, making operations error-free and guaranteeing the best resiliency for applications. Automation is hiding all the complexity.
- Design it to be cloud-native and support Cloud-First use-cases.
- Don’t reinvent the wheel: there are so many great building blocks out there so it would be a shame not to leverage them.
- Multitenant not only from a management point of view but also at the data plane level.
- Make it natively interoperable hence document APIs, offer SDKs, use open standards (MP-BGP, IPSEC,netconf…) and open source technologies
- Build it to be highly scalable and portable (i.e. multiple form factors)
- Implement observability and open it up to third parties (ingest data or feed other tools)
- Security is fully baked in the product, not an afterthought…
Let’s explore the different layers of such a software-defined solution.
Data plane
Appliances and form factor
The Data Plane in a SD-WAN solution is formed by the branch and datacenter appliances. So here, people would be interested in throughput, number of WAN ports, embedded 4G or even 5G modem, number of tunnels, path steering capabilities, security features…
I would take a “SD-Branch” approach with all-in-one appliance offering WiFI and switch ports. Ideally, the solution should be completed with standalone switches and WiFi. On that note, I believe that the recent move of HPE-Aruba Networks acquiring Silverpeak makes a lot of sense. Other vendors can do the same: Versa, Fortinet, Meraki, Riverbed…
What I believe is missing today for many vendors is the variety of form factors. Hardware appliances are primarily asked by customers and Telco Providers. It is a given. They are not going to disappear any time soon as long companies have factories, campuses, headquarters and any other offices with tens, hundreds or more users.
Offering virtual appliances running on VMware vSphere, Microsoft Hyper-V or KVM is also required nowadays. The industry needs a lightweight VM so it can be deployed and spun up very quickly. Perfect for the NFV use case. I am not a big fan of it but there are interesting cases for it.
Cloud gateways that run on AWS EC2, Microsoft Azure, GCP and other IaaS platforms and available from the marketplace are also required. Note that even in 2020, only few vendors offer their SD-WAN appliances directly from the IaaS marketplaces, much less with automation to deploy in your VPCs/VNETs! There is something wrong here. Having a virtual form factor of your software does not make it Cloud ready! I leave this ranting for another time….
I would highly consider ARM based solutions. The ratio price/performance is currently more compelling than standard x86 CPUs. Performance is good. Telco providers are contemplating the use of ARM-based hardware for their network, Cloud Providers are already offering ARM-based instances. Would I drop x86? may be not for high-end appliances but certainly for small/medium branches.
Another form-factor to build is an endpoint agent i.e. a piece of software that would be deployed on a laptop/tablet or even smartphone. With the actual global pandemic going on and the surge of remote-work, being able to connect to the Enterprise network seamlessly, reliably and in the most secured way is paramount. I believe in Zero Trust, I don’t believe in SSL VPN for that use case anymore.
The Cloud’s ubiquity and scalability shall be leveraged to build the SD-WAN fabric and network. Having POPs in the Cloud a la VeloCloud, Cato Networks and Azure vWAN. so that your branches or endpoints can connect via those cloud gateways would be part of the solution. We would equip those POPs (identified as Cloud Gateways in the diagram) with advanced security threat management so to leverage the power of the Cloud to protect your network. Zscaler and other vendors have proven it is a valid approach.
Finally, I am a big believer in cloud-native applications. The industry needs a SD-WAN appliance with Ingress Controller capabilities a la Traefik so it can integrated seamlessly with the new applications the Industry is building to run in the Cloud or at the Edge. Why not extending the concept of Service Mesh to the WAN🤭
Routing
This could be a big point of contention. How deep in routing shall we go? Do we still need “full stack routing” as we moved to an overlay network?
There are vendors with basic capabilities on one side, the like of CloudGenix and Meraki for example. On the other side, they are real routers like Cisco Viptela, Versa and Riverbed SteelConnect-EX.
In the middle, many vendors claims support for OSPF and BGP. Sure, they can participate in the routing with those protocols (some less than others) to learn underlay routes and inject overlay routes on the underlay.
That sound enough. Guess what, it is not!
Unless, you are in a green field situation and you are lucky enough to build the network from scratch, it is unlikely that you won’t face any limitations and you will have to design your SD-WAN network around those limitations, not ideal as eventually it will bite you! I have been there, trust me on this one…
Ok, so full routing stack? Isn’t going to defeat the purpose of SD-WAN and our key tenet: simplicity. Yes, it might and that’s the case for the router-like solutions I mentioned earlier. Depending on how it is implemented and exposed to the administrator, we could have a great solution. There is hope my friends! I’ll discuss it in the Management plane section, stay tuned…
We need full BGP (MP-BGP in fact) and full OSPF (including OSPFv3 of course for IPv6) implementation so to be able to integrate with the rest of the ecosystem (remember the interoperability tenet…) and have full control on traffic flows. There are great stacks out there to be reused.
The ability to create VRs (Virtual Routers, I know it is confusing those are not. VMs like you would deploy on VMWare but rather a separate routing instance) and VRFs is really powerful too. I have experienced it with Versa and Riverbed SteelConnect-EX. This is also how you can achieve multi-tenancy for example but also control routes. Powerful but complex, something to simplify again… This would be on my list too!
Overlay
When we talk about SDN in general, we talk about overlay networks. IPSEC, VxLAN, GRE are the most used tunneling technologies to build overlay networks.
IPSEC has been around for a while, works well, is secure but becomes resource intensive with a high number of peers. This is the main challenge that SD-WAN vendors are facing. Throughput is one thing, number of tunnels that a box can handle is another.
I have tested other solutions like ZeroTier and Wireguard. I found them flexible and performant. Look at [this article](https://paulierco.ro/wireguard-vs-zerotier-throughput-performance.html).
Again, if I were to build my own SD-WAN product, I would probably spend some time investigating other technologies than the traditional ones so to achieve the best scalability and performance.
Don’t misunderstand me, IPSEC and GRE would be available in the product as well. Remember, we want the solution to be interoperable. We need it to connect with third party firewalls, security routers and other concentrators (Azure Virtual WAN Virtual Gateways, Zscaler and other CASBs).
Ideally, we should follow the work done by the MEF and also ONUG to make sure our overlay (and the route exchange) would work with other SD-WAN vendors. Hey, why not? We are dreaming about the ultimate solution.
One critical point to achieve scalability would be the ability to dynamically create and terminate tunnels whilst not compromising the performance that could be negatively applications impacted during connection setup.
Path Resiliency features
On this one, we go all-in. Path Conditioning is table stakes in the SD-WAN world. Network architects expect to be able to aggregate multiple WANs (MPLS, broadband Internet, 4G…) not only to increase the bandwidth in a given site but to to guarantee the best SLAs to the business. We want to make the best use of all the bandwidth hence load balancing techniques will be required.
We need mechanisms to actively measure the availability and the performance of uplinks or more precisely guarantee application SLAs over the WAN. Therefore, the solution should be able to track the performance at the Layer 7. How would that work? Different probing techniques exist (HTTPS GET, Selenium scripts…) or best, tracking end-user experience. The latter would be resource intensive on an appliance but could be coupled with the end-point agents. What if, indeed, you had agents on the end-point that would measure and report their performance locally to the appliance to help the gateway take decisions…
In case of performance degradation on the links, mitigation techniques should automatically kick-in: switch over to a better link, Packet Racing, TCP optimization, Forward Error-Correction…
I did not mention WAN Acceleration techniques (caching, application specific optimizations…). As long as latency will be impacting user experience, there will be a need for that technology. Quick reminder, latency greatly impacts application performance when they are chatty (a lot of messages going back and forth on the network between the client and the server…).
Control plane
The control plane determines how sites are connected between each other and how packets should be forwarded. This is a very critical piece of the architecture.
Running in the cloud, controllers will be fully redundant with maximum resiliency. They will be hosted in different regions as per best practices.
By default, it should be a fully managed service unless customers want to have it running in their own VPCs.
The control plane is responsible for the keying (for tunnel encryption) so it will highly depend on the overlay solution that is used (see the previous chapter).
A routing protocol like MP-BGP for exchanging routes and for the controller to program routes seems to be the best approach. Versa’s solution (and Riverbed SteelConnect-EX) is known and well appreciated by large organizations for this.
Among the many customers I talked to, a great number of them were confused about the way sites should be interconnected: Full-Mesh or Hub & Spoke. They understand that Full-Mesh has a cost (resources on the appliance to maintain n tunnels, impact on performance…) but Hub & Spoke would not satisfy all their requirements in terms of performance.
In most cases, I believe traffic engineering should be automated and customers should rely on the product.
By default, all sites will be connected to the Cloud POPs and traffic would transit via those special (managed) sites. For certains sites like data centers, I would add the concept of service site so branch offices can directly connect to them for better performance. More importantly, I would leverage Machine-Learning to analyze traffic patterns (and if possible application performance) so to let the system learn and adapt the tunnel creation between sites accordingly.
Management plane
I want this solution to be Cloud-Native and I won’t offer an on-premises offer. I know that it is still a prerequisite for many large enterprise companies or government agencies. That’s fine, the market is huge and “in the fullness of time”, those corporations will eventually migrate to the cloud…
Going back again to our tenets: Don’t reinvent the wheel and hyper-scale, I would certainly leverage an IoT framework for the management plane and collecting telemetry. Those solutions were designed from the ground-up to be able to manage and monitor thousands and more of devices. Cloud providers are offering solutions, look at AWS IoT Device Management. Secure on-boarding, secure access, health monitoring, troubleshooting, software/firmware management, all those key capabilities are native, I would go there for sure.
I want to design a solution that is easy to use, with simple workflows but yet offers full control on the solution when necessary (there will always be use cases that are not covered…). The initial promise of Ocedo/Riverbed SteelConnect-CX was great but not complete. I would go several steps further. Architects draw the IT systems they want to build. This is a natural way of brainstorming and “designing” a solution. I am convinced that if network architects could build their network template with visual building blocks (like in Microsoft Visio, Lucidchart and others…), this would be a great experience. In the background, configurations will be generated to configure the system. Then you just have to rely on automation to have your network up and running. We did a proof of concept with a friend by creating a Visio plugin, it called APIs in the background to create the configuration on the SD-WAN system. It was promising although not really beautiful, having a dedicated interface would be better.
In addition, I believe that troubleshooting would be greatly enhanced as well if you could visualize what your network actually look like and where to dive deeper…
An open-source library of network topologies could be offered for architects to share and download validated designs. Which leads me to third party management, I believe that interoperability is key in networking and more than often overlooked by vendors (normal for those in a power position…). Users will be able to configure 3rd party devices with a configuration automatically generated. Ansible, netconf modules or even Terraform/CloudFormation (for your network in the clouds…) will be implemented to automate the configuration of routers, firewall and other devices. Let’s talk about the Cloud for a moment. I still see people struggling to get their remote sites connected in a secured way to their new “datacenter” running in the Cloud. They are many solutions available for a site-to-site connection with a VPC in Amazon but they require heavy manual configuration. Imagine you design your solution graphically with a AWS Transit Gateway so all your remote sites created VPN tunnels to it. Behind the Transit Gateway, 4 VPCs are connected. With automation, it can be easy to implement. Let’s make that happen! Let the machine convert our “ideas” into config files and design workflows that are natural for humans. Learning CLI is not natural!
For the management of the solution, I also want to leverage and apply the concepts of GitOps. Managing a large network at scale with a GUI is not ideal nor sometimes possible or practical. I know, using Git repos is unnatural for most network engineers, not something they are used to or even experienced with. In addition, network engineers are familiar with imperative programming not declarative. Still, there is a lot of value so let’s explore the concept for a minute. Imagine you could define the end-state of your network, and the network would automatically “heal” itself. So what is the end-state for a network? I would argue is the ability to transport applications with the right performance with maximum resiliency. And yes, that’s the promise of SD-WAN: if an uplink is degraded, switch the traffic to another WAN, there are probes measuring the performance and so on… Still, I think we could go several steps further. Today, SD-WAN solutions are still imperative and it is expected from network architects to work on defining everything and of course the traffic engineering. With a declarative approach, this would change. I will go deeper in that concept in another blog. Having a declarative way of defining your network is a first step towards GitOps. I wanted to put it there.
Finally, you can’t really manage a network at scale without any visibility/observability capabilities. The IoT framework offers a way to report telemetry. It would send everything (system logs, packet logs, flows, application performance statistics, security threats… to a solution like. ELK in the cloud. I am impressed with Zeek and how it helps streamlining packet analysis. Look at what former colleagues are implementing.
Conclusion
SD-WAN is hot on the market and there are not perfect solution yet. I wanted to share my thoughts after several years talking to customers and helping them to implement solutions. Of course, we would need to dive deeper into all the components but that would be probably the beginning of a new company…