Five Design Principles for the Network Architect - Simplicity

(#6 of 7)

So, across the articles in this series, we have covered most of the basic tenets of network design to ensure we have an available, supportable, secure network to implement for our customer.  Inevitably though, in any network there is a need to implement a wide range of capabilities - in order to interoperate between different vendors' kit perhaps; or maybe with a specific network operator for WAN connectivity; and of course we are highly unlikely to have an entirely green field operation, so the chances are we need to interact with an existing environment.

Simplicity vs Complexity

All of these bring a degree of complexity to  the network design, but also to the implementation and ongoing support of the network.  But when we say "complexity", what do we really mean?  Well, better minds than mine have considered this at length, and while there is no absolute consensus, a widely accepted definition might be where the network has a large number of interaction surfaces which involve a significant amount of state transfer between elements.  In other words, where you have lots of different moving parts that require masses of intercommunication in order to maintain a working network. (For a more rigorous and detailed explanation, you might want to try Navigating Network Complexity by White & Tantsura)

So, what do we do to combat the complexity then?  Simple.  Reduce the number and variety of moving parts!  But there are a couple of questions that crop up there:
  • what are those moving parts?  In this case, we are talking about the many different elements of the network that process or handle any computer-computer interaction - switches, routers, firewalls, gateways, load balancers, DNS, DHCP (and all of those both physical and virtual) - the list goes on;
  • what form does the intercommunication take? All of our standard (and not so standard) network protocols, ARP, DHCP, IP, ICMP, routing protocols, encapsulation and tunneling, encryption - again, not exhaustive but you get the picture.
From one perspective then, keeping the number of different types of device and the amount they need to exchange information about endpoints on the network gives us the simplest network.  The problem with that is that most networks need a level of complexity to meet the requirements they are being built to service.  Just looking back over previous posts, we can see that maintaining availability and security for example, could be considered competing objectives - and that a degree of necessary complexity in the use of  features and protocols available to us is necessary to strike the balance.

Let the network do its thing

One approach to keeping the network implementation simple is to minimise the use of nerd knobs and esoteric features that are difficult to understand and hard to configure.  When designing a network, wherever possible, make good use of default behaviours and standard configurations.  For example, rather than utilise routing protocols with complex path selection algorithms that need to transfer state around the network in case of failure, look to use simple longest prefix matching to maintain your failover routing.  You can minimise failover time because you are not waiting for routing updates to be propagated around the network: your backup routes are already in the routing table when the primary routes are withdrawn.

In another example, my team will tell you that I harbour major distrust of any "virtual stacking" technology for LAN switches (I'd include Catalyst VSS and Nexus VPC in that) as the complexity of the control planes has gone through the roof in implementing these capabilities.  From personal experience, not withstanding the known limitations of the technologies (think about dynamic routing over a VPC), issues with these technologies more often than not are of the "grey failure" variety where the hardware continues to function but software manages to tie itself in knots.  Thus the failures we are supposed to be protecting against don't materialise, but other more esoteric (and generally catastrophic) ones do.

On the whole, it's just better to use standard mechanisms and allow failures to be routed around "normally".  An example here might be that a known good tried and tested failover mechanism is better than one that might not work the way you'd expect.  What do I mean by that?  Well, consider the cases of ASA or Cisco WLC SSO failover.  Typically if one of the appliances or its network connections go down, state has been sync'd to the standby box and it simply picks up where the previously active appliance left off.  The temptation might be to cross connect these appliances to multiple upstream switches using some form of MLAG facility.  But in the case of grey failure as described above, the aggregated ports stop behaving as predicted and at least some traffic disappears.  In this case, it's probably better to connect a single appliance to a single switch and tie the fates of those devices together.  Then if a single switch fails, the failure will not propagate to both instances in the HA pair.

Hide the complexity

An alternative view might be to maintain the complexity but allow it to be automated and "hidden" from view of the implementer, operator or end user.  Abstraction means effectively that features and protocols are bundled up into templated, standardised packages of features which are configured across the interacting devices.  This can be carried out through individual device automation - such as Python scripts, or Ansible playbooks - or by the use of an orchestrated workflow, using an all-seeing controller - such as Cisco's DNA Centre - or indeed an orchestration platform which attempts to translate business intent into individual device configurations - such as NSO.

Individual device automation requires a significant amount of work templating configurations and developing the standardised approaches required.  Essentially, they replace the engineer typing CLI into multiple boxes to make things work, but the configurations still have to be developed in order to make this approach work.  The orchestration approach typically makes available a number of standardised pre-built workflows to remove the need for the individual development, but also allows for custom workflows should they be required for a specific application.

By hiding the way the network is configured in this way, the end user can simply get on with running their services right?

Well almost.  Fundamentally, the devices in the network are still being individually configured, they are still processing packets in the same way they would in traditional networks and the network engineer still needs to be able to understand and troubleshoot network operations.  Because software can go wrong.  And if that software is a network controller for the entire network, then when it does go wrong, it has the ability to take everything else with it.  In this case, the network is no simpler, just that the complexity is abstracted away behind a GUI and orchestrated workflows: devices are still being configured with the same features as before but the operator's view is simpler.  If this means that the consumer can get his or her service delivered more quickly, then the world is a brighter place.


Another way of operationally simplifying the environment is simply to define a set of templated change requests which achieve a set of standardised actions.  The end user or customer then simply requests the changes to occur, and the operator or administrator of the network can then execute them.  The actual processes can bu hugely complex with many elements, but the reporting back to the end user is a simple RAG status of the overall change, and they think it's simple.  Set appropriate SLAs for the changes, and you have simplified the consumption of the network service.


Accept that the network itself, its traffic flows and the applications are never going to be simple.  Look for ways in which you can simplify the particular view that a user or operator has of the network, and you are hiding that complexity away!

As usual, really interested to hear your thoughts on this!

Previous> Supportability
Next> Conclusion


Popular posts from this blog

The CCIE is Dead? Long Live the CCIE!! And CCNA! And CCNP!

Why study for a Cloud networking certification?

DevNet certifications