Five Design Principles for the Network Architect - Availability


(#2 of 7)

In the first post in this series, I shared a summary of my fundamental design principles which I try to apply to every network design I am involved in.  The follow-ups to that summary post will discuss these at greater length - this one addresses Network Availability.

The network exists to provide the transport for endpoints to be able to consume services in a "remote" location.   Whether the endpoints are application servers in racks in a DC trying to consume database entries, wireless clients accessing an application in the data centre, sensors collecting data and dropping that data into storage, and regardless of location of the services themselves - public cloud, private cloud, co-lo DC - the fundamental measure of success of the network is availability of the service to the endpoint and thus the user.

Clearly then availability can't be considered a simple measure of the network as a whole - it takes a number of capabilities and properties of the environment to contribute.  The intention of a good network design has to be to put all of the elements in place to try and maximise service availability in spite of the constraints of the environment.  The first stage is to be able to measure that service performance is within acceptable bounds in the physical places that it is to be consumed.  For example, in a manufacturing organisation, the stock picking application needs to be available in the company warehouses using hand-held terminals; determining that this service is available from the company's Head Office may not necessarily be useful.  For this then, clearly some form of distributed, application-aware measurement capability would be useful, using intelligence within the network devices themselves or by collecting data regarding traffic flows and shipping it out for analysis to a separate server platform using IPFIX or Netflow for example.  The advantages of the centralised approach include the ability to correlate performance data from across the network in one place and so potentially identify where performance bottlenecks might be occurring.

The key point here is that each of the organisation's services may require a different subset of the network infrastructure to deliver it and so some services may be affected when there is a real network problem where others may not - hence this sort of monitoring can indicate when real-life user experience is affected by network conditions.  There is also a desirable side-effect to this service monitoring in that it effectively introduces the ability to detect "grey failures" - issues where performance is degraded but it is not reflected in a definitive status of a device or interface (ie "fail-stop" scenarios). There is an interesting Microsoft research paper on this topic where the authors argue that monitoring of the ilk described herein helps to protect services from those kinds of grey failures through "differential observability".

Services need to be mapped out to show interdependencies of network elements required to deliver them, which of course is useful for analysing network events and demonstrating their impact.  In our warehouse example above, by measuring the performance of a series of key services - let's say the picking application from hand-held and truck-mounted terminals, and the back office applications from PCs in the admin office - in each of the company's warehouse locations, issues can be scoped correctly as only affecting the wireless in one location, or all networks in that location, or affecting all locations so more likely to be WAN or DC network related and so on.

From that example, we can see that there are a number of interrelated elements that traditionally get broken out in the design process which might contribute to that availability, including in no particular order:
  • Stability - ensuring the need for change in the environment configuration and topology is minimised through simplicity of design, equipment choice and so on;
  • Redundancy - the use of multiple paths or devices between two points in the network which may not necessarily carry traffic under normal circumstance but are available to take over should the active devices/paths fail;
  • Fast convergence - in conjunction with redundancy, mechanisms that are put in place in the network to detect failure and route traffic around it;
  • Performance - provisioning devices and links with more capacity than is required, measuring and monitoring the capacity usage, and use of Quality of Service techniques to ensure best usage of the available capacity;
  • Scalability - being able to manage the capacity of the network through addition or removal of devices or links;
  • Lifecycle Management - being able to manage device upgrades without impacting the network in an uncontrolled manner, through use of redundancy, staggered maintenance windows, device groupings etc;
  • Monitoring - having a good method of detecting the status of devices, links and services to determine status at any given time in any given location;
  • Supportability - strong support processes which kick in when monitoring indicates that an issue has occurred.  These might include manual failover or DR invocation which - so long as they are robust and able to be executed in a timely fashion - may be completely acceptable, rather than relying on complex automatic processes which may be brittle.

Traditionally, each of these things comes to be discussed as the solution to availability issues (for example, listen to the Network Collective podcast on resilience).  However, all of these elements need to work in concert to ensure that there is always sufficient visible active capacity in the network to deliver services to the necessary endpoints during the operational windows for those services.

So how do we demonstrate service availability?  There are two levels of visibility required:
  • The first is for the user of the service, the end customer.  They need to be able to determine that the service is available when it needs to be from the places it is needed.  This might be collated from the service monitoring described above, and would best be shown very simply, perhaps as a RAG status on a dashboard per location or similar.
  • The second is for the engineering team responsible for supporting the environment, and this would take the form of the more traditional network monitoring, looking at uptime of devices, utilisation of links etc.

Ultimately, the user of a network service is not concerned with the mechanics of how the service is delivered but simply needs to consume it.  So long as the service is available in the locations it is required at the times it is needed, such things as network device uptime or HA failover do not concern the end user.  SLAs and reporting should be built to reflect the needs of the audience.

As usual, really interested to hear your thoughts on this!

Previous> Introduction
Next> Scalability

Comments

Popular posts from this blog

The CCIE is Dead? Long Live the CCIE!! And CCNA! And CCNP!

Why study for a Cloud networking certification?