STP and Graph Theory

Spanning Tree Protocol (STP) is one of several methods of avoiding loops in bridged local area networks. For most of us, working with the typical corporate LAN, STP is the default configuration of switches and it just works. But it may not always be the most efficient way to connect switches. This post looks at exactly what a spanning tree is, why it is used to solve the problem of loops in a network, and why it may not always be the most efficient way to do that.

Spanning Tree Protocol

When we build a typical bridged local area network, we add multiple physical paths between different LAN’s. If we did not, we would have a single point of failure. But IEEE 802 frame-based networks (like 802.3 Ethernet) cannot tolerate a switching loop. With a loop, a single frame would be endlessly retransmitted. And if a frame had a broadcast destination, then the retransmission would rise exponentially, since each received instance of the frame would be flooded again to every other port.

So we need a dynamic method or protocol to break any logical loops and ensure that there is only a single logical path between LAN’s at any one moment. If a path breaks, then the protocol needs to reconfigure the logical network to open a previously closed alternate path.

The original Spanning Tree protocol was published in 1984 by Radia Perlman, working at the time for Digital Equipment Corporation (DEC). It was adopted and absorbed into IEEE 802.1D in 1990. It was extended with Rapid Spanning Tree Protocol (RSTP) in 802.1D in 2004, and with Multiple Spanning Tree Protocol (MSTP) in IEEE 802.1Q in 2005. Meanwhile vendors implemented proprietary extensions, such as Cisco Rapid per VLAN Spanning Tree + (Rapid PVSTP+). Or they adopted proprietary alternatives.

The clause covering STP (as well as Rapid STP and Multiple STP) in 802.1Q – 2014 is 86 pages long. You will find plenty of sites on the internet telling you how STP works, although these are inevitably a summary or an interpretation. To know how it really works, you need to read the standard, which is available from the IEEE GET Program.

But what is less clear is why it works this way. For that, we need to look at Graph Theory.

Graph Theory

Graph Theory is the branch of mathematics that studies the abstract relationship between objects and the things connecting any two of them. In Graph terminology, the objects are vertices, and the things connecting them are edges.

200px-Petersen

Illustration: a Petersen graph

The best description, for our purpose, is here in the MIT course Mathematics for Computer Science, Chapter 5. "Graphs are ubiquitous in computer science because they provide a handy way to represent a relationship between pairs of objects. The objects represent items of interest such as programs, people, cities, or web pages, and we place an edge between a pair of nodes if they are related in a certain way. For example, an edge between a pair of people might indicate that they like (or, in alternate scenarios, that they don’t like) each other. An edge between a pair of courses might indicate that one needs to be taken before the other."

Graphs have no shape. They are a representation of objects and their connections. A simple graph just has vertices and edges. A weighted graph has a value to the edge (a cost, a distance, a mass: anything with quantity). A directed graph has a direction to the edge. Some further definitions:

  • two vertices are adjacent if they are connected by an edge
  • a path is a non-repeating sequence of adjacent vertices and edges
  • two vertices are connected if there is a path from one to the other
  • a cycle is a path that takes you back where you started (a closed path) with no repeated edge
  • a tree is a set of vertices and edges such that there is only a single path connecting any two. If you add an edge, you create a cycle. If you remove an edge, you disconnect one or more of the vertices.
  • a spanning tree is a tree that spans all the vertices in a graph. It has all the vertices connected by a single path, but not all the edges. A graph can have more than one spanning tree.

The Petersen graph (illustrated above) has 2,000 possible spanning trees.

Spanning Trees

Spanning trees connect all the vertices in a graph, with the fewest number of edges. There are no loops and no repeats. But a graph can have more than one spanning tree, and any given spanning tree is not necessarily the most efficient way of connecting the vertices. If all edges have the same weight (as in a simple graph), then all spanning trees must have the same total weight. But if the edges have different weights, such as a cost or a distance, then some spanning trees would be a more efficient way to connect the vertices than others:

  • A minimum spanning tree connects all the vertices with the minimum total weight of the edges. So, if you were laying a fiber optic cable between towns, this would be an efficient way to do it with the least amount of digging. But it does not give us the shortest path from one specific town to another.
  • A shortest path spanning tree obtains the lowest weight path from a given vertex (the root of the tree) to any other vertex. If you start a journey from one town, this is the shortest route to every other given town. But the same path is not the shortest route if you start from a different town.

IEEE 802.1 Spanning Tree Protocols

With bridged local area networks, the bridges are the vertices and the LAN’s are the edges connecting them. The bridges are usually connected by more than one physical path, for redundancy; or they may be connected in such a way by accident. To avoid a loop we need to switch ports off so that there is only one active path between any pair of bridges at any one time. A fully connected graph with only one path between vertices is a spanning tree. So we need to implement a spanning tree across all the bridges.

The title of Perlman’s paper in 1984 was "An Algorithm for Distributed Computation of a Spanning Tree in an Extended LAN". From the paper:

"This approach [transparent bridging] assumes that the topology is a tree (loop-free). However, requiring a topology to be loop-free means there are no backup paths in the case of bridge or LAN failures. Also, because the technology allows network growth so easily, it might be difficult to prevent someone from adding a bridge and creating a loop. A loop in the topology might cause severe performance degradation in the entire extended network due to congestion caused by infinitely circulating packets. It is undesirable to have a network that can be brought down so easily, merely by plugging a cable into the wrong place."

"Thus we have designed an algorithm that allows the extended network to consist of an arbitrary topology. The algorithm is run by the bridges, and computes a subset of the topology that connects all LANs yet is loop-free (a spanning tree). The algorithm is self-configuring. The only a priori information necessary in a bridge is its own unique ID (MAC address), which we are assuming can be attained in some manner, for instance with a hardware ROM containing the value."

The algorithm also meets several other design goals, including low memory and bandwidth regardless of the number of bridges; stabilises in a small multiple of the round trip delay across the network; and completely deterministic behaviour.

This design is carried through to the IEEE 802.1Q standard today, over thirty years later.

Spanning Tree Type

I am interested in what type of spanning tree we obtain with the current IEEE 802.1Q algorithm. The textbook and online guide: Algorithms, by Robert Sedgewick and Kevin Wayne, has a good description and illustration of the standard algorithms for deriving a spanning tree of a graph:

The selection of a starting vertex is common to any spanning tree algorithm. After picking a starting point, the next step is to evaluate the edges, to see which to add to the developing tree. The minimum path algorithms add the next lowest weight edge. The shortest path algorithms add the next lowest root cost path. The IEEE 802.1 STP algorithm uses the root path cost (together with user configurable parameters) to evaluate the vectors it receives, which gives us a shortest path spanning tree algorithm.

However, the standard spanning tree algorithms require somewhere to store the data structures while the tree is constructed. A network of bridges, in an arbitrary topology, in which bridges may be added or removed without pre-configuration, does not have this. Instead, the bridges exchange vectors with their adjacent bridges, discarding those that contain worse information, and replacing their own with any that is better.

This is somewhat similar to the Bellman-Ford algorithm (published in about 1958). This algorithm uses repeated passes, and discards paths that are worse, ending up with the least cost (shortest) path. It is not a distributed algorithm, but I can just about see how the algorithm could be implemented in a distributed fashion. Each bridge only needs to know which adjacent bridge provides the best path to the root. If it receives a lower cost path, it can discard the one it had previously.

In summary, STP (as well as RSTP and MSTP) has the effect of creating a shortest path spanning tree, although the protocol itself does not claim to do so. The tree is rooted on a random bridge, unless one is configured. The spanning tree is guaranteed to be acyclic (loop free) but it is most definitely not the shortest path from any given bridge to another, or to a gateway, or to a core switch (unless you configure the core switch to be the root bridge).

In a network of arbitrary topology, or a simple hierarchy with an obvious root, then STP (and RSTP and MSTP) may be the best way to calculate the spanning tree. But if you have a campus, sites across a metropolitan area, a factory floor or a mesh of sensors, then it may not be.

Two alternatives to STP are:

  • IEEE 802.1Q – 2014 Shortest Path Bridging (SPB): one spanning tree for each entry point to the network
  • IEC 62429 – 2010 Part II Media Redundancy Protocol (MRP): a ring protocol with a topology manager and clients

These have to be the subject of a different post.

The Troubleshooting Process

I do a lot of troubleshooting of problems in corporate infrastructure. By that, I mean investigating difficult technical problems in a structured way to find either a solution or a workaround. This post is a few thoughts on the process.

Here is a caricature of the process that often happens. It is imaginary, but I think most people will easily identify real cases that are similar.

  1. A user (or a group of users, for example at one site or using one application) experiences a problem. The help desk tries a few things, not very systematically. Sometimes different people from the help desk try the same thing again. Various updates and changes are made to see if they fix the problem. Sometimes the problem is resolved. Sometimes it goes away. And sometimes it is ignored as "just the way it is until we upgrade xyz".
  2. A user experiences a problem. The user is senior, or the problem affects something that senior people care about. A problem manager takes over the case. The problem manager is someone who co-ordinates actions but does not, themselves, possess the technical skills to resolve the problem. The problem is assigned to whichever team seems most likely e.g. networks, server, storage, application. The team works hard to demonstrate that the fault cannot be with them. The problem continues, with increasing levels of polite acrimony. Eventually a significant change is made which may, or may not, be the cause e.g. a platform component is changed. The problem goes away, but no-one knows what actually caused it.

What is the perfect way? A user experiences a problem. A technical wizard understands every aspect of the problem without needing to be told anything about it, or to do any investigation. The technical wizard knows the solution, whether it is in the network, or server, or storage, or application layer. After some discussion, the fix is implemented and the problem is solved.

This perfect way is absurd, obviously. No-one can know what the problem is until they have seen it, thought about it, asked a few questions, gathered some data. No-one can be expert in enough technologies to know exactly what the fix is, without running some tests, trying a few things.

So we need a process that is not like 1) and 2), but also does not assume there is a perfect way.

First, some context. We already have an awareness of what is normal behaviour and what is not. If a user says that logon is slow, we assess it against a typical logon time. If the behaviour is not normal, then we can assume there is a fault. For the most part (not always!) vendor products do not show obvious faults in isolation. So, if the vendors do not know about this fault, then there must be a fault in our configuration, or an unknown fault in the product. There must be something we can change so that the problem does not occur. Our job is to find what to change.

The way I go about it is divided into five parts. These are not all sequential, but they involve different types of activity:

  1. Incident Handling
  2. Problem Definition
  3. Investigation
  4. Analysis
  5. Resolution.

I am not going to describe each part in detail: just give an idea.

Incident Handling

Being practical, I am going to assume that the great majority of day-to-day help desk calls do not require much rigour. The most important thing is to avoid doing things twice. As soon as we (on the help desk) realise that a problem is not simple, we need to follow a systematic process to gather the incident data in a way that avoids having to do it again.

Problem Definition

As soon as we realise that a fault may be more difficult to resolve than by providing a simple change, we need to prepare a Problem Definition.

Very often I find that, when starting out, it is more difficult to write down an accurate statement of the problem than you might expect. The report might be: "The application crashes". But is that a Not Responding window or an application error? If there is an error, what is the exact message, and what happens when you click on OK? Is there an error shown in the Application Event Log? Does the application log show an error? How often does this occur? How many people are affected? What is the user doing in the moments leading up to the problem? Writing down what is and is not known goes a long way to defining what problem we are trying to solve.

It is not always easy. I have worked on a fault causing Windows to crash with a bug check, where it was difficult to know even whether there was a single problem, or more than one.

Investigation

This is not a distinct step. But I identify it separately because we attempt to gather data systematically to understand the problem better. The more we understand, the more specific the tools we might use to investigate further.

Initially we might configure application debug level logging. Or we might run a network capture. As we narrow it down we might get very specific.

For example, I recently had a problem where Outlook seemed to hang for a few seconds when typing. I first ran a network capture of all traffic to see if, perhaps, a peak was causing a delay in response. There was no significant level of traffic. There was no significant latency. No retransmission. No problems with the network connection in general.

Then I ran a capture only between the Outlook client and the Exchange server (using the relevant ports). I noticed a spike in traffic every 15 minutes, coinciding with the problem. But what was the spike? Was it normal, or a fault?

So then, on the server, I configured Remote Operation (ROP) logging to capture the exact MAPI operations being performed by the client. The problem was caused by a MAPI action to re-read the mail database and refresh the view in Outlook. This occurred exactly when the user experienced the problem.

Analysis

I will mention just two aspects of the analysis.

One is that we must have a mental model of the part of the system we are examining, and all its components. We then need to rule in or out the different components we think could be involved. Often I am told the problem is in Citrix. When I investigate, it is not in Citrix at all. It is in, for example, the configuration of the application on Citrix. Or, in one very difficult case, it was port exhaustion in an unpatched Windows server. Or it might be a configuration error in Citrix after all.

The second is conjecture. I use the term to mean that we need to develop an idea of a single cause of all the different symptoms we have found in the investigation. I don’t mean a guess. For example, sometimes people will say "I think it must be the anti-virus". This is a guess. But a conjecture would be: "A windows driver fault is occurring in the I/O stack. It is a type of fault that does not appear in testing using Driver Verifier". This is exactly what the problem was, but it was extremely difficult to find.

Resolution

An interesting aspect of troubleshooting is that the final part is often really easy. Often, I don’t need to be involved any further, once the exact cause is identified.

You might think that, after investigation and analysis, we may have a good idea but we cannot be certain. Of course, nothing is ever entirely certain. But going back to what I said about the context, the problem must be caused by a specific fault; otherwise the behaviour would be normal. When you find the cause, you just know this is it.

IEEE 802.1Q Bridges and Bridged Networks

I have been working recently on the design for an industrial network. The network runs through a physical tunnel, and so requires a ring topology to provide resilience against failure. Because the network may incorporate switches from different vendors I decided to read the current standard for this, the IEEE standard 802.1Q – 2014. This blog post aims to break down the idea of a bridged network to the lowest level, to reflect accurately what is in the standard.

IEEE 802.1Q – 2014

IEEE 802.1Q – 2014 is the current industry standard for Bridges and Bridged Networks. It includes the standards for bridge (or switch) operation, spanning tree protocols to prevent loops, and other aspects of bridging. The standard is a dense document. It is 1768 pages long. It defines 266 terms, not including terms defined by other standards. And it employs 247 acronyms or abbreviations, from ACK (acknowledgement) to VTID (VSI Type Identifier), where VSI is the Virtual Station Interface.

The standard specifies the operation of bridges in Clause 3: Introduction "for the purpose of compatible interconnection of information technology equipment using the IEEE 802 MAC Service supported by interconnected IEEE 802 standard LAN’s using different or identical media access control methods".

It does this by defining the protocols for different aspects of operation, so that vendors can implement them and obtain conformance. It is not intended to provide an explanation or to give examples.

So, by my estimation, it would take at least 5,000 pages to provide an accurate and full description of the standard. It would require far more to provide the specifics in terms of linked standards, like 802.3 Ethernet. This is why, when you read vendor documentation, you are reading a summary or a description, rather than an accurate definition of the protocols. You will find a great deal of inaccurate, misleading or incomplete descriptions in various documentation. What follows is a breakdown of the core components of the standard that, I hope, make it easier to understand other documentation accurately.

IEEE 802 Networks

IEEE 802 is a family of standards for frame-based networks. The family is described in IEEE 802 – 2014 Overview and Architecture. From Clause 4.1 Key Concepts:

"IEEE 802 networks use frame-based communications over a variety of media to connect various digital apparatus regardless of computer technology and data type."

"The basic communications capabilities provided by all IEEE 802 standards are frame based with source and destination addressing and asynchronous timing. In a frame-based system, the format is a variable-length sequence of data octets. By contrast, cell-based communication transmits data in fixed-length units in specified time intervals while isochronous communication transmits data as a steady stream of octets, or groups of octets, at equal time intervals."

"An IEEE 802 LAN is a peer-to-peer communication network that enables stations to communicate directly on a point-to-point, or point-to-multipoint, basis without requiring them to communicate with any intermediate stations that perform forwarding or filtering above the PHY [physical layer]."

It is really quite remarkable that the same family of coherent standards has governed local area networking from the early 10 Mbps Ethernet over coax (802.3 – 1983), to 100 Gbps Ethernet over optical fiber (802.3 – 2015), to Wireless (802.11) and Bluetooth (part of Personal Area Networks in 802.15) today. A key point is that the standard for Bridges and Bridged Network operates with all of them.

Media Access Control Service

IEEE local area networks use a shared medium, like a cable, to transmit and receive frames of data. The Media Access Control (MAC) Service is the service that controls access to the medium, so that signals do not collide. If they were to collide, the resulting signal would be garbage.

Clause 6.2 Provision of the MAC Service "The MAC Service provided in end stations attached to MAC Bridged Networks and Virtual Bridged Networks is the (unconfirmed) connectionless mode MAC Service defined in IEEE Std 802.1AC. The MAC Service is defined as an abstraction of the features common to a number of specific MAC Services".

The definition of the MAC Service from IEEE 802.1AC – 2016: Media Access Control (MAC) Service Definition.

Clause 7.2 "The primitives of the MAC Service comprise a data request and a corresponding data indication, each with MAC destination address, MAC source address, a MAC Service Data Unit (MSDU) comprising one or more octets of data, and priority parameters. Taken together these parameters are conveniently referred to as a frame".

A MAC Service User makes a connection to the medium via a Service Access Point (SAP). The implementation of a SAP is what we usually call a port.

Clause 7.4 "The term port is used to refer to the interface stack for a given SAP. Often the interface stack comprises a single protocol entity attached to a single Local Area Network (LAN), and port can be conveniently used to refer to several aspects of the interface stack, including the physical interface connector for example."

The protocol requires no negotiation or set-up between endpoints. The MAC Service User simply transmits the frame over the medium via the port.

Clause 7.8 "The MAC Service supported by an IEEE 802 LAN provides connectionless connectivity, i.e., communication between attached stations occurs without explicit prior agreement between service users."

Clause 14. "An MSDU transmitted using MAC connectionless-mode transmission is not considered by the MAC Service provider to be related in any way to any previously transmitted MSDU…The MAC Service provider is not required to maintain state information for flow control between specific combinations of MSAPs."

Here is an illustration of the MAC service, from 802.1AC – 2016:

802.1AC-2016 Figure 7.1 MAC Service

Figure 7.1 MAC entities, the MAC Service, and MAC Service users (clients).

Local Area Network

Colloquially, the Local Area Network (LAN) usually refers to the whole collection of cables and switches on a site (i.e. a local as distinct from a wide area network). But the exact meaning of LAN in the IEEE 802.1Q standard is a single segment of a shared medium (for example a cable). The definition of a LAN is given here:

Clause 3.94: " The term “Local Area Network” and the abbreviation LAN are used exclusively to refer to an individual LAN specified by a MAC technology, without the inclusion of Bridges. This precise use of terminology within this specification allows a Bridged Network to be distinguished from an individual LAN that has been bridged to other LANs in the network (a bridged LAN). In more general usage, such precise terminology is not required, as it is an explicit goal of this standard that Bridges are transparent to the users of the MAC Service".

The correct term for the collection of cables and switches on a site is a Bridged Network or a Bridged LAN.

This definition also highlights that different LAN’s can use different media access technologies, but only one technology on one LAN. Ethernet has become so dominant that it is easy to forget the other standards, like Token Ring. But the standards for Bridges and Bridged Networks do not depend on using Ethernet, and do not require the same MAC technology on different LAN’s.

A single LAN, in principle, originally supported multiple stations (PC’s, printers, servers) communicating directly between each other. But in practice now we usually have only end station on a dedicated LAN, together with its switch port. We also have a LAN connecting two ports on separate switches. This is the accurate definition, although we commonly use the term differently.

Bridge

A common question seems to be: "what is the difference between a bridge and a switch?". The answer is sometimes given that, while bridge is the technical term, vendors call them switches for marketing purposes. This is not exactly correct.

The definition of the term in IEEE 802.1Q – 2014 is the following:

Clause 3.22 Bridge: "A system that includes Media Access Control (MAC) Bridge or Virtual Local Area Network (VLAN) Bridge component functionality and that supports a claim of conformance to Clause 5 of IEEE Std 802.1Q-2014 for system behavior."

This is a tautology, so not much help. However, the main function of a bridge is to relay or filter frames between two ports.

Clause 8.1 Bridge operation: "The principal elements of Bridge operation are:

  1. ) Relay and filtering of frames (8.1.1).
  2. ) Maintenance of the information required to make frame filtering and relaying decisions (8.1.2).
  3. ) Management of the above (Clause 12)."

So we could say that the best description of a bridge is a network component that relays or filters frames, among other related functions. But relay between what?

Clause 8.1.1 Relay: "A Bridge relays individual MAC user data frames between the separate MACs of the individual LANs connected to its Ports."

So a bridge is the connecting function between two ports. Each port receives all the signals on its LAN, and so the bridge connects two LAN’s. The term "switch" does not occur in 802.1Q. A Layer 2 switch is simply a vendor implementation of multiple bridged ports.

If you have a PC attached to a port on an access switch, and a server attached to a port on a core switch in the datacentre, then:

  • a frame is transmitted by the PC over its LAN
  • received by the port on the access switch and relayed to another port
  • re-transmitted over the LAN between the bridge ports on different switches
  • then, after a few more relays and re-transmissions, re-transmitted over the LAN of the server where it is received.

VLAN

We think of a Virtual LAN (VLAN) as a subset of the whole network, identified by the same VLAN ID (e.g. all the ports allocated to VLAN 20). But, if the LAN is one segment of a MAC medium only, then the Virtual LAN sounds as though is should be a subset of one segment. As there is normally only one end station and one switch port on a LAN, what does that mean?

Clause 3.258 Virtual Local Area Network (VLAN): "The closure of a set of Media Access Control (MAC) Service Access Points (MSAPs) such that a data request in one MSAP in the set is expected to result in a data indication in another MSAP in the set."

This is a strange definition. The VLAN is a closed subset of MAC Service Access Points. A MAC Service Access Point (MSAP) is referred to as a port. But there is no entity defined as the set of all such access points.

A VLAN cannot be a Virtual Bridged Network, because that refers to a network with multiple VLAN’s.

3.254 Virtual Bridged Network: "A concatenation of individual IEEE 802 Local Area Networks (LANs) interconnected by Bridges, including Virtual Local Area Network (VLAN) Bridges."

This means that a VLAN is not a virtual instance of a LAN, as you might expect from the name. Instead it is a subset of ports in a Virtual Bridged Network. This is the way we use the term, but it is not the literal meaning of Virtual LAN.

Relay Function

The principal element of Bridge operation is to relay or filter (not relay) frames. A Bridge relays frames between the individual LAN’s connected to its Ports.

Here is an illustration of the relay function, showing the MAC Service, LAN, Bridge and VLAN components that I described above.

802.1Q-2014 Figure 8.2 VLAN Bridge Architecture

IEEE 802.1Q – 2014 Figure 8.2 VLAN Bridge architecture.

In the figure above:

  • Two separate LAN’s are connected by a bridge component
  • The relay function occurs between two ports
  • The MAC Service (shown as MS) is the tower of functions on a single port
  • At the bottom of the MAC Service the functions are specific to the media access method being employed, like Ethernet or Token Ring
  • At the top of the MAC Service the functions are independent of the specific access method. This is the Internal Sublayer Service (ISS)
  • The Extended ISS (EISS) is where adding and removing VLAN tags takes place. It is an extension because some bridges are VLAN-aware and others not.
  • The MAC Relay Entity is what does the forwarding or filtering of frames between ports
  • Higher layer entities manage functions across more than two ports. Monitoring port status is an obvious example.

Filtering Database

The basic architecture of IEEE 802 networks is a distributed one. There is no overall controller. No configuration of a bridge is required to enable it to operate on the network. The bridge is transparent to the end stations. A bridge operates successfully without configuration because it follows a set of protocols that are automatically compatible with other bridges doing the same.

So how does a switch know where to send the frame? There is no route, or addressing scheme, to use. The unique ID of a MAC Service entity, the MAC address, does not contain addressing information. It is not really an address! It is like having a unique social security number, but no address.

When an end station transmits a frame, it includes in the frame the source MAC address (its own) and the destination MAC address (which it discovered by another process). It has no idea of where the destination is, or how to reach it. It simply drops the frame onto the LAN.

A bridge port attached to the LAN listens to all the frames. The port learns the MAC address(es) of the device(s) on its LAN, and records them in the bridge’s Filtering Database (FDB). As each port does this, the FDB grows to contain the MAC addresses of all the devices on all the LAN’s attached to the Bridge, and which port they are attached to.

This is illustrated here:

802.1Q-2014 Figure 8.5 Observation of network traffic

IEEE 802.1Q – 2014 Figure 8.5 Observation of network traffic

When a station transmits a frame to another station on the same LAN, the bridge port does nothing except to record the source addresses. When a station transmits a frame to a station that is not on the same LAN, then the bridge port will relay the frame. The bridge looks for the destination MAC address in the FDB, sees which port it is associated with, and relays the frame to that port. The port then drops the frame onto the LAN attached to that port. The destination device recognises its own MAC address in the frame, and receive it.

Clause 8.7 The Learning Process "The Learning Process receives the source MAC addresses and VIDs, or only the source MAC addresses in the case of VLAN-unaware MAC Relays, of received frames from the Forwarding Process, subject to active topology enforcement (8.6.1) and the application of ingress filtering (8.6.2)." "When invoked, the Learning Process shall create or update a Dynamic Filtering Entry (8.8.3) that specifies the reception Port for the frame’s source address and, in the case of VLAN Bridge components, the frame’s VID".

Entries in the FDB are aged out, with a default time of 5 minutes.

Clause 8.7.3 Ageing of Dynamic Filtering Entries "Dynamic Filtering Entries shall be automatically removed after a specified time, the Ageing Time, has elapsed since the entry was created or last updated by the Learning Process. The ageing out of Dynamic Filtering Entries ensures that end stations that have been moved to a different part of the network will not be permanently prevented from receiving frames. It also takes account of changes in the active topology of the network that can cause end stations to appear to move from the point of view of the Bridge; i.e., the path to those end stations subsequently lies through a different Bridge Port."

The FDB is then used to determine to which port frames are relayed.

Clause 8.8 The Filtering Database (FDB) "The FDB supports queries by the Forwarding Process to determine whether received frames, with given values of, destination MAC address, and for VLAN Bridge components, VID, are to be forwarded through a given potential transmission Port". "The FDB contains filtering information in the form of filtering entries that are either a) Static, and explicitly configured by management action; or b) Dynamic, and automatically entered into the FDB by the normal operation of the Bridge and the protocols it supports."

Flooding

Since a MAC address, and its association with a port, is only known in the FDB when that station has transmitted a frame, we need a mechanism for finding a station when it has not yet transmitted. This is done by flooding the frame to every port, except the port the frame came from. Flooding is not a specific operation of a bridge. The terms "flood" or "flooding" are used to describe the result of not filtering.

When the destination MAC address is not held in the FDB, the frame is not filtered and so every port drops the frame onto its attached LAN. When the frame eventually reaches the end station with the destination address of the frame, it recognises the address and responds. When it responds, its address is captured by the bridge port to which it is attached, and stored in the FDB.

If one of the bridge ports is connected to another bridge port (for example connecting two switches), then each of those bridge ports will see all the traffic coming from the other. The FDB of each switch will contain all the source MAC addresses for the entire network arriving by that path.

Loop Prevention

We all know that it is undesirable to have an open loop in a local area network. It is not easy to describe exactly why it is undesirable. The term "broadcast storm" is sometimes used, but that is not an accurate description of the problem. In fact, the standard does not define the problem, although it spends a great deal of time solving it.

The standard aims to maintain the Quality of Service (QoS) by preventing frame duplication.

Clause 6.5.4 Frame duplication "The MAC Service (IEEE Std 802.1AC) permits a negligible rate of duplication of frames. The operation of Bridges introduces a negligible rate of duplication of user data frames. The potential for frame duplication in a bridged network arises through the possibility of the following:

  1. ) Repeated transmission, through a given Bridge Port, of a frame received on another Bridge Port;
  2. ) Multiple paths between source and destination end stations;
  3. ) A loop in a path between source and destination stations."

"When Bridges in a network connect individual LANs in such a way that physical topology is capable of providing multiple paths between any source and destination, a protocol is required to ensure that the active topology comprises a single path."

Frame duplication would happen if a frame were sent out from one port, and relayed back to another port. In this case, the frame would again be sent out. If the frame had more than a single destination (such as a multicast or broadcast frame), then it would be relayed back by multiple ports. Each returning frame would be transmitted, causing an exponential increase in traffic.

A large part of the 802.1Q standard is assigned to preventing this from happening, through Spanning Tree Protocols (STP) and Shortest Path Bridging (SPB).

Containers in the Enterprise

This post is a few thoughts about Windows Containers and the impact they are likely to have on enterprise IT infrastructure.

Containers are very new to Windows. Although they have been around for a while in Linux, it is still not that long in terms of infrastructure technologies. In Windows, they were introduced in Windows Server 2016; they are still at a very early stage of maturity. I expect them to make a big impact.

Here are the best two resources that I know of for a general discussion around containers:

This post is specifically around Windows containers, and their possible impact on enterprise IT infrastructure. I am not a great believer in trying to predict technology. I am just interested to consider how much effort we should put in to understanding containers and perhaps starting to use them for IT services in the enterprise.

Containers

If you are already working with containers you can skip this. If you are not working with containers, and you want to know roughly what they are so that you can appreciate the impact, then this is all you really need to know.

Containers are just another form of task virtualisation.

  • A virtual machine runs an OS kernel on virtualised hardware
  • A virtual application runs a process on a virtualised file system
  • A container runs a session on a partitioned kernel.

From the point of view of enterprise infrastructure, the key attributes of a container are:

  1. No boot time. The kernel has already booted up on the hardware. The container starts and stops as fast as a session logon would.
  2. No persistence. There is no dedicated storage, so nothing persists outside the container.
  3. Because of 1 and 2, a different model for scalability.
  4. Efficiency, because the same OS kernel is shared between containers.
  5. Isolation between different containers, but not between the container and the OS.

One of the most interesting things about containers on Windows is how they use or extend existing technologies:

Docker provides the commands to operate containers, but the underlying technology that creates and runs containers on Windows is Windows. This makes containers on Windows a very large scale robust solution for compute, storage and networking infrastructure.

There are also some limitations or features of containers specifically on Windows:

  • no Active Directory domain membership, or Group Policy
  • no user interface processes: no Explorer; no desktop; no GUI
  • no kernel mode device drivers (like AV or encryption)
  • two options for the base image: Windows Server Core; or Nano Server.

Using Containers

For a developer, on Windows or Linux, containers are a wonderful thing. You can create a complex infrastructure very quickly. You can create different versions of the infrastructure. You can add or remove infrastructure components as needed. You can do it on a micro scale; then scale it up to production size with exactly the same images and deployment tools. You can easily do it with both Windows and Linux together. There is no dependency on the infrastructure team. They will only need to provide hosts, not install and configure applications. If you design with a new micro-service architecture, you can scale up and out just by adding replicas.

But I think there are a number of problems with implementing current general purpose enterprise applications in containers. To summarise, most enterprise applications have evolved over ten or twenty years in parallel with the three tier application model running on Windows servers using Windows services. There is little benefit to implementing the same model in containers:

  1. Many services use Active Directory as the AAA provider. They assume the servers are part of a domain, both for administrators and users.
  2. Most services already have a scalability and availability model, for example based on Windows clusters, or load balanced web front ends.
  3. Most services can already be partitioned. For example, IIS already runs distinct sites with their own application pool. SQL Server already runs distinct instances with their own security context.
  4. Services are often closely coupled with data. For example it would make no sense to run a DNS or DHCP service as an ephemeral instance with non-persistent data.
  5. Virtual machines already pool the resources of the hardware. There is certainly a benefit in reducing the number of instances of the kernel, but I don’t know if the benefit would be sufficient to merit the change.

I see containers more as part of a trend away from enterprise infrastructure together. In many medium sized enterprises, at least, the foundation of enterprise infrastructure was Active Directory, Exchange, and SharePoint. When you already run Exchange in-house, then it makes sense to prefer other applications to run on Windows Server, with IIS for the front end, SQL Server for the back end, and Active Directory for AAA. Now this is moving to Office 365, and not for economic reasons. The primary reason in my experience , and often unstated, is an organisational desire to move away from the complexity of running IT services.

Once you have Office 365 instead of on-premise Exchange, then it makes sense increasingly to use SaaS services. It is all about marginal costs. If you already have a Windows infrastructure, then the marginal cost of running something like SAP on-premise is lower than it would be if you had no existing Windows infrastructure. The more services move to SaaS, the higher the marginal cost of running on-premise Windows infrastructure.

For an enterprise of any scale, some part of the enterprise is already going to be remote from the datacentre. As long as the SaaS service is located less than about 50 milliseconds away from the user, and provided the application is designed for thin client, then there is no difference in the intrinsic cost or performance.

Once the balance of enterprise software moves to SaaS, then the dominant three tier architecture for enterprise applications is no longer necessary, or even suitable. SaaS applications are by definition multi-tenant, multi-site, parallel, and continuous. Interestingly, Microsoft has moved first to what it calls Service Fabric, and only secondly to Containers in Service Fabric. The real architectural change is in Service Fabric.

On an even larger scale, if you move away from the traditional (only twenty years old!) enterprise infrastructure, you also move away from the divide between Windows and Unix/Linux OS. As an enterprise, you don’t know or care about the OS that a SaaS application runs on. As a developer you can use any language on any OS for any component of the architecture. Microsoft becomes a cloud infrastructure services vendor, and no longer a "Windows" vendor. We can see this already with Linux containers on Hyper-V and Linux hosts in Service Fabric.

Docker Overlay Network Details

Docker Swarm uses an overlay network for communication between containers on different hosts, and for load balancing incoming traffic to a service. On Windows Server 2016 before Windows Update KB4015217 this overlay network is not supported. After KB4015217 the communication between containers works, but the routing mesh that load balances incoming traffic is not supported. Now with Windows Server 2016 version 1709 the routing mesh works as well. The purpose of this post is to take an in depth look at how the overlay network and the routing mesh work in practice.

Testing environment

This is my environment for testing:

  1. Two hosts with Windows Server 2016 version 1709 on the same vnet in Azure
  2. Both hosts with the Hyper-V role and the Windows Containers feature
  3. Both hosts running experimental Docker 17.10
  4. A Docker Swarm service with three containers, running the image Microsoft/IIS:windowsservercore-1709, with a published port 80
  5. A third host running Portainer and the new Project Honolulu server management gateway.

I tested before that I can reach any container on any host, on port 80, from an external client. I also tested that I can ping and telnet between containers.

Theory

The Docker documentation describes how this works on Linux: Designing Scalable, Portable Docker Container Networks. Containers are assigned to a Virtual Extensible LAN (VXLAN) and traffic between containers on different hosts is encapsulated in UDP packets on port 4789. The routing mesh is implemented by Linux IP Virtual Server (IPVS) layer 4 switching.

On Windows, it is a bit more difficult to piece together the documentation. This is because containers on Windows are just part of a swathe of Azure, Hyper-V and Windows technologies.

SDN comes from implementing multi-tenant architectures in Azure, where VM’s on different hosts, in different datacentres, need to communicate securely and in isolation from other tenants. This is not very different from containers in different Swarm services communicating with each other but not with other services.

VXLAN is a generic standard documented in RFC 7348. There are a lot of different diagrams of the VXLAN, but basically a Layer 2 switched packet between containers on different hosts is encapsulated in a UDP packet and sent across the host network.

Implementation

When we initialise the Docker Swarm, a default overlay network is created, called “ingress”. We can see this with docker network ls.

NETWORK ID NAME DRIVER SCOPE
xio0654aj01a ingress overlay swarm
5bcf2a6fe500 nat nat local
cef0ceb618b6 none null local

This is in addition to the default NAT network created when we add the Containers feature. With docker network inspect ingress we can see the details of this network:

  • It has an ID of xio0654aj01a6x60kfnoe4r12 and a subnet of 10.255.0.0/16
  • Each container on the network has: an endpoint ID; an IP address on the subnet, and a unique MAC address
  • Each node has one ingress-endpoint, again with: an endpoint ID; an address and a MAC address.
"ConfigOnly": false,
"Containers": {
"206fe3c22aa9682f6db7c0ff2d2665ea647d2d2825218a9a1a6ee6bda4c80de7": {
"Name": "web.2.03uu9bab6n416jqi0reg59ohh",
"EndpointID": "136a5e8a952b7bc3da6b395e9ff3fb138cd93c97e3fafda1299f804f9cbe2bf1",
"MacAddress": "00:15:5d:71:af:d8",
"IPv4Address": "10.255.0.6/16",
"IPv6Address": ""
},
"92d6b5d2c353d43dad6e072e25865bdf91003b069fd3a527d953b9a62384f0a0": {
"Name": "web.3.nzxp6uhcvxhejp2iodd29l3gu",
"EndpointID": "b1937b9d22d2aa9881d0e45b16bc7031b2d4d07d4d0059531d64a6ade5a5242e",
"MacAddress": "00:15:5d:71:a4:c5",
"IPv4Address": "10.255.0.7/16",
"IPv6Address": ""
},
"ingress-sbox": {
"Name": "ingress-endpoint",
"EndpointID": "7037a8b3628c9d5d49730472c37a800e4d1882f0cb125ec75e75477c02104526",
"MacAddress": "00:15:5d:71:a7:dd",
"IPv4Address": "10.255.0.2/16",
"IPv6Address": ""
}
},

In this case there are two containers on the host. If we look on the other host, we see the third container (of three replicas in the service) and a different endpoint.

We can also see the ingress network, the web service and the containers in Portainer, a simple management GUI for containers:

Docker Network Ingress

If we look inside a container, with docker exec -it web.2.03uu9bab6n416jqi0reg59ohh powershell and ipconfig /all, we can see that the endpoint ID is the ID of the NIC, and the IP address and MAC address also belong to this NIC:

Ethernet adapter vEthernet (136a5e8a952b7bc3da6b395e9ff3fb138cd93c97e3fafda1299f804f9cbe2bf1):
Connection-specific DNS Suffix . : nehng5n4bb2ejkdqdqbqdv4dxe.zx.internal.cloudapp.net
Description . . . . . . . . . . . : Hyper-V Virtual Ethernet Adapter #5
Physical Address. . . . . . . . . : 00-15-5D-71-AF-D8
DHCP Enabled. . . . . . . . . . . : No
Autoconfiguration Enabled . . . . : Yes
Link-local IPv6 Address . . . . . : fe80::7dfd:d3f7:6350:759d%32(Preferred)
IPv4 Address. . . . . . . . . . . : 10.255.0.6(Preferred)
Subnet Mask . . . . . . . . . . . : 255.255.0.0
Default Gateway . . . . . . . . . : 10.255.0.1
DNS Servers . . . . . . . . . . . : 10.255.0.1
168.63.129.16
NetBIOS over Tcpip. . . . . . . . : Disabled

To see how the ingress network is implemented, we need to look at the host networking configuration. With Get-VMSwitch we can see that there is a Hyper-V virtual switch with the same name as the Docker ingress network ID:

Name SwitchType NetAdapterInterfaceDescription
---- ---------- ------------------------------
nat Internal
xio0654aj01a6x60kfnoe4r12 External Microsoft Hyper-V Network Adapter #5

With Get-VMSwitchExtension -VMSwitchName xio0654aj01a6x60kfnoe4r12 we can see that the switch has a Microsoft Azure VFP Switch Extension:

Id : E9B59CFA-2BE1-4B21-828F-B6FBDBDDC017
Name : Microsoft Azure VFP Switch Extension

If we do ipconfig /all on the host we see two network adapters. The primary host network adapter:

Ethernet adapter vEthernet (Ethernet 5)

and an adapter attached to the Docker NAT network:

Ethernet adapter vEthernet (nat)

But if we run Get_NetworkAdapter we see three:

Name InterfaceDescription ifIndex Status MacAddress LinkSpeed
---- -------------------- ------- ------ ---------- ---------
vEthernet (Ethernet 5) Hyper-V Virtual Ethernet Adapter #2 16 Up 00-22-48-01-00-03 40 Gbps
vEthernet (nat) Hyper-V Virtual Ethernet Adapter 3 Up 00-15-5D-6A-D6-E2 10 Gbps
Ethernet 5 Microsoft Hyper-V Network Adapter #5 11 Up 00-22-48-01-00-03 40 Gbps

The extra one, named “Ethernet 5” with Interface Description “Microsoft Hyper-V Network Adapter 5”, on the same MAC address as the primary host adapter, and with no IP address, is the ingress endpoint on the overlay network.

We can see this in the Project Honolulu browser-based server manager.

The adapters:

Honolulu Docker1 Adapters

The Hyper-V ingress network switch:

Honolulu Docker1 Ingress Switch

Trace: incoming

I previously did a trace of the traffic, first into a container from a remote client and second, between containers. With Microsoft Message Analyzer we can see what happens.

Here is the flow of an HTTP request on port 80 from a remote client to one of the swarm nodes, and load balanced to a container on the same host.

In the first message a TCP packet arrives at the IP address of the host adapter:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
3526None2017-11-08T17:02:15.98638390.187222500TCPFlags: ......S., SrcPort: 53711, DstPort: HTTP(80), Length: 0, Seq Range: 1862583515 - 1862583516, Ack: 0, Win: 65535(negotiating scale factor: 3)

In the second message, the packet is received by the Hyper-V switch for the overlay network:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
3527None2017-11-08T17:02:15.98639200.000008100Microsoft_Windows_Hyper_V_VmSwitchNBL 0xFFFF880A4600B370 received from Nic /DEVICE/{DAB8937D-9AD5-460E-8652-C2E152CCE573} (Friendly Name: Microsoft Hyper-V Network Adapter #5) in switch A404BC57-741B-4C79-8BA5-1D7D3FDA92C1 (Friendly Name: xio0654aj01a6x60kfnoe4r12)

In the third message the packet is routed to the container adapter:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
3591None2017-11-08T17:02:15.98659060.000002200Microsoft_Windows_Hyper_V_VmSwitchNBL 0xFFFF880A492B1030 routed from Nic 533EF66B-A5F3-4926-A1EE-79AF499F85C7 (Friendly Name: Ethernet 5) to Nic F3EA5A0C-2253-472F-8FFA-3467568C6D00 (Friendly Name: 136a5e8a952b7bc3da6b395e9ff3fb138cd93c97e3fafda1299f804f9cbe2bf1) on switch A404BC57-741B-4C79-8BA5-1D7D3FDA92C1 (Friendly Name: xio0654aj01a6x60kfnoe4r12)

In the fourth message, the packet is received by the container adapter:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
3592None2017-11-08T17:02:15.98659320.000002600Microsoft_Windows_Hyper_V_VmSwitchNBL 0xFFFF880A492B1030 delivered to Nic F3EA5A0C-2253-472F-8FFA-3467568C6D00 (Friendly Name: 136a5e8a952b7bc3da6b395e9ff3fb138cd93c97e3fafda1299f804f9cbe2bf1) in switch A404BC57-741B-4C79-8BA5-1D7D3FDA92C1 (Friendly Name: xio0654aj01a6x60kfnoe4r12)

And in the fifth message the first packet is delivered:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
3593None2017-11-08T17:02:15.98661680.00002364288TCPFlags: ......S., SrcPort: 65408, DstPort: HTTP(80), Length: 0, Seq Range: 1862583515 - 1862583516, Ack: 0, Win: 65535(negotiating scale factor: 3)

You will notice that the sent packet is from port 53711 to port 80. But the arrived packet is from port 65408 to port 80. You can’t see it in this summary of the message, but the sent packet is from the client IP address 92.234.68.72 to the host IP address 10.0.0.4 while the arrived packet is from the ingress-endpoint IP address 10.255.0.2 to the container IP address 10.255.0.6. The virtual switch has re-written the source port and address of the packet. The container sends a reply packet to the ingress-endpoint, where the switch again re-writes the source and destination addresses to send the reply back to the client.

From the point of view of the host, there is:

  • no route to the ingress network 10.255.0.0/16
  • no ARP cache addresses for endpoints on the ingress network
  • no host process listening on port 80
  • a virtual adapter (Friendly Name: Microsoft Hyper-V Network Adapter #5), with the same MAC address as the primary adapter (00-22-48-01-00-03), but with no IP address, attached to a virtual switch (Friendly Name: xio0654aj01a6x60kfnoe4r12), which is the switch for the ingress network.

The virtual switch intercept the request on the published port 80 (using the Azure Virtual Filtering Platform switch extension?) and forwards it to one of the containers.

From the point of view of the container, there is:

  • no route to the host network 10.0.0.0/24
  • no ARP cache address for endpoints on the host network
  • an ARP cache address for the ingress-endpoint 10.255.0.2, with the same MAC address as the primary host network adapter (00-22-48-01-00-03)
  • a process (web server) listening on port 80
  • a virtual adapter (Friendly Name: 136a5e8a952b7bc3da6b395e9ff3fb138cd93c97e3fafda1299f804f9cbe2bf1) attached to the same virtual switch (Friendly Name: xio0654aj01a6x60kfnoe4r12) as the phantom adapter on the host.

The virtual switch receives the reply from the container and forwards it to the MAC address of the ingress-endpoint, which is the same as the MAC address of the primary network adapter of the host. The host network adapter sends the reply to the remote client.

This trace has been for incoming traffic from an external client. The next trace is for inter-container traffic across hosts.

Traffic: inter-container

Here is the flow of a ping from a container on one host to a container on the other. The trace is being performed on the receiving host. We need to dissect each packet to see what happens.

The first packet arrives, an echo (ping) request. This is the content of the packet:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
8852None2017-11-08T18:34:34.80668870.033675500ICMPEcho Operation
8852None2017-11-08T18:34:34.80668870.000000000ICMPEcho Request
8852None2017-11-08T18:34:34.80668870.000000000IPv4Next Protocol: ICMP, Packet ID: 29796, Total Length: 60
8852None2017-11-08T18:34:34.80668870.000000000EthernetType: Internet IP (IPv4)
8852None2017-11-08T18:34:34.80668870.000000000VXLANVXLAN Frame
8852None2017-11-08T18:34:34.80668870.000000000UDPSrcPort: 1085, DstPort: VXLAN(4789), Length: 90
8852None2017-11-08T18:34:34.80668870.000000000IPv4Next Protocol: UDP, Packet ID: 30052, Total Length: 110
8852None2017-11-08T18:34:34.80668870.000000000EthernetType: Internet IP (IPv4)

From inside to outside, the packet is structured as follows:

  • ICMP Echo Eequest
  • IPv4 protocol ICMP, from source address 10.255.0.5 (the remote container) to destination address 10.255.0.7 (the local container)
  • Ethernet from source MAC address 00-15-5D-BC-F9-AA (the remote container) to destination MAC address 00-15-5D-71-A4-C5 (the local container). These are Hyper-V MAC addresses on the ingress network. The host network does not know anything about these IP or MAC addresses.
  • ———– so far, this is the original packet sent by the remote container————
  • VXLAN header with network identifier 4096. This is the VXLAN ID shown by docker network inspect ingress
  • Outer UDP header, from source port 1085 to destination port 4789 (the standard port for VXLAN traffic)
  • Outer IPv4 header, protocol UDP, from source address 10.0.0.5 (the remote host) to destination address 10.0.0.4 (the local host)
  • Outer Ethernet header, from source MAC address 00-22-48-01-9E-11 (the primary adapter of the remote host) to destination MAC address 00-22-48-01-00-03 (the primary adapter of the local host)

Following the flow of messages, the packet is received by the Hyper-V switch for the overlay network:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
8853None2017-11-08T18:34:34.80669300.000004300Microsoft_Windows_Hyper_V_VmSwitchNBL 0xFFFF880A4626D6A0 received from Nic /DEVICE/{DAB8937D-9AD5-460E-8652-C2E152CCE573} (Friendly Name: Microsoft Hyper-V Network Adapter #5) in switch A404BC57-741B-4C79-8BA5-1D7D3FDA92C1 (Friendly Name: xio0654aj01a6x60kfnoe4r12)

The packet is routed to the container adapter:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
8867None2017-11-08T18:34:34.80672460.000001300Microsoft_Windows_Hyper_V_VmSwitchNBL 0xFFFF880A4626D6A0 routed from Nic /DEVICE/{DAB8937D-9AD5-460E-8652-C2E152CCE573} (Friendly Name: Microsoft Hyper-V Network Adapter #5) to Nic 0330EF2B-74AB-4E06-A32D-86DA92145374 (Friendly Name: b1937b9d22d2aa9881d0e45b16bc7031b2d4d07d4d0059531d64a6ade5a5242e) on switch A404BC57-741B-4C79-8BA5-1D7D3FDA92C1 (Friendly Name: xio0654aj01a6x60kfnoe4r12)

The packet is received by the container adapter:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
8868None2017-11-08T18:34:34.80672690.000002300Microsoft_Windows_Hyper_V_VmSwitchNBL 0xFFFF880A4626D6A0 delivered to Nic 0330EF2B-74AB-4E06-A32D-86DA92145374 (Friendly Name: b1937b9d22d2aa9881d0e45b16bc7031b2d4d07d4d0059531d64a6ade5a5242e) in switch A404BC57-741B-4C79-8BA5-1D7D3FDA92C1 (Friendly Name: xio0654aj01a6x60kfnoe4r12)

The original packet is delivered, minus the VXLAN header and UDP wrapper:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
8869None2017-11-08T18:34:34.80672960.000002700ICMPEcho Operation
8869None2017-11-08T18:34:34.80672960.000000000ICMPEcho Request
8869None2017-11-08T18:34:34.80672960.000000000IPv4Next Protocol: ICMP, Packet ID: 29796, Total Length: 60
8869None2017-11-08T18:34:34.80672960.000000000EthernetType: Internet IP (IPv4)

You can see it has taken 0.4 milliseconds to process the packet in the switch.

Traffic: incoming across hosts

With the routing mesh, incoming traffic from a remote client to any node in the swarm can be load balanced and routed to a container on a different node. This uses the routing mesh to handle the incoming and outgoing traffic, and the overlay network to handle the traffic between container and node.

In this example the incoming packet arrives at host Docker2. It is load balanced to a container running on host Docker1. The trace is running on Docker1, receiving the packet from Docker 2.

This time the incoming TCP packet has the same VXLAN and UDP headers as inter-container traffic (when it is across hosts):

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
11165None2017-11-08T17:02:50.33488900.048658400TCPFlags: ......S., SrcPort: 65408, DstPort: HTTP(80), Length: 0, Seq Range: 4237068666 - 4237068667, Ack: 0, Win: 29200(negotiating scale factor: 7)
11165None2017-11-08T17:02:50.33488900.000000000IPv4Next Protocol: TCP, Packet ID: 41609, Total Length: 60
11165None2017-11-08T17:02:50.33488900.000000000EthernetType: Internet IP (IPv4)
11165None2017-11-08T17:02:50.33488900.000000000VXLANVXLAN Frame
11165None2017-11-08T17:02:50.33488900.000000000UDPSrcPort: 40558, DstPort: VXLAN(4789), Length: 90
11165None2017-11-08T17:02:50.33488900.000000000IPv4Next Protocol: UDP, Packet ID: 41865, Total Length: 110
11165None2017-11-08T17:02:50.33488900.000000000EthernetType: Internet IP (IPv4)

The UDP and VXLAN headers are stripped off by the switch, routed and presented to the container as standard TCP, coming from the ingress-endpoint on the other host with address 10.255.0.3:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
11186None2017-11-08T17:02:50.33495200.000004000TCPFlags: ......S., SrcPort: 65408, DstPort: HTTP(80), Length: 0, Seq Range: 4237068666 - 4237068667, Ack: 0, Win: 29200(negotiating scale factor: 7)
11186None2017-11-08T17:02:50.33495200.000000000IPv4Next Protocol: TCP, Packet ID: 41609, Total Length: 60
11186None2017-11-08T17:02:50.33495200.000000000EthernetType: Internet IP (IPv4)

This time the container makes an ARP request to find the MAC address of the ingress-endpoint on the other host that sent it the packet:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
11187None2017-11-08T17:02:50.33503730.000085345944ARPREQUEST, SenderIP: 10.255.0.7, TargetIP: 10.255.0.3
11187None2017-11-08T17:02:50.33503730.000000045944EthernetType: ARP

The ARP request is intercepted by the VFP extension in the switch and dropped:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
11192None2017-11-08T17:02:50.33505780.000001745944Microsoft_Windows_Hyper_V_VmSwitchNBLs were dropped by extension {24C70E26-D4C4-42B9-854A-0A4B9BA2C286}-{E9B59CFA-2BE1-4B21-828F-B6FBDBDDC017}-0000 (Friendly Name: Virtual Filtering Platform VMSwitch Extension) in switch A404BC57-741B-4C79-8BA5-1D7D3FDA92C1 (Friendly Name: xio0654aj01a6x60kfnoe4r12). Source Nic 0330EF2B-74AB-4E06-A32D-86DA92145374 (Friendly Name: b1937b9d22d2aa9881d0e45b16bc7031b2d4d07d4d0059531d64a6ade5a5242e), Reason Outgoing packet dropped by VFP

The switch fabricates an ARP reply:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
11200None2017-11-08T17:02:50.33522190.000004839363284ARPREPLY, SenderIP: 10.255.0.3, TargetIP: 10.255.0.7
11200None2017-11-08T17:02:50.33522190.000000039363284EthernetType: ARP

The container replies to the SYN with an ACK:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
11201None2017-11-08T17:02:50.33522900.000007139363284TCPFlags: ...A..S., SrcPort: HTTP(80), DstPort: 65408, Length: 0, Seq Range: 3626128581 - 3626128582, Ack: 4237068667, Win: 65535(negotiating scale factor: 8)
11201None2017-11-08T17:02:50.33522900.000000039363284IPv4Next Protocol: TCP, Packet ID: 17960, Total Length: 52
11201None2017-11-08T17:02:50.33522900.000000039363284EthernetType: Internet IP (IPv4)

This is routed by the virtual switch and emerges at the host adapter as a reply, wrapped in the VXLAN and UDP headers:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
11217None2017-11-08T17:02:50.33528510.000001039363284TCPFlags: ...A..S., SrcPort: HTTP(80), DstPort: 65408, Length: 0, Seq Range: 3626128581 - 3626128582, Ack: 4237068667, Win: 65535(negotiating scale factor: 8)
11217None2017-11-08T17:02:50.33528510.000000039363284IPv4Next Protocol: TCP, Packet ID: 17960, Total Length: 52
11217None2017-11-08T17:02:50.33528510.000000039363284EthernetType: Internet IP (IPv4)
11217None2017-11-08T17:02:50.33528510.000000039363284VXLANVXLAN Frame
11217None2017-11-08T17:02:50.33528510.000000039363284UDPSrcPort: 37734, DstPort: VXLAN(4789), Length: 82
11217None2017-11-08T17:02:50.33528510.000000039363284IPv4Next Protocol: UDP, Packet ID: 18216, Total Length: 102
11217None2017-11-08T17:02:50.33528510.000000039363284EthernetType: Internet IP (IPv4)

This reply is forwarded across the host network to the other host. The virtual switch on the other host fabricated a reply to the remote client. This is not shown here, but is the same as the reply in the first trace above.

So there we have it: Windows Server 2016 version 1709 with the Docker overlay network and routing mesh, using Software Defined Networking, Hyper-V switches and the Azure Virtual Filtering Platform virtual switch extension.

Docker Swarm Networking

Docker Swarm enables containers to operate together to provide a service, across different nodes in a cluster. It uses an overlay network for communication between containers on different hosts. It also supports a routing mesh, which load-balances and routes incoming connections to the containers. On Windows Server 2016 before the latest version this routing mesh is not supported. Now it is, with the release of version 1709, so we can see how it all works.

Docker Swarm enables containers to operate together to provide a service, across different nodes in a cluster.

It uses an overlay network for communication between containers providing the same service. You can read an excellent description of it here, in the Docker Reference Architecture: Designing Scalable, Portable Docker Container Networks. The overlay network is implemented as a Virtual Extensible LAN (VXLAN) stretched in software across the underlying network connecting the hosts.

The network has a built-in routing mesh that directs incoming traffic on a published port, on any node, to any container running the service on any node. This diagram illustrates the routing mesh on Linux, where it is implemented in the kernel by the IP Virtual Server (IPVS) component:

Docker_Reference_Architecture-_Designing_Scalable _Portable_Docker_Container_Networks_images_routing-mesh

On Windows Server 2016 version 1607 the routing mesh does not work. Now, with the new Windows Server 2016 version 1709 it does.

Microsoft introduced support for Docker Swarm with overlay networks in April 2017, with KB4015217. This document Getting Started with Swarm Mode describes it, but down at the bottom it says that the routing mesh is not supported. Although you can still publish a port, this limits your options to either one per host, or a dynamic port, and a separate load balancer.

To get the terms straight:

  • Overlay network: a VXLAN shared by containers on different hosts, transported by the underlying host network
  • Routing mesh: load balanced routing of incoming traffic on published ports to the destination port on one of the containers in the service
  • Ingress mode: the port publishing mode that uses the routing mesh, instead of direct connection to ports on the container host (host mode or global mode)
  • "Ingress": the name of the default overlay-type network created by Docker, just as "nat" is the name of the default NAT-type network; but you can create your own overlay network.

Support for the routing mesh and ingress mode has arrived in Windows Server 2016 version 1709 and is now available in Azure too. It is still at an early stage. It requires:

  • A new installation of Windows Server 2016 version 1709
  • Docker EE version 17.10, still in Preview.

To install Docker EE Preview, run:

Install-Module DockerProvider
Install-Package Docker -ProviderName DockerProvider -RequiredVersion Preview -Force

To test this, I created a Docker Swarm service with three replicas on two nodes. I am using the microsoft/iis:windowsservercore-1709 image to have something to connect to:

docker service create --name web --replicas 3 --publish mode=ingress,target=80,published=80 microsoft/iis:windowsservercore-1709

The service is created by default on the "ingress" overlay network, because it has a published port.

With three containers on two nodes, I should be able to see:

  • Both nodes responding to a connection on port 80
  • Two containers servicing the same published port, on one node
  • One container servicing port 80 on the other node
  • Traffic arriving at a node, and going to a container either on the same node, or crossing to a container on the other node
  • All containers able to communicate with each other, on the same Layer 2 switched network.

I am using Portainer as a simple GUI to view the Docker Swarm service. Here is the web service:

Portainer Service List

and the service details:

Portainer Service Details

with the service overlay network:

Portainer Service Network

Using Portainer or the Docker command line (docker service inspect web and docker network inspect ingress), I can see that the containers are on a subnet of 10.255.0.0/16. The network also has one "ingress-endpoint" for each node, with addresses of 10.255.0.2 and .3.

First let’s check that the routing mesh works. Here you can see four different connections (click to see details):

Docker 1 to web.2 – container on same host;

Docker 1 Container 2 crop

Docker 1 to web.3 – different container on same host;

Docker 1 Container 3 crop

Docker 2 to web.1 – container on the other host;

Docker 2 Container 1 crop

Docker 2 to web.3 – container on different host;

Docker 2 Container 3 crop

If I run a network trace I can see how it works. Below isthe conversation between client and container, where the incoming request is routed to a container on the same node:

Connection to Container on Same Host

It consists of exact pairs of packets, If we take a look at one pair:

Source Destination Content
IP address MAC address IP address MAC address TCP
92.234.68.72 12:34:56:78:9a:bc 10.0.0.4 00:22:48:01:00:03 53711 → 80 [SYN]
10.255.0.2 00:22:48:01:00:03 10.255.0.6 00:15:5d:71:af:d8 65408 → 80 [SYN]

00:22:48 is the Vendor ID of adapters in the Azure VMs. 00:15:5d is the Vendor ID of Hyper-V adapters created by the Host Network Service for containers.

The packet has come from the external client on 92.234.68.72. The host adapter has received the packet from the client on its external IP address of 10.0.0.4, on port 80; and sent it with the same MAC address, but with the IP address of the ingress-endpoint 10.255.0.2, to port 80 on one of the containers. The same process happens in reverse with the reply.

Below is the conversation between client and container when the incoming request is routed to a container on a different node:

Connection to Container on Different Host

In this case we don’t see the translation between node and ingress-endpoint, because it is on the other container. Instead we see that the request comes from the ingress-endpoint of the sending node, using the MAC address of the host adapter. The reply is sent to the ingress-endpoint using the MAC address of the overlay network adapter.

Source Destination Content
IP address MAC address IP address MAC address TCP
10.255.0.3 00:22:48:01:9e:11 10.255.0.7 00:15:5d:71:a4:c5 65408 → 80 [SYN]
10.255.0.7 00:15:5d:71:a4:c5 10.255.0.3 00:15:5d:bc:f5:40 80 → 65408 [SYN, ACK]

In between the two packets, we see the container broadcast to find the MAC address of the ingress-endpoint. All communication between entities in the overlay network is by Layer 2 switching.

Below is the conversation between two containers on different nodes:

Ping Container to Container on Different Host

The containers are on the same Layer 2 broadcast domain. There is no firewall between them, even though the two nodes both operate the Windows Firewall and do not communicate openly with each other. The containers can ping each other and connect on any listening port.

We will have to dig a bit deeper to find out what makes this work, but for the moment we can see that:

  • The overlay network is a switched LAN segment stretched across the hosts
  • The ingress-endpoints act as load-balancing and routing gateways between the nodes and the container network.

Docker Swarm on Windows

Docker Swarm enables containers to be managed across different hosts. It work on Windows Server 2016 hosts, but the built-in routing mesh is not supported until the newest Windows Server version 1709, released in October 2017.

Docker Swarm is the tool for managing containers across separate docker machines. It defines machines as managers or workers. They communicate with each other to implement docker services. A service is a collection of containers running with the same configuration, and following a set of rules to define the service.

Just to complete the picture, Docker Compose is the tool that creates an application from a set of services. The Containers feature in Windows Server 2016 by default includes Docker Swarm but not Docker Compose.

To set up the Swarm cluster we need more than one machine, obviously. Azure Container Service (ACS) does not currently include Windows hosts, although it is changing so fast that may be out of date any time soon. Instead we can create a cluster of Windows hosts using the Azure virtual machine scale set with Windows Server 2016 Datacenter – with Containers.

We need to open ports on the Windows firewall on each host to allow communication between the docker machines:

  • TCP port 2377 is for Docker communication between manager and worker.
  • TCP and UDP port 7946 is for the “control plane” communication between hosts (worker to worker). This trafffic synchronises the state of a service between hosts.
  • UDP port 4789 is for the “data plane” VXLAN encapsulated traffic between applications in containers.

To create the swarm, run:

docker swarm init --advertise-addr [IP address of manager]

The default is to listen on all addresses on port 2377 (0.0.0.0:2377), so there is no need to specify it. The dialogue returns a token.

To join a host as a worker, run:

docker swarm join --token [the token number returned when creating the swarm] [the listening address of the manager]

We can add or remove nodes later, as workers or managers. The documentation for setting up and managing the swarm is here: Docker Swarm.

If we want to use a GUI to see what is going on, we can use Portainer. I have described setting it up here: Windows Containers: Portainer GUI. This is what we see in the dashboard after creating the swarm:

Docker Swarm Portainer Dashboard

In the Swarm section, we can see an overview of the cluster:

Docker Swarm Portainer Swarm Cluster

And the default overlay network:

Docker Swarm Portainer Swarm Network

Before we create a service, we need to decide how external clients will connect to containers, and how containers will connect to each other. The default network type in Docker is nat. A port on the host is translated to a port on the container so, for example, we use --publish 80:80. But this limits us to one container only, on that port. If we do not define the host port (by using --publish 80), then one is created dynamically on the host, and so we can have more than one container listening on the same port. But then the client does not know what port on the host to connect to. We would need to discover the dynamic ports and put them into an external load balancer. In the case of a docker service, we would need to do this whenever a new replica is created or removed.

Alternatively we can set up a transparent network, where the container has an externally reachable IP address. This way we can have more than one container listening on the same port. But we would still need to manage the addresses in a load balancer whenever a replica is created or removed.

This is a general problem with service scaling across hosts. The Docker solution is to use an Overlay network for swarm traffic. Connections from external clients arriving at any host are routed to any replica in the service (a “routing mesh”). Connections from one container to another are on a private subnet shared across containers in the swarm, rather than on the subnet shared with the host. 

Windows Server before version 1709 supports the overlay network for communication between containers, but not the routing mesh for communication between external clients and containers. This leads to some confusing documentation.

For version 1709 and beyond, the command to create a service using the overlay network and routing mesh is, for example:

docker service create to create a new service
--name to give the service a friendly name
--replicas to specify the numbers of replicas at any one time
--publish if any ports are to be published externally
[image name] for the name of the image to run.

We can include other options, both for the configuration of the service, and the configuration of the containers. The full command for an IIS web server would be:

docker service create --name web --replicas 2 --publish 80:80 microsoft/iis

By default the containers are attached to the swarm overlay network (called “ingress”). The publishing mode is also “ingress”. Any client connection to any host on port 80 is routed in a round robin to one of the containers on any host participating in the service. The containers can reach each other on their internal network on any port.

Here is the service in Portainer:

Docker Swarm Portainer Service 2

A wide range of parameters is shown in the Service Details:

Docker Swarm Portainer Service Details 2

Portainer shows the published port, in ingress mode:

Docker Swarm Portainer Service Publish Mode Ingress

We can see all the parameters of the service with docker service inspect [service name]. The overlay network has a subnet of 10.255.0.0/16. The service has created a Virtual IP of 10.255.0.4. With docker container inspect [container name] we can see the IP addresses of the containers are 10.255.0.6 and 10.255.0.7.

For version 1607 the routing mesh does not work. The approach that works on the earlier build is to publish the ports in host mode. Each host publishes the port directly, and maps it to the container. If we use a defined port on the host, then we can only have one container per host. Instead of defining the number of replicas we need to specify --mode global, so that one container is created on each node. The command to create the service this way is:

docker service create --name web --mode global --publish mode=host,published=80,target=80 microsoft/iis

If we use a dynamic port on the host, then we can have more than one, but we have to discover the port to connect to. The command to create the service this way is:

docker service create --name web --replicas 2 --publish mode=host,target=80 microsoft/iis

Doing it this way, the container is created on the “nat”network. Portainer shows the published port, in host mode:

Docker Swarm Portainer Service Publish Mode Host

Now we have containers running as a service. If a container fails, another is created. If a node fails or is shutdown, any containers running on it are replaced by new containers on other nodes.

Windows Containers: Hyper-V

An option with Windows Containers is to run a container in Hyper-V Isolation Mode. This blog shows what happens when we do this.

When we run a container normally, the processes running in the container are running on the kernel of the host. The Process ID and the Session ID of the container process are the same as on the host.

When we run a container in Hyper-V Isolation Mode, a utility VM is created and the container runs within that. We need to have the Hyper-V role installed on the host. Then we need to add --isolation hyperv to the docker run command.

Here are some of the main differences.

The processes in the container are isolated from the host OS kernel. The Session 0 processes do not appear on the host. Session 1 in the container is not Session 1 on the host, and the Session 1 processes of the container do not appear on the host.

Container:

Get Process Hyper-V Container

Host:

Get Process Hyper-V Host Same SI

There is no mounted Virtual Hard Disk (VHD):

Disk Management Hyper-V

Instead we have a set of processes for the Hyper-V virtual machine:

Hyper-V Processes on Host

A set of inbound rules is not automatically created on the host Windows firewall. There are no rules for ICC, RDP, DNS, DHCP as there are when we create a standard container:

Firewall Rules Hyper-V Host

But the container is listening on port 135, and we can connect from the host to the container on that port, as we can with a standard container:

Netstat Hyper-V Container Established

And if we create another, standard, container, they each respond to a ping from the other.

Hyper-V does not add to the manageability of containers. The Hyper-V containers do not appear in the Hyper-V management console.

Hyper-V Manager

So in summary: in Hyper-V Isolation Mode the container processes are fully isolated; but the container is not on an isolated network, and is still open to connections from the host and from other containers by default.