« Virtualization is Easy Enough for an 11 Year Old | Main | Part II: Quick Migration vs VMware VMotion and Live Migration - The Financial Impact »

April 04, 2008

Part I: Quick Migration vs VMware VMotion and Live Migration - Why Things Fail with Quick Migration

There has been a lot of talk recently about why Quick Migration with Microsoft’s Hyper-V drops connections when it simply uses Microsoft Clustering Services (MSCS) to do the migration and MSCS by itself does not drop network connections. In a nutshell it really has to do with where the network stack is running and what happens during a cluster failover versus a Quick Migration. This two part blog will first cover why Quick Migration drops connections and then what the impact is from actual customers in the field. As always, your constructive comments are appreciated.

A Look at TCP

First a quick talk through of what happens in a stateful network connection using TCP. When you transmit data through TCP you care about whether or not packets get to the other end and as such there is a great deal of the protocol stack that deals with state and connection.

In Windows Server 2003 there is a registry value called TcpMaxDataRetransmissions that controls the number of times TCP will retry the destination when it does not receive an acknowledgement (ACK). The default in Windows is 5 tries. The amount of time between retries is determined by the Smoothed Round Trip Time (SRTT). The SRTT can vary depending on the network speed and latency and so TCP tunes itself to each network connection for retransmissions. Since MSCS and VMotion both use a private local LAN connection for data the SRTT is very small (2 ms is common). For each retry attempt the previous SRTT value is doubled until the TcpMaxDataRetransmissions value is reached (5 attempts total by default). If you do not get an ACK after all 5 retries then the TCP connection is aborted and that’s when clients get kicked off the network. Below is a network trace showing what this looks like on a Fast Ethernet (100 Mb) network.


Network Trace.png

As you can see the TCP connection is aborted after 8 seconds. That’s over Fast Ethernet. Over Gigabit Ethernet (recommended for nearly every implementation of live migration and VMotion and Quick Migration) the timeout will occur even faster (generally around 5 seconds).

For further reading see TCP Retransmission Behavior in the Core Protocol Stack Components and the TDI Interface in this Microsoft Technet article.

The Network and MSCS

Now that we understand a little about TCP timeouts, let’s look at how this applies to MSCS. In a typical 2 node MSCS setup with one application (let’s just say a static web page to keep it simple) you have at least 3 actual public IP addresses. You have one network for each physical host plus a common cluster IP for the service you’re running. Microsoft refers to the cluster IP address for the service as the virtual host IP. Without using network load balancing (NLB) only one host controls and responds to the virtual host IP at a time. When the primary node fails the virtual host IP and name are the first services to fail over to the secondary node and come on-line. The virtual host IP switch happens relatively quickly - usually less than 2 seconds - and so no clients are impacted by this service transition since the TCP connection has not been aborted as we learned in the previous section. With NLB in place all active nodes in the cluster can service the virtual host IP and Windows uses a simplified version of Internet Group Management Protocol (IGMP) to control which node responds to the request. Again, in a NLB environment there is seamless transition of the network from one node to another and clients are not impacted.

The networking techniques used with MSCS have been around for over 10 years and are very reliable. I personally used to deploy this at several of my former employers when I worked on the corporate side of the world.

For more information on NLB see this Microsoft Technet article. For more information on MSCS see this Microsoft Technet article.

MSCS and Quick Migration

Quick Migration uses MSCS to coordinate the failover of a VM from one node to another. However, there is a BIG difference between the network failover that occurs with a normal cluster failover inside the OS and a cluster failover that occurs during Quick Migration. With a Quick Migration setup there is no virtual host IP serviced by MSCS. The IP address and all communication to the Guest OS in the VM is controlled by the network protocol stack running inside the VM. If the VM is not running then there is nothing responding to the VM’s IP address. During a Quick Migration you actually suspend the VM to disk, failover the disk resource, and then unsuspend the VM on the second host. During this transition there is no network stack up and running and ready to reply to requests for the VM’s IP Address. It’s true that a Quick Migration can be pretty quick. I’ve seen it as fast as 8 seconds for a low memory idle VM doing nothing which is in-line with Microsoft’s stated expectations for a migration.


Quick Migration Times.png

The only problem with 8 seconds is that it’s longer than the 5 seconds we saw in the first section for TCP connection aborts. This means that any TCP connections to the VM will be aborted and clients will need to try and reconnect when the VM becomes available again. The result of this dropped connection is apparent when you watch the Microsoft Quick Migration video here.

How VMotion Keeps the Network Alive

During a VMotion or even a live migration from other Xen based vendors the network failover happens relatively quick (generally less than 1 second). The VM is never suspended and resumed during the process. Here’s a simplified flow of what happens during a VMotion process.

  1. The VM configuration is created on the destination node.
  2. A memory map is created on the first node of active pages of memory and the map is sent to the destination node.
  3. Upon receiving the memory map the destination node demand pages the memory being used from the first node and places it in the new VM’s memory pages on the destination node.
  4. Clients are still connected to the source node and changing memory and so a new memory map is created of just the changed pages. This list is shorter and it is sent to the destination node.
  5. The destination node again demand pages the memory over to the destination node. This process takes a shorter amount of time than the first process and so less memory changes on the source node while this occurs.
  6. This back and forth recursive memory copy continues until we get to a point where we think we can grab the last bit of memory in 1 second or less of downtime.
  7. At this point we actually stop processing on the source node, grab the last little bit of memory that has changed, and start processing on the destination node.
  8. Immediately when the destination node takes over the VM reverse ARP’s out to the network to help with network convergence. This lets all of the new frames come to the correct ESX host and to the new VM location.
  9. All of this happens well under the TCP session abort time and so no client connections are interrupted.

So you can see that by doing a recursive memory copy we can greatly reduce the amount of time the VM’s IP address was off the network and in turn we do not drop any network connection. This has been so reliable that customers trust this and use this every day. Below is a screen shot I just got from a customer that has VMware’s Dynamic Resource Scheduler (DRS) turned on to help balance out virtual machines during peak loads. As you can see, over the past year DRS has initiated 150,000 VMotions in a production environment with no downtime for clients or the VM’s themselves.


screenshot 150.000 V-Motions-mk.jpg

Just to drive this home let’s say we had this same picture and Quick Migration was involved. Let’s also assume that we get the absolutely best migration speed with Quick Migration of 8 seconds. First of all that would mean that your clients got disconnected from their applications 150,000 times throughout the year. Second, that means you had 1,200,000 seconds of downtime (14 days)!

Conclusion

All of this was a very long explanation to show the major difference between MSCS when used inside the OS and MSCS when used as part of Quick Migration. The key is where the network stack is running. I hope this clears up why every other virtualization provider out there besides Microsoft has chosen to implement a true live migration solution that transfers the VM with less than 1 second of downtime for the network stack. Live migration or VMware VMotion is the foundation to building a truly dynamic datacenter - something that Microsoft has talked about for years but seems to have missed the first building block to getting there. So the next time Microsoft comes in and says to you Quick Migration is good enough, you can now know why it really isn’t and why those comments come from customers, partners, and every other virtualization vendor in the marketplace.

To see the financial impact of the downtime introduced by Quick Migration, continue reading Part II of this blog.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/t/trackback/2799626/27753946

Listed below are links to weblogs that reference Part I: Quick Migration vs VMware VMotion and Live Migration - Why Things Fail with Quick Migration:

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Post a comment

If you have a TypeKey or TypePad account, please Sign In

May 2008

Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31

Recent Comments

Blog powered by TypePad