Virtual Machine Mobility Protocol for L2 and L3 Overlay Networks

Data center networks are being increasingly used by telecom operators as well as by enterprises. In this document we are interested in overlay-based data center networks supporting multitenancy. These networks are organized as one large Layer 2 network geographically distributed in several buildings. In some cases geographical distribution can span across Layer 2 boundaries. In that case need arises for connectivity between Layer 2 boundaries which can be achieved by the network virtualization edge (NVE) functioning as Layer 3 gateway routing across bridging domain such as in Warehouse Scale Computers (WSC). Virtualization which is being used in almost all of today's data centers enables many virtual machines to run on a single physical computer or compute server. Virtual machines (VM) need hypervisor running on the physical compute server to provide them shared processor/memory/storage. Network connectivity is provided by the network virtualization edge (NVE) , . Being able to move VMs dynamically, or live migration, from one server to another allows for dynamic load balancing or work distribution and thus it is a highly desirable feature . There are many challenges and requirements related to migration, mobility, and interconnection of Virtual Machines (VMs)and Virtual Network Elements (VNEs). Retaining IP addresses after a move is a key requirement . Such a requirement is needed in order to maintain existing transport connections. In L3 based data networks, retaining IP addresses after a move is simply not possible. This introduces complexity in IP address management and as a result transport connections need to be reestablished. In view of many virtual machine mobility schemes that exist today, there is a desire to define a standard control plane protocol for virtual machine mobility. The protocol should be based on IPv4 or IPv6. In this document we specify such a protocol for Layer 2 and Layer 3 data networks.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 and . This document uses the terminology defined in . In addition we make the following definitions: Tasks. Tasks are the generalization of virtual machines. Tasks in containers that can be migrated correspond to the virtual machines that can be migrated. We use task and virtual machine interchangeably in this document. Hot VM Mobility. A given VM could be moved from one server to another in running state. Warm VM Mobility. In case of warm VM mobility, the VM states are mirrored to the secondary server (or domain) at a predefined (configurable) regular intervals. This reduces the overheads and complexity but this may also lead to a situation when both servers may not contain the exact same data (state information) Cold VM Mobility. A given VM could be moved from one server to another in stopped or suspended state. Source NVE refers to the old NVE where packets were forwarded to before migration. Destination NVE refers to the new NVE after migration.

This section states requirements on data center network virtual machine mobility. Data center network SHOULD support virtual machine mobility in IPv6. IPv4 SHOULD also be supported in virtual machine mobility. Virtual machine mobility protocol MAY support host routes to accomplish virtualization. Virtual machine mobility protocol SHOULD not support triangular routing except for handling packets in flight. Virtual machine mobility protocol SHOULD not need to use tunneling except for handling packets in flight.

Layer 2 and Layer 3 protocols are described next. In the following sections, we examine more advanced features.

Being able to move Virtual Machines dynamically, from one server to another allows for dynamic load balancing or work distribution and thus it is a highly desirable feature. In a Layer-2 based data center approach, virtual machine moving to another server does not change its IP address. Because of this an IP based virtual machine mobility protocol is not needed. However, when a virtual machine moves, NVEs need to change their caches associating VM Layer 2 or Medium Access Control (MAC) address with NVE's IP address. Such a change enables NVE to send outgoing MAC frames addressed to the virtual machine. VM movement across Layer 3 boundaries is not typical but the same solution applies if the VM moves in the same link such as in WSCs. Virtual machine moves from its source NVE to a new, destination NVE. The move is initiated by the source NVE and is in the same L2 link, the virtual machine IP address(es) do not change but this virtual machine is now under a new NVE, previously communicating NVEs will continue to send their packets to the source NVE. Address Resolution Protocol (ARP) cache in IPv4 or neighbor cache in IPv6 in the NVEs need to be updated. It takes a few seconds for a VM to move from its source NVE to the new destination one. During this period, a tunnel is needed so that source NVE forwards packets to the destination NVE. In IPv4, the virtual machine immediately after the move sends a gratuitous ARP request message containing its IPv4 and Layer 2 or MAC address in its new NVE, destination NVE. This message's destination address is the broadcast address. NVE receives this message. NVE should update VM's ARP entry in the central directory at the NVA. NVE asks NVA to update its mappings to record IPv4 address of VM along with MAC address of VM, and NVE IPv4 address. An NVE-to-NVA protocol is used for this purpose . Reverse ARP (RARP) which enables the host to discover its IPv4 address when it boots from a local server is not used by VMs because the VM already knows its IPv4 address. IPv4/v6 address is assigned to a newly created VM, possibly using Dynamic Host Configuration Protocol (DHCP). There are some vendor deployments (diskless systems or systems without configuration files) wherein VM users, i.e. end-user clients ask for the same MAC address upon migration. This can be achieved by the clients sending RARP request reverse message which carries the old MAC address looking for an IP address allocation. The server, in this case the new NVE needs to communicate with NVA, just like in the gratuitous ARP case to ensure that the same IPv4 address is assigned to the VM. NVA uses the MAC address as the key in the search of ARP cache to find the IP address and informs this to the new NVE which in turns sends RARP reply reverse message. This completes IP address assignment to the migrating VM. All NVEs communicating with this virtual machine uses the old ARP entry. If any VM in those NVEs need to talk to the new VM in the destination NVE, it uses the old ARP entry. Thus the packets are delivered to the source NVE. The source NVE MUST tunnel these in-flight packets to the destination NVE. When an ARP entry in those VMs times out, their corresponding NVEs should access the NVA for an update. IPv6 operation is slightly different: In IPv6, the virtual machine immediately after the move sends an unsolicited neighbor advertisement message containing its IPv6 address and Layer-2 MAC address in its new NVE, the destination NVE. This message is sent to the IPv6 Solicited Node Multicast Address corresponding to the target address which is VM's IPv6 address. NVE receives this message. NVE should update VM's neighbor cache entry in the central directory at the NVA. IPv6 address of VM, MAC address of VM and NVE IPv6 address are recorded to the entry. An NVE-to-NVA protocol is used for this purpose . All NVEs communicating with this virtual machine uses the old neighbor cache entry. If any VM in those NVEs need to talk to the new VM in the destination NVE, it uses the old neighbor cache entry. Thus the packets are delivered to the source NVE. The source NVE MUST tunnel these in-flight packets to the destination NVE. When a neighbor cache entry in those VMs times out, their corresponding NVEs should access the NVA for an update.

Virtualization in L2 based data center networks becomes quickly prohibitive because ARP/neighbor caches don't scale. Scaling can be accomplished seamlessly in L3 data center networks by just giving each virtual network an IP subnet and a default route that points to NVE. This means no explosion of ARP/ neighbor cache in guests (just one ARP/ neighbor cache entry for default route) and we do not need to have Ethernet header in encapsulation which saves at least 16 bytes. In L3 based data center networks, since IP address of the task has to change after move, an IP based task migration protocol is needed. The protocol mostly used is the identifier locator addressing or ILA . Address and connection migration introduce complications in task migration protocol as we discuss below. Especially informing the communicating hosts of the migration becomes a major issue. Also, in L3 based networks, because broadcasting is not available, multicast of neighbor solicitations in IPv6 would need to be emulated. Task migration involves the following steps: Stop running the task. Package the runtime state of the job. Send the runtime state of the task to the destination NVE where the task is to run. Instantiate the task's state on the new machine. Start the tasks for the task continuing from the point at which it was stopped. Address migration and connection migration in moving tasks are addressed next.

Address migration is achieved as follows: Configure IPv4/v6 address on the target host. Suspend use of the address on the old host. This includes handling established connections. A state may be established to drop packets or send ICMPv4 or ICMPv6 destination unreachable message when packets to the migrated address are received. Push the new mapping to hosts. Communicating hosts will learn of the new mapping via a control plane either by participation in a protocol for mapping propagation or by getting the new mapping from a central database such as Domain Name System (DNS). Connection migration involves reestablishing existing TCP connections of the task in the new place. The simplest course of action is to drop TCP connections across a migration. Since migrations should be relatively rare events, it is conceivable that TCP connections could be automatically closed in the network stack during a migration event. If the applications running are known to handle this gracefully (i.e. reopen dropped connections) then this may be viable. More involved approach to connection migration entails pausing the connection, packaging connection state and sending to target, instantiating connection state in the peer stack, and restarting the connection. From the time the connection is paused to the time it is running again in the new stack, packets received for the connection should be silently dropped. For some period of time, the old stack will need to keep a record of the migrated connection. If it receives a packet, it should either silently drop the packet or forward it to the new location, similarly as in .

Source hypervisor may receive packets from the virtual machine's ongoing communications and these packets should not be lost and they should be sent to the destination hypervisor to be delivered to the virtual machine. The steps involved in handling packets in flight are as follows: It takes some time, possibly a few seconds for a VM to move from its source hypervisor to a new destination one. During this period, a tunnel needs to be established so that the source NVE forwards packets to the destination NVE. Inflight packets are tunneled to the destination NVE using the encapsulation protocol such as VXLAN in IPv6. Source NVE gets destination NVE address from NVA in the request to move the virtual machine. Inflight packets are tunneled to the destination NVE using the encapsulation protocol such as VXLAN in IPv4. Source NVE gets destination NVE address from NVA when NVA requests NVE to move the virtual machine. IPv6 packets are received for the migrating virtual machine encapsulated in an IPv6 header at the source NVE. Destination NVE decapsulates the packet and sends IPv6 packet to the migrating VM. IPv4 packets are received for the migrating virtual machine encapsulated in an IPv4 header at the source NVE. Destination NVE decapsulates the packet and sends IPv4 packet to the migrating VM. When source NVE stops receiving packets destined to the virtual machine that has just moved to the destination NVE.

After VM mobility related signaling (VM Mobility Registration Request/Reply), the virtual machine state needs to be transferred to the destination Hypervisor. The state includes its memory and file system. Source NVE opens a TCP connection with destination NVE over which VM's memory state is transferred. File system or local storage is more complicated to transfer. The transfer should ensure consistency, i.e. the VM at the destination should find the same file system it had at the source. Precopying is a commonly used technique for transferring the file system. First the whole disk image is transferred while VM continues to run. After the VM is moved any changes in the file system are packaged together and sent to the destination Hypervisor which reflects these changes to the file system locally at the destination.

Cold Virtual Machine mobility is facilitated by the VM initially sending an ARP or Neighbor Discovery message at the destination NVE but the source NVE not receiving any packets inflight. Cold VM mobility also allows all previous source NVEs and all communicating NVEs to time out ARP/neighbor cache entries of the VM and then get NVA to push to NVEs or get NVEs to pull the updated ARP/neighbor cache entry from NVA. The VMs that are used for cold standby receive scheduled backup information but less frequently than that would be for warm standby option. Therefore, the cold mobility option can be used for non-critical applications and services. In cases of warm standby option, the backup VMs receive backup information at regular intervals. The duration of the interval determines the warmth of the standby option. The larger the duration, the less warm (and hence cold) the standby option becomes. In case of hot standby option, the VMs in both primary and secondary domains have identical information and can provide services simultaneously as in load-share mode of operation. If the VMs in the primary domain fails, there is no need to actively move the VMs to the secondary domain because the VMs in the secondary domain already contains identical information. The hot standby option is the most costly mechanism for providing redundancy, and hence this option is utilized only for mission-critical applications and services.

Virtual machines are not involved in any mobility signalling. Once VM moves to the destination NVE, VM IP address does not change and VM should be able to continue to receive packets to its address(es). This happens in hot VM mobility scenarios. Virtual machine sends a gratuitous Address Resolution Protocol or unsolicited Neighbor Advertisement message upstream after each move.

Managing the lifecycle of VM includes creating a VM with all of the required resources, and managing them seamlessly as the VM migrates from one service to another during its lifetime. The on-boarding process includes the following steps: Sending an allowed (authorized/authenticated) request to Network Virtualization Authority (NVA) in an acceptable format with mandatory/optional virtualized resources {cpu, memory, storage, process/thread support, etc.} and interface information Receiving an acknowledgement from the NVA regarding availability and usability of virtualized resources and interface package Sending a confirmation message to the NVA with request for approval to adapt/adjust/modify the virtualized resources and interface package for utilization in a service.

Security threats for the data and control plane are discussed in . There are several issues in a multi-tenant environment that create problems. In L2 based data center networks, lack of security in VXLAN, corruption of VNI can lead to delivery to wrong tenant. Also, ARP in IPv4 and ND in IPv6 are not secure especially if we accept gratuitous versions. When these are done over a UDP encapsulation, like VXLAN, the problem is worse since it is trivial for a non trusted application to spoof UDP packets. In L3 based data center networks, the problem of address spoofing may arise. As a result the destinations may contain untrusted hosts. This usually happens in cases like the virtual machines running third part applications. This requires the usage of stronger security mechanisms.

This document makes no request to IANA.

The authors are grateful to Qiang Zu, Andrew Malis for helpful comments.

submitted version -00 as a working group draft after adoption