Difference between revisions of "RFC7342"

From RFC-Wiki
imported>Admin
(Created page with " Independent Submission L. DunbarRequest for Comments: 7342 HuaweiCategory: Informational ...")
 
 
Line 1: Line 1:
 +
Independent Submission                                        L. Dunbar
 +
Request for Comments: 7342                                        Huawei
 +
Category: Informational                                        W. Kumari
 +
ISSN: 2070-1721                                                  Google
 +
                                                        I. Gashinsky
 +
                                                                Yahoo
 +
                                                          August 2014
  
 +
        Practices for Scaling ARP and Neighbor Discovery (ND)
 +
                        in Large Data Centers
  
 
+
'''Abstract'''
 
 
 
 
 
 
Independent Submission                                        L. DunbarRequest for Comments: 7342                                        HuaweiCategory: Informational                                        W. KumariISSN: 2070-1721                                                  Google                                                        I. Gashinsky                                                                Yahoo                                                          August 2014
 
 
 
        Practices for Scaling ARP and Neighbor Discovery (ND)                        in Large Data Centers
 
Abstract
 
  
 
This memo documents some operational practices that allow ARP and
 
This memo documents some operational practices that allow ARP and
 
Neighbor Discovery (ND) to scale in data center environments.
 
Neighbor Discovery (ND) to scale in data center environments.
  
Status of This Memo
+
'''Status of This Memo'''
  
 
This document is not an Internet Standards Track specification; it is
 
This document is not an Internet Standards Track specification; it is
Line 29: Line 31:
 
http://www.rfc-editor.org/info/rfc7342.
 
http://www.rfc-editor.org/info/rfc7342.
  
Copyright Notice
+
'''Copyright Notice'''
  
 
Copyright (c) 2014 IETF Trust and the persons identified as the
 
Copyright (c) 2014 IETF Trust and the persons identified as the
Line 41: Line 43:
 
to this document.
 
to this document.
  
 +
  5.1. Practices to Alleviate APR/ND Burden on L2/L3
  
 
+
        5.1.2. L2/L3 Boundary Router Processing of Inbound
 
 
 
 
 
 
 
 
 
 
  
 
== Introduction ==
 
== Introduction ==
Line 54: Line 52:
 
scale in data center environments.
 
scale in data center environments.
  
As described in [RFC6820], the increasing trend of rapid workload
+
As described in [[RFC6820]], the increasing trend of rapid workload
 
shifting and server virtualization in modern data centers requires
 
shifting and server virtualization in modern data centers requires
 
servers to be loaded (or reloaded) with different Virtual Machines
 
servers to be loaded (or reloaded) with different Virtual Machines
Line 68: Line 66:
 
subnets to span multiple router ports.
 
subnets to span multiple router ports.
  
'''Note:''' L2/L3 boundary routers as discussed in this document are
+
Note: L2/L3 boundary routers as discussed in this document are
 
capable of forwarding IEEE 802.1 Ethernet frames (Layer 2) without a
 
capable of forwarding IEEE 802.1 Ethernet frames (Layer 2) without a
 
Media Access Control (MAC) header change.  When subnets span multiple
 
Media Access Control (MAC) header change.  When subnets span multiple
Line 74: Line 72:
 
"single-link" subnets, specifically the multi-access link model
 
"single-link" subnets, specifically the multi-access link model
  
 
+
recommended by [[RFC4903]].  They are different from the "multi-link"
 
 
 
 
 
 
recommended by [RFC4903].  They are different from the "multi-link"
 
 
subnets described in [Multi-Link] and [[RFC4903|RFC 4903]], which refer to
 
subnets described in [Multi-Link] and [[RFC4903|RFC 4903]], which refer to
 
different physical media with the same prefix connected to one
 
different physical media with the same prefix connected to one
Line 104: Line 98:
 
to all physical links, becomes negligible compared to the link
 
to all physical links, becomes negligible compared to the link
 
bandwidth.  In addition, IGMP/MLD (Internet Group Management Protocol
 
bandwidth.  In addition, IGMP/MLD (Internet Group Management Protocol
and Multicast Listener Discovery) snooping [RFC4541] can further
+
and Multicast Listener Discovery) snooping [[RFC4541]] can further
 
reduce the ND multicast traffic to some physical link segments.
 
reduce the ND multicast traffic to some physical link segments.
  
Line 121: Line 115:
 
data center environment, especially in reducing processing loads to
 
data center environment, especially in reducing processing loads to
 
L2/L3 boundary routers.
 
L2/L3 boundary routers.
 
 
 
 
 
 
 
 
 
  
 
== Terminology ==
 
== Terminology ==
  
This document reuses much of the terminology from [RFC6820].  Many of
+
This document reuses much of the terminology from [[RFC6820]].  Many of
 
the definitions are presented here to aid the reader.
 
the definitions are presented here to aid the reader.
  
ARP: IPv4 Address Resolution Protocol [RFC826]
+
ARP: IPv4 Address Resolution Protocol [[RFC826]]
  
 
Aggregation Switch: A Layer 2 switch interconnecting ToR switches
 
Aggregation Switch: A Layer 2 switch interconnecting ToR switches
Line 154: Line 139:
 
NA: IPv6 Neighbor Advertisement
 
NA: IPv6 Neighbor Advertisement
  
ND: IPv6 Neighbor Discovery [RFC4861]
+
ND: IPv6 Neighbor Discovery [[RFC4861]]
  
 
NS: IPv6 Neighbor Solicitation
 
NS: IPv6 Neighbor Solicitation
Line 177: Line 162:
  
 
3) Overlay models.
 
3) Overlay models.
 
 
 
 
 
 
  
 
There is no single network design that fits all cases.  The following
 
There is no single network design that fits all cases.  The following
Line 194: Line 173:
 
the attached VMs.
 
the attached VMs.
  
As described in [RFC6820], many data centers are architected so that
+
As described in [[RFC6820]], many data centers are architected so that
 
ARP/ND broadcast/multicast messages are confined to a few ports
 
ARP/ND broadcast/multicast messages are confined to a few ports
 
(interfaces) of the access switches (i.e., ToR switches).
 
(interfaces) of the access switches (i.e., ToR switches).
Line 225: Line 204:
 
practices for reducing the ARP/ND processing required on L2/L3
 
practices for reducing the ARP/ND processing required on L2/L3
 
boundary routers.
 
boundary routers.
 
 
 
 
 
 
 
 
 
 
 
  
 
==== Communicating with a Peer in a Different Subnet ====
 
==== Communicating with a Peer in a Different Subnet ====
Line 250: Line 218:
 
   (and not in hardware).
 
   (and not in hardware).
  
'''Note:''' Any centralized configuration that preloads the default MAC
+
Note: Any centralized configuration that preloads the default MAC
 
   addresses is not included in this scenario.
 
   addresses is not included in this scenario.
  
Line 273: Line 241:
 
Recommendation: If the network is an IPv4-only network, then this
 
Recommendation: If the network is an IPv4-only network, then this
 
   approach can be used.  For an IPv6 network, one needs to consider
 
   approach can be used.  For an IPv6 network, one needs to consider
   the work described in [RFC7048].  Note: ND and Secure Neighbor
+
   the work described in [[RFC7048]].  Note: ND and Secure Neighbor
   Discovery (SEND) [RFC3971] use the bidirectional nature of queries
+
   Discovery (SEND) [[RFC3971]] use the bidirectional nature of queries
 
   to detect and prevent security attacks.
 
   to detect and prevent security attacks.
 
 
 
 
 
 
 
 
 
 
 
 
 
  
 
==== L2/L3 Boundary Router Processing of Inbound Traffic ====
 
==== L2/L3 Boundary Router Processing of Inbound Traffic ====
Line 313: Line 268:
 
in the L2 domain.  As a result, there is an increased likelihood of
 
in the L2 domain.  As a result, there is an increased likelihood of
 
the router's ARP cache having the IP-MAC entry when it receives data
 
the router's ARP cache having the IP-MAC entry when it receives data
frames from external peers.  [RFC6820] Section 7.1 provides a full
+
frames from external peers.  [[RFC6820]] Section 7.1 provides a full
 
description of this problem.
 
description of this problem.
  
Line 320: Line 275:
 
from those stations.  Therefore, this practice allows an L2/L3
 
from those stations.  Therefore, this practice allows an L2/L3
 
boundary to send unicast RAs to the target instead of multicasts.
 
boundary to send unicast RAs to the target instead of multicasts.
[RFC6820] Section 7.2 has a full description of this problem.
+
[[RFC6820]] Section 7.2 has a full description of this problem.
  
 
Advantage: This practice results in a reduction of the number of ARP
 
Advantage: This practice results in a reduction of the number of ARP
Line 336: Line 291:
 
   packets that are destined for nonexistent or inactive targets,
 
   packets that are destined for nonexistent or inactive targets,
 
   alternative approaches should be considered.
 
   alternative approaches should be considered.
 
 
 
 
 
 
  
 
==== Inter-Subnet Communications ====
 
==== Inter-Subnet Communications ====
Line 391: Line 340:
 
   groups (e.g., NETCONF, NVO3, I2RS, etc.) to get prompt incremental
 
   groups (e.g., NETCONF, NVO3, I2RS, etc.) to get prompt incremental
 
   updates of static ARP/ND entries when changes occur.
 
   updates of static ARP/ND entries when changes occur.
 
 
 
 
  
 
=== ARP/ND Proxy Approaches ===
 
=== ARP/ND Proxy Approaches ===
  
[[RFC1027|RFC 1027]] [RFC1027] specifies one ARP Proxy approach referred to as
+
[[RFC1027|RFC 1027]] [[RFC1027]] specifies one ARP Proxy approach referred to as
 
"Proxy ARP".  However, [[RFC1027|RFC 1027]] does not discuss a scaling mechanism.
 
"Proxy ARP".  However, [[RFC1027|RFC 1027]] does not discuss a scaling mechanism.
 
Since the publication of [[RFC1027|RFC 1027]] in 1987, many variants of Proxy ARP
 
Since the publication of [[RFC1027|RFC 1027]] in 1987, many variants of Proxy ARP
Line 406: Line 351:
 
[ARP_Reduction] describes a type of "ARP Proxy" that allows a ToR
 
[ARP_Reduction] describes a type of "ARP Proxy" that allows a ToR
 
switch to snoop ARP requests and return the target station's MAC if
 
switch to snoop ARP requests and return the target station's MAC if
the ToR has the information in its cache.  However, [RFC4903] doesn't
+
the ToR has the information in its cache.  However, [[RFC4903]] doesn't
 
recommend the caching approach described in [ARP_Reduction] because
 
recommend the caching approach described in [ARP_Reduction] because
 
such a cache prevents any type of fast mobility between Layer 2 ports
 
such a cache prevents any type of fast mobility between Layer 2 ports
and breaks Secure Neighbor Discovery [RFC3971].
+
and breaks Secure Neighbor Discovery [[RFC3971]].
  
IPv6 ND Proxy [RFC4389] specifies a proxy used between an Ethernet
+
IPv6 ND Proxy [[RFC4389]] specifies a proxy used between an Ethernet
 
segment and other segments, such as wireless or PPP segments.  ND
 
segment and other segments, such as wireless or PPP segments.  ND
Proxy [RFC4389] doesn't allow a proxy to send NA messages on behalf
+
Proxy [[RFC4389]] doesn't allow a proxy to send NA messages on behalf
 
of the target to ensure that the proxy does not interfere with hosts
 
of the target to ensure that the proxy does not interfere with hosts
 
moving from one segment to another.  Therefore, the ND Proxy
 
moving from one segment to another.  Therefore, the ND Proxy
[RFC4389] doesn't reduce the number of ND messages to an L2/L3
+
[[RFC4389]] doesn't reduce the number of ND messages to an L2/L3
 
boundary router.
 
boundary router.
  
Line 428: Line 373:
 
The IETF should consider making proxy recommendations for data center
 
The IETF should consider making proxy recommendations for data center
 
environments as a transition issue to help DC operators transitioning
 
environments as a transition issue to help DC operators transitioning
to IPv6.  Section 7 of [RFC4389] ("Guidelines to Proxy Developers")
+
to IPv6.  Section 7 of [[RFC4389]] ("Guidelines to Proxy Developers")
 
should be considered when developing any new proxy protocols to
 
should be considered when developing any new proxy protocols to
 
scale ARP.
 
scale ARP.
Line 435: Line 380:
  
 
Multicast snooping (IGMP/MLD) has different implementations and
 
Multicast snooping (IGMP/MLD) has different implementations and
scaling issues.  [RFC4541] notes that multicast IGMPv2/v3 snooping
+
scaling issues.  [[RFC4541]] notes that multicast IGMPv2/v3 snooping
has trouble with subnets that include IGMPv2 and IGMPv3.  [RFC4541]
+
has trouble with subnets that include IGMPv2 and IGMPv3.  [[RFC4541]]
 
also notes that MLDv2 snooping requires the use of either destination
 
also notes that MLDv2 snooping requires the use of either destination
 
MAC (DMAC) address filtering or deeper inspection of frames/packets
 
MAC (DMAC) address filtering or deeper inspection of frames/packets
 
to allow for scaling.
 
to allow for scaling.
 
 
 
 
 
 
 
 
  
 
MLDv2 snooping needs to be re-examined for scaling within the DC.
 
MLDv2 snooping needs to be re-examined for scaling within the DC.
Line 493: Line 430:
 
has summarized some practices in various scenarios and the advantages
 
has summarized some practices in various scenarios and the advantages
 
and disadvantages of all of these practices.
 
and disadvantages of all of these practices.
 
 
 
 
 
 
 
 
  
 
In some of these scenarios, the common practices could be improved by
 
In some of these scenarios, the common practices could be improved by
Line 516: Line 445:
  
 
o  Develop IPv4 ARP/IPv6 ND Proxy standards for use in the data
 
o  Develop IPv4 ARP/IPv6 ND Proxy standards for use in the data
   center.  Section 7 of [RFC4389] ("Guidelines to Proxy Developers")
+
   center.  Section 7 of [[RFC4389]] ("Guidelines to Proxy Developers")
 
   should be considered when developing any new proxy protocols to
 
   should be considered when developing any new proxy protocols to
 
   scale ARP/ND.
 
   scale ARP/ND.
Line 540: Line 469:
 
and K.K. Ramakrishnan.
 
and K.K. Ramakrishnan.
  
 +
10.  References
  
 +
10.1.  Normative References
  
 +
[GratuitousARP]
 +
          Cheshire, S., "IPv4 Address Conflict Detection", [[RFC5227|RFC 5227]],
 +
          July 2008.
  
 +
[[RFC826]]  Plummer, D., "Ethernet Address Resolution Protocol: Or
 +
          Converting Network Protocol Addresses to 48.bit Ethernet
 +
          Address for Transmission on Ethernet Hardware", [[STD37|STD 37]],
 +
          [[RFC826|RFC 826]], November 1982.
  
 +
[[RFC1027]]  Carl-Mitchell, S. and J. Quarterman, "Using ARP to
 +
          implement transparent subnet gateways", [[RFC1027|RFC 1027]],
 +
          October 1987.
  
 +
[[RFC3971]]  Arkko, J., Ed., Kempf, J., Zill, B., and P. Nikander,
 +
          "SEcure Neighbor Discovery (SEND)", [[RFC3971|RFC 3971]], March 2005.
  
 +
[[RFC4389]]  Thaler, D., Talwar, M., and C. Patel, "Neighbor Discovery
 +
          Proxies (ND Proxy)", [[RFC4389|RFC 4389]], April 2006.
  
 +
[[RFC4541]]  Christensen, M., Kimball, K., and F. Solensky,
 +
          "Considerations for Internet Group Management Protocol
 +
          (IGMP) and Multicast Listener Discovery (MLD) Snooping
 +
          Switches", [[RFC4541|RFC 4541]], May 2006.
  
 +
[[RFC4861]]  Narten, T., Nordmark, E., Simpson, W., and H. Soliman,
 +
          "Neighbor Discovery for IP version 6 (IPv6)", [[RFC4861|RFC 4861]],
 +
          September 2007.
  
 +
[[RFC4903]]  Thaler, D., "Multi-Link Subnet Issues", [[RFC4903|RFC 4903]],
 +
          June 2007.
  
 +
[[RFC6820]]  Narten, T., Karir, M., and I. Foo, "Address Resolution
 +
          Problems in Large Data Center Networks", [[RFC6820|RFC 6820]],
 +
          January 2013.
  
 +
10.2.  Informative References
  
 +
[ARMD-Statistics]
 +
          Karir, M. and J. Rees, "Address Resolution Statistics",
 +
          Work in Progress, July 2011.
  
 +
[ARP_Reduction]
 +
          Shah, H., Ghanwani, A., and N. Bitar, "ARP Broadcast
 +
          Reduction for Large Data Centers", Work in Progress,
 +
          October 2011.
  
 +
[IGMP-MLD-Tracking]
 +
          Asaeda, H., "IGMP/MLD-Based Explicit Membership Tracking
 +
          Function for Multicast Routers", Work in Progress,
 +
          December 2013.
  
== References ==
+
[L3-VM-Mobility]
 
+
          Kumari, W. and J. Halpern, "Virtual Machine mobility in L3
=== Normative References ===
+
          Networks", Work in Progress, August 2011.
 
 
[GratuitousARP]          Cheshire, S., "IPv4 Address Conflict Detection", [[RFC5227|RFC 5227]],          July 2008.
 
[RFC826]  Plummer, D., "Ethernet Address Resolution Protocol: Or          Converting Network Protocol Addresses to 48.bit Ethernet          Address for Transmission on Ethernet Hardware", STD 37,          [[RFC826|RFC 826]], November 1982.
 
[RFC1027]  Carl-Mitchell, S. and J. Quarterman, "Using ARP to          implement transparent subnet gateways", [[RFC1027|RFC 1027]],          October 1987.
 
[RFC3971]  Arkko, J., Ed., Kempf, J., Zill, B., and P. Nikander,          "SEcure Neighbor Discovery (SEND)", [[RFC3971|RFC 3971]], March 2005.
 
[RFC4389]  Thaler, D., Talwar, M., and C. Patel, "Neighbor Discovery          Proxies (ND Proxy)", [[RFC4389|RFC 4389]], April 2006.
 
[RFC4541]  Christensen, M., Kimball, K., and F. Solensky,          "Considerations for Internet Group Management Protocol          (IGMP) and Multicast Listener Discovery (MLD) Snooping          Switches", [[RFC4541|RFC 4541]], May 2006.
 
[RFC4861]  Narten, T., Nordmark, E., Simpson, W., and H. Soliman,          "Neighbor Discovery for IP version 6 (IPv6)", [[RFC4861|RFC 4861]],          September 2007.
 
[RFC4903]  Thaler, D., "Multi-Link Subnet Issues", [[RFC4903|RFC 4903]],          June 2007.
 
[RFC6820]  Narten, T., Karir, M., and I. Foo, "Address Resolution          Problems in Large Data Center Networks", [[RFC6820|RFC 6820]],          January 2013.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
=== Informative References ===
 
 
 
[ARMD-Statistics]          Karir, M. and J. Rees, "Address Resolution Statistics",          Work in Progress, July 2011.
 
[ARP_Reduction]          Shah, H., Ghanwani, A., and N. Bitar, "ARP Broadcast          Reduction for Large Data Centers", Work in Progress,          October 2011.
 
[IGMP-MLD-Tracking]          Asaeda, H., "IGMP/MLD-Based Explicit Membership Tracking          Function for Multicast Routers", Work in Progress,          December 2013.
 
[L3-VM-Mobility]           Kumari, W. and J. Halpern, "Virtual Machine mobility in L3           Networks", Work in Progress, August 2011.
 
[Multi-Link]          Thaler, D. and C. Huitema, "Multi-link Subnet Support in          IPv6", Work in Progress, June 2002.
 
[RFC1076]  Trewitt, G. and C. Partridge, "HEMS Monitoring and Control          Language", [[RFC1076|RFC 1076]], November 1988.
 
[RFC7048]  Nordmark, E. and I. Gashinsky, "Neighbor Unreachability          Detection Is Too Impatient", [[RFC7048|RFC 7048]], January 2014.
 
[VXLAN]    Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger,          L., Sridhar, T., Bursell, M., and C. Wright, "VXLAN: A          Framework for Overlaying Virtualized Layer 2 Networks over          Layer 3 Networks", Work in Progress, April 2014.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
 +
[Multi-Link]
 +
          Thaler, D. and C. Huitema, "Multi-link Subnet Support in
 +
          IPv6", Work in Progress, June 2002.
  
 +
[[RFC1076]]  Trewitt, G. and C. Partridge, "HEMS Monitoring and Control
 +
          Language", [[RFC1076|RFC 1076]], November 1988.
  
 +
[[RFC7048]]  Nordmark, E. and I. Gashinsky, "Neighbor Unreachability
 +
          Detection Is Too Impatient", [[RFC7048|RFC 7048]], January 2014.
  
 +
[VXLAN]    Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger,
 +
          L., Sridhar, T., Bursell, M., and C. Wright, "VXLAN: A
 +
          Framework for Overlaying Virtualized Layer 2 Networks over
 +
          Layer 3 Networks", Work in Progress, April 2014.
  
 
Authors' Addresses
 
Authors' Addresses
Line 620: Line 553:
 
Phone: (469) 277 5840
 
Phone: (469) 277 5840
  
 
  
 
Warren Kumari
 
Warren Kumari
Line 629: Line 561:
  
  
 
  
 
Igor Gashinsky
 
Igor Gashinsky
Line 638: Line 569:
  
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
 
[[Category:Informational]]
 
[[Category:Informational]]

Latest revision as of 05:46, 2 October 2020

Independent Submission L. Dunbar Request for Comments: 7342 Huawei Category: Informational W. Kumari ISSN: 2070-1721 Google

                                                        I. Gashinsky
                                                               Yahoo
                                                         August 2014
       Practices for Scaling ARP and Neighbor Discovery (ND)
                       in Large Data Centers

Abstract

This memo documents some operational practices that allow ARP and Neighbor Discovery (ND) to scale in data center environments.

Status of This Memo

This document is not an Internet Standards Track specification; it is published for informational purposes.

This is a contribution to the RFC Series, independently of any other RFC stream. The RFC Editor has chosen to publish this document at its discretion and makes no statement about its value for implementation or deployment. Documents approved for publication by the RFC Editor are not a candidate for any level of Internet Standard; see Section 2 of RFC 5741.

Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at http://www.rfc-editor.org/info/rfc7342.

Copyright Notice

Copyright (c) 2014 IETF Trust and the persons identified as the document authors. All rights reserved.

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document.

  5.1. Practices to Alleviate APR/ND Burden on L2/L3
       5.1.2. L2/L3 Boundary Router Processing of Inbound

Introduction

This memo documents some operational practices that allow ARP/ND to scale in data center environments.

As described in RFC6820, the increasing trend of rapid workload shifting and server virtualization in modern data centers requires servers to be loaded (or reloaded) with different Virtual Machines (VMs) or applications at different times. Different VMs residing on one physical server may have different IP addresses or may even be in different IP subnets.

In order to allow a physical server to be loaded with VMs in different subnets or allow VMs to be moved to different server racks without IP address reconfiguration, the networks need to enable multiple broadcast domains (many VLANs) on the interfaces of L2/L3 boundary routers and Top-of-Rack (ToR) switches and allow some subnets to span multiple router ports.

Note: L2/L3 boundary routers as discussed in this document are capable of forwarding IEEE 802.1 Ethernet frames (Layer 2) without a Media Access Control (MAC) header change. When subnets span multiple ports of those routers, they still fall under the category of "single-link" subnets, specifically the multi-access link model

recommended by RFC4903. They are different from the "multi-link" subnets described in [Multi-Link] and RFC 4903, which refer to different physical media with the same prefix connected to one router. Within the "multi-link" subnet described in RFC 4903, Layer 2 frames from one port cannot be natively forwarded to another port without a header change.

Unfortunately, when the combined number of VMs (or hosts) in all those subnets is large, this can lead to address resolution (i.e., IPv4 ARP and IPv6 ND) scaling issues. There are three major issues associated with ARP/ND address resolution protocols when subnets span multiple L2/L3 boundary router ports:

1) The ARP/ND messages being flooded to many physical link segments,

  which can reduce bandwidth utilization for user traffic.

2) The ARP/ND processing load impact on the L2/L3 boundary routers.

3) In IPv4, every end station in a subnet receiving ARP broadcast

  messages from all other end stations in the subnet.  IPv6 ND has
  eliminated this issue by using multicast.

Since the majority of data center servers are moving towards 1G or 10G ports, the bandwidth taken by ARP/ND messages, even when flooded to all physical links, becomes negligible compared to the link bandwidth. In addition, IGMP/MLD (Internet Group Management Protocol and Multicast Listener Discovery) snooping RFC4541 can further reduce the ND multicast traffic to some physical link segments.

As modern servers' computing power increases, the processing taken by a large amount of ARP broadcast messages becomes less significant to servers. For example, lab testing shows that 2000 ARP requests per second only takes 2% of a single-core CPU server. Therefore, the impact of ARP broadcasts to end stations is not significant on today's servers.

Statistics provided by Merit Network [ARMD-Statistics] have shown that the major impact of a large number of mobile VMs in a data center is on the L2/L3 boundary routers, i.e., issue 2 above.

This memo documents some simple practices that can scale ARP/ND in a data center environment, especially in reducing processing loads to L2/L3 boundary routers.

Terminology

This document reuses much of the terminology from RFC6820. Many of the definitions are presented here to aid the reader.

ARP: IPv4 Address Resolution Protocol RFC826

Aggregation Switch: A Layer 2 switch interconnecting ToR switches

Bridge: IEEE802.1Q-compliant device. In this document, the term

  "Bridge" is used interchangeably with "Layer 2 switch"

DC: Data Center

DA: Destination Address

End Station: VM or physical server, whose address is either the

  destination or the source of a data frame

EoR: End-of-Row switches in a data center

NA: IPv6 Neighbor Advertisement

ND: IPv6 Neighbor Discovery RFC4861

NS: IPv6 Neighbor Solicitation

SA: Source Address

ToR: Top-of-Rack Switch (also known as access switch)

UNA: IPv6 Unsolicited Neighbor Advertisement

VM: Virtual Machine

Subnet: Refers to the multi-access link subnet referenced by RFC 4903

Common DC Network Designs

Some common network designs for a data center include:

1) Layer 3 connectivity to the access switch,

2) Large Layer 2, and

3) Overlay models.

There is no single network design that fits all cases. The following sections document some of the common practices to scale address resolution under each network design.

Layer 3 to Access Switches

This network design configures Layer 3 to the access switches, effectively making the access switches the L2/L3 boundary routers for the attached VMs.

As described in RFC6820, many data centers are architected so that ARP/ND broadcast/multicast messages are confined to a few ports (interfaces) of the access switches (i.e., ToR switches).

Another variant of the Layer 3 solution is a Layer 3 infrastructure configured all the way to servers (or even to the VMs), which confines the ARP/ND broadcast/multicast messages to the small number of VMs within the server.

Advantage: Both ARP and ND scale well. There is no address

  resolution issue in this design.

Disadvantage: The main disadvantage of this network design occurs

  during VM movement.  During VM movement, either VMs need an
  address change or switches/routers need a configuration change
  when the VMs are moved to different locations.

Summary: This solution is more suitable to data centers that have a

  static workload and/or network operators who can reconfigure IP
  addresses/subnets on switches before any workload change.  No
  protocol changes are suggested.

Layer 2 Practices to Scale ARP/ND

Practices to Alleviate APR/ND Burden on L2/L3 Boundary Routers

The ARP/ND broadcast/multicast messages in a Layer 2 domain can negatively affect the L2/L3 boundary routers, especially with a large number of VMs and subnets. This section describes some commonly used practices for reducing the ARP/ND processing required on L2/L3 boundary routers.

Communicating with a Peer in a Different Subnet

Scenario: When the originating end station doesn't have its default

  gateway MAC address in its ARP/ND cache and needs to communicate
  with a peer in a different subnet, it needs to send ARP/ND
  requests to its default gateway router to resolve the router's MAC
  address.  If there are many subnets on the gateway router and a
  large number of end stations in those subnets that don't have the
  gateway MAC address in their ARP/ND caches, the gateway router has
  to process a very large number of ARP/ND requests.  This is often
  CPU intensive, as ARP/ND messages are usually processed by the CPU
  (and not in hardware).

Note: Any centralized configuration that preloads the default MAC

  addresses is not included in this scenario.

Solution: For IPv4 networks, a practice to alleviate this problem is

  to have the L2/L3 boundary router send periodic gratuitous ARP
  [GratuitousARP] messages, so that all the connected end stations
  can refresh their ARP caches.  As a result, most (if not all) end
  stations will not need to send ARP requests for the gateway
  routers when they need to communicate with external peers.

For the above scenario, IPv6 end stations are still required to send unicast ND messages to their default gateway router (even with those routers periodically sending Unsolicited Neighbor Advertisements) because IPv6 requires bidirectional path validation.

Advantage: This practice results in a reduction of ARP requests to be

  processed by the L2/L3 boundary router for IPv4.

Disadvantage: This practice doesn't reduce ND processing on the L2/L3

  boundary router for IPv6 traffic.

Recommendation: If the network is an IPv4-only network, then this

  approach can be used.  For an IPv6 network, one needs to consider
  the work described in RFC7048.  Note: ND and Secure Neighbor
  Discovery (SEND) RFC3971 use the bidirectional nature of queries
  to detect and prevent security attacks.

L2/L3 Boundary Router Processing of Inbound Traffic

Scenario: When an L2/L3 boundary router receives a data frame

  destined for a local subnet and the destination is not in the
  router's ARP/ND cache, some routers hold the packet and trigger an
  ARP/ND request to resolve the L2 address.  The router may need to
  send multiple ARP/ND requests until either a timeout is reached or
  an ARP/ND reply is received before forwarding the data packets
  towards the target's MAC address.  This process is not only CPU
  intensive but also buffer intensive.

Solution: To protect a router from being overburdened by resolving

  target MAC addresses, one solution is for the router to limit the
  rate of resolving target MAC addresses for inbound traffic whose
  target is not in the router's ARP/ND cache.  When the rate is
  exceeded, the incoming traffic whose target is not in the ARP/ND
  cache is dropped.

For an IPv4 network, another common practice to alleviate pain caused by this problem is for the router to snoop ARP messages between other hosts, so that its ARP cache can be refreshed with active addresses in the L2 domain. As a result, there is an increased likelihood of the router's ARP cache having the IP-MAC entry when it receives data frames from external peers. RFC6820 Section 7.1 provides a full description of this problem.

For IPv6 end stations, routers are supposed to send Router Advertisements (RAs) unicast even if they have snooped UNAs/NSs/NAs from those stations. Therefore, this practice allows an L2/L3 boundary to send unicast RAs to the target instead of multicasts. RFC6820 Section 7.2 has a full description of this problem.

Advantage: This practice results in a reduction of the number of ARP

  requests that routers have to send upon receiving IPv4 packets and
  the number of IPv4 data frames from external peers that routers
  have to hold due to targets not being in the ARP cache.

Disadvantage: The amount of ND processing on routers for IPv6 traffic

  is not reduced.  IPv4 routers still need to hold data packets from
  external peers and trigger ARP requests if the targets of the data
  packets either don't exist or are not very active.  In this case,
  IPv4 processing or IPv4 buffers are not reduced.

Recommendation: If there is a higher chance of routers receiving data

  packets that are destined for nonexistent or inactive targets,
  alternative approaches should be considered.

Inter-Subnet Communications

The router could be hit with ARP/ND requests twice when the originating and destination stations are in different subnets attached to the same router and those hosts don't communicate with external peers often enough. The first hit is when the originating station in subnet-A initiates an ARP/ND request to the L2/L3 boundary router if the router's MAC is not in the host's cache (Section 5.1.1 above), and the second hit is when the L2/L3 boundary router initiates ARP/ND requests to the target in subnet-B if the target is not in the router's ARP/ND cache (Section 5.1.2 above).

Again, practices described in Sections 5.1.1 and 5.1.2 can alleviate some problems in some IPv4 networks.

For IPv6 traffic, the practices described above don't reduce the ND processing on L2/L3 boundary routers.

Recommendation: Consider the recommended approaches described in

  Sections 5.1.1 and 5.1.2.  However, any solutions that relax the
  bidirectional requirement of IPv6 ND disable the security that the
  two-way ND communication exchange provides.

Static ARP/ND Entries on Switches

In a data center environment, the placement of L2 and L3 addressing may be orchestrated by Server (or VM) Management System(s). Therefore, it may be possible for static ARP/ND entries to be configured on routers and/or servers.

Advantage: This methodology has been used to reduce ARP/ND

  fluctuations in large-scale data center networks.

Disadvantage: When some VMs are added, deleted, or moved, many

  switches' static entries need to be updated.  In a data center
  with virtualized servers, those events can happen frequently.  For
  example, for an event of one VM being added to one server, if the
  subnet of this VM spans 15 access switches, all of them need to be
  updated.  Network management mechanisms (SNMP, the Network
  Configuration Protocol (NETCONF), or proprietary mechanisms) are
  available to provide updates or incremental updates.  However,
  there is no well-defined approach for switches to synchronize
  their content with the management system for efficient incremental
  updates.

Recommendation: Additional work may be needed within IETF working

  groups (e.g., NETCONF, NVO3, I2RS, etc.) to get prompt incremental
  updates of static ARP/ND entries when changes occur.

ARP/ND Proxy Approaches

RFC 1027 RFC1027 specifies one ARP Proxy approach referred to as "Proxy ARP". However, RFC 1027 does not discuss a scaling mechanism. Since the publication of RFC 1027 in 1987, many variants of Proxy ARP have been deployed. RFC 1027's Proxy ARP technique allows a gateway to return its own MAC address on behalf of the target station.

[ARP_Reduction] describes a type of "ARP Proxy" that allows a ToR switch to snoop ARP requests and return the target station's MAC if the ToR has the information in its cache. However, RFC4903 doesn't recommend the caching approach described in [ARP_Reduction] because such a cache prevents any type of fast mobility between Layer 2 ports and breaks Secure Neighbor Discovery RFC3971.

IPv6 ND Proxy RFC4389 specifies a proxy used between an Ethernet segment and other segments, such as wireless or PPP segments. ND Proxy RFC4389 doesn't allow a proxy to send NA messages on behalf of the target to ensure that the proxy does not interfere with hosts moving from one segment to another. Therefore, the ND Proxy RFC4389 doesn't reduce the number of ND messages to an L2/L3 boundary router.

Bottom line, the term "ARP/ND Proxy" has different interpretations, depending on vendors and/or environments.

Recommendation: For IPv4, even though those Proxy ARP variants (not

  RFC 1076) have been used to reduce ARP traffic in various
  environments, there are many issues with caching.

The IETF should consider making proxy recommendations for data center environments as a transition issue to help DC operators transitioning to IPv6. Section 7 of RFC4389 ("Guidelines to Proxy Developers") should be considered when developing any new proxy protocols to scale ARP.

Multicast Scaling Issues

Multicast snooping (IGMP/MLD) has different implementations and scaling issues. RFC4541 notes that multicast IGMPv2/v3 snooping has trouble with subnets that include IGMPv2 and IGMPv3. RFC4541 also notes that MLDv2 snooping requires the use of either destination MAC (DMAC) address filtering or deeper inspection of frames/packets to allow for scaling.

MLDv2 snooping needs to be re-examined for scaling within the DC. Efforts such as IGMP/MLD explicit tracking [IGMP-MLD-Tracking] for downstream hosts need to provide better scaling than IGMP/MLDv2 snooping.

Practices to Scale ARP/ND in Overlay Models

There are several documents on using overlay networks to scale large Layer 2 networks (or avoid the need for large L2 networks) and enable mobility (e.g., [L3-VM-Mobility], [VXLAN]). Transparent Interconnection of Lots of Links (TRILL) and IEEE 802.1ah (Mac-in-Mac) are other types of overlay networks that can scale Layer 2.

Overlay networks hide the VMs' addresses from the interior switches and routers, thereby greatly reducing the number of addresses exposed to the interior switches and router. The overlay edge nodes that perform the network address encapsulation/decapsulation still handle all remote stations' addresses that communicate with the locally attached end stations.

For a large data center with many applications, these applications' IP addresses need to be reachable by external peers. Therefore, the overlay network may have a bottleneck at the gateway node(s) in processing resolving target stations' physical addresses (MAC or IP) and the overlay edge address within the data center.

Here are two approaches that can be used to minimize this problem:

1. Use static mapping as described in Section 5.2.

2. Have multiple L2/L3 boundary nodes (i.e., routers), with each

  handling a subset of stations' addresses that are visible to
  external peers (e.g., Gateway #1 handles a set of prefixes,
  Gateway #2 handles another subset of prefixes, etc.).

Summary and Recommendations

This memo describes some common practices that can alleviate the impact of address resolution on L2/L3 gateway routers.

In data centers, no single solution fits all deployments. This memo has summarized some practices in various scenarios and the advantages and disadvantages of all of these practices.

In some of these scenarios, the common practices could be improved by creating and/or extending existing IETF protocols. These protocol change recommendations are:

o Relax the bidirectional requirement of IPv6 ND in some

  environments.  However, other issues will be introduced when the
  bidirectional requirement of ND is relaxed.  Therefore, it is
  necessary to have performed a comprehensive study of possible
  issues prior to making those changes.

o Create an incremental "update" scheme for efficient static ARP/ND

  entries.

o Develop IPv4 ARP/IPv6 ND Proxy standards for use in the data

  center.  Section 7 of RFC4389 ("Guidelines to Proxy Developers")
  should be considered when developing any new proxy protocols to
  scale ARP/ND.

o Consider scaling issues with IGMP/MLD snooping to determine

  whether or not new alternatives can provide better scaling.

Security Considerations

This memo documents existing solutions and proposes additional work that could be initiated to extend various IETF protocols to better scale ARP/ND for the data center environment.

Security is a major issue for data center environments. Therefore, security should be seriously considered when developing any future protocol extensions.

Acknowledgements

We want to acknowledge the ARMD WG and the following people for their valuable inputs to this document: Joel Jaeggli, Dave Thaler, Susan Hares, Benson Schliesser, T. Sridhar, Ron Bonica, Kireeti Kompella, and K.K. Ramakrishnan.

10. References

10.1. Normative References

[GratuitousARP]

          Cheshire, S., "IPv4 Address Conflict Detection", RFC 5227,
          July 2008.

RFC826 Plummer, D., "Ethernet Address Resolution Protocol: Or

          Converting Network Protocol Addresses to 48.bit Ethernet
          Address for Transmission on Ethernet Hardware", STD 37,
          RFC 826, November 1982.

RFC1027 Carl-Mitchell, S. and J. Quarterman, "Using ARP to

          implement transparent subnet gateways", RFC 1027,
          October 1987.

RFC3971 Arkko, J., Ed., Kempf, J., Zill, B., and P. Nikander,

          "SEcure Neighbor Discovery (SEND)", RFC 3971, March 2005.

RFC4389 Thaler, D., Talwar, M., and C. Patel, "Neighbor Discovery

          Proxies (ND Proxy)", RFC 4389, April 2006.

RFC4541 Christensen, M., Kimball, K., and F. Solensky,

          "Considerations for Internet Group Management Protocol
          (IGMP) and Multicast Listener Discovery (MLD) Snooping
          Switches", RFC 4541, May 2006.

RFC4861 Narten, T., Nordmark, E., Simpson, W., and H. Soliman,

          "Neighbor Discovery for IP version 6 (IPv6)", RFC 4861,
          September 2007.

RFC4903 Thaler, D., "Multi-Link Subnet Issues", RFC 4903,

          June 2007.

RFC6820 Narten, T., Karir, M., and I. Foo, "Address Resolution

          Problems in Large Data Center Networks", RFC 6820,
          January 2013.

10.2. Informative References

[ARMD-Statistics]

          Karir, M. and J. Rees, "Address Resolution Statistics",
          Work in Progress, July 2011.

[ARP_Reduction]

          Shah, H., Ghanwani, A., and N. Bitar, "ARP Broadcast
          Reduction for Large Data Centers", Work in Progress,
          October 2011.

[IGMP-MLD-Tracking]

          Asaeda, H., "IGMP/MLD-Based Explicit Membership Tracking
          Function for Multicast Routers", Work in Progress,
          December 2013.

[L3-VM-Mobility]

          Kumari, W. and J. Halpern, "Virtual Machine mobility in L3
          Networks", Work in Progress, August 2011.

[Multi-Link]

          Thaler, D. and C. Huitema, "Multi-link Subnet Support in
          IPv6", Work in Progress, June 2002.

RFC1076 Trewitt, G. and C. Partridge, "HEMS Monitoring and Control

          Language", RFC 1076, November 1988.

RFC7048 Nordmark, E. and I. Gashinsky, "Neighbor Unreachability

          Detection Is Too Impatient", RFC 7048, January 2014.

[VXLAN] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger,

          L., Sridhar, T., Bursell, M., and C. Wright, "VXLAN: A
          Framework for Overlaying Virtualized Layer 2 Networks over
          Layer 3 Networks", Work in Progress, April 2014.

Authors' Addresses

Linda Dunbar Huawei Technologies 5340 Legacy Drive, Suite 175 Plano, TX 75024 USA

Phone: (469) 277 5840 EMail: [email protected]

Warren Kumari Google 1600 Amphitheatre Parkway Mountain View, CA 94043 USA

EMail: [email protected]

Igor Gashinsky Yahoo 45 West 18th Street 6th floor New York, NY 10011 USA

EMail: [email protected]