Routing Deep, NATing Late: A Network Edge Philosophy
My IT career started back in 1993, and I've seen the network edge evolve dramatically. I remember the early days of access lists blocking a few ports, shifting eventually to stateful firewalls meticulously allowing only specific traffic. The approach flipped entirely. Recently, I've been reflecting on how we handle edge security today, especially with firewalls managing complex NAT configurations, DMZs, and intricate routing.
Over the years, I've worn many hats: router guy, WAN guy, LAN guy, firewall guy, load balancer guy – often several at once. One constant in technology is surprise; solving the unexpected problems is probably what keeps many of us engaged.
Through this journey, I've developed a personal set of guiding principles for designing the network edge – that critical point where the internet, untrusted networks, and our securely delivered services converge. These are my internal compass; reality, of course, always finds ways to surprise us. Still, I find these principles valuable:
My Core Principles for the Network Edge:
- Leverage Natural Routing as Deep as Possible.
- Perform NAT Only at the Last Possible Moment.
- Keep NAT Simple ("Lazy NAT").
(A quick note: My focus here is primarily on traffic originating from untrusted networks like the internet towards internal services. Internal NAT scenarios might benefit from these ideas, but I haven't applied as much thought there, so it's outside the scope of this post.)
1. Deep Routing: Let Routers Route
What do I mean by "deep routing"? Simply put: allow the original destination IP address to persist as far into your network infrastructure as logically possible.
I know some network veterans might balk at the idea of public IP addresses appearing in internal routing tables or DMZs. Perhaps there's wisdom there, or maybe it's a holdover from times when organizations had vast public /16s assigned to all internal systems, NAT wasn't as seamless, and stateful firewalls were less mature. My take is that we shouldn't automatically dismiss routing public IPs past the edge firewall if it simplifies the overall design, provided robust security policies are enforced regardless.
Consider a typical DMZ setup with firewalls protecting load balancers which front your application servers. Deep routing means your edge routers and firewalls route the incoming traffic using its original public destination IP all the way to the load balancer. The load balancer, sitting in the DMZ (protected by firewalls), becomes the natural point to handle the Destination NAT (DNAT), translating the public IP to the private IP of the selected backend server. You've used routing effectively right up to the point where translation is unavoidable.
2. Late NAT: Translate Only When Necessary
This principle flows directly from Deep Routing. If you're routing deep, you are inherently delaying NAT. In our DMZ example, the load balancer is the "last possible moment" in the path where you must perform DNAT because the backend servers use private addresses.
Why delay? It keeps the configuration on upstream devices (routers, edge firewalls) simpler. Their job is routing and security policy enforcement, not complex address translation if it can be handled more appropriately downstream.
(Could you route public IPs even further, maybe directly to servers? Yes, in some specific scenarios – like one case I recall where public IPs landed directly on backend servers to satisfy a tricky vendor requirement. I'd call that "uncommon depth," facilitated by robust surrounding security. But for most typical web service deployments, the load balancer is the practical and logical place to draw the line for DNAT.)
3. Lazy NAT: Keep Translation Simple
Okay, so Deep Routing and Late NAT help avoid unnecessary NAT. But what about when NAT is required? This is where "Lazy NAT" comes in. It's about the how, not just the when and where.
Modern firewalls are incredibly capable NAT devices. I know this well. My point isn't that they can't do it; it's often that they don't need to add complexity here.
- Don't Conflate NAT with Security: A NAT policy isn't your primary security enforcement tool. Your firewall rules (or other security features) are responsible for stopping threats. Adding complex NAT rules doesn't inherently improve security; it just complicates the configuration. Route traffic through the firewall, apply security policy, and perform NAT elsewhere if routing allows. Don't NAT on the firewall just because you can if a downstream device (like the load balancer) is a more natural fit for DNAT.
- Simplify the NAT Rule Itself: When you do need NAT, keep the NAT rule as simple as the situation allows. Avoid using PAT (Port Address Translation) unless conserving IP addresses is an absolute requirement. If a 1:1 NAT or a simple dynamic pool NAT works, use it. Why? Because it keeps the translation straightforward. Enforcement still lies with the security policy. Operationally, if adding a new allowed service only requires a firewall rule change and doesn't force you to rework complex PAT rules, everyone wins.
Handling Outbound Traffic (Source NAT)
While the focus so far has been on inbound traffic and DNAT, what about outbound connections initiated from your internal networks? My approach here complements the inbound strategy: let the firewalls handle outbound Source NAT (SNAT).
The edge firewall is the natural choke point for outbound traffic leaving your trusted environment. Applying SNAT here (often PAT to conserve public IPs) provides policy control, hides internal addressing, and aligns with the firewall's role as the security boundary. The "Lazy NAT" principle still applies – keep the SNAT policy as simple as needed, relying on the main firewall rules for actual traffic enforcement.
A Note on Load Balancer Deployment: Routed vs. Proxy Mode
How you deploy your load balancers significantly impacts your ability to follow the Deep Routing principle effectively.
- Inline/Routed Mode: My strong preference is to deploy load balancers as true layer 3 hops (inline/routed mode). This means backend servers use the load balancer's internal interface IP as their default gateway. Traffic flows through the load balancer naturally. In this mode, the load balancer doesn't need to perform SNAT for return traffic or for health checks initiated from its own IP. The original client source IP is preserved all the way to the backend server, simplifying logging and application-level controls. This mode fully enables the Deep Routing philosophy.
- One-Arm/Proxy Mode: Many load balancers are deployed "on a stick" or in a proxy mode where they are not the default gateway for the servers. In this common setup, the load balancer must SNAT traffic destined for the backend servers. Otherwise, the servers would reply directly to the client, bypassing the load balancer on the return path (asymmetric routing). While common, and sometimes promoted by vendors (perhaps due to the routing demands on the appliance), this mandatory SNAT adds complexity and obscures the original client IP from the server's perspective.
When possible, design for inline/routed load balancer deployments. It aligns best with leveraging natural routing and minimizing unnecessary NAT complexity.
- The Exception: Conditional SNAT (Hairpinning): Even in a clean routed mode deployment, there's one scenario where SNAT on the load balancer is often necessary: when a client in the same network/subnet as the Virtual IP (VIP) needs to access that VIP. Without SNAT in this specific "hairpin" case, the server's return traffic would go directly back to the client within the same L2 domain, again bypassing the load balancer. So, configure SNAT on the LB, but only for traffic originating from the server-side network(s) destined for a local VIP.
Why This Philosophy?
Why advocate for this approach? Three main reasons:
- I Abhor Technical Debt: Unnecessary complexity (like mandatory LB SNAT in proxy mode, or convoluted bi-directional NAT rules which often signal design issues) introduced today becomes a maintenance headache tomorrow.
- Challenge Legacy Assumptions: Just because something was considered wise or necessary in the past doesn't mean it holds true today. Technology evolves, and our designs should reflect current capabilities.
- Operational Simplicity: Aligning function with device (routers/firewalls route and enforce policy, LBs balance and optionally NAT when required) makes troubleshooting and changes more straightforward.
By letting routers route, deploying load balancers intelligently, and performing NAT only when and as simply as needed, we can build more manageable, resilient, and operationally efficient network edges. Security policy remains paramount, but it doesn't need to be entangled with unnecessarily complex address translation schemes.
Glossary of Terms
For readers who might appreciate a quick definition of some terms used above:
- Network Edge: The boundary point where your internal, controlled network connects to an external, untrusted network like the internet.
- NAT (Network Address Translation): The process of rewriting IP addresses (and sometimes port numbers) in network packets as they pass through a routing device. This is commonly used to map private internal IP addresses to public internet addresses.
- DNAT (Destination NAT): A type of NAT that changes the destination IP address of incoming packets. Often used to direct traffic arriving at a public IP address to an internal server with a private IP address (e.g., directing web traffic to a load balancer or web server).
- SNAT (Source NAT): A type of NAT that changes the source IP address of outgoing packets. Often used for traffic leaving a private network to hide internal IPs or allow multiple internal devices to share a single public IP.
- PAT (Port Address Translation): Also known as NAT Overload. A common form of SNAT where many internal private IP addresses are mapped to a single public IP address. Different outgoing connections are distinguished by unique source port numbers assigned by the NAT device.
- DMZ (Demilitarized Zone): A perimeter network segment isolated between an internal private network and the external untrusted network. It typically hosts services (like web servers or load balancers) that need to be accessible from the outside but shouldn't have direct access to the internal network.
- Load Balancer: A device or service that distributes incoming client requests across a group (pool) of backend servers. This improves application availability, reliability, and scalability.
- VIP (Virtual IP Address): An IP address configured on a load balancer or firewall that clients connect to. The device then directs traffic destined for the VIP to one of the backend servers it manages.
- Public IP Address: An IP address that is globally unique and routable on the public internet.
- Private IP Address: An IP address from specific ranges reserved for internal network use (like 10.x.x.x, 172.16.x.x-172.31.x.x, 192.168.x.x). These are not routable on the public internet.
- Inline/Routed Mode (Load Balancer): A deployment method where the load balancer acts as a Layer 3 routing hop (gateway) for the backend servers. Traffic naturally flows through the load balancer in both directions. This typically preserves the original client source IP address seen by the server.
- One-Arm/Proxy Mode (Load Balancer): A deployment method where the load balancer is not the gateway for the backend servers (it sits "on a stick" logically connected to the same network). It usually must perform SNAT on traffic going to the servers to ensure return traffic comes back through it.
- Hairpinning (or Hairpin NAT): A scenario where a client inside a network needs to access a service hosted inside the same network using its external-facing address (like a VIP). The traffic essentially goes out to the edge device (firewall or load balancer) and is routed back into the network. Requires careful NAT/routing configuration.
- Technical Debt: A concept borrowed from software development, referring to the long-term cost of choosing an easy or quick solution now instead of using a better approach that would take longer to implement initially. In networking, it means configurations that are overly complex, non-standard, or difficult to maintain and modify later.
- Stateful Firewall: A firewall that tracks the state of network connections (e.g., TCP connection status) and makes filtering decisions based on this state in addition to static rules.