AT9424S L3 Switch does not route traffic
A tricky behavior happened to me last week where an Allied Telesis layer 3 switch apparently didn't want to route the traffic correctly.
I got to admit I never happened such problem, but it was pretty fun (and relatively short) to determine. We have recently introduced a L3 switch in a SMB to divide the traffic between the servers, clients, WiFi corporate and guest VLANs. Whereas the Guest WiFi have the firewall as its default gateway, the Corporate one has the L3 Switch as its DG. And apparently it doesn't work. My main issue was that this configuration is in the middle of a migration from a flat network to a segmented network, which means we had also some mess with static routes.
The behavior
Clients behind the new DG were able to register to the hidden SSID and "PING" their Default Gateway, as well as the IP address that responds to the other VLAN on the same switch. But when trying to PING the remote host on the VLAN, it didn't respond. But sometimes it did!
Now this was the interesting part, as the first thought was that someone did miss adding the routes on the Windows servers, but it turned to be void. All the servers were able to PING all the gateways. And soon after the client was able to ping the server too... but not for too long.
It turns out after checking the L3 ARP table that this switch does not seem to handle ARP resolution when it comes to routed traffic. When the host from a VLAN tried to route its traffic to the servers VLAN, the L3 switch didn't trigger the ARP resolution, so the ARP table was empty. On the other side, if the server did try a PING to the gateway on its VLAN, the ARP table was compiled successfully, and therefore the communication was restored... until the ARP cache didn't expire (600 seconds). For the record, the switch is an Allied Telesis AT9424S switch with firmware at revision 4.1.0, and this issue has been solved with Patch 2 of this firmware.
The tricky part
I didn't really want to roll-back the configuration I built with my colleagues. Holding the customer and rolling back the network to a flat-condition for this issue wasn't an option. We're in the half way to our goal, and installation of the patch can be shortly scheduled in time. Also having to reconfigure all the APs could lead to some annoyance for the users, which could lead to more political problems with PMs than technical issues (and no, we don't have a centralized AP management system).
Since the patch can be installed next week, I built up a trick to solve the issue until then. I built up a script that simply does a "ping" of the L3 switch and pushed it into the scheduled tasks of the windows servers, running every 5 minutes. This at least makes sure the L3 switch ARP table is filled all the time with some sort of "gratuitous ARP".
Conclusion
Technically I wouldn't say this was a huge problem, in the end in 1 hour the issue was tracked down and worked-around. The real goal of this post is to show how a network analysis approach can save time (and possibly money) when troubleshooting "strange behaviors". Nothing should be really left out of consideration and should be checked, just to make sure. Mostly in can happen that a behavior is caused by a human error, but other times it doesn't. Knowing the ISO/OSI model shouldn't be really considered like an "accademic topic" and be left behind because it's useless on work. It's the exact opposite. Knowing the basics of all the technologies we do work with (see my previous "Back to basics" posts) helps in better understanding of problems that we can encounter today.
27/07/2013 08:00:00