Wednesday, 14 November 2007

Linux: plain weird network behaviour; Windows is OK

Update: problem fixed, thanks for the comments; it was MTU related issues.

Note: This is a long post, but I expect it to bring in some questions for many Linux people; I advise you to read this when you have enough time.

After the last events related to connectivity at home which have been lasting since Thursday evening and the weird fact that it seems that NAT doesn't work at home, but does at work for the same setup, I called the ISP's support and wasted 15 minutes trying to convince them at least to send a technical team with another modem just to test if that is the reason for the breakage.

Of course, talking to the ISP support and trying to convince them that there might be a problem on their side since NAT works with another provider but doesn't with them, was fruitless.

Tonight I was stuck and tried another approach, convinced I would confirm my suspicion that there is something wrong on ISP's side. Still I am not yet sure what to conclude from what happened.

So, to get a better view of what is going on, I'll describe the setup I have and what are its limitations and characteristics. People in a hurry can skip to the paragraph starting with "I tested" and stare in wonder at will.

So the connection I have is done through a DSL modem which gets the MAC of the network card connected to it and exposes that as its own to the ISP's network. This MAC seems to be quite persistent and special measures must be taken in order to be able to use another NIC to connect. The modem (or probably some machine in ISP's; the IP is offers a DHCP address and everything should work fine.

Because of this "your MAC is my MAC" issue, when I connected the first time, I used an USB NIC since a broken router or some temporary failure would have allowed me to use the Internet connection directly from my laptop. I can say this decision has proven in time to be wise.

The router I used until now is a NSLU2 with Debian installed on it. The built in network interface always faced the internal network.

The router (which I call ritter) served as a NAT router for two machines inside the network, my laptop and my apartment mate's laptop. All until last Thursday, after which it never got back properly.

I tested (doing NAT on my laptop; the laptop shouldn't have been affected by the power problems):
  • with ritter behind the laptop NAT, at home
  • with two different virtual machines as NAT "clients", at home
  • with ritter behind the laptop NAT, at work
  • with a virtual machine as NAT client, at home
  • directly from the laptop
  • NAT made through a SNAT rule
  • NAT made through a MASQUERADING rule
  • with TTL mangled (increased by one, although is was never in the ballpark of a low TTL)

Of course, I have n-checked that /proc/sys/net/ipv4/ip_forward was set to 1, the tables had policy ACCEPT and there were no extra rules, except the basic NAT-ting stuff, the routes were correctly set on both the clients and the machine doing the NAT.

All I could see is that:
  1. the machine doing the NAT was always working fine
  2. at work all NAT clients worked fine
  3. at home any of the NAT "clients" were
    1. able to resolve addresses while the DNS server was in ISP territory and another LAN
    2. ping the outside world (if ping was available - in D-I is not)
    3. hanging when trying to get a http page
    4. telnet-ting directly to the port 80 was ok (but I didn't try to "GET / HTTP/1.0")

So after all of this, I was thinking of trying to see if a new client (the laptop of my apartment mate, a Windows XP machine) would work using the connection at home. It didn't work.

Then I thought of trying to do "Internet Connection Sharing" as is called in Windows. Of course, there was some pain to find Windows XP drivers for the ASIX AX88172 network card (remember, the modem needed to see the MAC of that NIC), but I managed to find the proper one.

And, almost sure the NAT wouldn't work for this case either, I configured the new connection as a shared one. I didn't even disabled the firewall, as I was thinking I could take those down gradually.

I wasn't expecting this, not even by pure chance, but my laptop which was a NAT "client" now was able to browse, ping and do whatever was normal through NAT, while the Windows machine was doing the "Connection sharing".

I was utterly flabbergasted. And that was just the beginning.

I was expecting that the problem coincidentally went away, but after a minute I was proven otherwise. It still didn't work with Linux as the NAT-ting machine. I connected back the USB NIC to the Windows machine and I saw the same thing. NAT was just working.

At that point I was to observe an even more shocking fact: the IP that the Windows machine got was different from the one the Linux machine received, in spite of the fact that the network card was the same, so it would have made sense to get the same one. More than that, the IP that the Windows machine got was from an entirely different network, although it was a valid IP belonging to my ISP.

I was thinking that one reason why it works with Windows might be that there could be some TCP protocol twist that is differently implemented in Windows and the equipment from my ISP gets along better with the Windows network stack.

As a way to test that, I am thinking of forcing somehow the IP on the Linux machine to see if anything changes. But before doing that, I felt the urge to post these, maybe some kind soul will shed some light on this issue for me or drop a hint.

Another reason might be different DHCP servers answering, but I don't know how I can see in Windows who offered the lease.

If anyone has any clue why these weird things happen, please drop a line. I would greatly appreciate it. TIA.


Anonymous said...

As far as I remember "ipconfig /all" on windows will let you see the dhcp server which provided the lease.

Jason D. Clinton said...

Time to break our Wireshark which runs on both platforms in question. It's really easy to use.

Justin said...

Strange problems+DSL generally == MTU issues.

Did you check your MTU settings?

and yeah, a capture would help :)

Anonymous said...

I remember having a similar problem with a linux computer on my university network. Connecting to one, same port, a Fedora laptop and a Windows laptop managed to get different IPs each time (actually, the GNU/Linux laptop always got the same IP assigned, while the Windows laptop would get a different IP every single time), and of course, the Fedora laptop was having trouble connecting to the Internet.

For me, the problem turned out to be ... the IP that the Fedora laptop got assigned happened to be a very old IP that was banned years ago for being hacked into (that was before they had DHCP server in the building, so everything was static and banning IP made sense, when a machine got compromised). The network administrators unbanning the IP, as you would've guessed it, fixed the problem right up. ;)

It's a long shot, but your situation might just be the same---i.e. the problem is on your ISP's end; either they did something bad with the particular IP your linux laptop gets, or the whole subnet is having routing problems?

Daniel said...

The Linux machine may be asking for a "bad" ip -- can you reset your DHCP client so it asks for a new lease?

Steinar H. Gunderson said...

I have to agree with Justin, this smells MTU issues even though it might be a long shot. The HTTP page hanging is one of those typical issues, where the first packet of the page never comes down. Usually such a too large packet should either be fragmented, or there should be an ICMP Fragmentation Needed, but I've seen ISPs who in their wisdom block that ICMP type (!!).

If it works from the router but not from the clients, try the tcpmss target to iptables. this page has a brief description, which also mentions a few symptoms that probably mirror yours.

BTW, the other IP address over DHCP is a red herring; this is perfectly normal, as what identifies you is a combination of MAC address and something called a client ID. Your DHCP server has to bend the standard quite a bit to give the same address to both installations.

/* Steinar */

Olaf said...

It's easy to change the MAC of a NIC in Windows, so there's little need to use that USB NIC.

> as what identifies you is a combination of MAC address and something called a client ID.

What is it called exactly?
I really thought it was just MAC.

eddyp said...

As far as I remember "ipconfig /all" on windows will let you see the dhcp server which provided the lease.

No, I tried this before posting. It only shows when the lease was obtained and until when is valid.

Olaf said...

It does for me:
Ethernet-adapter LAN-verbinding:

Verbindingsspec. DNS-achtervoegsel: lokaal
Beschrijving . . . . . . . . . . .:
Realtek RTL8168/8111 PCI-E Gigabit Ethernet NIC
Fysiek adres. . . . . . . . . . . : 00-17-31-64-23-C5
DHCP ingeshakeld. . . . . . . . . : ja
Autom. configuratie ingeschakeld. : ja
IP-adres. . . . . . . . . . . . . :
Subnetmasker. . . . . . . . . . . :
Standaardgateway. . . . . . . . . :
DHCP-server . . . . . . . . . . . :
DNS-servers . . . . . . . . . . . :
Lease verkregen . . . . . . . . . : woensdag 14 november 2007 17:05:00
Lease verlopen . . . . . . . . . : donderdag 15 november 2007 1:05:00

eddyp said...

Justin said...

Strange problems+DSL generally == MTU issues.

Did you check your MTU settings?

and yeah, a capture would help :)

Steinar H. Gunderson said...

I have to agree with Justin, this smells MTU issues even though it might be a long shot.

That was it, thanks. There is some weirdry going on, I suppose somewhere in the setup of the ISP.

eddyp said...
This comment has been removed by the author.
Olaf said...

I would guess it's this line:
DHCP-server . . . . . . . . . . . :

Anonymous said...
This comment has been removed by a blog administrator.