Today I experienced a server freeze on an Ubuntu 12.04 LTS running with the quantal kernel (3.5.0-48-generic #72~precise1-Ubuntu).
The last entries on the console were the following:
tg3 0000:03:00.2: eth2: 0: Host status block [00000005:00000003:(0000:0000:0000):(0000:0000))]
tg3 0000:03:00.2: eth2: 0: NAPI info [00000003:00000003:(0000:0000:01ff):0000(02e6:0000:0000:0000)]
tg3 0000:03:00.2: eth2: 1: Host status block [00000001:000000c2:(0000:0000:0000):(0f22:0150))]
tg3 0000:03:00.2: eth2: 1: NAPI info [000000c2:000000c2:(00bf:0150:01ff):0f22:(0722:0722:0000:0000)]
tg3 0000:03:00.2: eth2: 2: Host status block [00000001:00000064:(0b3f:0000:0000):(0000:0049)]
tg3 0000:03:00.2: eth2: 2: NAPI info [00000064:00000064:(0049:0049:01ff):0b3f:(03ff:03ff:0000:0000)]
tg3 0000:03:00.2: eth2: 3: Host status block [00000001:00000024:(0000:0000:0000):(00000:012b)]
tg3 0000:03:00.2: eth2: 3: NAPI info [00000024:00000024:(012b:012b:01ff):0a8f:(028f:028f:0000:0000)]
tg3 0000:03:00.2: eth2: 4: Host status block [00000001:000000c7:(0000:0000:0d2e):(0000:010d)]
tg3 0000:03:00.2: eth2: 4: NAPI info [000000c7:000000c7:(010d:010d:01ff):0d2e:(052e:052e:0000:0000)]
tg3 0000:03:00.2: tg3_stop_block timed out, ofs=1400 enable_bit=2
tg3 0000:03:00.2: tg3_stop_block timed out, ofs=c00 enable_bit=2
tg3 0000:03:00.2: eth2: Link is down
tg3 0000:03:00.1: eth1: Link is down
tg3 0000:03:00.0: eth0: Link is down
br1: port 1(eth1) entered disabled state
After these entries, the system completely froze. Not even the console was working anymore.
Here some additional information about the system:
ethtool -k eth2
Offload parameters for eth2:
lspci | grep 03:00
03:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
03:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
03:00.2 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
03:00.3 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
Linux myserver.local 3.5.0-48-generic #72~precise1-Ubuntu SMP Tue Mar 11 20:09:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
I've tried to pinpoint the freeze to a certain bug, but couldn't really find a description which EXACTLY describes this issue. I did however find some clues/possibilities:
Deadlock bug in tg3 driver (tg3_change_mtu)?
It's possible that this freeze was triggered by a bug in the tg3 driver in the tg3_change_mtu function. A bug fix was released just recently on March 4th 2014 (see https://lkml.org/lkml/2014/3/4/568).
According to the Ubuntu changelog for the linux-lts-quantal package, Ubuntu (Canonical) added this kernel fix in 3.5.0-49~precise1, released on May 5th 2014 (one week ago).
I will definitely give it a try with the new kernel.
Broken TSO (TCP Segmentation Offload) handling in tg3 driver?
I found another bug report which shows very similar kernel outputs (see http://hotpotato.tistory.com/361). This bug report seems to be a copy of https://access.redhat.com/site/solutions/69382, but unfortunately the solution on the RedHat site can only be seen with a valid subscription. ARGH. According to the first page, the root cause for the issue is:
Certain Broadcom devices, mostly the BMC5704 controllers, failed to work due to incorrect TSO (TCP Segmentation Offload) handling in the tg3 driver. The TSO handling code has been revised so that the devices now work as expected.
But as this bug is already known since August 30th 2013 on the Red Hat site, I still tend for the first possibility (the deadlock bug).
General tg3 issue with Broadcom BCM5719?
According to the VMware Knowledge Base entry #2035701, last updated on December 11th 2013, there is a general issue in the tg3 driver specific on BCM5719 and BCM5720 NIC controllers. The issue can be resolved by updating the Broadcom driver (tg3). As a workaround, the "NetQueue feature" can be disabled. As this is a VMware feature, it doesn't seem to be the cause for my freeze.
By the way there is a video on Youtube (https://www.youtube.com/watch?v=6jRho13n-k4) from Önder Yilmaz, published on April 28th 2014, which seems to be describing the same issue.
Update May 19th, 2014:
After an uptime of 5 days with the new kernel (3.5.0-49-generic), the entries have disappared from /var/log/kern.log and dmesg.