Header RSS Feed
 
If you only want to see the articles of a certain category, please click on the desired category below:
ALL Android Backup BSD Database Hacks Hardware Internet Linux Mail MySQL Monitoring Network Personal PHP Proxy Shell Solaris Unix Virtualization VMware Windows Wyse

Ubuntu freeze due to bug in tg3 driver for Broadcom NIC (?)
Monday - May 12th 2014 - by - (0 comments)

Today I experienced a server freeze on an Ubuntu 12.04 LTS running with the quantal kernel (3.5.0-48-generic #72~precise1-Ubuntu).

The last entries on the console were the following:

tg3 0000:03:00.2: eth2: 0: Host status block [00000005:00000003:(0000:0000:0000):(0000:0000))]
tg3 0000:03:00.2: eth2: 0: NAPI info [00000003:00000003:(0000:0000:01ff):0000(02e6:0000:0000:0000)]
tg3 0000:03:00.2: eth2: 1: Host status block [00000001:000000c2:(0000:0000:0000):(0f22:0150))]
tg3 0000:03:00.2: eth2: 1: NAPI info [000000c2:000000c2:(00bf:0150:01ff):0f22:(0722:0722:0000:0000)]
tg3 0000:03:00.2: eth2: 2: Host status block [00000001:00000064:(0b3f:0000:0000):(0000:0049)]
tg3 0000:03:00.2: eth2: 2: NAPI info [00000064:00000064:(0049:0049:01ff):0b3f:(03ff:03ff:0000:0000)]
tg3 0000:03:00.2: eth2: 3: Host status block [00000001:00000024:(0000:0000:0000):(00000:012b)]
tg3 0000:03:00.2: eth2: 3: NAPI info [00000024:00000024:(012b:012b:01ff):0a8f:(028f:028f:0000:0000)]
tg3 0000:03:00.2: eth2: 4: Host status block [00000001:000000c7:(0000:0000:0d2e):(0000:010d)]
tg3 0000:03:00.2: eth2: 4: NAPI info [000000c7:000000c7:(010d:010d:01ff):0d2e:(052e:052e:0000:0000)]
tg3 0000:03:00.2: tg3_stop_block timed out, ofs=1400 enable_bit=2
tg3 0000:03:00.2: tg3_stop_block timed out, ofs=c00 enable_bit=2
tg3 0000:03:00.2: eth2: Link is down
tg3 0000:03:00.1: eth1: Link is down
tg3 0000:03:00.0: eth0: Link is down
br1: port 1(eth1) entered disabled state

After these entries, the system completely froze. Not even the console was working anymore.

Here some additional information about the system:

ethtool -k eth2
Offload parameters for eth2:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: off

lspci  | grep 03:00
03:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
03:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
03:00.2 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
03:00.3 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)

uname -a
Linux myserver.local 3.5.0-48-generic #72~precise1-Ubuntu SMP Tue Mar 11 20:09:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

I've tried to pinpoint the freeze to a certain bug, but couldn't really find a description which EXACTLY describes this issue. I did however find some clues/possibilities:

Deadlock bug in tg3 driver (tg3_change_mtu)?
It's possible that this freeze was triggered by a bug in the tg3 driver in the tg3_change_mtu function. A bug fix was released just recently on March 4th 2014 (see https://lkml.org/lkml/2014/3/4/568).
According to the Ubuntu changelog for the linux-lts-quantal package, Ubuntu (Canonical) added this kernel fix in 3.5.0-49~precise1, released on May 5th 2014 (one week ago).
I will definitely give it a try with the new kernel.

Broken TSO (TCP Segmentation Offload) handling in tg3 driver?
I found another bug report which shows very similar kernel outputs (see http://hotpotato.tistory.com/361). This bug report seems to be a copy of https://access.redhat.com/site/solutions/69382, but unfortunately the solution on the RedHat site can only be seen with a valid subscription. ARGH. According to the first page, the root cause for the issue is:

Certain Broadcom devices, mostly the BMC5704 controllers, failed to work due to incorrect TSO (TCP Segmentation Offload) handling in the tg3 driver. The TSO handling code has been revised so that the devices now work as expected.

But as this bug is already known since August 30th 2013 on the Red Hat site, I still tend for the first possibility (the deadlock bug).

General tg3 issue with Broadcom BCM5719?
According to the VMware Knowledge Base entry #2035701, last updated on December 11th 2013, there is a general issue in the tg3 driver specific on BCM5719 and BCM5720 NIC controllers. The issue can be resolved by updating the Broadcom driver (tg3). As a workaround, the "NetQueue feature" can be disabled. As this is a VMware feature, it doesn't seem to be the cause for my freeze.

By the way there is a video on Youtube (https://www.youtube.com/watch?v=6jRho13n-k4) from Índer Yilmaz, published on April 28th 2014, which seems to be describing the same issue.

Update May 19th, 2014:
After an uptime of 5 days with the new kernel (3.5.0-49-generic), the entries have disappared from /var/log/kern.log and dmesg.

 

Add a comment

Show form to leave a comment

Comments (newest first):

No comments yet.

Go to Homepage home
Linux Howtos how to's
Nagios Plugins nagios plugins
Links links

Valid HTML 4.01 Transitional
Valid CSS!
[Valid RSS]

7639 Days
until Death of Computers
Why?