Documenting Problems That Were Difficult To Find The Answer To

Lenovo TS-140 Ethernet Card Halt

I had my Ethernet interface effectively just die after 33 days uptime in Linux and running continuously for many, many months.

What was particularly bizarre was that I had an identical Lenovo TS-140 running beside it attached to the same Ethernet switch – that was running a GUI and it completely froze at this point. At least with the first console server I was able to access it and make a copy of the logs for later analysis after rebooting.

From /var/log/dmesg I had the following:

[2931914.307645] ------------[ cut here ]------------
[2931914.307663] WARNING: CPU: 0 PID: 0 at /build/linux-Mxzr_W/linux-3.13.0/net/sched/sch_generic.c:264 dev_watchdog+0x276/0x280()
[2931914.307668] NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
[2931914.307671] Modules linked in: btrfs raid6_pq xor ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs libcrc32c nf_conntrack_netlink nfnetlink_queue nfnetlink_log nfnetlink bluetooth xt_LOG xt_limit ts_bm xt_comment xt_string xt_conntrack xt_HL xt_nat veth xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_tcpudp bridge stp llc iptable_filter ip_tables x_tables x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec kvm snd_hwdep snd_pcm mei_me snd_page_alloc serio_raw snd_timer mei snd shpchp lpc_ich soundcore mac_hid nf_nat_sip nf_conntrack_sip nf_nat nf_conntrack zfs(POX) zunicode(POX) zcommon(POX) znvpair(POX) spl(OX) zavl(POX) hid_generic usbhid hid dm_crypt usb_storage crct10dif_pclmul crc32_pclmul i915 aesni_intel aes_x86_64 lrw e1000e gf128mul psmouse glue_helper ablk_helper i2c_algo_bit cryptd ptp drm_kms_helper pps_core drm ahci libahci video wmi
[2931914.307793] CPU: 0 PID: 0 Comm: swapper/0 Tainted: P           OX 3.13.0-85-generic #129-Ubuntu
[2931914.307797] Hardware name: LENOVO ThinkServer TS140/ThinkServer TS140, BIOS FBKT82AUS 04/02/2014
[2931914.307800]  0000000000000000 ffff88051ea03d98 ffffffff8172b6a7 ffff88051ea03de0
[2931914.307808]  0000000000000009 ffff88051ea03dd0 ffffffff810699cd 0000000000000000
[2931914.307814]  ffff8800361a0000 ffff8804fc73e880 0000000000000001 0000000000000000
[2931914.307820] Call Trace:
[2931914.307824]  <IRQ>  [<ffffffff8172b6a7>] dump_stack+0x64/0x82
[2931914.307845]  [<ffffffff810699cd>] warn_slowpath_common+0x7d/0xa0
[2931914.307851]  [<ffffffff81069a3c>] warn_slowpath_fmt+0x4c/0x50
[2931914.307863]  [<ffffffff8164ef86>] dev_watchdog+0x276/0x280
[2931914.307870]  [<ffffffff8164ed10>] ? dev_graft_qdisc+0x80/0x80
[2931914.307878]  [<ffffffff81076956>] call_timer_fn+0x36/0x150
[2931914.307884]  [<ffffffff8164ed10>] ? dev_graft_qdisc+0x80/0x80
[2931914.307892]  [<ffffffff8107798f>] run_timer_softirq+0x21f/0x310
[2931914.307900]  [<ffffffff8106f00c>] __do_softirq+0xfc/0x310
[2931914.307908]  [<ffffffff8106f595>] irq_exit+0x105/0x110
[2931914.307919]  [<ffffffff8173e755>] smp_apic_timer_interrupt+0x45/0x60
[2931914.307926]  [<ffffffff8173d0dd>] apic_timer_interrupt+0x6d/0x80
[2931914.307929]  <EOI>  [<ffffffff815dc5e2>] ? cpuidle_enter_state+0x52/0xc0
[2931914.307946]  [<ffffffff815dc5d8>] ? cpuidle_enter_state+0x48/0xc0
[2931914.307954]  [<ffffffff815dc72c>] cpuidle_idle_call+0xdc/0x220
[2931914.307963]  [<ffffffff8101e4de>] arch_cpu_idle+0xe/0x30
[2931914.307971]  [<ffffffff810c1eb5>] cpu_startup_entry+0xc5/0x2b0
[2931914.307980]  [<ffffffff81719777>] rest_init+0x77/0x80
[2931914.307990]  [<ffffffff81d34f70>] start_kernel+0x438/0x443
[2931914.307998]  [<ffffffff81d34941>] ? repair_env_string+0x5c/0x5c
[2931914.308006]  [<ffffffff81d34120>] ? early_idt_handler_array+0x120/0x120
[2931914.308014]  [<ffffffff81d345ee>] x86_64_start_reservations+0x2a/0x2c
[2931914.308021]  [<ffffffff81d34733>] x86_64_start_kernel+0x143/0x152
[2931914.308026] ---[ end trace 7c85c7d5a955f5e4 ]---
[2931914.308063] e1000e 0000:00:19.0 eth0: Reset adapter unexpectedly
[2931918.468625] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[2931938.327046] e1000e 0000:00:19.0 eth0: Reset adapter unexpectedly
[2931942.327873] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

Almost exactly the same set of messages described in this bug thread (but with no solution at time of writing).

A few solutions were proposed. This one proposed disabling TSO, GSO and GRO using ethtool:

ethtool -K eth0 gso off gro off tso off

But I decided to try turning active power state management off in the kernel after seeing the following in /var/log/dmesg:

[    0.114082] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
[    0.147241] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
[    0.147621] acpi PNP0A08:00: FADT indicates ASPM is unsupported, using BIOS configuration

So I followed the recommendation in this post by adding pcie_aspm=off to /etc/default/grub as follows:

GRUB_CMDLINE_LINUX_DEFAULT="pcie_aspm=off nosplash"

… and then re-ran sudo update-grub.

Note that I cannot tell you if this definitively works. This Ethernet crash only happened once in the 14 months I’ve had the server. Hopefully it won’t happen again.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: