runtime/netpoll: heavy epoll_wait activity on AF_LOCAL sockets we're not really waiting for #17249

tpotega · 2016-09-27T08:36:34Z

What version of Go are you using (`go version`)?

go version go1.7.1 linux/amd64

What operating system and processor architecture are you using (`go env`)?

GOARCH="amd64"
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GORACE=""
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build017533435=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"

Description

Golang's epoll-based poller uses EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET on all the sockets it listens on (including unix domain sockets), even those that had not yet been used, and thus did not yet have a chance to return EAGAIN during read/write. This differs a bit from the man-page suggested usage scenario.

This might result in high netpoll activity - with thousands of epoll_wait()'s constantly returning a "ready-to-write" AF_LOCAL socket (a syslog-ng socket in my example).

How to reproduce

Connect to a busy syslog-ng unixgram socket, make netpoll wait on some other network operation - a GET on localhost:12345 in this example:

package main

import (
  "net/http"
  "net"
  "fmt"
)

func main() {
  a, _ := net.Dial("unixgram", "/var/run/syslog")
  _ = a
  b, _ := http.Get("http://localhost:12345/")
  fmt.Println(b)
}

Redacted strace output

fd 3 - /var/run/syslog, syslog-ng unix datagram socket - unix-dgram("/var/run/syslog")

[pid 27747] socket(PF_LOCAL, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0 <unfinished ...>
[pid 27813] <... select resumed> )      = 0 (Timeout)
[pid 27813] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 27747] <... socket resumed> )      = 3
[pid 27813] <... select resumed> )      = 0 (Timeout)
[pid 27747] setsockopt(3, SOL_SOCKET, SO_BROADCAST, [1], 4 <unfinished ...>
[pid 27813] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 27747] <... setsockopt resumed> )  = 0
[pid 27747] connect(3, {sa_family=AF_LOCAL, sun_path="/var/run/syslog"}, 16 <unfinished ...>
[pid 27813] <... select resumed> )      = 0 (Timeout)
[pid 27813] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 27747] <... connect resumed> )     = 0
[pid 27747] epoll_create1(EPOLL_CLOEXEC <unfinished ...>
[pid 27813] <... select resumed> )      = 0 (Timeout)
[pid 27747] <... epoll_create1 resumed> ) = 4
[pid 27813] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>

fd 3 gets added to the epoll instance at fd 4, with event data set to {u32=4082519848, u64=140161750290216}:

[pid 27747] epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=4082519848, u64=140161750290216}}) = 0
[pid 27747] getsockname(3, {sa_family=AF_LOCAL, NULL}, [2]) = 0
[pid 27747] getpeername(3, {sa_family=AF_LOCAL, sun_path="/var/run/syslog"}, [20]) = 0
[pid 27813] <... select resumed> )      = 0 (Timeout)
[pid 27813] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 27813] <... select resumed> )      = 0 (Timeout)
[pid 27813] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, 0) = 1
[pid 27816] epoll_wait(4,  <unfinished ...>
[pid 27813] <... select resumed> )      = 0 (Timeout)
[pid 27813] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 27818] select(0, NULL, NULL, NULL, {0, 3} <unfinished ...>

First sign of "available for write()" for fd 3:

[pid 27816] <... epoll_wait resumed> {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4,  <unfinished ...>
[pid 27813] <... select resumed> )      = 0 (Timeout)
[pid 27818] <... select resumed> )      = 0 (Timeout)
[pid 27813] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>

epoll_wait()'s continue to return EPOLLOUT for fd 3:

[pid 27816] <... epoll_wait resumed> {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1

fd 5 will be used for the GET request:

[pid 27747] socket(PF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP <unfinished ...>
[pid 27817] <... select resumed> )      = 0 (Timeout)
[pid 27813] <... select resumed> )      = 0 (Timeout)
[pid 27813] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 27747] <... socket resumed> )      = 5
[pid 27813] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 27747] setsockopt(5, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
[pid 27747] connect(5, {sa_family=AF_INET, sin_port=htons(12345), sin_addr=inet_addr("127.0.0.1")}, 16 <unfinished ...>
[pid 27813] <... select resumed> )      = 0 (Timeout)
[pid 27813] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 27747] <... connect resumed> )     = -1 EINPROGRESS (Operation now in progress)

fd 5 being added to the epoll instance with {u32=4082519656, u64=140161750290024}:

[pid 27747] epoll_ctl(4, EPOLL_CTL_ADD, 5, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=4082519656, u64=140161750290024}} <unfinished ...>
[pid 27816] <... epoll_wait resumed> {{EPOLLOUT, {u32=4082519656, u64=140161750290024}}}, 128, -1) = 1
[pid 27747] <... epoll_ctl resumed> )   = 0
[pid 27816] epoll_wait(4,  <unfinished ...>
[pid 27816] <... epoll_wait resumed> {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27747] getsockopt(5, SOL_SOCKET, SO_ERROR,  <unfinished ...>
[pid 27816] epoll_wait(4,  <unfinished ...>
[pid 27747] <... getsockopt resumed> [0], [4]) = 0
[pid 27747] getsockname(5, {sa_family=AF_INET, sin_port=htons(24603), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
[pid 27817] <... select resumed> )      = 0 (Timeout)
[pid 27747] getpeername(5, {sa_family=AF_INET, sin_port=htons(12345), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
[pid 27747] setsockopt(5, SOL_TCP, TCP_NODELAY, [1], 4) = 0
[pid 27816] <... epoll_wait resumed> {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27747] setsockopt(5, SOL_SOCKET, SO_KEEPALIVE, [1], 4 <unfinished ...>
[pid 27816] epoll_wait(4,  <unfinished ...>
[pid 27747] <... setsockopt resumed> )  = 0
[pid 27747] setsockopt(5, SOL_TCP, TCP_KEEPINTVL, [30], 4) = 0
[pid 27747] setsockopt(5, SOL_TCP, TCP_KEEPIDLE, [30], 4) = 0
[pid 27816] <... epoll_wait resumed> {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4,  <unfinished ...>

The GET request itself:

[pid 27747] write(5, "GET / HTTP/1.1\r\nHost: localhost:"..., 96 <unfinished ...>
[pid 27816] <... epoll_wait resumed> {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27817] mprotect(0x7f79f2d5f000, 4096, PROT_NONE <unfinished ...>
[pid 27816] epoll_wait(4,  <unfinished ...>
[pid 27747] <... write resumed> )       = 96
Process 27846 attached
[pid 27813] select(0, NULL, NULL, NULL, {0, 320} <unfinished ...>
[pid 27816] <... epoll_wait resumed> {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4,  <unfinished ...>
[pid 27813] select(0, NULL, NULL, NULL, {0, 640} <unfinished ...>
[pid 27816] <... epoll_wait resumed> {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27817] read(5,  <unfinished ...>
[pid 27816] epoll_wait(4,  <unfinished ...>
[pid 27817] <... read resumed> 0xc4200fd000, 4096) = -1 EAGAIN (Resource temporarily unavailable)
[pid 27816] <... epoll_wait resumed> {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27813] <... select resumed> )      = 0 (Timeout)
[pid 27816] epoll_wait(4,  <unfinished ...>
[pid 27816] <... epoll_wait resumed> {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1

Waiting for the GET reply, we get tens of thousands of epoll_wait() EPOLLOUT events for fd 3:

[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
...
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
...

Expected behavior

It seems the poller should not add the fd's we're not really waiting for with the EPOLLIN|EPOLLOUT flags.

The text was updated successfully, but these errors were encountered:

bradfitz · 2016-09-27T13:34:01Z

/cc @dvyukov @ianlancetaylor

dvyukov · 2016-09-27T13:42:52Z

It seems the poller should not add the fd's we're not really waiting for with the EPOLLIN|EPOLLOUT flags.

This should be fine per se, because we use EPOLLET.
What is the kernel version?

tpotega · 2016-09-27T14:42:04Z

CentOS 7 - 3.10.0-327.36.1.el7.x86_64

EPOLLET does not guarantee one will only get exactly one event in this case, see http://cmeerw.org/blog/753.html#753 (or https://github.com/bjaspan/epoll-test for a little bit different case).

dvyukov · 2016-09-27T15:13:12Z

EPOLLET does not guarantee one will only get exactly one event in this case, see http://cmeerw.org/blog/753.html#753 (or https://github.com/bjaspan/epoll-test for a little bit different case).

Netpoller does not rely on the fact that there will be only one notification. Any finite number of notifications should be OK.

tpotega · 2016-09-27T15:50:16Z

Yes, functionally everything should be correct, but the thousands of syscalls we are seeing every second could be avoided. That's more of a performance issue, for sure.

dvyukov · 2016-09-27T18:32:46Z

@tpotega what exactly you wanted to point out in these links?

You said that exactly one event one event is not guaranteed. So I thought you are pointing to:

For UDP sockets you get an EPOLLOUT event when the send operation completes, whereas with UNIX domain sockets you get an EPOLLOUT event when the data has been read by the peer. And finally, with UNIX pipes you get an event when the send operation completes and another event when the data is read by the peer.

Thus my reply -- getting one or two notifications is irrelevant for netpoller.

Is there a description as to why/when zillions of notifications happen?

tpotega · 2016-09-29T10:43:08Z

The links were meant as just an example of (maybe) unexpected EPOLLET behavior.

As to why/when notification happen - that's a longer story.

TL;DR: zillions of notifications don't appear on 3.10.0-327.28.2.el7.x86_64 and up.
(and yes, the machines affected were running 3.10.0-327.18.2.el7.x86_64, not -36.1)

As for the 3.10.0-327.18.2.el7 code path: unix_dgram_recvmsg() (handling AF_LOCAL datagram input, as the name suggests) ends its processing trying to free the datagram (skb_free_datagram()). This passes control to the appropriate destructor - unix_destruct_scm() in case of AF_LOCAL sockets. This one in turn calls the generic sock_wfree():

/*
 * Write buffer destructor automatically called from kfree_skb.
 */
void sock_wfree(struct sk_buff *skb)
{
        struct sock *sk = skb->sk;
        unsigned int len = skb->truesize;

        if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE)) {
                /*
                 * Keep a reference on sk_wmem_alloc, this will be released
                 * after sk_write_space() call
                 */
                atomic_sub(len - 1, &sk->sk_wmem_alloc);
                sk->sk_write_space(sk);
                len = 1;
        }
        /*
         * if sk_wmem_alloc reaches 0, we must finish what sk_free()
         * could not do because of in-flight packets
         */
        if (atomic_sub_and_test(len, &sk->sk_wmem_alloc))
                __sk_free(sk);
}

sk_write_space() for AC_LOCAL is set to unix_write_space():

static inline int unix_writable(struct sock *sk)
{
        return (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
}

static void unix_write_space(struct sock *sk)
{
        struct socket_wq *wq;

        rcu_read_lock();
        if (unix_writable(sk)) {
                wq = rcu_dereference(sk->sk_wq);
                if (wq_has_sleeper(wq))
                        wake_up_interruptible_sync_poll(&wq->wait,
                                POLLOUT | POLLWRNORM | POLLWRBAND);
                sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
        }
        rcu_read_unlock();
}

A notification will be generated provided there is buffer space available (a new event every time, even with EPOLLET). With many packets received, expect even more than zillions of notifications.

See also: http://vger.kernel.org/~davem/skb_sk.html

But now, the bomb: some recent AF_LOCAL changes backported to 3.10.0-327.28.2.el7 (and up) hide the problem.

quentinmit · 2016-10-04T22:23:22Z

If this is fixed in newer kernels (where "newer" means after 2013), I don't think there's much we can do in Go.

Though I guess 2013 is still rather recent.

/cc @rsc @aclements Should we workaround this kernel bug?

ianlancetaylor · 2016-10-04T22:39:58Z

I don't see any way to fix this without noticeably hurting performance for the normal case, because we will have to shuffle descriptors in and out of the epoll. Since it apparently is not a problem in newer kernels, I think we should do nothing.

aclements · 2016-10-05T00:07:29Z

I agree with @ianlancetaylor. I don't see how to handle this without having to constantly change the state of the epoll descriptor.

However, I'm a bit confused about when this was actually fixed upstream. @tpotega, you mentioned these changes were backported to 3.10.0. Do you know what upstream change they were backported from? Based on the version numbers you gave, I assume this is a RedHat kernel, but I can't figure out how to get any useful log of their backporting activity.

tpotega · 2016-10-05T10:42:39Z

Yep, that were RHEL/CentOS kernel versions.

This was a side-effect of fixing CVE-2013-7446 (torvalds/linux@7d26727), upstream:

dvyukov · 2016-10-05T11:26:45Z

Amusingly, it was me who found and reported the kernel bug...
https://groups.google.com/forum/#!topic/syzkaller/3twDUI4Cpm8

tpotega · 2016-10-05T11:39:07Z

😄

quentinmit · 2016-10-05T15:47:23Z

@aclements and I were able to reproduce this on a machine with http://kernel.ubuntu.com/git/ubuntu/ubuntu-trusty.git/tag/?h=Ubuntu-lts-4.2.0-22.27_14.04.1

We had to change the example to /dev/log instead of /var/run/syslog - our system has syslog at the former location. With that change, we were able to observe poll returning every time another process on the system called read() or write() on a socket connected to /dev/log. Effectively, the poll loop is waking up twice every time another process writes to syslog.

This obviously isn't ideal, but it's also a small and bounded amount of work. We think that it is not worth putting a workaround into Go for this. If the slight extra CPU is causing problems for your application, we recommend upgrading to a newer kernel.

quentinmit added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Oct 4, 2016

quentinmit added this to the Go1.8Maybe milestone Oct 4, 2016

quentinmit closed this as completed Oct 5, 2016

golang locked and limited conversation to collaborators Oct 5, 2017

gopherbot added the FrozenDueToAge label Oct 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime/netpoll: heavy epoll_wait activity on AF_LOCAL sockets we're not really waiting for #17249

runtime/netpoll: heavy epoll_wait activity on AF_LOCAL sockets we're not really waiting for #17249

tpotega commented Sep 27, 2016

bradfitz commented Sep 27, 2016

dvyukov commented Sep 27, 2016

tpotega commented Sep 27, 2016

dvyukov commented Sep 27, 2016

tpotega commented Sep 27, 2016

dvyukov commented Sep 27, 2016

tpotega commented Sep 29, 2016

quentinmit commented Oct 4, 2016

ianlancetaylor commented Oct 4, 2016

aclements commented Oct 5, 2016

tpotega commented Oct 5, 2016

dvyukov commented Oct 5, 2016

tpotega commented Oct 5, 2016

quentinmit commented Oct 5, 2016

runtime/netpoll: heavy epoll_wait activity on AF_LOCAL sockets we're not really waiting for #17249

runtime/netpoll: heavy epoll_wait activity on AF_LOCAL sockets we're not really waiting for #17249

Comments

tpotega commented Sep 27, 2016

What version of Go are you using (go version)?

What operating system and processor architecture are you using (go env)?

Description

How to reproduce

Redacted strace output

Expected behavior

bradfitz commented Sep 27, 2016

dvyukov commented Sep 27, 2016

tpotega commented Sep 27, 2016

dvyukov commented Sep 27, 2016

tpotega commented Sep 27, 2016

dvyukov commented Sep 27, 2016

tpotega commented Sep 29, 2016

quentinmit commented Oct 4, 2016

ianlancetaylor commented Oct 4, 2016

aclements commented Oct 5, 2016

tpotega commented Oct 5, 2016

dvyukov commented Oct 5, 2016

tpotega commented Oct 5, 2016

quentinmit commented Oct 5, 2016

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?