Skip to content

runtime/netpoll: heavy epoll_wait activity on AF_LOCAL sockets we're not really waiting for #17249

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tpotega opened this issue Sep 27, 2016 · 14 comments
Labels
FrozenDueToAge NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Milestone

Comments

@tpotega
Copy link

tpotega commented Sep 27, 2016

What version of Go are you using (go version)?

go version go1.7.1 linux/amd64

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GORACE=""
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build017533435=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"

Description

Golang's epoll-based poller uses EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET on all the sockets it listens on (including unix domain sockets), even those that had not yet been used, and thus did not yet have a chance to return EAGAIN during read/write. This differs a bit from the man-page suggested usage scenario.

This might result in high netpoll activity - with thousands of epoll_wait()'s constantly returning a "ready-to-write" AF_LOCAL socket (a syslog-ng socket in my example).

How to reproduce

Connect to a busy syslog-ng unixgram socket, make netpoll wait on some other network operation - a GET on localhost:12345 in this example:

package main

import (
  "net/http"
  "net"
  "fmt"
)

func main() {
  a, _ := net.Dial("unixgram", "/var/run/syslog")
  _ = a
  b, _ := http.Get("http://localhost:12345/")
  fmt.Println(b)
}

Redacted strace output

fd 3 - /var/run/syslog, syslog-ng unix datagram socket - unix-dgram("/var/run/syslog")

[pid 27747] socket(PF_LOCAL, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0 <unfinished ...>
[pid 27813] <... select resumed> )      = 0 (Timeout)
[pid 27813] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 27747] <... socket resumed> )      = 3
[pid 27813] <... select resumed> )      = 0 (Timeout)
[pid 27747] setsockopt(3, SOL_SOCKET, SO_BROADCAST, [1], 4 <unfinished ...>
[pid 27813] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 27747] <... setsockopt resumed> )  = 0
[pid 27747] connect(3, {sa_family=AF_LOCAL, sun_path="/var/run/syslog"}, 16 <unfinished ...>
[pid 27813] <... select resumed> )      = 0 (Timeout)
[pid 27813] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 27747] <... connect resumed> )     = 0
[pid 27747] epoll_create1(EPOLL_CLOEXEC <unfinished ...>
[pid 27813] <... select resumed> )      = 0 (Timeout)
[pid 27747] <... epoll_create1 resumed> ) = 4
[pid 27813] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>

fd 3 gets added to the epoll instance at fd 4, with event data set to {u32=4082519848, u64=140161750290216}:

[pid 27747] epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=4082519848, u64=140161750290216}}) = 0
[pid 27747] getsockname(3, {sa_family=AF_LOCAL, NULL}, [2]) = 0
[pid 27747] getpeername(3, {sa_family=AF_LOCAL, sun_path="/var/run/syslog"}, [20]) = 0
[pid 27813] <... select resumed> )      = 0 (Timeout)
[pid 27813] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 27813] <... select resumed> )      = 0 (Timeout)
[pid 27813] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, 0) = 1
[pid 27816] epoll_wait(4,  <unfinished ...>
[pid 27813] <... select resumed> )      = 0 (Timeout)
[pid 27813] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 27818] select(0, NULL, NULL, NULL, {0, 3} <unfinished ...>

First sign of "available for write()" for fd 3:

[pid 27816] <... epoll_wait resumed> {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4,  <unfinished ...>
[pid 27813] <... select resumed> )      = 0 (Timeout)
[pid 27818] <... select resumed> )      = 0 (Timeout)
[pid 27813] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>

epoll_wait()'s continue to return EPOLLOUT for fd 3:

[pid 27816] <... epoll_wait resumed> {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1

fd 5 will be used for the GET request:

[pid 27747] socket(PF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP <unfinished ...>
[pid 27817] <... select resumed> )      = 0 (Timeout)
[pid 27813] <... select resumed> )      = 0 (Timeout)
[pid 27813] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 27747] <... socket resumed> )      = 5
[pid 27813] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 27747] setsockopt(5, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
[pid 27747] connect(5, {sa_family=AF_INET, sin_port=htons(12345), sin_addr=inet_addr("127.0.0.1")}, 16 <unfinished ...>
[pid 27813] <... select resumed> )      = 0 (Timeout)
[pid 27813] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 27747] <... connect resumed> )     = -1 EINPROGRESS (Operation now in progress)

fd 5 being added to the epoll instance with {u32=4082519656, u64=140161750290024}:

[pid 27747] epoll_ctl(4, EPOLL_CTL_ADD, 5, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=4082519656, u64=140161750290024}} <unfinished ...>
[pid 27816] <... epoll_wait resumed> {{EPOLLOUT, {u32=4082519656, u64=140161750290024}}}, 128, -1) = 1
[pid 27747] <... epoll_ctl resumed> )   = 0
[pid 27816] epoll_wait(4,  <unfinished ...>
[pid 27816] <... epoll_wait resumed> {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27747] getsockopt(5, SOL_SOCKET, SO_ERROR,  <unfinished ...>
[pid 27816] epoll_wait(4,  <unfinished ...>
[pid 27747] <... getsockopt resumed> [0], [4]) = 0
[pid 27747] getsockname(5, {sa_family=AF_INET, sin_port=htons(24603), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
[pid 27817] <... select resumed> )      = 0 (Timeout)
[pid 27747] getpeername(5, {sa_family=AF_INET, sin_port=htons(12345), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
[pid 27747] setsockopt(5, SOL_TCP, TCP_NODELAY, [1], 4) = 0
[pid 27816] <... epoll_wait resumed> {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27747] setsockopt(5, SOL_SOCKET, SO_KEEPALIVE, [1], 4 <unfinished ...>
[pid 27816] epoll_wait(4,  <unfinished ...>
[pid 27747] <... setsockopt resumed> )  = 0
[pid 27747] setsockopt(5, SOL_TCP, TCP_KEEPINTVL, [30], 4) = 0
[pid 27747] setsockopt(5, SOL_TCP, TCP_KEEPIDLE, [30], 4) = 0
[pid 27816] <... epoll_wait resumed> {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4,  <unfinished ...>

The GET request itself:

[pid 27747] write(5, "GET / HTTP/1.1\r\nHost: localhost:"..., 96 <unfinished ...>
[pid 27816] <... epoll_wait resumed> {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27817] mprotect(0x7f79f2d5f000, 4096, PROT_NONE <unfinished ...>
[pid 27816] epoll_wait(4,  <unfinished ...>
[pid 27747] <... write resumed> )       = 96
Process 27846 attached
[pid 27813] select(0, NULL, NULL, NULL, {0, 320} <unfinished ...>
[pid 27816] <... epoll_wait resumed> {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4,  <unfinished ...>
[pid 27813] select(0, NULL, NULL, NULL, {0, 640} <unfinished ...>
[pid 27816] <... epoll_wait resumed> {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27817] read(5,  <unfinished ...>
[pid 27816] epoll_wait(4,  <unfinished ...>
[pid 27817] <... read resumed> 0xc4200fd000, 4096) = -1 EAGAIN (Resource temporarily unavailable)
[pid 27816] <... epoll_wait resumed> {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27813] <... select resumed> )      = 0 (Timeout)
[pid 27816] epoll_wait(4,  <unfinished ...>
[pid 27816] <... epoll_wait resumed> {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1

Waiting for the GET reply, we get tens of thousands of epoll_wait() EPOLLOUT events for fd 3:

[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
...
[pid 27816] epoll_wait(4, {{EPOLLOUT, {u32=4082519848, u64=140161750290216}}}, 128, -1) = 1
...

Expected behavior

It seems the poller should not add the fd's we're not really waiting for with the EPOLLIN|EPOLLOUT flags.

@bradfitz
Copy link
Contributor

/cc @dvyukov @ianlancetaylor

@dvyukov
Copy link
Member

dvyukov commented Sep 27, 2016

It seems the poller should not add the fd's we're not really waiting for with the EPOLLIN|EPOLLOUT flags.

This should be fine per se, because we use EPOLLET.
What is the kernel version?

@tpotega
Copy link
Author

tpotega commented Sep 27, 2016

CentOS 7 - 3.10.0-327.36.1.el7.x86_64

EPOLLET does not guarantee one will only get exactly one event in this case, see http://cmeerw.org/blog/753.html#753 (or https://github.com/bjaspan/epoll-test for a little bit different case).

@dvyukov
Copy link
Member

dvyukov commented Sep 27, 2016

EPOLLET does not guarantee one will only get exactly one event in this case, see http://cmeerw.org/blog/753.html#753 (or https://github.com/bjaspan/epoll-test for a little bit different case).

Netpoller does not rely on the fact that there will be only one notification. Any finite number of notifications should be OK.

@tpotega
Copy link
Author

tpotega commented Sep 27, 2016

Yes, functionally everything should be correct, but the thousands of syscalls we are seeing every second could be avoided. That's more of a performance issue, for sure.

@dvyukov
Copy link
Member

dvyukov commented Sep 27, 2016

@tpotega what exactly you wanted to point out in these links?

You said that exactly one event one event is not guaranteed. So I thought you are pointing to:

For UDP sockets you get an EPOLLOUT event when the send operation completes, whereas with UNIX domain sockets you get an EPOLLOUT event when the data has been read by the peer. And finally, with UNIX pipes you get an event when the send operation completes and another event when the data is read by the peer.

Thus my reply -- getting one or two notifications is irrelevant for netpoller.

Is there a description as to why/when zillions of notifications happen?

@tpotega
Copy link
Author

tpotega commented Sep 29, 2016

The links were meant as just an example of (maybe) unexpected EPOLLET behavior.

As to why/when notification happen - that's a longer story.

TL;DR: zillions of notifications don't appear on 3.10.0-327.28.2.el7.x86_64 and up.
(and yes, the machines affected were running 3.10.0-327.18.2.el7.x86_64, not -36.1)

As for the 3.10.0-327.18.2.el7 code path: unix_dgram_recvmsg() (handling AF_LOCAL datagram input, as the name suggests) ends its processing trying to free the datagram (skb_free_datagram()). This passes control to the appropriate destructor - unix_destruct_scm() in case of AF_LOCAL sockets. This one in turn calls the generic sock_wfree():

/*
 * Write buffer destructor automatically called from kfree_skb.
 */
void sock_wfree(struct sk_buff *skb)
{
        struct sock *sk = skb->sk;
        unsigned int len = skb->truesize;

        if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE)) {
                /*
                 * Keep a reference on sk_wmem_alloc, this will be released
                 * after sk_write_space() call
                 */
                atomic_sub(len - 1, &sk->sk_wmem_alloc);
                sk->sk_write_space(sk);
                len = 1;
        }
        /*
         * if sk_wmem_alloc reaches 0, we must finish what sk_free()
         * could not do because of in-flight packets
         */
        if (atomic_sub_and_test(len, &sk->sk_wmem_alloc))
                __sk_free(sk);
}

sk_write_space() for AC_LOCAL is set to unix_write_space():

static inline int unix_writable(struct sock *sk)
{
        return (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
}

static void unix_write_space(struct sock *sk)
{
        struct socket_wq *wq;

        rcu_read_lock();
        if (unix_writable(sk)) {
                wq = rcu_dereference(sk->sk_wq);
                if (wq_has_sleeper(wq))
                        wake_up_interruptible_sync_poll(&wq->wait,
                                POLLOUT | POLLWRNORM | POLLWRBAND);
                sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
        }
        rcu_read_unlock();
}

A notification will be generated provided there is buffer space available (a new event every time, even with EPOLLET). With many packets received, expect even more than zillions of notifications.

See also: http://vger.kernel.org/~davem/skb_sk.html

But now, the bomb: some recent AF_LOCAL changes backported to 3.10.0-327.28.2.el7 (and up) hide the problem.

@quentinmit
Copy link
Contributor

If this is fixed in newer kernels (where "newer" means after 2013), I don't think there's much we can do in Go.

Though I guess 2013 is still rather recent.

/cc @rsc @aclements Should we workaround this kernel bug?

@quentinmit quentinmit added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Oct 4, 2016
@quentinmit quentinmit added this to the Go1.8Maybe milestone Oct 4, 2016
@ianlancetaylor
Copy link
Contributor

I don't see any way to fix this without noticeably hurting performance for the normal case, because we will have to shuffle descriptors in and out of the epoll. Since it apparently is not a problem in newer kernels, I think we should do nothing.

@aclements
Copy link
Member

I agree with @ianlancetaylor. I don't see how to handle this without having to constantly change the state of the epoll descriptor.

However, I'm a bit confused about when this was actually fixed upstream. @tpotega, you mentioned these changes were backported to 3.10.0. Do you know what upstream change they were backported from? Based on the version numbers you gave, I assume this is a RedHat kernel, but I can't figure out how to get any useful log of their backporting activity.

@tpotega
Copy link
Author

tpotega commented Oct 5, 2016

Yep, that were RHEL/CentOS kernel versions.

This was a side-effect of fixing CVE-2013-7446 (torvalds/linux@7d26727), upstream:

@dvyukov
Copy link
Member

dvyukov commented Oct 5, 2016

Amusingly, it was me who found and reported the kernel bug...
https://groups.google.com/forum/#!topic/syzkaller/3twDUI4Cpm8

@tpotega
Copy link
Author

tpotega commented Oct 5, 2016

😄

@quentinmit
Copy link
Contributor

@aclements and I were able to reproduce this on a machine with http://kernel.ubuntu.com/git/ubuntu/ubuntu-trusty.git/tag/?h=Ubuntu-lts-4.2.0-22.27_14.04.1

We had to change the example to /dev/log instead of /var/run/syslog - our system has syslog at the former location. With that change, we were able to observe poll returning every time another process on the system called read() or write() on a socket connected to /dev/log. Effectively, the poll loop is waking up twice every time another process writes to syslog.

This obviously isn't ideal, but it's also a small and bounded amount of work. We think that it is not worth putting a workaround into Go for this. If the slight extra CPU is causing problems for your application, we recommend upgrading to a newer kernel.

@golang golang locked and limited conversation to collaborators Oct 5, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Projects
None yet
Development

No branches or pull requests

7 participants