-
Notifications
You must be signed in to change notification settings - Fork 67
joold 4.0.9 and 4.1.2 reliably segfaults #340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Sigh Hypothesis: Open .hdrsize = sizeof(struct joolnlhdr), Change that into .hdrsize = NLA_ALIGN(sizeof(struct joolnlhdr)), Recompile, reinstall, retry. |
BTW: Did you really mean this? From "ss-flush-deadline": 2, This value is measured in milliseconds. I suspect you might have meant "2000." |
Also: Can you please confirm that the output of |
Patched, recompiled, reinstalled and retried... joold still segfaults. Output from
Output from
Output from
I have also changed the value of If you are unable to reproduce locally, I am quite happy to build another identical virtual machine and provide remote access to it via SSH (be sure to put your public key in an update to this ticket) with full superuser privileges. If not, I will continue to test whatever further patches/suggestions you are willing to throw at me. |
I think this is a great idea. The segfault is going to be easy to find and patch, but the communication problem debugging is bound to be trial-and-error-y. |
Is this still a problem? |
Hi there! I'm running Jool 4.1.4 on a couple of Debian Buster machines, and I'm running into this |
I've done some digging to try and get some more information on where and why the My test environment is a pair of Debian Buster routers, both running a 4.19-series kernel, with Jool compiled against I've compiled two different copies of My test methodology is as follows:
What I observed is that the ICMP state entry for the ping process was successfully replicated from the first router (where the packets leave my network) to the second router (where the packets enter my network), however if I stopped and restarted the ping process, no new state entries were synchronised. In some cases, a few seconds after starting the first ping, the process reading from the kernel and sending to the network on the first router would crash with a segfault; however in other cases this process kept on running without doing anything apart from printing log messages. I ran the kernel-to-network
I also managed to do a
I've dug around in the code in
However, callbacks which are associated with a In the first case, what I think has happened is that a netlink message has been received which should be forwarded to the network, however the callback handler from a previous call to In the second case, it appears that when the I think the workaround for this is to reset the However While this seems to be the source of the
even with the above patch applied to |
Hey. Sorry for taking so long to respond. And thank you for your hard work. Indeed, I seem to have made a callback mess in the joold code during the last refactor. Here's my take on it: There is absolutely no reason to ACK the ACK. Therefore, there is no reason to juggle the callbacks. Patch uploaded. While I am still unable to replicate the segfault (that undefined behavior must be going funny places), the session replication quirk was easy enough to pull off (Ugh. Need to improve those unit tests), and my patch seems to have also amended it. I'm also not getting your syslog errors. Please confirm whether everything works well on your end. The patch can be found in the issue340 branch. |
The patch in the I'm still seeing intermittent error messages in syslog however, though I've narrowed them down to the kernel-to-network path. I haven't looked that deeply, but it appears that occasionally the kernel is sending netlink messages which |
What's the syslog error message? |
The error message is the same as before, i.e. |
Well, if it's sent by the kernel... The only place where the module writes flag Try adding WARN(1, "Printing stack trace:"); In line 123. (You'll have to reinstall the module, Let's see what it prints. |
Patched, recompiled, reinstalled. The stack trace printing logic works, confirmed with running Running (the kernel-to-network parts of)
There is no stack trace in dmesg when these messages are printed! I ran the
As the I then worked out how to set a breakpoint in
That last line looks suspiciously like an SSH handshake. Is it possible that Jool is leaking packet contents from the network into userland through the netlink socket with |
I've run |
I've instrumented
So it appears that something either isn't setting or isn't reading the |
This is actually fixing two bugs: 1. The kernel module was not initializing the Jool header on joold packets. Ever. At all. 2. joold wasn't validating the Jool header. The two bugs were working in beautiful concert, cancelling each other in the unit tests. FML Further progress on #340.
Well, this is embarrassing. New patch. Let's hope it works properly this time. |
I've rebuilt with that new patch, and I can confirm that this appears to solve the issue! I'm not seeing any further netlink error messages in syslog. I've left some comments on the commit, but they're mostly just small cosmetic things. I would note that as this introduces changes to |
Right. Another solution would be to move the (Either way, once Jool 4.1.5 is released, it will For what it's worth, you don't need to remove the instances manually. You can just do |
Ah, fair enough. In that case, given that the version checks are quite strict, I don't think worrying about the struct layout is that important.
I'll bear that in mind in the future! |
Small quirk, though: If you have iptables rules referencing instances, those rules will prevent the Update: Hmmm... I should include this in the documentation. |
Ah, that would make sense, I'm using the |
I have a quick follow-up question about 58bf14e, actually: as far as I can tell, the only place in the kernel module where a
Please credit me as Molly Miller, and link to either my GitHub profile or to my personal website. |
It's a dumb reason: The code is super lazy. Because the request header has been validated, and can simply be the same as the response header, Jool simply copies it from the request to the response. |
Ah, that works :-). |
The magic string was introduced to the netlink header struct in 58bf14e as part of the fix for NICMx#340, initially as a hard-coded byte sequence. This commit moves the magic string and its length into a preprocessor definition, and reads and writes this field using memcmp() and memmove() -- if the string ever needs to be changed in the future, then the change will be automatically picked up by all the code which reads or writes this header field.
I am currently in the process of migrating my CentOS 7-based Jool 3.5.7 configuration to Jool 4.
My kernel release is 3.10.0-1127.el7.x86_64.
I am configuring Jool using atomic configuration; my configuration is attached to this report as jool.conf
The instance is running:
... and Jool is successfully performing NAT64 as I would expect.
As I have two hosts configured with session synchronization, I am configuring joold as per the two .json files attached to this report as netsocket.json and modsocket.json (also, the documentation states that only one parameter is required to joold but immediately terminates if a second parameter is not supplied along with the first parameter).
I have only built one of the new Jool 4 hosts on an isolated network so there is no attempt to synchronize sessions with Jool 3.5.8 and more importantly, there is nothing else to talk to on the dedicated interface (which is 'eth2' in my config and is configured with only an IPv6 link-local address).
The problem is as follows:
Output from dmesg:
Logs from /var/log/message:
I have also managed to reproduce this problem on a CentOS 8.2 host running Jool 4.1.2 (as 4.0.9 is unsupported on RHEL8) with kernel 4.18.0-193.6.3.el8_2.x86_64 so I don't think this is a bug specific to the Linux kernel.
I am certainly willing to consider that my configuration may be incorrect but a segmentation fault is most certainly an issue which needs fixing.
The text was updated successfully, but these errors were encountered: