-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory corruption issue using C++ binding #624
Comments
@rbdm-qnt This issue should only arise when the Please be aware that:
Could you provide us a code snippet of the subscriber receive part that snapshots the whole lifetime of a received sample (from receive until it goes out-of-scope) and give me a bit more context so that I might be able to reproduce it locally. Also could you please attach a the logfile to the issue - not as copy and paste (then github seems to cut out parts). If its too large, maybe the last 100 lines. |
This is the method we use to receive the data in the Sub. The sub is single-threaded and has only 1 instance of iceoryx2 Bus. The callback reads the data and appends it to a file to save it to disk. The sub never gets destroyed, the application runs 24/7 with no interruption and the process or thread are never restarted. The processing of the sample is completely sync with no other threads active.
|
@rbdm-qnt just to be clear. You do not experience (visible) communication issues. It's just the log that indicates that there might be issues, right? |
Not that I'm aware of. From the error message it doesn't seem like a sample is lost (although this is almost impossible to verify on our side), just that one of preallocated chunks got corrupted, so I don't know if this will cause a crash down the line after the application runs for many days. I've noticed that this issue happens about twice a day, each time writing 2 lines (that I posted at the start of this issue) that weight about 50MB in the log files: it basically prints my entire preallocated memory chunk as all zeros. This seems to happen at a similar frequency to that of the crashes we used to get with Iceoryx 1. The error messages were similar, I can't find one now but it was something along the lines of POPO_CHUNK_INVALID_CHUNK, and it was fatal, so it's a big upgrade either way. Would it improve things if I changed my design to something like this to immediately free the sample before calling my callback? This would minimise the time we "hold on" to the sample as much as possible, and make it constant. We'd rather not copy those 256 bytes one extra time, but if it guarantees stability we can temporarily keep it this way.
|
@rbdm-qnt I think we are one step closer to a possible solution. I could reproduce your bug, but only in three misuse scenario.
Could you take a look at the publisher side if you find any instance where you might access the publisher by accident from multiple threads. Or maybe you can share a piece of code here. Also to point it out. In classic iceoryx and iceoyrx2 you are not allows to loan a sample in one thread and send or access it in another thread. |
Ok, then we'll do an in-depth code review of how we use our publishers, do some tests and report back in a week or so. So you confirm that the change I proposed to my "receive" method in my previous comment is useless, right? |
@rbdm-qnt You are already using the subscriber correctly therefore there is no need for this change.
If we can support you, please let us know! I suspect there is a concurrency issue with the publisher usage since you seem to have the same issue with classic iceoryx which has the same restriction. |
@rbdm-qnt you mentioned that you have the issue twice a day, each time with two log entries. Does it happen more or less periodically? I just want to rule out that there is somewhere a overflow or something similar, e.g. for twice a day, a 32 bit integer would overflow when the publishing rate would be 100kH. |
Thank you for your availability! Will keep you posted after we review everything. It doesn't happen exactly periodically. I've seen it happen once every 20-28 hours, and I've seen it happen every 4-10 hours. But the field of application is finance, so the message rate has a ton of variance, so I wouldn't rule this theory out completely. For reference, we are dealing with 2-5 billion messages per day, on average. |
Update, so it looks like we had an edge case where we did use a publisher in 2 threads, we fixed and the issue seems to have disappeared. Thanks! On the flip side, this happened today on one of the servers:
So, the message about "Config::global_config()" always appears on startup, I haven't figured out how to make it load a config but I think the default is fine, we put our desired settings as compile flags. The problem is, the program crashed due to the second part of the message, any attempt to restart the process failed, and it was solved by a reboot of the server. I have no idea what it could be related to, nothing out of the ordinary happened, the RAM, CPU and disk space were all fine. |
Pinging @elfenpiff @elBoberido |
Getting "Bus error." again every few days. The program crashes with this error, every restart fails with the same error, and it can only be solved by a system reboot. @elBoberido @elfenpiff |
We fixed this in main and in the upcoming v0.6 release. Up to v0.5 iceoryx2 expects that the config file can be found under With v0.6 it looks it up under:
See the documentation in: https://github.com/eclipse-iceoryx/iceoryx2/tree/main/config
The error looks familiar. I think in classic iceoryx we had this problem when the user tried to acquire more memory than the system provided and then the Based on this, we could find the exact location and provide you a more helpful error message. Also take a look at this: https://github.com/eclipse-iceoryx/iceoryx2/blob/main/FAQ.md#run-out-of-memory-when-creating-publisher-with-a-large-service-payload |
Required information
Linux 6.5.0-1018-aws #18~22.04.1-Ubuntu SMP Fri Apr 5 17:44:33 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
rustc 1.75.0 (82e1608df 2023-12-21) (built from a source tarball)
cargo 1.75.0
iceoryx2 version:
main branch, ICEORYX2_VERSION_STRING="0.5.0", commit hash: 5b45d39
Detailed log output:
Attached below
Observed result or behaviour:
This happened while using Pub/Sub mode using C++ bindings. It happened in the publisher application. We have multiple pub instances in the same application in different threads publishing data on the same bus. Logging level is TRACE. We had no crashes so far, and about 400MB worth of this error in the log files per day. We publish 500GB+ of data per day via the bus. No error and no logs for the subscriber.
The text was updated successfully, but these errors were encountered: