-
Notifications
You must be signed in to change notification settings - Fork 227
Scrub reference counting for possible task deletion issues #473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Update: turns out this issue can be reproduced/exposed, not just theoretical. We have a recently-added test case which creates and immediately deletes a timebase here: osal/src/tests/time-base-api-test/time-base-api-test.c Lines 87 to 90 in 9407cdf
Due to some other changes, the timing of the task startup got changed, and this test case was actually deleting the timebase internal task while it was holding the mutex .... thus creating a deadlock situation for the next operation. That specific issue can be avoided by preventing cancellation while actively holding the lock, but the more general issue is still there - if the task we doing a read/write operation, where the table lock is NOT held during the actual blocking operation, then it will leave a nonzero refcount that will never be released, rendering the object un-deleteable. |
Some specific issues/examples of where there could be problems: If there are two tasks A and B, and task A deletes task B, a dangling/unreleased reference might be left behind if B was:
Note these cases are specific to POSIX/pthreads which implements thread cancellation at defined cancellation points. In many cases these issues could be avoided by more aggressively tracking the resource(s) held by tasks. So while we could make the POSIX OSAL more robust here, as far as I know the only method of deleting tasks on VxWorks (taskDelete) and RTEMS (rtems_task_delete) is equivalent to the pthread async cancellation -- which offers no control over where exactly it happens. So it is unlikely that VxWorks or RTEMS could ever make arbitrary task deletion safe in OSAL. The only way to make it safe is to actually code the target task (i.e. task B in the example above) to perform its own orderly shutdown and ensure it is NOT doing any ops when it is deleted by task A (or it could self-exit when done). |
The best solution here might be to just document that the |
Fix nasa#473 Adding coverage tests to cFE TIME
Is your feature request related to a problem? Please describe.
The OSAL shared layer employs a reference counting scheme for long running/blocking operations, such as file read/write, and socket operations. This reference count prevents deletion while these operations are still in progress.
However, the possibility exists that the task is deleted while this operation is occurring, which means the reference count may never get decremented.
Describe the solution you'd like
Whenever possible/relevant, the OSAL should "wrap" the long running operation in a cancellation cleanup handler as was done for binary sems in #470. For POSIX, this may be needed for anything that invokes a cancellation point:
Describe alternatives you've considered
Leave as-is and accept a risk that there may be dangling references when tasks are deleted.
Additional context
There is no way for the OSAL to know about and handle inter-relationships between resources that the application may impose (i.e. using a mutex to control access to a shared memory or a reference count of its own) and therefore this cleanup/recovery can never be bulletproof.
While OSAL could potentially do better at handling its own reference counts in the context of task deletion, there will still be other remaining risks of unreleased resources after tasks being deleted for things it cannot track.
Requester Info
Joseph Hickey, Vantage Systems, Inc.
The text was updated successfully, but these errors were encountered: