Scrub reference counting for possible task deletion issues #473

jphickey · 2020-05-19T15:54:33Z

Is your feature request related to a problem? Please describe.
The OSAL shared layer employs a reference counting scheme for long running/blocking operations, such as file read/write, and socket operations. This reference count prevents deletion while these operations are still in progress.

However, the possibility exists that the task is deleted while this operation is occurring, which means the reference count may never get decremented.

Describe the solution you'd like
Whenever possible/relevant, the OSAL should "wrap" the long running operation in a cancellation cleanup handler as was done for binary sems in #470. For POSIX, this may be needed for anything that invokes a cancellation point:

read/write/open/close (files)
send/recv/connect/accept (sockets)
select
mq_receive/mq_send (queues)

Describe alternatives you've considered
Leave as-is and accept a risk that there may be dangling references when tasks are deleted.

Additional context
There is no way for the OSAL to know about and handle inter-relationships between resources that the application may impose (i.e. using a mutex to control access to a shared memory or a reference count of its own) and therefore this cleanup/recovery can never be bulletproof.

While OSAL could potentially do better at handling its own reference counts in the context of task deletion, there will still be other remaining risks of unreleased resources after tasks being deleted for things it cannot track.

Requester Info
Joseph Hickey, Vantage Systems, Inc.

jphickey · 2020-12-10T14:46:54Z

Update: turns out this issue can be reproduced/exposed, not just theoretical. We have a recently-added test case which creates and immediately deletes a timebase here:

osal/src/tests/time-base-api-test/time-base-api-test.c

Lines 87 to 90 in 9407cdf

    
           tbc_results[i] = OS_TimeBaseCreate(&tb_id[i], TimeBaseIter[i], 0); 
        
           UtAssert_True(tbc_results[i] == expected, "OS_TimeBaseCreate() (%ld) == OS_SUCCESS", (long)actual); 
        
           OS_TimeBaseDelete(tb_id[i]);

Due to some other changes, the timing of the task startup got changed, and this test case was actually deleting the timebase internal task while it was holding the mutex .... thus creating a deadlock situation for the next operation.

That specific issue can be avoided by preventing cancellation while actively holding the lock, but the more general issue is still there - if the task we doing a read/write operation, where the table lock is NOT held during the actual blocking operation, then it will leave a nonzero refcount that will never be released, rendering the object un-deleteable.

jphickey · 2020-12-10T15:26:24Z

Some specific issues/examples of where there could be problems:

If there are two tasks A and B, and task A deletes task B, a dangling/unreleased reference might be left behind if B was:

Reading directory entries in OS_DirectoryRead - due to readdir()
In pretty much ANY file operation - due to stat/lseek/read/write etc all being cancellation points
Actively creating or deleting a resource of another type - semaphore, mutex, loading/unloading a module, etc.
Any C library call that fails and therefore causes the implementation to invoke the strerror() function - which may be a cancellation point.

Note these cases are specific to POSIX/pthreads which implements thread cancellation at defined cancellation points. In many cases these issues could be avoided by more aggressively tracking the resource(s) held by tasks.

So while we could make the POSIX OSAL more robust here, as far as I know the only method of deleting tasks on VxWorks (taskDelete) and RTEMS (rtems_task_delete) is equivalent to the pthread async cancellation -- which offers no control over where exactly it happens. So it is unlikely that VxWorks or RTEMS could ever make arbitrary task deletion safe in OSAL. The only way to make it safe is to actually code the target task (i.e. task B in the example above) to perform its own orderly shutdown and ensure it is NOT doing any ops when it is deleted by task A (or it could self-exit when done).

jphickey · 2020-12-10T15:31:42Z

The best solution here might be to just document that the OS_TaskDelete() function is potentially unsafe due to its inherent nature. Creating tasks that run forever is one way to avoid any problems. But if deletion is desired, then developers must explicitly code tasks to support this and include some sort of safe-state or shutdown logic within the task itself to allow for its safe deletion.

Fix nasa#473 Adding coverage tests to cFE TIME

jphickey self-assigned this May 19, 2020

jphickey added the enhancement label May 19, 2020

jphickey mentioned this issue May 19, 2020

Global lock for "timecb" objects missing #474

Closed

jphickey pushed a commit to jphickey/osal that referenced this issue Aug 10, 2022

Fix nasa#473 Adding coverage tests to cFE TIME

eec8ec9

jphickey pushed a commit to jphickey/osal that referenced this issue Aug 10, 2022

Merge pull request #1836 from pepepr08/fix473-time-coverage

54612f1

Fix nasa#473 Adding coverage tests to cFE TIME

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrub reference counting for possible task deletion issues #473

Scrub reference counting for possible task deletion issues #473

jphickey commented May 19, 2020

jphickey commented Dec 10, 2020

jphickey commented Dec 10, 2020

jphickey commented Dec 10, 2020

Scrub reference counting for possible task deletion issues #473

Scrub reference counting for possible task deletion issues #473

Comments

jphickey commented May 19, 2020

jphickey commented Dec 10, 2020

jphickey commented Dec 10, 2020

jphickey commented Dec 10, 2020