Threading problem

This is going to be an analysis in order to fix the bug from https://codepad.co/snippet/EP7uOrlh There are two mutex protected variables involved: 1) AIStatefulTask::mState 2) AIEngine::mEngineState mState contains the base state of the task and a pointer to it's current engine (a AIEngine*) called current_engine. mEngineState contains a boolean 'waiting' and a std::list with pointers to AIStatefulTask's that are added to this engine. Obviously mState needs to be locked while reading/writing current_engine, and mEngineState needs to be locked while manipulating the std::list. Moreover, if ever the two are simulateneously locked then that has to always happen in the same order to avoid a deadlock. Most of the time we expect a correspondence between current_engine and the linked list's of engines: the task should be added to the list of the engine current_engine (and only that one). If current_engine is nullptr then it should not be in the list of any engine. However, a task can (want to) switch engines or go idle and (want to) be removed from it's current list. The way things currently work is that current_engine signals the wish of the task; this value can be changed only while executing, but since a task can also execute without being added to an engine (yet), for example at the moment run(), cont() or advance_state() is called, we have the situation that current_engine can change from nullptr to engine1, from engine1 to engine2 and from engine2 to nullptr while being added to any engine or none at all. Engines simply check, shortly before running a task, what the value of current_engine is and when that isn't themselves they don't run the task but remove them from their list. In other words, a task adds himself to an engine when it wants to run in it, but can rapidly change its mind; while an engine only removes a task when it gets to that. So, as a result, a task can be added to multiple lists and only the value of current_engine is the canonical engine that may run it. However, a task is being run by an engine while mState is unlocked (because we want it unlocked most of the time), so theoretically an engine can start to run a task, which is then run from elsewhere before that happens, switches state/engine, and then continues to run by the (now wrong) engine. The only way to prevent this is by doing the canonical check whether we're being run by the requested engine AFTER mRunMutex is already locked (which shuts out any runs from elsewhere that might change engine again). Things we definitely want to prevent are the following: * current_engine is set, but the task is not added to the list of that engine. This is easily avoided by setting current_engine before adding the task to that engine. However, this leaves one problem: a task can still be added more than once to an engine. If this is an engine that it shouldn't run in than all instances are simply (eventually) removed, but it can happen that having two or more instances of that same task in the list of an engine poses a problem: The engine namely first run a task, and when it returns checks the value of current_engine and removes the task from its list when current_engine became the nullptr. But since that might not remove all instances of that task it may happen that the same task is still run again. Removing not one but ALL instances from a list has it's problems too, and here it becomes tricky: Because of the (current) order in which mState and mEngineState may be locked (mState first), we cannot first lock mEngineState and then read the value of current_engine from the task. Instead, the code looks like this: do { AIStatefulTask& stateful_task(queued_element->stateful_task()); stateful_task.multiplex(AIStatefulTask::normal_run); // Run the task. // This locks mState shortly, so it must be called before locking mEngineState because // add() locks mEngineState while holding mState. bool active = stateful_task.active(this); // Is current_engine equal this engine? engine_state_type::wat engine_state_w(mEngineState); // This locks mEngineState. if (!active) engine_state_w->list.erase(queued_element++); else ++queued_element; } while (queued_element != end); This code guarantees that the element that is removed is the once that just ran: queued_element is an iterator into the list and it just ran. If current_engine is nullptr after returning from multiplex then we can safe remove this instance (arguably we want to remove all instances even) because if stays nullptr then we can not remove once but all instances even, and if directly after setting the active boolean current_engine is set to this engine again then there was also another add() and we will end up with (at least) one instance in the list anyway after the erase. However, if we'd erase all elements we could end up with a list that doesn't contain this task while it's current_engine points to us. One solution could be to keep the lock on mState while locking mEngineState, just like we do when we're adding a task to an engine, and removing all instances of the task from the list. In that case the task is removed while current_engine is nullptr and nobody can interfere. That is, if some other thread tries to run the task it will block until we're done with removing all instances and when a subsequent run of the task by another thread would readd us to this list then that's ok because it will be in sync with setting current_engine. I don't really like this solution because it keeps mState locked a lot longer and we have to traverse the whole list checking for other instances instead of just erasing a single iterator. This solution would be viable if keeping the lock on mState would also guarantee that a task is only ever added once to a list: in that case we wouldn't have to traverse the whole list and the lock on mState would be kept shorter. But currently it is possible that other threads run the task at will, ie with a series of idle(), cont(), idle(), cont(), ... where each idle() sets current_engine to nullptr and each cont() sets it back to the engine (assuming that is -say- its default engine) and adds it to its list. So, for this to work we then also need to change the code that adds a task to a list and make sure it is never added more than once which then again means keeping the lock on mState while traversing the whole list. A second solution is to accept that the engine lists can be temporarily out of sync with the state of a task and thus run it while it should be idle. While the only problem seems to be that an engine should only run a task when it is allowed to; that is, we can pass a pointer to the engine that runs multiplex along with multiplex and check if that is equal to current_engine before running the task. The only demand here, I think, is that this decision is made at a point where it is guaranteed that we are also the first thread that will run that task, otherwise the decision can get out of sync again. And that means a considerable amount of initialization code has already run; making aborting multiplex() because we do NOT want to run not trivial.

Fixed a million grammar and typo errors.

Threading problem

Be the first to comment