[Glass] Recursion deadlock with TransientRecursionLock

Dale Henrichs via Glass glass at lists.gemtalksystems.com
Mon Apr 13 12:53:48 PDT 2015


Johan,

I'm glad that you "talked this out" ... I'm trying to wrap my brain 
around the "combination of conditions". I think that this problem might 
be partially related to Issue #64[1] as well ... it involves commit 
errors while logging commit conflicts, too, which is a common theme.

The part that is mind boggling to me, is how the activeProcess is 
changed at this point in the stack ... I seem to recall that there 
should be a process fork at the session level

... hold on a second ...

I just looked and there used to be (in 2.8?) a fork at the session level 
and that fork insulated the rest of the handling stack (which is pretty 
much unchanged since 2.8)  from process perturbations caused by 
continuations ... with the introduction of partial continuations that 
occurred in 3.0(?) the fork was eliminated ... so these days the process 
can (and apparently does) change all of the time during `(resultBlock 
value: aNativeRequest)` and as you point out this doesn't matter unless 
a commit conflict occurs during the handling of a continuation ....

... and your proposed fix looks a perfectly good solution ...

So I will take a look at your bug report where you propose alternate 
solutions and comment there ....

Dale

[1] https://github.com/GsDevKit/Seaside31/issues/64

On 4/13/15 7:25 AM, Johan Brichau via Glass wrote:
> Sorry for talking in partial mails like this :)
> But yes, it’s obvious that the continuation logic is creating new 
> processes and that this is the cause.
>
> I fixed it by setting the #lockingProcess in the mutex back to 
> ‘Processor activeProcess’ when we return from the seaside request 
> handling, right before we try committing and possibly need to commit 
> object log entries when the commit fails. So, in 
> #seasideProcessRequestWithRetry:resultBlock:
>
> ...
> [result := (resultBlock value: aNativeRequest)]
> on: (WARetryHttpRequest, SafelyPerformBlockRequiringAbort, 
> DALGsSemanticConflictException)
> do: [:ex |
> retryRequest := true.
> retryException := ex ].
> *self transactionMutex lockingProcess: Processor activeProcess*.
> ...
>
> The reason this does not happen too much is the combination of 
> conditions… and the result is nasty… but the fix seems easy ;-)
>
> Johan
>
>> On 13 Apr 2015, at 15:58, Johan Brichau <johan at yesplan.be 
>> <mailto:johan at yesplan.be>> wrote:
>>
>> A bit of follow-up: I am suspecting this is because of the 
>> #call:/#answer logic of Seaside.
>> When the request processing of Seaside returns, it might have 
>> processed a call or answer, which is manipulating the call stack, 
>> which seems to induce a change of the process.
>>
>> At least, when I process requests with call/answer, there is no such 
>> thing happening.
>>
>> Dale, you probably can tell what goes on here :)
>> Working my way through this… I can imagine a couple of 
>> workarounds/fixes but I would like to pinpoint the exact issue first
>>
>> Johan
>>
>>
>>> On 13 Apr 2015, at 11:23, Johan Brichau <johan at yesplan.be 
>>> <mailto:johan at yesplan.be>> wrote:
>>>
>>> Hi,
>>>
>>> I’m getting deadlocks in recursive calls of #critical: on the 
>>> Seaside transactionMutex (which is a TransientRecursionLock). The 
>>> #lockingProcess stored by the mutex is no longer the same object as 
>>> the ‘Processor activeProcess’, though it should be (see below).
>>> Hence, the mutex blocks access to the critical section and deadlocks 
>>> because the #lockingProcess object no longer exists. It actually 
>>> seems as if the result object of ‘Processor activeProcess’ changes 
>>> even though we are still in the same thread.
>>>
>>> I found this by finally getting a good tracing on a long-lasting bug 
>>> that sometimes blocks a seaside gem outside of a transaction, 
>>> eventually having the gem killed off by the stone due to a 
>>> lostOTView. The typical log sequence found in the seaside gem log is 
>>> quoted below. The gem log would contain a log entry for a commit 
>>> failure but the object log would not. This means that the commit on 
>>> the object log never happened. Furthermore, the gem is left outside 
>>> a transaction when that happens, eventually getting the #lostOT 
>>> signal from the stone. This log is from a 2.4.4.1 stone but I get 
>>> the same results in a 3.1.0.6 stone (have not tried with 3.2 at this 
>>> time).
>>>
>>>> failure
>>>> Read-Write Conflicts...
>>>> Write-Write Conflicts...
>>>>     3856175617
>>>> Write-Dependency Conflicts...
>>>> Write-ReadLock Conflicts...
>>>> Write-WriteLock Conflicts...
>>>> Rc-Write-Write Conflicts...
>>>> Synchronized-Commit Conflicts...
>>>> ----------- Commit failure - retrying LOG ENTRY: 
>>>> aSymbolDictionary-----------
>>>> -----------------------------------------------------
>>>> GemStone: Error   Nonfatal
>>>> When running outside of a transaction the session failed to respond
>>>> quickly enough to a request from Stone to abort.  All object caches
>>>> have been cleared.  ---
>>>> Error Category: 231169 [GemStone] Number: 3031 Arg Count: 0 Context 
>>>> : 7964271617
>>>>
>>>> Now executing the following command saved from "iferr 1":
>>>>    where
>>>> ==> 1 ProcessorScheduler >> _reapEvents: @63 line 65 [GsMethod 
>>>> 10209537]
>>>> 2 ProcessorScheduler >> _findReadyProcess @2 line 7   [GsMethod 
>>>> 10204417]
>>>> 3 ProcessorScheduler >> _reschedule @2 line 8 [GsMethod 10207489]
>>>> 4 GsProcess >> _wait @2 line 3   [GsMethod 4503553]
>>>> 5 Delay >> wait @23 line 23   [GsMethod 11755009]
>>>> 6 SimpleBlock in Executed Code @19 line 24 [GsMethod 3879968001]
>>>> 7 GsProcess >> _startPart2 @15 line 17 [GsMethod 4501249]
>>>> 8 GsProcess >> _start @1 line 9   [GsMethod 4501761]
>>>>   [GsProcess 7964271617]
>>>> topaz 1> [268 sz:0 cls: 68097 Boolean] true
>>>> topaz 1>
>>>> [Info]: Logging out at 03/27/15 14:47:52 CET
>>>
>>> The exact point where the process gets blocked is the call to the 
>>> #doTransaction: in the #saveLogEntry: method that logs the commit 
>>> failure. So, the gem log got the log entry on file but the 
>>> transaction to commit to the object log is blocked and never proceeds.
>>> Inside #doTransaction, we first make a recursive call to #critical: 
>>> on the transactionMutex (‘recursive’ because it happens inside the 
>>> Seaside request cycle, which is protected by the same mutex). 
>>> However, it should never block because it is a RecursionLock that 
>>> allows recursive entry; it blocks however *because the 
>>> #lockingProcess object and the ‘Processor activeProcess’ object are 
>>> no longer the same.*
>>>
>>> I added some transcript logging at several points in the request 
>>> handling and in the TransientRecursionSemaphore>>#critical: method 
>>> that demonstrates what happens.
>>>
>>> - the ‘START’ entry prints label of the activeProcess at the 
>>> beginning of the request handling inside the 
>>> #seasideProcessRequestWithRetry:resultBlock method.
>>> - the ‘CONTINUE’ entry prints label of the activeProcess right after 
>>> the Seaside processing (i.e. right after result := (resultBlock 
>>> value: aNativeRequest) )
>>> - The indented parts are a printing of the labels of all processes 
>>> (i.e. Processor allProcesses) to see what processes exist at the 
>>> same time as the log entry just before it.
>>>
>>> What we see that happens is that the oop of the activeProcess at the 
>>> START and at CONTINUE is not the same.
>>> However, right after the ‘Commit Failure’, this process wants to 
>>> commit to the object log but is being blocked by semaphore because 
>>> it’s activeProcess now has oop 1022696705 and it was 1020803073 when 
>>> it first entered the critical section. So, this recursive call is 
>>> being blocked by itself because the semaphore thinks it’s another 
>>> process.
>>>
>>>> --transcript--'yield for process 1020803073’
>>>> --transcript--'(priority=15) active [oop=1020803073] START 
>>>>  /Yesplan?_s=ChMaN_wjn2Vs6KYg&_k=D5lOZWUbexZzdMjo&g=aZVW4S9mhfdwKZH5&277='
>>>> --transcript--'(priority=1) waiting on a Delay [oop=1007921665]'
>>>> --transcript--'(priority=15) waiting on a Delay [oop=1007922177]'
>>>> --transcript--'(priority=15) active [oop=1020803073]’
>>>> --transcript--'(priority=15) active [oop=1022696705] CONTINUING 
>>>>  /Yesplan?_s=ChMaN_wjn2Vs6KYg&_k=D5lOZWUbexZzdMjo&g=aZVW4S9mhfdwKZH5&277='
>>>> --transcript--'(priority=1) waiting on a Delay [oop=1007921665]'
>>>> --transcript--'(priority=15) waiting on a Delay [oop=1007922177]'
>>>> --transcript--'(priority=15) active [oop=1022696705]'
>>>> ----------- Commit failure - retrying LOG ENTRY: 
>>>> aSymbolDictionary-----------
>>>> --transcript--'want to commit to object log by 1022696705'
>>>> --transcript--'block for process 1022696705 by 1020803073'
>>>> --transcript--'(priority=1) waiting on a Delay [oop=1007921665]'
>>>> --transcript--'(priority=15) waiting on a Delay [oop=1007922177]'
>>>> --transcript--'(priority=15) active [oop=1022696705]’
>>>
>>> Mind that this ‘switching’ of Process activeProcess objects does not 
>>> always happens and it only becomes a problem when there is a 
>>> recursive call on the transactionMutex. However, these conditions 
>>> pop up from time to time in production. I just expose them a bit 
>>> faster by randomly returning false from the doCommitTransaction 
>>> method, which forces the Seaside code in a bit more retries per 
>>> request, so I can reproduce it easily.
>>>
>>> The problem is, of course, why is this happening? The implementation 
>>> of TransientRecursionLock>>critical: relies on the fact that same 
>>> thread returns the same object for ‘Processor activeProcess’, while 
>>> it seems it does not.
>>> Hope someone can shed some light on this issue...
>>>
>>> greetings,
>>> Johan
>>
>
>
>
> _______________________________________________
> Glass mailing list
> Glass at lists.gemtalksystems.com
> http://lists.gemtalksystems.com/mailman/listinfo/glass

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gemtalksystems.com/mailman/private/glass/attachments/20150413/5f6c4b60/attachment-0001.html>


More information about the Glass mailing list