[Glass] Recursion deadlock with TransientRecursionLock

Mon Apr 13 08:01:59 PDT 2015

Created an issue for better tracking: https://github.com/GsDevKit/Seaside31/issues/71 <https://github.com/GsDevKit/Seaside31/issues/71>

> On 13 Apr 2015, at 16:25, Johan Brichau <johan at yesplan.be> wrote:
> 
> Sorry for talking in partial mails like this :)
> But yes, it’s obvious that the continuation logic is creating new processes and that this is the cause.
> 
> I fixed it by setting the #lockingProcess in the mutex back to ‘Processor activeProcess’ when we return from the seaside request handling, right before we try committing and possibly need to commit object log entries when the commit fails. So, in #seasideProcessRequestWithRetry:resultBlock:
> 
> 		...
> 		[result := (resultBlock value: aNativeRequest)]
> 				on: (WARetryHttpRequest, SafelyPerformBlockRequiringAbort, DALGsSemanticConflictException)
> 				do: [:ex |
> 					
> 					retryRequest := true.
> 					retryException := ex ].
> 		self transactionMutex lockingProcess: Processor activeProcess.
> 		...
> 
> The reason this does not happen too much is the combination of conditions… and the result is nasty… but the fix seems easy ;-)
> 
> Johan
> 
>> On 13 Apr 2015, at 15:58, Johan Brichau <johan at yesplan.be <mailto:johan at yesplan.be>> wrote:
>> 
>> A bit of follow-up: I am suspecting this is because of the #call:/#answer logic of Seaside.
>> When the request processing of Seaside returns, it might have processed a call or answer, which is manipulating the call stack, which seems to induce a change of the process.
>> 
>> At least, when I process requests with call/answer, there is no such thing happening.
>> 
>> Dale, you probably can tell what goes on here :)
>> Working my way through this… I can imagine a couple of workarounds/fixes but I would like to pinpoint the exact issue first
>> 
>> Johan
>> 
>> 
>>> On 13 Apr 2015, at 11:23, Johan Brichau <johan at yesplan.be <mailto:johan at yesplan.be>> wrote:
>>> 
>>> Hi,
>>> 
>>> I’m getting deadlocks in recursive calls of #critical: on the Seaside transactionMutex (which is a TransientRecursionLock). The #lockingProcess stored by the mutex is no longer the same object as the ‘Processor activeProcess’, though it should be (see below).
>>> Hence, the mutex blocks access to the critical section and deadlocks because the #lockingProcess object no longer exists. It actually seems as if the result object of ‘Processor activeProcess’ changes even though we are still in the same thread.
>>> 
>>> I found this by finally getting a good tracing on a long-lasting bug that sometimes blocks a seaside gem outside of a transaction, eventually having the gem killed off by the stone due to a lostOTView. The typical log sequence found in the seaside gem log is quoted below. The gem log would contain a log entry for a commit failure but the object log would not. This means that the commit on the object log never happened. Furthermore, the gem is left outside a transaction when that happens, eventually getting the #lostOT signal from the stone. This log is from a 2.4.4.1 stone but I get the same results in a 3.1.0.6 stone (have not tried with 3.2 at this time).
>>> 
>>>> failure
>>>> Read-Write Conflicts...
>>>> Write-Write Conflicts...
>>>>     3856175617
>>>> Write-Dependency Conflicts...
>>>> Write-ReadLock Conflicts...
>>>> Write-WriteLock Conflicts...
>>>> Rc-Write-Write Conflicts...
>>>> Synchronized-Commit Conflicts...
>>>> ----------- Commit failure - retrying LOG ENTRY: aSymbolDictionary-----------
>>>> -----------------------------------------------------
>>>> GemStone: Error         Nonfatal
>>>> When running outside of a transaction the session failed to respond
>>>> quickly enough to a request from Stone to abort.  All object caches
>>>> have been cleared.  --- 
>>>> Error Category: 231169 [GemStone] Number: 3031 Arg Count: 0 Context : 7964271617
>>>> 
>>>> Now executing the following command saved from "iferr 1":
>>>>    where
>>>> ==> 1 ProcessorScheduler >> _reapEvents: @63 line 65   [GsMethod 10209537]
>>>> 2 ProcessorScheduler >> _findReadyProcess @2 line 7   [GsMethod 10204417]
>>>> 3 ProcessorScheduler >> _reschedule @2 line 8   [GsMethod 10207489]
>>>> 4 GsProcess >> _wait @2 line 3   [GsMethod 4503553]
>>>> 5 Delay >> wait @23 line 23   [GsMethod 11755009]
>>>> 6 SimpleBlock in Executed Code @19 line 24   [GsMethod 3879968001]
>>>> 7 GsProcess >> _startPart2 @15 line 17   [GsMethod 4501249]
>>>> 8 GsProcess >> _start @1 line 9   [GsMethod 4501761]
>>>>   [GsProcess 7964271617]
>>>> topaz 1> [268 sz:0 cls: 68097 Boolean] true
>>>> topaz 1> 
>>>> [Info]: Logging out at 03/27/15 14:47:52 CET
>>> 
>>> The exact point where the process gets blocked is the call to the #doTransaction: in the #saveLogEntry: method that logs the commit failure. So, the gem log got the log entry on file but the transaction to commit to the object log is blocked and never proceeds. 
>>> Inside #doTransaction, we first make a recursive call to #critical: on the transactionMutex (‘recursive’ because it happens inside the Seaside request cycle, which is protected by the same mutex). However, it should never block because it is a RecursionLock that allows recursive entry; it blocks however because the #lockingProcess object and the ‘Processor activeProcess’ object are no longer the same.
>>> 
>>> I added some transcript logging at several points in the request handling and in the TransientRecursionSemaphore>>#critical: method that demonstrates what happens.
>>> 
>>> - the ‘START’ entry prints label of the activeProcess at the beginning of the request handling inside the #seasideProcessRequestWithRetry:resultBlock method.
>>> - the ‘CONTINUE’ entry prints label of the activeProcess right after the Seaside processing (i.e. right after result := (resultBlock value: aNativeRequest) )
>>> - The indented parts are a printing of the labels of all processes (i.e. Processor allProcesses) to see what processes exist at the same time as the log entry just before it.
>>> 
>>> What we see that happens is that the oop of the activeProcess at the START and at CONTINUE is not the same. 
>>> However, right after the ‘Commit Failure’, this process wants to commit to the object log but is being blocked by semaphore because it’s activeProcess now has oop 1022696705 and it was 1020803073 when it first entered the critical section. So, this recursive call is being blocked by itself because the semaphore thinks it’s another process. 
>>> 
>>>> --transcript--'yield for process 1020803073’
>>>> --transcript--'(priority=15) active [oop=1020803073] START  /Yesplan?_s=ChMaN_wjn2Vs6KYg&_k=D5lOZWUbexZzdMjo&g=aZVW4S9mhfdwKZH5&277='
>>>> 	--transcript--'(priority=1) waiting on a Delay [oop=1007921665]'
>>>> 	--transcript--'(priority=15) waiting on a Delay [oop=1007922177]'
>>>> 	--transcript--'(priority=15) active [oop=1020803073]’
>>>> --transcript--'(priority=15) active [oop=1022696705] CONTINUING  /Yesplan?_s=ChMaN_wjn2Vs6KYg&_k=D5lOZWUbexZzdMjo&g=aZVW4S9mhfdwKZH5&277='
>>>> 	--transcript--'(priority=1) waiting on a Delay [oop=1007921665]'
>>>> 	--transcript--'(priority=15) waiting on a Delay [oop=1007922177]'
>>>> 	--transcript--'(priority=15) active [oop=1022696705]'
>>>> ----------- Commit failure - retrying LOG ENTRY: aSymbolDictionary-----------
>>>> --transcript--'want to commit to object log by 1022696705'
>>>> --transcript--'block for process 1022696705 by 1020803073'
>>>> --transcript--'(priority=1) waiting on a Delay [oop=1007921665]'
>>>> --transcript--'(priority=15) waiting on a Delay [oop=1007922177]'
>>>> --transcript--'(priority=15) active [oop=1022696705]’
>>> 
>>> Mind that this ‘switching’ of Process activeProcess objects does not always happens and it only becomes a problem when there is a recursive call on the transactionMutex. However, these conditions pop up from time to time in production. I just expose them a bit faster by randomly returning false from the doCommitTransaction method, which forces the Seaside code in a bit more retries per request, so I can reproduce it easily. 
>>> 
>>> The problem is, of course, why is this happening? The implementation of TransientRecursionLock>>critical: relies on the fact that same thread returns the same object for ‘Processor activeProcess’, while it seems it does not.
>>> Hope someone can shed some light on this issue...
>>> 
>>> greetings,
>>> Johan
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gemtalksystems.com/mailman/private/glass/attachments/20150413/6a596557/attachment-0001.html>