[Glass] Swazoo server hangs

Wed Nov 6 07:27:31 PST 2013

----- Original Message -----
| From: "Otto Behrens" <otto at finworks.biz>
| To: "Johan Brichau" <johan at yesplan.be>
| Cc: "Dawie Strauss" <dawie at finworks.biz>, glass at lists.gemtalksystems.com
| Sent: Wednesday, November 6, 2013 4:49:25 AM
| Subject: Re: [Glass] Swazoo server hangs
| 
| Thanks for the input Johan.
| 
| > First off: given the number of problems you have using Swazoo and
| > that Zinc server has not been battle tested in Gemstone (and there
| > are open issues nobody really looked at), I definitely recommend
| > to switch (back) to FastCGI. It is stable and fast. But, of
| > course, it would be great if you can flesh out the remaining
| > issues with Zinc server on Gemstone ;-)
| 
| Thanks, really can't work on Zinc now, pressure => battle tested
| FastCGI.
| 
| > Second, are you seeing the lock ups occurring frequently? Are they
| > irregular or is there a pattern?
| 
| Yes, there's a pattern.
| 
| Someone else looked at the problem, but here's my laymens
| interpretation. We picked up the pattern when we had the same ajax
| call on the onblur and onchange events on the same html element. This
| caused virtually simultaneous calls to 2 different Swazoo servers
| with
| the same session (and action) id. This caused a conflict and one
| process retries. When it retries, it tries to read from the socket
| again, which has already been read on the first try (hey, there's no
| 2
| phase commit on reading from sockets?). So, something like that. In
| principle, when we read from / write to external systems and a commit
| fails in GS, we generally have a problem.
| 
| Does this make sense? I can get more details if you need.

This makes a lot of sense ... I have never really hammered Swazoo under load, like I have FastCGI, so this pattern of retry on failed commit has probably never been tested ... The Zinc code will have to undergo similar load testing before it's ready...

| 
| > I am asking this because we do have a similar problem that occurs
| > (rather infrequently) with FastCGI adaptors for Seaside [1]:
| > A seaside gem will become unresponsive after some time. I already
| > managed to find out that the gateSemaphore of a quit system could
| > still be less than 10 (i.e. some processes got locked and never
| > signaled the semaphore) and that it might have something to do
| > with the front-end server dropping connections. I'm not sure if
| > these problems are related though.
| 
| Does not sound as if they are related, but I suppose it could be.
|

I think it is something different as well, but I would like to get this problem under a microscope some day... 

According to Google Issue #341, there might be a correlation to commit conflicts and I mention a suspicion about ensure blocks ... the issue with ensure blocks is that when an error occurs during the execution of ensure blocks, the rest of the ensure blocks might not get evaluated ... so this vulnerability may be causing Swazoo to misbehave as well ... 

Johan, it might be worth adding some logging in the ensure blocks associated with the gateSemaphore to eliminate this as a possible problem.. 

Dale

[1] https://code.google.com/p/glassdb/issues/detail?id=341&q=fastCGI&colspec=ID%20Type%20Status%20Priority%20GLASS%20Version%20Milestone%20Owner%20Summary%20bugid%20Fixed