[Glass] High Availability in GemStone

Thu Feb 20 10:26:37 PST 2014

Hi Bruno,

There are so many failure scenarios that it is difficult to automate the responses. 

The “hot standby” is a situation where transactions are sent to a backup system that tries to keep up. If the backup system has not received the most recent transaction (network delays, slow processing, etc), then bringing it on-line will involve lost transactions. The decision to fail over to the backup should consider the trade-off of transferring the latest transaction logs (which might be on a disk that is still good) vs. restarting quickly. Is it more important to have the system up quickly or to avoid lost transactions? How quickly? How many lost transactions?

If the primary system suffered a crash due to corruption in the Shared Page Cache but the host OS is fine, then it might be best to just restart the primary system and let it automatically replay the existing transactions. An automated system to do this could have the stone back in a minute or so. This is what we do for our internal bug tracking GemStone database.

With any fail-over the existing sessions will be lost. With a web application where every request is a new connection that can be handled by a newly-started gem, this can be transparent (the replacement system can even take over the IP address of the failed system). With a rich-client application that has a long-running connection (or with some interface-type background processes), the client will have to reconnect.

It might seem fairly easy to confirm that a system is dead in an automated way, but even then one has the challenge of distinguishing between slow responses and a hung system. Typically a stone will respond to a request from a gem in less than a millisecond, but a network “glitch” with retry might get unstuck in a few seconds. What if just after you decide that a remote host is not responding it comes back to life? What if some clients think it is gone but others never noticed the absence? 

All this “smarts” will need to be external to GemStone, probably on a separate host. But will that monitoring system itself have redundancy or is it okay if it is a single point of failure? Will the backup monitor be on a separate host? What if it loses connection to the primary monitor, but the primary monitor has not failed? We are back to the split network problem.

Again, I’m sure someone has thought about this and is selling a solution (Oracle?). But it isn’t an easy problem to solve and I’d be reluctant to let some automated process decide to take over without being sure that there are no missing transactions. The most I’d consider as a general solution is a front-end that routes requests to a read-only system if the main system appears to be unavailable.

James

On Feb 20, 2014, at 9:59 AM, BrunoBB <smalltalk at adinet.com.uy> wrote:

> Hi James,
> 
> Thanks for the answer. 
> 
> The "official" requirement is: "High Availability" without human
> intervention.
> But after find out more details some application has the database automatic
> take over disable. Why ? I do not know.
> 
> The problem you describe is pretty interesting, we are talking about that
> now.
> 
> I will collect more data about this i will come back then to this issue.
> 
> Daemon tools are to handle more than one GemStone installation ?
> 
> After reading this link is not clear to me:
> https://github.com/glassdb/GemStone_daemontools_setup
> 
> Regards,
> Bruno
> 
> 
> 
> --
> View this message in context: http://forum.world.st/High-Availability-in-GemStone-tp4745211p4745353.html
> Sent from the GLASS mailing list archive at Nabble.com.
> _______________________________________________
> Glass mailing list
> Glass at lists.gemtalksystems.com
> http://lists.gemtalksystems.com/mailman/listinfo/glass