[Glass] [3.1.0.4] Tranlog Restore Seems to Hang

Fri Aug 23 09:21:03 PDT 2013

Ken, 

In earlier versions we've seen hangs but they were always accompanied by "Recovery waiting for gcGems to reclaim dead" in the stone log, so this appears to possibly be a different variant of that problem... 

First off, it would be easier if we could get copies of your extents and tranlogs so that we could debug the hang here ... 

If that's not feasible, then we'd like to gather stats, stacks and logs: 

1. start the stone with -N -R 
2. fire up statmon 
3. do your restore from current logs 
4. after you see the hand, use $GEMSTONE/bin/pstack to get stacks 
from the stoned process ... (take several samples with an intervening 
delay) 

The wrap up the stats, stacks and stone.log and send it to me ... if you can share extents and tranlogs, let me know and I'll give you the details ftp details... 

Dale 

----- Original Message -----

| From: "Ken Treis" <ken at miriamtech.com>
| To: glass at lists.gemtalksystems.com
| Sent: Thursday, August 22, 2013 5:30:12 PM
| Subject: [Glass] [3.1.0.4] Tranlog Restore Seems to Hang

| We had a cloud server crash (problem on the underlying hardware), and
| after things were fixed I find myself unable to restore certain
| transaction logs. It goes something like this:

| 1. startstone -N -R seaside.
| 2. topaz -l

| | successful login
| 
| | topaz 1> run
| 
| | SystemRepository restoreStatus
| 
| | %
| 
| | Restoring from transaction log files, restored to 20/08/13 18:26:21
| | UTC, file 926 record 1596051, nextFileId 926, oldest fileId 926
| 
| | topaz 1> run
| 
| | SystemRepository restoreFromCurrentLogs
| 
| | %
| 
| 3. In the stone log, I see:

| --- 08/22/13 23:56:27 UTC ---
| Opened a transaction log file for read_nolocks.
| filename =
| /opt/gemstone/GemStone64Bit3.1.0.4-x86_64.Linux//seaside/data/tranlog926.dbf
| Restoring from current log directory to end of logs

| and then nothing else happens. After an hour, I get impatient. Load
| average is low, and the only process using any appreciable CPU is
| pgsvrmain. So I kill pgsvrmain and let the stone crash.

| 4. I repeat 1 and 2, and restoreStatus shows me that no progress has
| been made. I'm still at the same record (1596051).

| I've seen this happen in 3 different scenarios:

| (A) After restoring a Smalltalk full backup.
| (B) After restoring a filesystem backup.

| (C) During normal startup after the crash, when there are just a
| couple of records to replay since the pre-crash checkpoint. I didn't
| confirm that the record ID wasn't changing at this point, but the
| behavior was identical to the other two).

| All three hang at different record IDs, so I can't blame a single bad
| record. Also, in (A) both page and object audits came back OK. I
| haven't done audits on the other scenario.

| Is this a known issue in 3.1.0.4? Anything else I should be
| checking/trying?

| --
| Ken Treis
| Miriam Technologies, Inc.
| (866) 652-2040 x221

| _______________________________________________
| Glass mailing list
| Glass at lists.gemtalksystems.com
| http://lists.gemtalksystems.com/mailman/listinfo/glass
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gemtalksystems.com/mailman/private/glass/attachments/20130823/8ae52939/attachment-0001.html>