[Glass] Experiences migrating a Gemstone/S database

Fri Dec 20 03:56:40 PST 2019

Migration und Updating Databases

I would like to post some experiences I had while migrating a
Gemstone database. 

The customer database - my tests are based on - has a size
of about 420GB. The database has been copied to our reference
system - an old Thinkpad W520 (I7 based) with 16GB of RAM 
and ONE SSD and the tests were done on this machine. The stone 
is working with 8GM shared cache page.

Between v70 and v71 of our product there were several 
changes to the domain model we were developing. The model 
is defined by 197 domain classes. 

In v71 39 of these classes have been changed and theses changes
are the reason for 119.000.000 objects to be migrated. One class
had 66.000.000 instances, another one 49.000.000 instances and
the other classes have around 4.000.000 instances.

*** The origial way ***

The old traditional way had been written in the early state
of this product, where databases were not that big and migration
speed was not that critical.

It worked more or less the following way (shame on me):

a) Scan the repository for ONE (!) changed class

b) For each instance do a migration and on demand (no memory)
   make a commit.

This was ok in the past. I could start update process on saturday
and finish the update on sunday remote.

Now the database became too large and this way of updating the
database would take from Thursday, 11:00 to Monday afternoon (so 
more or less 4 days !)

*** Repository Scanning ***

The next evolution in this topic had been done:

a) now ONE repository scan (FOR ALL changed classes) is done - using 
fastAllInstances and GsBitmap instances.

b) For each instance do a migration and on demand make a commit.

With this step the multiple scanning of the repository has been removed
and the largest time is now the base migration code execution. But
for 119 millions objects this still takes much time. I did not make a full 
test but an initial test over some hours suggested, that this would take
around 2 days.

*** Indices ***

More than satisfied with the benefits of ONE scan, I had to look to the migration
code. The base migration code was generated by our code generator and I did
not want to change that (because it is general and would cover all model versions),
but actually knowing the specific model I want to migrate from, would cut the
to be executed code to 1/4 of the originial code. So here would be possibilities
for enhancements.

So, what about starting multiple processes, do the step (b) in parallel ? I stored 
the GsBitmap in page order on the disc and that file became around 600 MByte of data.

And wrote processes to do the migration in parallel based on that GsBitmap file 
... and it did not work. 
Over and over commit conflicts. No way to go ... speed was pretty bad. 

Actually only one process was running more or less without problems - the other 
processes sometime did a little work, but most of the time they did an abort transaction.

So, somehow these conflicts were based on. As a first step I decided to remove
ALL indices used in the database. I had luck, that this application had an execution
path to find all used indices to remove them, to build them etc. 

That script to remove all indices were started before migration (and it took at 
least 1-2 hours).

Then I started the parallel migration code and now the stuff was working. The I7 
had 8 execution threads and I started 8 of these processes and they work without 
problems. The topaz script were started with "-t 500000" and that fit very well to
the machine above. 100% usage of the available RAM und minimal swapping.

The code itself had a sliding transaction size (from 1 to max. of 20000 objects between 
each commit. This limit is adapted according to conflicts/successes) - but the logs showed, 
that the processes are working with the upper value of 20000 each commit.

So to summarize:

a) Scanning the objects with fastAllInstances in ONE scan (1-2 hours)
b) Removing the indices (1-2) hours
c) Starting the migration code in 8 tasks (8 hours)
d) Scanning the objects with fastAllInstances in ONE scan (1-2 hours) - to reassure
e) Clean the history
f) Building the indices (3) hours

So, now I am at 17 hours and that is ok. I think, that (b) and (f) could also be 
done in parallel execution mode.

*** Workload ***

So removing indices in concurrent tasks leads to very strange exception errors, so 
I gave that up. 

Creating indices in concurrent tasks work - so the 3h above can be reduced to 40 
minutes and the overall time is now 15 hours.

*** Equal Workload up to the end ***

The next point shown up in this work was, that the work of creating indices vary 
very much and so some task have much more to do than others ... and the
parallel workidea is not done up to the end. (creating indices task: 37 minutes (longest) 
against 11 minutes (fastest)). So rearranging this work could still improve the
time needed to create the indices.

Marten