[Glass] How to manage large amounts of data ...

Thu Aug 14 11:09:55 PDT 2014

If you are using uuid as unique keys you should think a bit on this due to
storage and indexing costs/issues.  Background information here:
http://en.wikipedia.org/wiki/Universally_unique_identifier#Version_4_.28random.29

Although microsoft used version 1 they and others were pillaged by the
press as it exposed the MAC address, so generally people migrated to the
random or hash based versions. Generally here you have at least 122 bits to
work with so you play the math and say based on a probability of 2^122
there likely won't be a collision within any storage space I can think of.

However in retrospect we converted an app that stored about 10k worth of
elements from uuid key based to use a unsigned 32bit integer, via
arc4random(). Which gives a 2.3e-6 probability of collision, which we check
for at generation time anyway. This led to much better performance and as a
IOS app much improved storage costs as we just handle a 32bit integer, not
a very long string.

For your case of millions of records if you use a 64bit integer as the key
that might be worth while exploring.

I"m sure someone (on a trip somewhere?) could explore the costs of
inserting 10M uuids, versus 10M arc4random (or other generator) with or
without checking, into some type of dictionary, then retrieve a random
million entities...

On Thu, Aug 14, 2014 at 10:12 AM, Dale Henrichs <
dale.henrichs at gemtalksystems.com> wrote:

> Marten,
>
> I inadvertently sent you private email, correcting my previous post... I
> shouldn't hit send on an empty stomach while getting ready for a plane trip
> in a couple of hours:), but even that mail was not quite correct
>
> Obviously I missed the bit about having string keys for the UUIDs and I
> don't know what I was thinking about with the IdentitySet....
>
> The fastest String indexes (using "basic" classes) use the first 12
> characters of the string ... the first 12 characters are encoded as an
> integer, so objects are not faulted in while scanning the btrees during the
> #= test. If the Strings are longer than 12 characters the first 12 is used
> as a locator and then message sends (and object faulting) is used (beyond
> 12 characters performance falls off especially if the Strings are common in
> the first 12 characters)... The effective "collision bucket" size for a
> basic index is 500 so a String index with keys less than or equal to 12
> characters in length can outperform a StringKeyValueDictionary with
> equivalent-sized collision buckets on the basis of avoiding object faults
> ... No rebuilding is required with btrees (they are balanced on the fly)
> while the dictionary will require full rebuilds to manage the size of
> collision buckets. Beyond 12 character keys, the StringKeyValueDictionary
> takes the lead and wins running away:)
>
> If you can convert your key to a unique SmallInteger, you would compare
> the performance of a SmallInteger based index and an IdentityKeyValue
> dictionary ... you would not have a range limit like the String-based index
> and you would not have the faulting penalty in the IdentityKeyValue
> dictionary. so the difference will come down to collision bucket management
> and for the IdentityKeyValue case if you can predict your maximum
> SmallInteger range of values, you should be able to decide how big you want
> your collision buckets to be and I would think that the
> IdentityKeyValueDictionary would have the advantage.
>
> Dale
>
>
> On Thu, Aug 14, 2014 at 8:55 AM, Dale Henrichs <
> dale.henrichs at gemtalksystems.com> wrote:
>
>> Marten,
>>
>> I would say that Jame's suggestion is correct ... a dictionary is used
>> for identity-based indexes .
>>
>> You probably should use an IdentityKeyValueDictionary for quick access,
>> but you will want to keep an eye on the size of the collision buckets in
>> the dictionary ... when you hit the collision bucket limit on an at:put:
>> the dictionary is immediately rebuilt, which can cause an unreasonable
>> delay ... with a large dictionary you would probably want to rebuild the
>> dictionary during off hours ...
>>
>> IdentitySet would be an even better bet (no collision buckets combined
>> with quick identity based lookups) but to be able to use the UUID for
>> lookup you'd need to store the uuid in the identitySet ... but if you could
>> arrange for the uuid to reference the object directly an identity set would
>> work ...
>>
>> Dale
>>
>>
>> On Thu, Aug 14, 2014 at 7:01 AM, itlists at schrievkrom.de <
>> itlists at schrievkrom.de> wrote:
>>
>>> Assuming I have a large number of objects with a unique key attribute
>>> (uuid based).
>>>
>>> Normally I use an instance of class Dictionary (or perhaps better:
>>> StringKeyValueDictionary) to store instances of these objects. The
>>> initial finding access to these objects are mostly done via its unique
>>> key attribute.
>>>
>>> So the access to an instance was pretty fast:
>>>
>>>  ^aDictionary at: aKey ifAbsent: [ nil ]
>>>
>>> But what happens, if this dictionary is getting very large (say: a
>>> million of entries) ?
>>>
>>> Is is better to have a different approach then: e.g. a collection with a
>>> defined index on that unique attribute ?
>>>
>>>
>>> Thanks for answering ...
>>>
>>> Marten
>>>
>>>
>>> --
>>> Marten Feldtmann
>>> _______________________________________________
>>> Glass mailing list
>>> Glass at lists.gemtalksystems.com
>>> http://lists.gemtalksystems.com/mailman/listinfo/glass
>>>
>>
>>
>
> _______________________________________________
> Glass mailing list
> Glass at lists.gemtalksystems.com
> http://lists.gemtalksystems.com/mailman/listinfo/glass
>
>

-- 
===========================================================================
John M. McIntosh <johnmci at smalltalkconsulting.com>
https://www.linkedin.com/in/smalltalk
===========================================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gemtalksystems.com/mailman/private/glass/attachments/20140814/df4ab833/attachment.html>