[Glass] case insensitive search broken for Unicode7

Dale Henrichs dale.henrichs at gemtalksystems.com
Fri Mar 28 10:55:54 PDT 2014


Pieter,

The engineer responsible for the ICU implementation has been on vacation
all week, so I haven't had a chance to  discuss the
#_findString:startingAt:ignoreCase: issue with him ...

With that said we have been "discovering" things about mixed
Unicode[7|16|32] (where a collator is always used) and
[DoubleByte|QuadByte]String classes and the basic conclusion that we've
come to is that:

   "it does not make sense to attempt to perfomed mixed comparisons
    between Unicode* and *String instances"

For 3.2 the default mechanism ("Legacy comparison mode" ) will be that it
is an error to attempt to compared a Unicode* instance with a *String
instance, unless explicitly specifies a collator to be used for the
comparison. We will also support "Unicode comparison mode" in which all
comparisons for both Unicode* instances and *String instances will use the
default collator.

I am planning to make "Unicode comparison mode" the default for GLASS.

There is an implication for upgrading a repository from 3.1 to 3.2, from
the perspective that if you have "sorted" collections with mixed instances,
it is possible that the sort order will change in "Unicode comparison mode"
and you will start getting errors in "Legacy comparison mode" ...

I think that if you do have mixed "sorted" collections today that the sort
order is probably not deterministic.  In 3.1 a comparison with a *String
receiver and a Unicode* argument may give a different answer if you reverse
the comparison and have a Unicode* receiver and a *String argument ...
common occurrences in sorted collections:)

If it isn't already obvious, 3.1 was our first attempt at introducing
Unicode into the system and being a largely US-ASCII shop, we did not
recognize the perils of mixing encoding schemes until we started trying to
support Unicode indexing ...

I think the "comparison modes" are the right answer for allowing our
existing customer's applications to continue to function correctly within
the String* universe while providing a pathway towards a purely
Unicode-based system.

Even with the 3.2 release, we will leave a couple of loose ends to be
cleaned up in 3.2.1, mainly in the area of fileout ... we'd like to
standardize on UTF8 filein/fileout (while still supporting
"gemstone-encoding" filein/fileout for existing customer), but we're
defering that work for 3.2 ...

So for 3.2 I would expect taht in "Legacy comparison mode" it will be an
error to do a mixed #_findString:startingAt:ignoreCase:. For a Unicode*
#_find... (or in "Unicode comparison mode") the default collator will be
used...

Dale

On Thu, Mar 27, 2014 at 5:19 AM, Pieter Nagel <pieter at nagel.co.za> wrote:

> The currently buggy UnicodeX >> #_findString:startingAt:ignoreCase:
> delegates
> to the ICU collator only in the case where ignoreCase is true. Is this
> correct?


> The reasoning seems to be that a case-sensitive match in Unicode can be
> done
> by just comparing the byte values of the two strings for identity, as the
> super
> implementation presumably does. But since some letters can be decomposed
> into
> multiple codepoints in canonical and non-canonical ways[1], that's not
> true. And
> I suppose surrogate escapes factor in here as true.
>
> To be honest, I'm not familiar enough with ICU to know whether it
> (optionally?)
> takes character decomposition into account on comparison, but I would
> guess that
> issues like these are precisely why an industrial-strength Unicode handling
> library was needed in the first place.
>
> What does this look like in 3.2? Does your testing include comparisons
> where the
> two strings have characters decomposed in different ways?
>
> [1] I.e. U+00E9 LATIN SMALL LETTER E WITH ACUTE can also be decomposed as
> U+0065 LATIN SMALL LETTER E + U+0301 COMBINING ACUTE ACCENT
>
> _______________________________________________
> Glass mailing list
> Glass at lists.gemtalksystems.com
> http://lists.gemtalksystems.com/mailman/listinfo/glass
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gemtalksystems.com/mailman/private/glass/attachments/20140328/93ea52e6/attachment.html>


More information about the Glass mailing list