[Glass] case insensitive search broken for Unicode7

Mariano Martinez Peck marianopeck at gmail.com
Wed Apr 16 13:24:58 PDT 2014

On Fri, Mar 28, 2014 at 2:55 PM, Dale Henrichs <
dale.henrichs at gemtalksystems.com> wrote:

> Pieter,
> The engineer responsible for the ICU implementation has been on vacation
> all week, so I haven't had a chance to  discuss the
> #_findString:startingAt:ignoreCase: issue with him ...
> With that said we have been "discovering" things about mixed
> Unicode[7|16|32] (where a collator is always used) and
> [DoubleByte|QuadByte]String classes and the basic conclusion that we've
> come to is that:
>    "it does not make sense to attempt to perfomed mixed comparisons
>     between Unicode* and *String instances"

While I understand such conclusion, I think this is a different discussion
than the one I originally pasted, isn't it?

('Newmont' asUnicodeString _findString: 'newm' asUnicodeString startingAt:
1 ignoreCase: true) > 0

answers false and in this case I am not mixing anything...both are
Unicode7. So what I mean is that this is broken even without mixing.

Also...note that users may be comparing/mixing Unicode* and String* WITHOUT
knowing. For example...in my case, I don't know how but I get Unicode7 from
a combo list from a magritte form.... (of course, in Pharo I get a String)
and then I search over that result.... So if you were to choose "Legacy
comparison mode" for GLASS...then we should at least avoid using Unicode
classes in Seaside/magritte. Otherwise it would be a pain to maintain a
working system for Pharo and GemStone.  So..this is just to agree that
comparison mode" would be better for GLASS ?


> Dale
> On Thu, Mar 27, 2014 at 5:19 AM, Pieter Nagel <pieter at nagel.co.za> wrote:
>> The currently buggy UnicodeX >> #_findString:startingAt:ignoreCase:
>> delegates
>> to the ICU collator only in the case where ignoreCase is true. Is this
>> correct?
>> The reasoning seems to be that a case-sensitive match in Unicode can be
>> done
>> by just comparing the byte values of the two strings for identity, as the
>> super
>> implementation presumably does. But since some letters can be decomposed
>> into
>> multiple codepoints in canonical and non-canonical ways[1], that's not
>> true. And
>> I suppose surrogate escapes factor in here as true.
>> To be honest, I'm not familiar enough with ICU to know whether it
>> (optionally?)
>> takes character decomposition into account on comparison, but I would
>> guess that
>> issues like these are precisely why an industrial-strength Unicode
>> handling
>> library was needed in the first place.
>> What does this look like in 3.2? Does your testing include comparisons
>> where the
>> two strings have characters decomposed in different ways?
>> [1] I.e. U+00E9 LATIN SMALL LETTER E WITH ACUTE can also be decomposed as
>> _______________________________________________
>> Glass mailing list
>> Glass at lists.gemtalksystems.com
>> http://lists.gemtalksystems.com/mailman/listinfo/glass
> _______________________________________________
> Glass mailing list
> Glass at lists.gemtalksystems.com
> http://lists.gemtalksystems.com/mailman/listinfo/glass

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gemtalksystems.com/mailman/private/glass/attachments/20140416/9831ec63/attachment.html>

More information about the Glass mailing list