[Glass] case insensitive search broken for Unicode7

Pieter Nagel pieter at nagel.co.za
Thu Mar 27 05:19:39 PDT 2014


The currently buggy UnicodeX >> #_findString:startingAt:ignoreCase: delegates
to the ICU collator only in the case where ignoreCase is true. Is this
correct?

The reasoning seems to be that a case-sensitive match in Unicode can be done
by just comparing the byte values of the two strings for identity, as the
super
implementation presumably does. But since some letters can be decomposed into
multiple codepoints in canonical and non-canonical ways[1], that's not
true. And
I suppose surrogate escapes factor in here as true.

To be honest, I'm not familiar enough with ICU to know whether it
(optionally?)
takes character decomposition into account on comparison, but I would
guess that
issues like these are precisely why an industrial-strength Unicode handling
library was needed in the first place.

What does this look like in 3.2? Does your testing include comparisons
where the
two strings have characters decomposed in different ways?

[1] I.e. U+00E9 LATIN SMALL LETTER E WITH ACUTE can also be decomposed as
U+0065 LATIN SMALL LETTER E + U+0301 COMBINING ACUTE ACCENT



More information about the Glass mailing list