[Glass] Possible Bug: String>>#= treats nulls as a terminator

monty via Glass glass at lists.gemtalksystems.com
Tue Jan 30 21:50:14 PST 2018


You misunderstood the issue. If you choose one string representation (like an indexed collection of code points) but use another (like normalized EGSs) when doing basic comparisons, you get these inconsistencies that arguably violate the underlying indexable collection interface contract (like #= being true while #beingsWith: and #endsWith: are false).

Perl 6, which your article mentions, models strings as indexed, _pre-normalized_ collections of EGSs[0]:
	"Köln" eq "Ko\x308ln" && "Köln".chars == "Ko\x308ln".chars && "Köln".codes == "Ko\x308ln".codes && "Köln".starts-with("Ko\x308ln") && "Köln".ends-with("Ko\x308ln")

('chars' is the length in EGSs, while 'codes' is the length in code points.) The Java/C# approach is more basic, but it's still consistent, forcing you to manually normalize strings before comparing them by code unit, if you want a normalized comparison.

Anyway, I would recommend adding character (code point)-based comparison messages to String, and a #byteContents/#binaryContents message to GsFile, or even better, #ascii/#binary toggles like Pharo/Squeak have so you can set GsFile to #binary and use #next (instead of #nextByte) and #contents normally.

0: https://github.com/MoarVM/MoarVM/blob/master/docs/strings.asciidoc#normalization

> Sent: Tuesday, January 30, 2018 at 3:48 AM
> From: "Tobias Pape" <Das.Linux at gmx.de>
> To: monty <monty2 at programmer.net>
> Cc: glass at lists.gemtalksystems.com
> Subject: Re: [Glass] Possible Bug: String>>#= treats nulls as a terminator
>
> Hi Monty,
> 
> 
> > On 30.01.2018, at 08:34, monty via Glass <glass at lists.gemtalksystems.com> wrote:
> > 
> > The real problem is String>>#=. It's bizarre that two SequenceableCollections can be #= yet have different #sizes and that for every shared index i, it's not necessarily true that "(one at: i) = (two at: i)":
> 
> This is, however in line with unicode …
> See this very on-point discussion of the matter:
> 
> https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/
> 
> Best regards
> 	-Tobias
> 
> > | one two |
> > 
> > one := String with: $a with: 25 asCharacter with: $b.
> > two := one copyWithout: one second. 
> > one = two
> > 	and: [one asArray ~= two asArray
> > 		and: [
> > 			(1 to: (one size min: two size)) anySatisfy: [:i |
> > 				(one at: i) ~= (two at: i)]]].
> > 
> > Java and C# model strings as immutable indexed collections of UTF-16 16-bit code units (meaning surrogate pair-encoded code points require two units), and no normalization is done during comparisons. Instead there are special methods, like Normalize(), that convert a string into a chosen normalized form, and normalized comparisons can then be done on the converted strings. Ignoring the choice of UTF-16, this seems like a better, safer approach if you're still committed to treating strings as indexable character collections.
> > 
> > But I'm not sure how you can fix String or GsFile without breaking backwards compatibility.
> > 
> >> Sent: Monday, January 29, 2018 at 11:44 AM
> >> From: "Dale Henrichs via Glass" <glass at lists.gemtalksystems.com>
> >> To: glass at lists.gemtalksystems.com
> >> Subject: Re: [Glass] Possible Bug: String>>#= treats nulls as a terminator
> >> 
> >> 
> >> 
> >> On 01/29/2018 01:16 AM, monty via Glass wrote:
> >>> I was writing tests for stream converter classes that do encoding/decoding from various encodings. But any use of Strings to store binary data is a use case. ByteArray is more appropriate, but GsFile is still byte-character based by default, even when you open files in binary mode (which I assume just disables line ending normalization on Windows).
> >> This seems like a GemStone bug at the end of the day ... ByteArray and 
> >> Utf8 are the two classes that _should_ be used, but if GsFile is not 
> >> handling them well, then that is an issue for us ... I will check this 
> >> out ...
> >> 
> >> Thanks,
> >> 
> >> Dale
> >> 
> >>> 
> >>>> Sent: Saturday, January 27, 2018 at 12:18 PM
> >>>> From: "Dale Henrichs via Glass" <glass at lists.gemtalksystems.com>
> >>>> To: glass at lists.gemtalksystems.com
> >>>> Subject: Re: [Glass] Possible Bug: String>>#= treats nulls as a terminator
> >>>> 
> >>>> Monty,
> >>>> 
> >>>> Good points ... this "unexpected" behavior of Unicode strings with
> >>>> respect to control characters has been hard for us to grapple with
> >>>> internally as well, but this is unicode being unicode. I did notice that
> >>>> with the exception of code point 173, all of the code points you list
> >>>> are indeed control characters according the Unicode character table[1].
> >>>> 
> >>>> Code point 173 is a "Soft Hypen"[2] and doesn't really seem to fit the
> >>>> description of a control character, so I'm now curious if we might have
> >>>> a bug here, either in our implementation, the implementation of libICU
> >>>> or my understanding:)
> >>>> 
> >>>> I'm curious how you ran across this behavior? The control characters
> >>>> wouldn't seem to be a normal part of strings intended for display ...
> >>>> 
> >>>> I'm asking because if there is a use case for providing the old literal
> >>>> byte comparison operators we can make them available.
> >>>> 
> >>>> Dale
> >>>> 
> >>>> [1] https://unicode-table.com/en/#control-character
> >>>> [2] https://unicode-table.com/en/00AD/
> >>>> 
> >>>> On 01/27/2018 01:57 AM, monty via Glass wrote:
> >>>>> My example and thread title were wrong. It skips null *and* various control chars entirely when comparing:
> >>>>> (0 to: 255) select: [:each |
> >>>>> 	(String with: $a with: $b) =
> >>>>> 		(String with: $a with: each asCharacter with: $b)]
> >>>>> 
> >>>>> which yields:
> >>>>> anArray( 0, 1, 2, 3, 4, 5, 6, 7, 8, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 127, 128, 129, 130, 131, 132, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 173)
> >>>>> 
> >>>>> The GS Prog Guide (p. 77) says the ICU lib handles string comparisons internally, and it seems to ignore these characters for the sake of normalization.
> >>>>> 
> >>>>> But that means it's possible for two Strings to be #= while having different #sizes and indexable characters, and that comparisons between Strings containing binary data aren't reliable, and that other String methods aren't consistent with #=:
> >>>>> | one two |
> >>>>> one := String with: $a with: 0 asCharacter with: $b.
> >>>>> two := String with: $a with: $b.
> >>>>> one = two
> >>>>> 	and: [(one at: 1 equals: two) not
> >>>>> 		and: [(two at: 1 equals: one) not]]
> >>>>> 
> >>>>> And since GsFile #next and #contents are character based:
> >>>>> (GsFile open: 'bin.one' mode: 'wb' onClient: false)
> >>>>> 	nextPutAll: #[100 25 200];
> >>>>> 	close.
> >>>>> (GsFile open: 'bin.two' mode: 'wb' onClient: false)
> >>>>> 	nextPutAll: #[100 200];
> >>>>> 	close.
> >>>>> (GsFile open: 'bin.one' mode: 'rb' onClient: false) contents =
> >>>>> 	(GsFile open: 'bin.two' mode: 'rb' onClient: false) contents.
> >>>>> 
> >>>>> Consider this more as a "heads-up" for users than a bug report, since this is apparently the intended, documented behavior.
> >>>>> 
> >>>>>> Sent: Friday, January 26, 2018 at 2:20 AM
> >>>>>> From: "monty via Glass" <glass at lists.gemtalksystems.com>
> >>>>>> To: glass at lists.gemtalksystems.com
> >>>>>> Subject: [Glass] Possible Bug: String>>#= treats nulls as a terminator
> >>>>>> 
> >>>>>> Is this correct?
> >>>>>> 
> >>>>>> (String with: 12 asCharacter with: 0 asCharacter) =
> >>>>>>      (String with: 12 asCharacter with: 0 asCharacter with: 32 asCharacter)
> >>>>>> 
> >>>>>> Other string methods, like #copyAfter:, don't treat null the same way.
> >>>>>> _______________________________________________
> >>>>>> Glass mailing list
> >>>>>> Glass at lists.gemtalksystems.com
> >>>>>> http://lists.gemtalksystems.com/mailman/listinfo/glass
> >>>>>> 
> >>>>> _______________________________________________
> >>>>> Glass mailing list
> >>>>> Glass at lists.gemtalksystems.com
> >>>>> http://lists.gemtalksystems.com/mailman/listinfo/glass
> >>>> _______________________________________________
> >>>> Glass mailing list
> >>>> Glass at lists.gemtalksystems.com
> >>>> http://lists.gemtalksystems.com/mailman/listinfo/glass
> >>>> 
> >>> _______________________________________________
> >>> Glass mailing list
> >>> Glass at lists.gemtalksystems.com
> >>> http://lists.gemtalksystems.com/mailman/listinfo/glass
> >> 
> >> _______________________________________________
> >> Glass mailing list
> >> Glass at lists.gemtalksystems.com
> >> http://lists.gemtalksystems.com/mailman/listinfo/glass
> >> 
> > _______________________________________________
> > Glass mailing list
> > Glass at lists.gemtalksystems.com
> > http://lists.gemtalksystems.com/mailman/listinfo/glass
> 
>


More information about the Glass mailing list