[Glass] Possible Bug: String>>#= treats nulls as a terminator

monty via Glass glass at lists.gemtalksystems.com
Tue Jan 30 00:00:35 PST 2018


Another one:
| one two |
 
 one := 'Köln'.
 two :=  String with: $K with: $o with: 16r308 asCharacter with: $l with: $n.
 one = two
 	and: [one size ~= two size
		and: [(one endsWith: two) not
			and: [(one beginsWith: two) not
				and: [(two endsWith: one) not
					and: [(two beginsWith: one) not]]]]].

> Sent: Tuesday, January 30, 2018 at 2:34 AM
> From: "monty via Glass" <glass at lists.gemtalksystems.com>
> To: glass at lists.gemtalksystems.com
> Subject: Re: [Glass] Possible Bug: String>>#= treats nulls as a terminator
>
> The real problem is String>>#=. It's bizarre that two SequenceableCollections can be #= yet have different #sizes and that for every shared index i, it's not necessarily true that "(one at: i) = (two at: i)":
> | one two |
> 
> one := String with: $a with: 25 asCharacter with: $b.
> two := one copyWithout: one second. 
> one = two
> 	and: [one asArray ~= two asArray
> 		and: [
> 			(1 to: (one size min: two size)) anySatisfy: [:i |
> 				(one at: i) ~= (two at: i)]]].
> 
> Java and C# model strings as immutable indexed collections of UTF-16 16-bit code units (meaning surrogate pair-encoded code points require two units), and no normalization is done during comparisons. Instead there are special methods, like Normalize(), that convert a string into a chosen normalized form, and normalized comparisons can then be done on the converted strings. Ignoring the choice of UTF-16, this seems like a better, safer approach if you're still committed to treating strings as indexable character collections.
> 
> But I'm not sure how you can fix String or GsFile without breaking backwards compatibility.
> 
> > Sent: Monday, January 29, 2018 at 11:44 AM
> > From: "Dale Henrichs via Glass" <glass at lists.gemtalksystems.com>
> > To: glass at lists.gemtalksystems.com
> > Subject: Re: [Glass] Possible Bug: String>>#= treats nulls as a terminator
> >
> > 
> > 
> > On 01/29/2018 01:16 AM, monty via Glass wrote:
> > > I was writing tests for stream converter classes that do encoding/decoding from various encodings. But any use of Strings to store binary data is a use case. ByteArray is more appropriate, but GsFile is still byte-character based by default, even when you open files in binary mode (which I assume just disables line ending normalization on Windows).
> > This seems like a GemStone bug at the end of the day ... ByteArray and 
> > Utf8 are the two classes that _should_ be used, but if GsFile is not 
> > handling them well, then that is an issue for us ... I will check this 
> > out ...
> > 
> > Thanks,
> > 
> > Dale
> > 
> > >
> > >> Sent: Saturday, January 27, 2018 at 12:18 PM
> > >> From: "Dale Henrichs via Glass" <glass at lists.gemtalksystems.com>
> > >> To: glass at lists.gemtalksystems.com
> > >> Subject: Re: [Glass] Possible Bug: String>>#= treats nulls as a terminator
> > >>
> > >> Monty,
> > >>
> > >> Good points ... this "unexpected" behavior of Unicode strings with
> > >> respect to control characters has been hard for us to grapple with
> > >> internally as well, but this is unicode being unicode. I did notice that
> > >> with the exception of code point 173, all of the code points you list
> > >> are indeed control characters according the Unicode character table[1].
> > >>
> > >> Code point 173 is a "Soft Hypen"[2] and doesn't really seem to fit the
> > >> description of a control character, so I'm now curious if we might have
> > >> a bug here, either in our implementation, the implementation of libICU
> > >> or my understanding:)
> > >>
> > >> I'm curious how you ran across this behavior? The control characters
> > >> wouldn't seem to be a normal part of strings intended for display ...
> > >>
> > >> I'm asking because if there is a use case for providing the old literal
> > >> byte comparison operators we can make them available.
> > >>
> > >> Dale
> > >>
> > >> [1] https://unicode-table.com/en/#control-character
> > >> [2] https://unicode-table.com/en/00AD/
> > >>
> > >> On 01/27/2018 01:57 AM, monty via Glass wrote:
> > >>> My example and thread title were wrong. It skips null *and* various control chars entirely when comparing:
> > >>> (0 to: 255) select: [:each |
> > >>> 	(String with: $a with: $b) =
> > >>> 		(String with: $a with: each asCharacter with: $b)]
> > >>>
> > >>> which yields:
> > >>> anArray( 0, 1, 2, 3, 4, 5, 6, 7, 8, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 127, 128, 129, 130, 131, 132, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 173)
> > >>>
> > >>> The GS Prog Guide (p. 77) says the ICU lib handles string comparisons internally, and it seems to ignore these characters for the sake of normalization.
> > >>>
> > >>> But that means it's possible for two Strings to be #= while having different #sizes and indexable characters, and that comparisons between Strings containing binary data aren't reliable, and that other String methods aren't consistent with #=:
> > >>> | one two |
> > >>> one := String with: $a with: 0 asCharacter with: $b.
> > >>> two := String with: $a with: $b.
> > >>> one = two
> > >>> 	and: [(one at: 1 equals: two) not
> > >>> 		and: [(two at: 1 equals: one) not]]
> > >>>
> > >>> And since GsFile #next and #contents are character based:
> > >>> (GsFile open: 'bin.one' mode: 'wb' onClient: false)
> > >>> 	nextPutAll: #[100 25 200];
> > >>> 	close.
> > >>> (GsFile open: 'bin.two' mode: 'wb' onClient: false)
> > >>> 	nextPutAll: #[100 200];
> > >>> 	close.
> > >>> (GsFile open: 'bin.one' mode: 'rb' onClient: false) contents =
> > >>> 	(GsFile open: 'bin.two' mode: 'rb' onClient: false) contents.
> > >>>
> > >>> Consider this more as a "heads-up" for users than a bug report, since this is apparently the intended, documented behavior.
> > >>>
> > >>>> Sent: Friday, January 26, 2018 at 2:20 AM
> > >>>> From: "monty via Glass" <glass at lists.gemtalksystems.com>
> > >>>> To: glass at lists.gemtalksystems.com
> > >>>> Subject: [Glass] Possible Bug: String>>#= treats nulls as a terminator
> > >>>>
> > >>>> Is this correct?
> > >>>>
> > >>>> (String with: 12 asCharacter with: 0 asCharacter) =
> > >>>>       (String with: 12 asCharacter with: 0 asCharacter with: 32 asCharacter)
> > >>>>
> > >>>> Other string methods, like #copyAfter:, don't treat null the same way.
> > >>>> _______________________________________________
> > >>>> Glass mailing list
> > >>>> Glass at lists.gemtalksystems.com
> > >>>> http://lists.gemtalksystems.com/mailman/listinfo/glass
> > >>>>
> > >>> _______________________________________________
> > >>> Glass mailing list
> > >>> Glass at lists.gemtalksystems.com
> > >>> http://lists.gemtalksystems.com/mailman/listinfo/glass
> > >> _______________________________________________
> > >> Glass mailing list
> > >> Glass at lists.gemtalksystems.com
> > >> http://lists.gemtalksystems.com/mailman/listinfo/glass
> > >>
> > > _______________________________________________
> > > Glass mailing list
> > > Glass at lists.gemtalksystems.com
> > > http://lists.gemtalksystems.com/mailman/listinfo/glass
> > 
> > _______________________________________________
> > Glass mailing list
> > Glass at lists.gemtalksystems.com
> > http://lists.gemtalksystems.com/mailman/listinfo/glass
> > 
> _______________________________________________
> Glass mailing list
> Glass at lists.gemtalksystems.com
> http://lists.gemtalksystems.com/mailman/listinfo/glass
>


More information about the Glass mailing list