[m-dev.] a problem with trie string switches and string well-formedness

Sat Mar 23 13:31:37 AEDT 2024

On 2024-03-23 13:21 +11:00 AEDT, "Peter Wang" <novalazy at gmail.com> wrote:
> Yes, if the compiler uses Mercury strings to represent potentially
> non-well-formed sequences then we have to be a little bit careful
> (in case other parts of the compiler expect only well formed strings),
> or else use a different type, such as list(int), to represent such
> sequences.

The place where we want to put these not-necessarily-well-formed strings,
the arguments of the offset_str_eq operator, are processed by *very* few
parts of the compiler. I think documenting this property of the argument,
and looking over those compiler components, should be enough.

>> However, I don't know as much about i18n as you guys.
>> 
>> My opinion is that the best way forward is this:
>> 
>> - stop insisting that the operand we give to strcmp now, and strncmp soon,
>>   are well formed. This requires a version of from_code_unit_list_in_encoding
>>   that does not check well-formedness.
> 
> Where do we insist that of strcmp/strncmp?

We don't insist on it in general. We insist on it in this specific
use case, with the code starting on line 503 in ml_string_switch.m.

I will add a version of from_code_unit_list_in_encoding that does not
insist on well-formedness, and switch to using that in that code.

>> - renaming the offset_str_eq operation to something that reflects the fact that
>>   it operates on code unit sequences, i.e. on strings that may not be well formed.
>>   offset_nwf_str_eq? offset_cus_eq? Maybe something else?
> 
> The other _str operations compare arbitrary code units sequences
> already, so it just needs to be documented as such. I don't see any need
> to rename the operations.

Ok, will do. Thanks for the info.

Zoltan.