[m-dev.] a problem with trie string switches and string well-formedness

Sat Mar 23 13:21:05 AEDT 2024

On Fri, 22 Mar 2024 22:10:34 +1100 "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
> I am working on implementing trie switches for the LLDS backend.
> I am also updating the implementation of trie switches for the MLDS
> backend, because
> 
> - I want to the implementation of the same switch method
>   in the two backends to be as similar as possible, but
> 
> - I don't want to implement something in the LLDS backend
>   if I know that it should then updated in both backends.
> 
> In the process of doing this, I wanted to generate better code
> for "sticks", which are parts of the trie where we have a sequence
> of trie nodes that all have just one descendant, but the bottom node
> in the stick has two or more matching strings. Instead of generating
> code such as
> 
>  str[5] == 'e' && str[6] == 'f' && str[7] == 'g'
> 
> I wanted to generate code such as
> 
>   strncmp(str + 5, "abcdefghij" + 5, 3) == 0
> 
> We don't yet do this. However, in the special case where
> a trie node matches just one string, we have long generated
> code such as
> 
>   strcmp(str + 5, "abcdefgh" + 5) == 0
> 
> While working on this, I realized that this has a semantic problem.
> The problem is that
> 
> - each trie level matches one code unit,
> - the code unit sequence that a trie node corresponds to
>   may not be a valid string, because it may stop in the *middle*
>   of the code unit sequence of a code point, and that therefore
> - the code unit sequence we want to match with strcmp
>   may also start in the middle of a code point, meaning that
>   it is not a valid string.
> 

That's fine.

> At the moment, there is no risk of any compiled Mercury program
> going wrong, because
> 
> - we currently implement trie string switches only for the MLDS backend,
>   and only when targeting C, and
> 
> - the code in ml_string_switch that generates the binop that
>   eventually gets turned into the call to strcmp checks that the
>   string it constructs as the operand to compare against
>   is well formed. (The other operand comes from a string-valued
>   Mercury variable, for whose well-formedness the rest of the system
>   is responsible.
>  
> The symptom of the problem is therefore is not a crash when
> the user runs the program mmc generates, but an abort when
> invoking mmc. I have never come across such aborts, but then
> pretty much all my work is with the ASCII subset of UTF-8.
> 

Yes, if the compiler uses Mercury strings to represent potentially
non-well-formed sequences then we have to be a little bit careful
(in case other parts of the compiler expect only well formed strings),
or else use a different type, such as list(int), to represent such
sequences.

> The operation I am in the process of adding to the compiler,
> a size-limited variant of offset_str_eq, gets a double dose of this
> problem, since it gets it at the end of the string being matched
> as well as at the start.
> 
> Despite this semantic problem, I don't think there would be
> any *operational* problem if we simply did an end-run around
> the semantic problem, by recognizing that what we need are not
> operations that compare parts of strings, but operations that compare
> parts of code unit sequences. This could and maybe sould include
> invoking not strncmp  but memcmp for my new use case. However,
> for the old use case, we still want the comparison to stop at the
> first null char, which no memcmp variant does.
> 
> As far as I know, strcmp/strncmp do not care about well-formedness
> of their inputs, and operate on them as if they were code unit sequences
> anyway, with caring about code points delegated to other functions such as strcoll.

That is right.

> However, I don't know as much about i18n as you guys.
> 
> My opinion is that the best way forward is this:
> 
> - stop insisting that the operand we give to strcmp now, and strncmp soon,
>   are well formed. This requires a version of from_code_unit_list_in_encoding
>   that does not check well-formedness.

Where do we insist that of strcmp/strncmp?

> 
> - renaming the offset_str_eq operation to something that reflects the fact that
>   it operates on code unit sequences, i.e. on strings that may not be well formed.
>   offset_nwf_str_eq? offset_cus_eq? Maybe something else?

The other _str operations compare arbitrary code units sequences
already, so it just needs to be documented as such. I don't see any need
to rename the operations.

Peter