[m-dev.] a problem with trie string switches and string well-formedness

Zoltan Somogyi zoltan.somogyi at runbox.com
Fri Mar 22 22:10:34 AEDT 2024


I am working on implementing trie switches for the LLDS backend.
I am also updating the implementation of trie switches for the MLDS
backend, because

- I want to the implementation of the same switch method
  in the two backends to be as similar as possible, but

- I don't want to implement something in the LLDS backend
  if I know that it should then updated in both backends.

In the process of doing this, I wanted to generate better code
for "sticks", which are parts of the trie where we have a sequence
of trie nodes that all have just one descendant, but the bottom node
in the stick has two or more matching strings. Instead of generating
code such as

 str[5] == 'e' && str[6] == 'f' && str[7] == 'g'

I wanted to generate code such as

  strncmp(str + 5, "abcdefghij" + 5, 3) == 0

We don't yet do this. However, in the special case where
a trie node matches just one string, we have long generated
code such as

  strcmp(str + 5, "abcdefgh" + 5) == 0

While working on this, I realized that this has a semantic problem.
The problem is that

- each trie level matches one code unit,
- the code unit sequence that a trie node corresponds to
  may not be a valid string, because it may stop in the *middle*
  of the code unit sequence of a code point, and that therefore
- the code unit sequence we want to match with strcmp
  may also start in the middle of a code point, meaning that
  it is not a valid string.

At the moment, there is no risk of any compiled Mercury program
going wrong, because

- we currently implement trie string switches only for the MLDS backend,
  and only when targeting C, and

- the code in ml_string_switch that generates the binop that
  eventually gets turned into the call to strcmp checks that the
  string it constructs as the operand to compare against
  is well formed. (The other operand comes from a string-valued
  Mercury variable, for whose well-formedness the rest of the system
  is responsible.
 
The symptom of the problem is therefore is not a crash when
the user runs the program mmc generates, but an abort when
invoking mmc. I have never come across such aborts, but then
pretty much all my work is with the ASCII subset of UTF-8.

The operation I am in the process of adding to the compiler,
a size-limited variant of offset_str_eq, gets a double dose of this
problem, since it gets it at the end of the string being matched
as well as at the start.

Despite this semantic problem, I don't think there would be
any *operational* problem if we simply did an end-run around
the semantic problem, by recognizing that what we need are not
operations that compare parts of strings, but operations that compare
parts of code unit sequences. This could and maybe sould include
invoking not strncmp  but memcmp for my new use case. However,
for the old use case, we still want the comparison to stop at the
first null char, which no memcmp variant does.

As far as I know, strcmp/strncmp do not care about well-formedness
of their inputs, and operate on them as if they were code unit sequences
anyway, with caring about code points delegated to other functions such as strcoll.
However, I don't know as much about i18n as you guys.

My opinion is that the best way forward is this:

- stop insisting that the operand we give to strcmp now, and strncmp soon,
  are well formed. This requires a version of from_code_unit_list_in_encoding
  that does not check well-formedness.

- renaming the offset_str_eq operation to something that reflects the fact that
  it operates on code unit sequences, i.e. on strings that may not be well formed.
  offset_nwf_str_eq? offset_cus_eq? Maybe something else?

- I don't know whether you'd want to add a new builtin type, a variant of string
  that does not insist on well-formedness, as the operand type of the update op(s),
  but to me, that would be overkill.

Opinions?

Zoltan.


More information about the developers mailing list