[m-users.] Confused by action of string.prefix_length

Zoltan Somogyi zoltan.somogyi at runbox.com
Wed Jun 8 11:25:46 AEST 2022


2022-06-08 11:13 GMT+10:00 "Peter Wang" <novalazy at gmail.com>:
> The tildes in your email are U+02DC SMALL TILDE (˜).
> Each U+02DC takes two UTF-8 code units (i.e. bytes) to encode, and
> string.prefix_length returns the length of the prefix it finds
> in terms of code units.

I suppose the obvious next question is: *why* does it return the length
in terms of code units, or rather, *only* in terms of code units?

The documentation of prefix_length is:

% The length (in code units) of the maximal prefix of String consisting
% entirely of code points satisfying Pred.

which gives no justification for this choice.

Obviously, stepping past a given prefix is easier if you know its length
in code units, but sometimes, you may want to know how many code
points Pred has succeeded for. We could add a version of prefix_length
(and suffix_length) that either computes the length just in code points,
or in both code units and points. The latter would have to a predicate,
with a name such as prefix_lengths (note the plural). Since I rarely work
with Unicode, I don't know which would be more useful. Opinions?

Zoltan.


More information about the users mailing list