[m-users.] Confused by action of string.prefix_length

Peter Wang novalazy at gmail.com
Wed Jun 8 12:09:51 AEST 2022


On Wed, 08 Jun 2022 11:25:46 +1000 "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
> 
> 2022-06-08 11:13 GMT+10:00 "Peter Wang" <novalazy at gmail.com>:
> > The tildes in your email are U+02DC SMALL TILDE (˜).
> > Each U+02DC takes two UTF-8 code units (i.e. bytes) to encode, and
> > string.prefix_length returns the length of the prefix it finds
> > in terms of code units.
> 
> I suppose the obvious next question is: *why* does it return the length
> in terms of code units, or rather, *only* in terms of code units?
> 
> The documentation of prefix_length is:
> 
> % The length (in code units) of the maximal prefix of String consisting
> % entirely of code points satisfying Pred.
> 
> which gives no justification for this choice.

The string module works in terms of code units for the most part.
Apart from that, I expected (then and now) that the most common use
for the result of prefix_length would be to skip past the prefix.

> Obviously, stepping past a given prefix is easier if you know its length
> in code units, but sometimes, you may want to know how many code
> points Pred has succeeded for. We could add a version of prefix_length
> (and suffix_length) that either computes the length just in code points,
> or in both code units and points. The latter would have to a predicate,
> with a name such as prefix_lengths (note the plural). Since I rarely work
> with Unicode, I don't know which would be more useful. Opinions?

I think counting code points is rarely useful but if you wanted to add
something, a predicate that returns both would make sense
as the implementation would be keeping track of both anyway.

Peter


More information about the users mailing list