[m-users.] Confused by action of string.prefix_length

Volker Wysk post at volker-wysk.de
Wed Jun 8 17:51:14 AEST 2022


Hi

I don't find it obvious what a "code unit" is. Possibly I would confuse it
with "code point". Please replace "code unit" with "byte", unless there's
something to be said against it.

Regards,
Volker


Am Mittwoch, dem 08.06.2022 um 08:32 +0100 schrieb Sean Charles
(emacstheviking):
> OK, well the explanation makes sense, I guess in my head I had not fully comprehended what a code point truly is, not being that accustomed to that terminology myself.
> 
> At first (naive) reading I ASSUMED it would just return the -visible character count- i.e. the, dare I say it, intuitive value of 3, rather than the surgically precise value of 6!
> 
> I don’t think my code will be wrong as I am using the other string predicates to slice and dice and as Peter said, they too deal in code points so everything should just work out as expected.
> 
> Thanks again, that was a most interesting moment, to have ones concrete expectation blowing up in ones face like that!
> A true Zen like learning moment!
> 
> Mercury contintues to delight!
> 
> :)
> 
> Thanks again everybody,
> Sean.
> 
> 
> > On 8 Jun 2022, at 03:09, Peter Wang <novalazy at gmail.com> wrote:
> > 
> > On Wed, 08 Jun 2022 11:25:46 +1000 "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
> > > 2022-06-08 11:13 GMT+10:00 "Peter Wang" <novalazy at gmail.com>:
> > > > The tildes in your email are U+02DC SMALL TILDE (˜).
> > > > Each U+02DC takes two UTF-8 code units (i.e. bytes) to encode, and
> > > > string.prefix_length returns the length of the prefix it finds
> > > > in terms of code units.
> > > 
> > > I suppose the obvious next question is: *why* does it return the length
> > > in terms of code units, or rather, *only* in terms of code units?
> > > 
> > > The documentation of prefix_length is:
> > > 
> > > % The length (in code units) of the maximal prefix of String consisting
> > > % entirely of code points satisfying Pred.
> > > 
> > > which gives no justification for this choice.
> > 
> > The string module works in terms of code units for the most part.
> > Apart from that, I expected (then and now) that the most common use
> > for the result of prefix_length would be to skip past the prefix.
> > 
> > > Obviously, stepping past a given prefix is easier if you know its length
> > > in code units, but sometimes, you may want to know how many code
> > > points Pred has succeeded for. We could add a version of prefix_length
> > > (and suffix_length) that either computes the length just in code points,
> > > or in both code units and points. The latter would have to a predicate,
> > > with a name such as prefix_lengths (note the plural). Since I rarely work
> > > with Unicode, I don't know which would be more useful. Opinions?
> > 
> > I think counting code points is rarely useful but if you wanted to add
> > something, a predicate that returns both would make sense
> > as the implementation would be keeping track of both anyway.
> > 
> > Peter
> > _______________________________________________
> > users mailing list
> > users at lists.mercurylang.org
> > https://lists.mercurylang.org/listinfo/users
> 
> _______________________________________________
> users mailing list
> users at lists.mercurylang.org
> https://lists.mercurylang.org/listinfo/users
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <http://lists.mercurylang.org/archives/users/attachments/20220608/92944ad7/attachment.sig>


More information about the users mailing list