[m-users.] Confused by action of string.prefix_length
Volker Wysk
post at volker-wysk.de
Wed Jun 8 17:51:14 AEST 2022
Hi
I don't find it obvious what a "code unit" is. Possibly I would confuse it
with "code point". Please replace "code unit" with "byte", unless there's
something to be said against it.
Regards,
Volker
Am Mittwoch, dem 08.06.2022 um 08:32 +0100 schrieb Sean Charles
(emacstheviking):
> OK, well the explanation makes sense, I guess in my head I had not fully comprehended what a code point truly is, not being that accustomed to that terminology myself.
>
> At first (naive) reading I ASSUMED it would just return the -visible character count- i.e. the, dare I say it, intuitive value of 3, rather than the surgically precise value of 6!
>
> I don’t think my code will be wrong as I am using the other string predicates to slice and dice and as Peter said, they too deal in code points so everything should just work out as expected.
>
> Thanks again, that was a most interesting moment, to have ones concrete expectation blowing up in ones face like that!
> A true Zen like learning moment!
>
> Mercury contintues to delight!
>
> :)
>
> Thanks again everybody,
> Sean.
>
>
> > On 8 Jun 2022, at 03:09, Peter Wang <novalazy at gmail.com> wrote:
> >
> > On Wed, 08 Jun 2022 11:25:46 +1000 "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
> > > 2022-06-08 11:13 GMT+10:00 "Peter Wang" <novalazy at gmail.com>:
> > > > The tildes in your email are U+02DC SMALL TILDE (˜).
> > > > Each U+02DC takes two UTF-8 code units (i.e. bytes) to encode, and
> > > > string.prefix_length returns the length of the prefix it finds
> > > > in terms of code units.
> > >
> > > I suppose the obvious next question is: *why* does it return the length
> > > in terms of code units, or rather, *only* in terms of code units?
> > >
> > > The documentation of prefix_length is:
> > >
> > > % The length (in code units) of the maximal prefix of String consisting
> > > % entirely of code points satisfying Pred.
> > >
> > > which gives no justification for this choice.
> >
> > The string module works in terms of code units for the most part.
> > Apart from that, I expected (then and now) that the most common use
> > for the result of prefix_length would be to skip past the prefix.
> >
> > > Obviously, stepping past a given prefix is easier if you know its length
> > > in code units, but sometimes, you may want to know how many code
> > > points Pred has succeeded for. We could add a version of prefix_length
> > > (and suffix_length) that either computes the length just in code points,
> > > or in both code units and points. The latter would have to a predicate,
> > > with a name such as prefix_lengths (note the plural). Since I rarely work
> > > with Unicode, I don't know which would be more useful. Opinions?
> >
> > I think counting code points is rarely useful but if you wanted to add
> > something, a predicate that returns both would make sense
> > as the implementation would be keeping track of both anyway.
> >
> > Peter
> > _______________________________________________
> > users mailing list
> > users at lists.mercurylang.org
> > https://lists.mercurylang.org/listinfo/users
>
> _______________________________________________
> users mailing list
> users at lists.mercurylang.org
> https://lists.mercurylang.org/listinfo/users
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <http://lists.mercurylang.org/archives/users/attachments/20220608/92944ad7/attachment.sig>
More information about the users
mailing list