[m-rev.] for review: speed up uint32 to string conversion

Zoltan Somogyi zoltan.somogyi at runbox.com
Wed Oct 27 18:28:47 AEDT 2021



On Wed, 27 Oct 2021 16:10:18 +1100 (AEDT), Julien Fischer <jfischer at opturion.com> wrote:
> If anyone would care to try the benchmark on other system, I would be curious
> to know the result.

I cannot help with that: my machine cannot reliably measure small timing differences.

> +:- func alt1_uint32_to_string(uint32::in) = (string::uo) is det.
> +:- pragma foreign_proc("C",
> +    alt1_uint32_to_string(U::in) = (S::uo),
> +    [will_not_call_mercury, promise_pure, thread_safe, will_not_modify_trail],
> +"
> +    int num_digits;
> +    if (U < 10) {
> +        num_digits = 1;
> +    } else if (U < 100) {
> +        num_digits = 2;
> +    } else if (U < 1000) {
> +        num_digits = 3;
> +    } else if (U < 10000) {
> +        num_digits = 4;
> +    } else if (U < 100000) {
> +        num_digits = 5;
> +    } else if (U < 1000000) {
> +        num_digits = 6;
> +    } else if (U < 10000000) {
> +        num_digits = 7;
> +    } else if (U < 100000000) {
> +        num_digits = 8;
> +    } else if (U < 1000000000) {
> +        num_digits = 9;
> +    } else {
> +        num_digits = 10;
> +    }
> +
> +    MR_allocate_aligned_string_msg(S, num_digits, MR_ALLOC_ID);
> +    S[num_digits] = '\\0';
> +    int i = num_digits - 1;
> +    do {
> +        S[i] = \"0123456789\"[U % 10];
> +        i--;
> +    } while ( U /= 10 );
> +").

Please fix the line with "while" in three separate ways:

Move the division by 10 out of the condition.

Make the comparison with zero explicit.

To conform to our usual style in C, avoid the space after '(' and before ')'.

> +%---------------------------------------------------------------------------%
> +
> +% Same as alt1 except it uses Andrei Alexandrescu's digit counting method.
> +% See: <https://www.facebook.com/notes/10158791579037200/>
> +% This is biased in favour of small numbers, it doesn't really help for
> +% this benchmark.
> +%
> +:- func alt2_uint32_to_string(uint32::in) = (string::uo) is det.
> +:- pragma foreign_proc("C",
> +    alt2_uint32_to_string(U::in) = (S::uo),
> +    [will_not_call_mercury, promise_pure, thread_safe, will_not_modify_trail],
> +"
> +    uint32_t v = U;
> +    int num_digits = 1;
> +    for (;;) {
> +        if (v < 10) { break; }
> +        if (v < 100) { num_digits += 1; break; }
> +        if (v < 1000) { num_digits +=2; break; }
> +        if (v < 10000) { num_digits += 3; break; }
> +        v /= UINT32_C(10000);
> +        num_digits += 4;
> +    }
> +
> +    if (U < 10) {
> +        num_digits = 1;
> +    } else if (U < 100) {
> +        num_digits = 2;
> +    } else if (U < 1000) {
> +        num_digits = 3;
> +    } else if (U < 10000) {
> +        num_digits = 4;
> +    } else if (U < 100000) {
> +        num_digits = 5;
> +    } else if (U < 1000000) {
> +        num_digits = 6;
> +    } else if (U < 10000000) {
> +        num_digits = 7;
> +    } else if (U < 100000000) {
> +        num_digits = 8;
> +    } else if (U < 1000000000) {
> +        num_digits = 9;
> +    } else {
> +        num_digits = 10;
> +    }

Why is the second block of code there? The first has already
computed num_digits; why compute it again? This voids the results
of your benchmarking wrt alt2. In fact, I wouldn't be surprised
if the C compiler *deleted* the first block here, which means
that instead of measuring alt1 vs alt2, you measured alt1 twice :-(

> +% XXX TODO: try table driven approaches.

Alternatives 1 and 2 *are* table driven approaches,
though the table has only ten entries :-( But I presume
you mean approaches in which you index U into a table,
with or without having reduced its range. I don't think
doing this to find num_digits is likely to be faster than
the existing alternatives; quite the opposite. CPUs
can speculate about branch outcomes, and compilers
can arrange for both branches to be followed with a
late choice between them, but these techniques
are not applicable to value lookups in tables.

----------------------------------------------

I think there are two respects in which your alternatives
are different:

- how they compute num_digits; and
- how they convert each digit from an int in [0-9] to a char.

Call these a and b. Lets say a1 is the sequence of "if (U < 10^N)"
tests, and a2 is the small for loop at the top of alt2. And lets say
b1 is the index into "0123456789", and b2 is '0' + (U % 10).
Then your alt1 is a1b1, alt2 is a2b1 and alt3 is a1b2. That means
there should be an alternative with a2b2 as well.

I would also propose a3, which would be not a *sequence* of
if-then-elses, but a *tree* of if-then-else, that performs a
binary search on num_digits. There could be several versions
of this: one whose top level test is for num_digits being
above 5, one for below 4, one for below 3 etc. The latter
should perform better on workloads dominated by small
numbers.

We can also extend a1 by indexing not into a ten-element
table of chars, but into a 100-element table of char pairs,
1000-element table of char triples, etc. I don't think these
are likely to have any advantage over existing alternatives,
but if you want to be sure, you could try them.

Zoltan.




More information about the reviews mailing list