[m-rev.] for post-commit review: fix mantis bug #367

Zoltan Somogyi zoltan.somogyi at runbox.com
Fri Dec 5 17:43:38 AEDT 2014



On Fri, 5 Dec 2014 16:11:09 +1100, Peter Wang <novalazy at gmail.com> wrote:

> On Wed, 03 Dec 2014 18:20:15 +1100 (EST), "Zoltan Somogyi" <zoltan.somogyi at runbox.com> wrote:
> > Fix Mantis bug #367.
> 
> Please avoid using only a bug number for commit summary.

It is the git limit on the commit description. It is hard enough to describe
the change itself in 50 characters; including the bug number makes it
almost impossible. So I usually pick one or the other, not both.
The meat of the diff explains the diff anyway, and is not squeezed.

> > +    O = Cmp(MidX, X),
> > +    (
> > +        O = (>),
> > +        binary_search_loop(Cmp, A, X, Lo, Mid - 1, I)
> > +    ;
> > +        O = (=),
> > +        I = Mid         % XXX repetitions
> 
> Did you mean to leave these in?

No, I didn't. Thanks for catching that, I will fix it.

It was meant as a reminder that in the presence of repetitions,
the I we return isn't guaranteed to be either the first or the last
occurrence of the value being searched for. I thought we may
want to provide some such guarantee, but then thought better
of it, since (a) in many applications, repeats are not possible,
and (b) even if repeats are possible, some use cases will want
the first, some will want the last, and some won't care. If users want
a specific match, they can do a linear search after the binary search;
that will be the simplest and best solution in almost all cases.
The only case where array.m can do a better job is in cases
where there are not just repeats in the array, but there are
LOTS of repeats in the array. In such cases, knowing the
immediately previous Lo and Hi can allow this code
to do a BINARY search for the first or last occurrence
of the searched-for value. However, I concluded that this use case
is probably very rare, and that it is not worth complicating
and slowing down the operation of the standard binary search
predicate for.
 
> > -approx_binary_search_2(Cmp, A, X, Lo, Hi, I) :-
> > +approx_binary_search_loop(Cmp, A, X, Lo, Hi, I) :-
> > +    % loop invariant: if X is anywhere in A[0] .. A[array.size(A)-1],
> > +    % then it is in A[Lo] .. A[Hi].
> >      Lo =< Hi,
> >      Mid = (Lo + Hi) / 2,
> > -    O = Cmp(A ^ elem(Mid), X),
> > +    array.unsafe_lookup(A, Mid, MidX),
> > +    O = Cmp(MidX, X),
> >      (
> >          O = (>),
> > -        approx_binary_search_2(Cmp, A, X, Lo, Mid - 1, I)
> > +        approx_binary_search_loop(Cmp, A, X, Lo, Mid - 1, I)
> >      ;
> >          O = (=),
> > -        I = Mid
> > +        I = Mid         % XXX repetitions
> >      ;
> >          O = (<),
> > -        ( if ( Mid < Hi, X @< A ^ elem(Mid + 1) ; Mid = Hi ) then
> > +        ( if
> > +            (
> > +                Mid < Hi,
> > +                % Mid + 1 cannot exceed Hi, so the array access is safe.
> > +                array.unsafe_lookup(A, Mid + 1, MidP1X),
> > +                (<) = Cmp(X, MidP1X)
> > +            ;
> > +                Mid = Hi
> > +            )
> > +        then
> >              I = Mid
> > -          else
> > -            approx_binary_search_2(Cmp, A, X, Mid + 1, Hi, I)
> > +        else
> > +            approx_binary_search_loop(Cmp, A, X, Mid + 1, Hi, I)
> >          )
> >      ).
> 
> The unsafe_lookup goals could potentially be ordered before the bounds
> checks.  Do you want to want to fix it?

I don't see that there is anything to fix.

The loop invariant guarantees that the first unsafe_lookup is safe:
Lo and Hi are both always in [0, size-1].

Mid is always in [Lo, Hi]. If Mid < Hi, then obviously Mid+1 =< Hi,
so the second unsafe_lookup is also safe.

Both these facts are documented. Is anything wrong with my
reasoning, or with its documentation?

Or are you worried about the compiler moving Mid < Hi to AFTER
the lookup? I am pretty sure the compiler is not allowed to do that,
but I guess we could turn the disjunction into an if-then-else,
whose semantics guarantees that condition (Mid < Hi here)
is executed before the then-part (including the second lookup).
I don't see any way that reordering could affect the first first lookup.
If you are worried about integer overflow when calculating Mid,
the standard fix for that is to compute it as Lo + (Hi - Lo) / 2.
However, since all our ints are signed, there is no need for that;
even if Lo + Hi doesn't fit into a signed int, it will fit into an unsigned
int, and the /2 will get it back into the signed int range.

Zoltan.





More information about the reviews mailing list