[m-rev.] RNGs

Julien Fischer jfischer at opturion.com
Mon Aug 19 13:48:36 AEST 2019


Hi Mark,

On Mon, 19 Aug 2019, Mark Brown wrote:

> On Mon, Aug 19, 2019 at 1:45 AM Julien Fischer <jfischer at opturion.com> wrote:
>>
>>
>> Hi Mark,
>>
>> On Tue, 13 Aug 2019, Mark Brown wrote:
>>
>>> I've just made a pull request to add some RNGs to the standard library, as
>>> well as a typeclass interface to them. Pull request is here:
>>>
>>> https://github.com/Mercury-Language/mercury/pull/70
>>>
>>> The diff is also attached, for anyone who would prefer to comment on
>>> it via email.
>>
>> Thanks for looking into this, it's part of the stdlib that has long needed
>> fixing.  Here's a partial review that mainly addresses aspects of the interface
>> design.  I haven't reviewed the actual RNG implementations as yet.
>
> Thanks. You can leave reviewing the RNG implementations until after
> we've decided which ones to keep.

We can keep all of them; those not in the standard library can live
in extras or on github somewhere.

>>> commit f102ddaf4c128137b27ac6e85ac881faaf435c9b
>>> Author: Mark Brown <mark at mercurylang.org>
>>> Date:   Tue Aug 13 16:42:38 2019 +1000
>>>
>>>     Some RNGs for the standard library, plus typeclass interfaces.
>>>
>>>     library/rng.m:
>>>         Top-level module containing typeclasses and generic routines,
>>>         as well as overall documentation.
>>
>> The top-level module for random number generator facilities should really be
>> random.  I realise that's taken, however I propose that the new stuff be added
>> to it and the existing contents of that file deprecated.  (They ought to be
>> able to co-exist in the interim until the old stuff is removed.)
>
> Ok, but see below.
>
>>
>>>     library/rng.binfile.m:
>>>         A generator that reads from a file.
>>>
>>>     library/rng.marsaglia.m:
>>>         Fast, simple generator. Diehard results for default seed:
>>>             PASSED: 109
>>>             WEAK:   2
>>>             FAILED: 3
>>>
>>>     library/rng.tausworthe.m:
>>>         Combined Tausworthe generators. Diehard results for default seed:
>>>
>>>         generator T3
>>>             PASSED: 113
>>>             WEAK:   1
>>>             FAILED: 0
>>>
>>>         generator T4
>>>             PASSED: 112
>>>             WEAK:   2
>>>             FAILED: 0
>>
>> There's a bit of question as to what RNGs should be included in the
>> stdlib by default.
>
> Yes, I've been evaluating more of these, and I'll be happy to leave
> out any that we don't in the end have a use case to justify. At the
> moment, now that I've compared them to sfc, I'm thinking that both
> marsaglia and tausworthe could be removed:
>
> - The justification for marsaglia is that it is very fast. Sfc16 is
> also fast, but marsaglia is about 7 or 8 times faster, so it would
> seem to be worthwhile including it. However, the absolute times are so
> small that in order to notice a difference in overall time a lot of
> numbers will need to be generated, and by this time biases in
> marsaglia will have had a chance to show up. So whether you are
> generating a small or large set of numbers, it would seem that sfc16
> is at least as good as marsaglia.
>
> - The combined Tausworthe generators were meant to be something
> familiar and of better quality than marsaglia, but sfc has turned out
> to be faster and of much higher quality than tausworthe.
>
> If anyone thinks we should keep the above, please say so.
>
> I still have a few more generators I want to look at. We don't have
> any with arbitrary jumps, or that can accept added entropy during
> operation, but those features may be desirable.

As I mentioned above, extra RNGs can live in extras.  The standard
library itself only needs to provide a few generators: one that is fast
and one that is "good" (in the statistically random sense), assuming
that there is not one that is both.

> Sfc is not cryptographically secure, although I'm not sure we want to
> offer such a thing.

I don't think providing cryptographic services is the business of the
standard library.

>>> diff --git a/library/rng.binfile.m b/library/rng.binfile.m
>>> new file mode 100644
>>> index 0000000..36c1966
>>> --- /dev/null
>>> +++ b/library/rng.binfile.m
>>> @@ -0,0 +1,96 @@
>>> +%---------------------------------------------------------------------------%
>>> +% vim: ft=mercury ts=4 sts=4 sw=4 et
>>> +%---------------------------------------------------------------------------%
>>> +% Copyright (C) 2019 The Mercury team.
>>> +% This file is distributed under the terms specified in COPYING.LIB.
>>> +%---------------------------------------------------------------------------%
>>> +%
>>> +% File: rng.binfile.m
>>> +% Main author: Mark Brown
>>> +%
>>> +% "Random" number generator that reads numbers from a binary file.
>>> +%
>>> +%---------------------------------------------------------------------------%
>>> +%---------------------------------------------------------------------------%
>>> +
>>> +:- module rng.binfile.
>>> +:- interface.
>>> +
>>> +:- import_module io.
>>> +
>>> +%---------------------------------------------------------------------------%
>>> +
>>> +:- type binfile.
>>> +:- instance urng(binfile, io).
>>> +
>>> +    % Open a binfile generator from a filename. This should be closed
>>> +    % when no longer needed.
>>> +    %
>>> +:- pred open(string, io.res(binfile), io, io).
>>> +:- mode open(in, out, di, uo) is det.
>>
>> Is there a reason not to use predmode declarations here?
>
> Not really. Should I? (And should the coding standard be updated?)

Yes, generally single mode predicates and functions in the standard
library use predmode decls.  The coding standard for the stdlib should
be adjusted to mention that.

>>> +
>>> +    % Close a binfile generator.
>>> +    %
>>> +:- pred close(binfile, io, io).
>>> +:- mode close(in, di, uo) is det.
>>> +
>>> +%---------------------------------------------------------------------------%
>>> +
>>> +    % Generate a number between 0 and max_uint64. This reads 8 bytes
>>> +    % at a time from the binfile and interprets them as an unsigned,
>>> +    % big-endian integer.
>>> +    %
>>> +    % Throws an exception if the end-of-file is reached.
>>> +    %
>>> +:- pred rand(binfile, uint64, io, io).
>>> +:- mode rand(in, out, di, uo) is det.
>>> +
>>> +%---------------------------------------------------------------------------%
>>> +
>>> +:- implementation.
>>> +
>>> +:- import_module require.
>>> +:- import_module uint64.
>>> +
>>> +%---------------------------------------------------------------------------%
>>> +
>>> +:- type binfile
>>> +    --->    binfile(binary_input_stream).
>>> +
>>> +:- instance urng(binfile, io) where [
>>> +    pred(urandom/4) is rand,
>>> +    ( urandom_max(_) = uint64.max_uint64 )
>>> +].
>>
>> It would be preferable if each generator also exported the predicates that
>> implement these methods, so they can be used without going via the type class
>> interface.
>>
>> Ditto for the other generators.
>
> Ok, I've exported a "rand_max" for each generator. The exported
> predicates don't always have the same signatures as the typeclass
> methods, though, since they are specialized to the needs of the
> particular generator. I figure that the main reason to use a specific
> generator directly is speed, and as such it is better to leave off
> unneeded arguments (as in the sfc32 and sfc generators, which don't
> need the params argument). Similarly it is better to use uint16 or
> uint32 rather than uint64, if that is what the generator actually
> produces.

In the case, for example, where the generator only outputs a uint16,
I think you would better exporting two predicates: one which output the
"raw" output and another which provides the output as required by
the type class (e.g. whichever of uint64 or uint32 it turns out to be).

>>> diff --git a/library/rng.m b/library/rng.m
>>> new file mode 100644
>>> index 0000000..24e8086
>>> --- /dev/null
>>> +++ b/library/rng.m
>>> @@ -0,0 +1,345 @@
>>> +%---------------------------------------------------------------------------%
>>> +% vim: ft=mercury ts=4 sts=4 sw=4 et
>>> +%---------------------------------------------------------------------------%
>>> +% Copyright (C) 2019 The Mercury team.
>>> +% This file is distributed under the terms specified in COPYING.LIB.
>>> +%---------------------------------------------------------------------------%
>>> +%
>>> +% File: rng.m
>>> +% Main author: Mark Brown
>>> +%
>>> +% This module provides an interface to several random number generators,
>>> +% which can be found in the submodules.
>>
>> It also provides one instance of a distribution function (random_gauss);
>> such distributions would be better off placed in their own sub-modules.
>> (If you look at say the random library in Boost or in more recent versions of
>> C++, for example, there a bunch of others that we may well end up adding here.)
>
> As you point out below, all of the predicates that return random
> numbers have a distribution, so I'm not sure why the dividing line
> should be between random_gauss and random_{int,float}.

The dividing line I had in mind, was uniform distribution over the
ranges (or sub ranges) of Mercury primitive types vs. some other
distribution (I realise float_01 doesn't quite fit within that, but it's
a fairly fundamental building block for other stuff, so I think it does
belong in the top-level module.)

> Is "a bunch" too many for the top level module? It only has these,
> plus the typeclasses themselves.

I suggest we postpone deciding where randome distribution functions live
until we have more to add (i.e. just stick random_gauss in the
top-level module for now!)

> One thing I was wondering was whether to split it into two top-level
> modules, one with the shared interface and one with the unique
> interface. Any thoughts on that?

I think they should all be part of random.

>>> +:- module rng.
>>> +:- interface.
>>> +
>>> +:- include_module binfile.
>>> +:- include_module marsaglia.
>>> +:- include_module tausworthe.
>> `> +
>>> +%---------------------------------------------------------------------------%
>>> +
>>> +    % random_int(Start, Range, N, !RNG)
>>> +    %
>>> +    % Generate a random integer between Start and Start+Range-1 inclusive.
>>> +    % Throws an exception if Range < 1 or Range > random_max.
>>> +    %
>>> +:- pred random_int(int, int, int, RNG, RNG) <= rng(RNG).
>>> +:- mode random_int(in, in, out, in, out) is det.
>>
>> I would hope the intention here is that each call to random_int/5 returns a
>> random int uniformly distributed in the given range.  If so, the documentation
>> should say that, if not it should say what the user can assume (or not) about
>> the distribution of the ints returned.
>>
>> (I feel a reference to the XKCD random number generator is order here:
>> <https://xkcd.com/221/>.)
>
> Heh.  Technically, with the binfile generator, you are at the mercy of
> the file you decide to use (it's handy for reading /dev/urandom,
> though).

Unrelated to this change: /dev/unrandom is not portable.  We ought to
provide a portable mechanism for accessing the OS RNG / entropy pool.
Something like:

    :- type system_rng,
    :- instance urng(system_rng, io).

That would map on to /dev/random or whatever the appropriate platform
specific thing is.   (For most operating systems, there is typically
a prng seeded from an OS updated entropy pool).  I'll implement this
once the intitial version of these interfaces is committed.

If / when that is done, is there any point to the binfile generator.
I suppose it's useful for testing purposes?

>> All of the above generate random values uniformly distributed in the
>> given range.  The following predicate ...
>>
>>> +    % Generate two random floats from a normal distribution with
>>> +    % mean 0 and standard deviation 1, using the Box-Muller method.
>>> +    %
>>> +    % We generate two at a time for efficiency; they are independent of
>>> +    % each other.
>>> +    %
>>> +:- pred random_gauss(float, float, RNG, RNG) <= rng(RNG).
>>> +:- mode random_gauss(out, out, in, out) is det.
>>
>> ... should be moved to a separate submodule.  Ditto for the unique variants.
>
> See above reply. If you still want this I don't really object, but
> you'll have to think of a module name.

See above -- I'm fine with leaving this where it is for now.

>>> +%---------------------------------------------------------------------------%
>>> +
>>> +    % Interface to random number generators.
>>> +    %
>>> +:- typeclass rng(RNG) where [
>>> +
>>> +        % Generate a random integer between 0 and random_max, inclusive.
>>> +        %
>>> +    pred random(uint64, RNG, RNG),
>>> +    mode random(out, in, out) is det,
>>
>> I think the basic unit returned by an RNG ought to be a uint32 rather than a
>> uint64 -- I note most of the implementations you have here actually do that
>> anyway and just cast the result to a uint64.
>
> What's the advantage of uint32?

We still support 32-bit platforms; even if you say native 32-bit
platforms are unimportant now, that still leaves things like the JVM.
It's essentially the lowest common denominator type here.  What's the
advantage of uint64 for this?

>> I suggest we impose a stronger requirement on RNGs here, namely that each
>> call to random/3 returns a random 64-bit (or 32-bit) word.  It is the
>> responsbility of the RNG instance to fulfil that requirement.   Most modern
>> pseudo-random number generators should be fine with this requirement.
>
> I've documented that the numbers should be uniformly distributed,
> although this implies most instances will only be pseudo-correct ;-)

<https://dilbert.com/strip/2001-10-25>

> I've also added that random_max should be no less than 65535, as it
> doesn't seem too much to ask instances to be able to come up with at
> least 16 bits at a time.

Imposing the stronger requirement I gave above should simplify code that
uses the RNG because it won't have to concern itself with how much of
the output is actually random.  Under that requirement a 16-bit RNG
would just have to combine multiple outputs until it reaches the
required number of random bits.

Julien.


More information about the reviews mailing list