[m-rev.] RNGs
Mark Brown
mark at mercurylang.org
Mon Aug 19 16:25:01 AEST 2019
On Mon, Aug 19, 2019 at 1:48 PM Julien Fischer <jfischer at opturion.com> wrote:
>
>
> Hi Mark,
>
> On Mon, 19 Aug 2019, Mark Brown wrote:
>
> > On Mon, Aug 19, 2019 at 1:45 AM Julien Fischer <jfischer at opturion.com> wrote:
> >>
> >>
> >> Hi Mark,
> >>
> >> On Tue, 13 Aug 2019, Mark Brown wrote:
> >>
> >>> I've just made a pull request to add some RNGs to the standard library, as
> >>> well as a typeclass interface to them. Pull request is here:
> >>>
> >>> https://github.com/Mercury-Language/mercury/pull/70
> >>>
> >>> The diff is also attached, for anyone who would prefer to comment on
> >>> it via email.
> >>
> >> Thanks for looking into this, it's part of the stdlib that has long needed
> >> fixing. Here's a partial review that mainly addresses aspects of the interface
> >> design. I haven't reviewed the actual RNG implementations as yet.
> >
> > Thanks. You can leave reviewing the RNG implementations until after
> > we've decided which ones to keep.
>
> We can keep all of them; those not in the standard library can live
> in extras or on github somewhere.
>
> >>> commit f102ddaf4c128137b27ac6e85ac881faaf435c9b
> >>> Author: Mark Brown <mark at mercurylang.org>
> >>> Date: Tue Aug 13 16:42:38 2019 +1000
> >>>
> >>> Some RNGs for the standard library, plus typeclass interfaces.
> >>>
> >>> library/rng.m:
> >>> Top-level module containing typeclasses and generic routines,
> >>> as well as overall documentation.
> >>
> >> The top-level module for random number generator facilities should really be
> >> random. I realise that's taken, however I propose that the new stuff be added
> >> to it and the existing contents of that file deprecated. (They ought to be
> >> able to co-exist in the interim until the old stuff is removed.)
> >
> > Ok, but see below.
> >
> >>
> >>> library/rng.binfile.m:
> >>> A generator that reads from a file.
> >>>
> >>> library/rng.marsaglia.m:
> >>> Fast, simple generator. Diehard results for default seed:
> >>> PASSED: 109
> >>> WEAK: 2
> >>> FAILED: 3
> >>>
> >>> library/rng.tausworthe.m:
> >>> Combined Tausworthe generators. Diehard results for default seed:
> >>>
> >>> generator T3
> >>> PASSED: 113
> >>> WEAK: 1
> >>> FAILED: 0
> >>>
> >>> generator T4
> >>> PASSED: 112
> >>> WEAK: 2
> >>> FAILED: 0
> >>
> >> There's a bit of question as to what RNGs should be included in the
> >> stdlib by default.
> >
> > Yes, I've been evaluating more of these, and I'll be happy to leave
> > out any that we don't in the end have a use case to justify. At the
> > moment, now that I've compared them to sfc, I'm thinking that both
> > marsaglia and tausworthe could be removed:
> >
> > - The justification for marsaglia is that it is very fast. Sfc16 is
> > also fast, but marsaglia is about 7 or 8 times faster, so it would
> > seem to be worthwhile including it. However, the absolute times are so
> > small that in order to notice a difference in overall time a lot of
> > numbers will need to be generated, and by this time biases in
> > marsaglia will have had a chance to show up. So whether you are
> > generating a small or large set of numbers, it would seem that sfc16
> > is at least as good as marsaglia.
> >
> > - The combined Tausworthe generators were meant to be something
> > familiar and of better quality than marsaglia, but sfc has turned out
> > to be faster and of much higher quality than tausworthe.
> >
> > If anyone thinks we should keep the above, please say so.
> >
> > I still have a few more generators I want to look at. We don't have
> > any with arbitrary jumps, or that can accept added entropy during
> > operation, but those features may be desirable.
>
> As I mentioned above, extra RNGs can live in extras. The standard
> library itself only needs to provide a few generators: one that is fast
> and one that is "good" (in the statistically random sense), assuming
> that there is not one that is both.
Ok. I'll take that to mean that marsaglia and tausworthe can be put in extras.
>
> > Sfc is not cryptographically secure, although I'm not sure we want to
> > offer such a thing.
>
> I don't think providing cryptographic services is the business of the
> standard library.
I was hoping you would say that :-)
>
> >>> diff --git a/library/rng.binfile.m b/library/rng.binfile.m
> >>> new file mode 100644
> >>> index 0000000..36c1966
> >>> --- /dev/null
> >>> +++ b/library/rng.binfile.m
> >>> @@ -0,0 +1,96 @@
> >>> +%---------------------------------------------------------------------------%
> >>> +% vim: ft=mercury ts=4 sts=4 sw=4 et
> >>> +%---------------------------------------------------------------------------%
> >>> +% Copyright (C) 2019 The Mercury team.
> >>> +% This file is distributed under the terms specified in COPYING.LIB.
> >>> +%---------------------------------------------------------------------------%
> >>> +%
> >>> +% File: rng.binfile.m
> >>> +% Main author: Mark Brown
> >>> +%
> >>> +% "Random" number generator that reads numbers from a binary file.
> >>> +%
> >>> +%---------------------------------------------------------------------------%
> >>> +%---------------------------------------------------------------------------%
> >>> +
> >>> +:- module rng.binfile.
> >>> +:- interface.
> >>> +
> >>> +:- import_module io.
> >>> +
> >>> +%---------------------------------------------------------------------------%
> >>> +
> >>> +:- type binfile.
> >>> +:- instance urng(binfile, io).
> >>> +
> >>> + % Open a binfile generator from a filename. This should be closed
> >>> + % when no longer needed.
> >>> + %
> >>> +:- pred open(string, io.res(binfile), io, io).
> >>> +:- mode open(in, out, di, uo) is det.
> >>
> >> Is there a reason not to use predmode declarations here?
> >
> > Not really. Should I? (And should the coding standard be updated?)
>
> Yes, generally single mode predicates and functions in the standard
> library use predmode decls. The coding standard for the stdlib should
> be adjusted to mention that.
Will do.
>
> >>> +
> >>> + % Close a binfile generator.
> >>> + %
> >>> +:- pred close(binfile, io, io).
> >>> +:- mode close(in, di, uo) is det.
> >>> +
> >>> +%---------------------------------------------------------------------------%
> >>> +
> >>> + % Generate a number between 0 and max_uint64. This reads 8 bytes
> >>> + % at a time from the binfile and interprets them as an unsigned,
> >>> + % big-endian integer.
> >>> + %
> >>> + % Throws an exception if the end-of-file is reached.
> >>> + %
> >>> +:- pred rand(binfile, uint64, io, io).
> >>> +:- mode rand(in, out, di, uo) is det.
> >>> +
> >>> +%---------------------------------------------------------------------------%
> >>> +
> >>> +:- implementation.
> >>> +
> >>> +:- import_module require.
> >>> +:- import_module uint64.
> >>> +
> >>> +%---------------------------------------------------------------------------%
> >>> +
> >>> +:- type binfile
> >>> + ---> binfile(binary_input_stream).
> >>> +
> >>> +:- instance urng(binfile, io) where [
> >>> + pred(urandom/4) is rand,
> >>> + ( urandom_max(_) = uint64.max_uint64 )
> >>> +].
> >>
> >> It would be preferable if each generator also exported the predicates that
> >> implement these methods, so they can be used without going via the type class
> >> interface.
> >>
> >> Ditto for the other generators.
> >
> > Ok, I've exported a "rand_max" for each generator. The exported
> > predicates don't always have the same signatures as the typeclass
> > methods, though, since they are specialized to the needs of the
> > particular generator. I figure that the main reason to use a specific
> > generator directly is speed, and as such it is better to leave off
> > unneeded arguments (as in the sfc32 and sfc generators, which don't
> > need the params argument). Similarly it is better to use uint16 or
> > uint32 rather than uint64, if that is what the generator actually
> > produces.
>
> In the case, for example, where the generator only outputs a uint16,
> I think you would better exporting two predicates: one which output the
> "raw" output and another which provides the output as required by
> the type class (e.g. whichever of uint64 or uint32 it turns out to be).
Ok.
>
> >>> diff --git a/library/rng.m b/library/rng.m
> >>> new file mode 100644
> >>> index 0000000..24e8086
> >>> --- /dev/null
> >>> +++ b/library/rng.m
> >>> @@ -0,0 +1,345 @@
> >>> +%---------------------------------------------------------------------------%
> >>> +% vim: ft=mercury ts=4 sts=4 sw=4 et
> >>> +%---------------------------------------------------------------------------%
> >>> +% Copyright (C) 2019 The Mercury team.
> >>> +% This file is distributed under the terms specified in COPYING.LIB.
> >>> +%---------------------------------------------------------------------------%
> >>> +%
> >>> +% File: rng.m
> >>> +% Main author: Mark Brown
> >>> +%
> >>> +% This module provides an interface to several random number generators,
> >>> +% which can be found in the submodules.
> >>
> >> It also provides one instance of a distribution function (random_gauss);
> >> such distributions would be better off placed in their own sub-modules.
> >> (If you look at say the random library in Boost or in more recent versions of
> >> C++, for example, there a bunch of others that we may well end up adding here.)
> >
> > As you point out below, all of the predicates that return random
> > numbers have a distribution, so I'm not sure why the dividing line
> > should be between random_gauss and random_{int,float}.
>
> The dividing line I had in mind, was uniform distribution over the
> ranges (or sub ranges) of Mercury primitive types vs. some other
> distribution (I realise float_01 doesn't quite fit within that, but it's
> a fairly fundamental building block for other stuff, so I think it does
> belong in the top-level module.)
>
> > Is "a bunch" too many for the top level module? It only has these,
> > plus the typeclasses themselves.
>
> I suggest we postpone deciding where randome distribution functions live
> until we have more to add (i.e. just stick random_gauss in the
> top-level module for now!)
>
> > One thing I was wondering was whether to split it into two top-level
> > modules, one with the shared interface and one with the unique
> > interface. Any thoughts on that?
>
> I think they should all be part of random.
>
> >>> +:- module rng.
> >>> +:- interface.
> >>> +
> >>> +:- include_module binfile.
> >>> +:- include_module marsaglia.
> >>> +:- include_module tausworthe.
> >> `> +
> >>> +%---------------------------------------------------------------------------%
> >>> +
> >>> + % random_int(Start, Range, N, !RNG)
> >>> + %
> >>> + % Generate a random integer between Start and Start+Range-1 inclusive.
> >>> + % Throws an exception if Range < 1 or Range > random_max.
> >>> + %
> >>> +:- pred random_int(int, int, int, RNG, RNG) <= rng(RNG).
> >>> +:- mode random_int(in, in, out, in, out) is det.
> >>
> >> I would hope the intention here is that each call to random_int/5 returns a
> >> random int uniformly distributed in the given range. If so, the documentation
> >> should say that, if not it should say what the user can assume (or not) about
> >> the distribution of the ints returned.
> >>
> >> (I feel a reference to the XKCD random number generator is order here:
> >> <https://xkcd.com/221/>.)
> >
> > Heh. Technically, with the binfile generator, you are at the mercy of
> > the file you decide to use (it's handy for reading /dev/urandom,
> > though).
>
> Unrelated to this change: /dev/unrandom is not portable. We ought to
> provide a portable mechanism for accessing the OS RNG / entropy pool.
> Something like:
>
> :- type system_rng,
> :- instance urng(system_rng, io).
>
> That would map on to /dev/random or whatever the appropriate platform
> specific thing is. (For most operating systems, there is typically
> a prng seeded from an OS updated entropy pool). I'll implement this
> once the intitial version of these interfaces is committed.
>
> If / when that is done, is there any point to the binfile generator.
> I suppose it's useful for testing purposes?
I think so, but I'll try to come up with a specific case.
>
> >> All of the above generate random values uniformly distributed in the
> >> given range. The following predicate ...
> >>
> >>> + % Generate two random floats from a normal distribution with
> >>> + % mean 0 and standard deviation 1, using the Box-Muller method.
> >>> + %
> >>> + % We generate two at a time for efficiency; they are independent of
> >>> + % each other.
> >>> + %
> >>> +:- pred random_gauss(float, float, RNG, RNG) <= rng(RNG).
> >>> +:- mode random_gauss(out, out, in, out) is det.
> >>
> >> ... should be moved to a separate submodule. Ditto for the unique variants.
> >
> > See above reply. If you still want this I don't really object, but
> > you'll have to think of a module name.
>
> See above -- I'm fine with leaving this where it is for now.
>
> >>> +%---------------------------------------------------------------------------%
> >>> +
> >>> + % Interface to random number generators.
> >>> + %
> >>> +:- typeclass rng(RNG) where [
> >>> +
> >>> + % Generate a random integer between 0 and random_max, inclusive.
> >>> + %
> >>> + pred random(uint64, RNG, RNG),
> >>> + mode random(out, in, out) is det,
> >>
> >> I think the basic unit returned by an RNG ought to be a uint32 rather than a
> >> uint64 -- I note most of the implementations you have here actually do that
> >> anyway and just cast the result to a uint64.
> >
> > What's the advantage of uint32?
>
> We still support 32-bit platforms; even if you say native 32-bit
> platforms are unimportant now, that still leaves things like the JVM.
> It's essentially the lowest common denominator type here. What's the
> advantage of uint64 for this?
I'm not sure I understand the objection. Uint64 still works on 32-bit
platforms, doesn't it? it is just less efficient? The advantage of
uint64 is that it only takes one method call to get 64-bits of
information out of a generator, instead of two. At the speed these
things work that is noticeably faster, I think, and I'd rather
sacrifice performance on JVM than here.
>
> >> I suggest we impose a stronger requirement on RNGs here, namely that each
> >> call to random/3 returns a random 64-bit (or 32-bit) word. It is the
> >> responsbility of the RNG instance to fulfil that requirement. Most modern
> >> pseudo-random number generators should be fine with this requirement.
> >
> > I've documented that the numbers should be uniformly distributed,
> > although this implies most instances will only be pseudo-correct ;-)
>
> <https://dilbert.com/strip/2001-10-25>
I picked "default" seeds for sfc using "hexdump -Cv /dev/urandom |
head". Go figure.
>
> > I've also added that random_max should be no less than 65535, as it
> > doesn't seem too much to ask instances to be able to come up with at
> > least 16 bits at a time.
>
> Imposing the stronger requirement I gave above should simplify code that
> uses the RNG because it won't have to concern itself with how much of
> the output is actually random. Under that requirement a 16-bit RNG
> would just have to combine multiple outputs until it reaches the
> required number of random bits.
Oh, I see what you mean. That would mean there would be no need for
the random_max methods, right? That would cut down on method calls,
and I agree that most modern generators will be fine with this.
Mark
More information about the reviews
mailing list