[m-users.] Log parsing benchmark

Peter Wang novalazy at gmail.com
Sun Jun 23 12:24:50 AEST 2019


On Thu, 20 Jun 2019 03:17:57 -0500, Julian Fondren <jfondren at hostgator.com> wrote:
> Howdy,
> 
> I've added Mercury to a benchmark at
> https://github.com/jrfondren/topsender-bench , the task: run a regex over a
> 400 MB logfile, count email senders (as matched by the regex) and report
> the top 5 senders with their counts. I don't have a good sanitized logfile
> to use so this test isn't reproducible at the moment unless you have your
> own exim logs, sorry.
> 
> The Mercury code I've written is using PCRE for the actual regex matching
> (by linking pcre_d_shim.o, the same code I wrote to speed up the D
> candidates for this benchmark, earlier), the io module to read lines from
> the log, and the hash_table module to count senders.
> 
> On this benchmark, Mercury's about 5x slower than the reference C version,
> or only 1.4x as slow as the Python3 version. Mercury's memory usage is
> pretty disappointing; I experimented with adding some manual calls to the
> GC on every Nth match, and that could shave off dozens of MB, but at a
> significant runtime cost.

Hi,

I created a dumb program (attached) to generate a sample exim_mainlog
file, but I wasn't able to reproduce your result.

% du -h exim_mainlog
452M    exim_mainlog
% wc -l exim_mainlog
15457585 exim_mainlog

./topsender_c	       38.25s user 0.06s system 99% cpu 38.366 total
./topsender_pcre_c	5.52s user 0.07s system 99% cpu 5.598 total
./topsender_pcre_dmd	8.04s user 0.09s system 99% cpu 8.145 total
./topsender_m.af	9.23s user 0.08s system 99% cpu 9.329 total
./topsender_m.hlc	9.25s user 0.07s system 99% cpu 9.335 total
./topsender.pl	       13.07s user 0.06s system 99% cpu 13.153 total
./topsender.py	       15.07s user 0.06s system 99% cpu 15.159 total

That seems about right to me. The Mercury hash_table module is rather
basic (it uses chaining instead of open addressing, and incurs at least
two allocations per insertion even without collisions). Converting the
hash table to a list then sorting the list should be slower, and use
more memory, than sorting an array.

I didn't check memory usage. It is possible to tune the GC to collect
more often by setting GC_FREE_SPACE_DIVISOR.

> 
> The need for `with_type` in the following code really surprised me, but...
> as of writing this email, I finally get it: ordering() is so generic that
> it could be asked to compare two curried predicate snd()s, rather than two
> ints that the func snd() returns.
> 
>   :- func cmp(pair(string, int), pair(string, int)) = comparison_result.
>   cmp(A, B) = ordering(snd(B) `with_type` int, snd(A) `with_type` int).

You can write : instead of `with_type`. I would probably write:

    :- pred cmp(pair(string, int), pair(string, int), comparison_result).
    :- mode cmp(in, in, out) is det.

    cmp(_-A, _-B, R) :-
	compare(R, B, A).

> 
> This is all with asm_fast.gc. I get a segfault when I compile with hlc.gc.
> ltrace (library trace, not strace) shows that the SIGSEGV happens at
> 
> mercury__hash_table_search_3_p_0(0x608ce0, 0x608b40, 0x440000002d,
> 0x7f638d2c1fe0 <no return ....>
> --- SIGSEGV (Segmentation fault)
> 
> *** Mercury runtime: caught segmentation violation ***
> cause: address not mapped to object
> address involved: 0x4400000045

Weird, it works for me (rotd-2019-06-13, but any version should work).

I think it doesn't make any difference in this test, but I suggest
normally compiling with -O5 --intermodule-optimisation.
I don't know if -O6 ever gets tested.

> 
> which is the very first call to hash_table.search in the program. I'd like
> to troubleshoot this further, but I don't have any debugging hlc grades.
> Right now I only have the default libgrades installed, but I still have the
> compilation environment from the install hanging around. Is there an easy
> way to go back there and say "compile and install this one additional
> libgrade" without rebuilding everything?

make install LIBGRADES=...

Peter
-------------- next part --------------
% Usage:
% mmc generate.m && ./generate </usr/share/dict/words | shuf >exim_mainlog

:- module generate.
:- interface.

:- import_module io.

:- pred main(io::di, io::uo) is det.

:- implementation.

:- import_module list.
:- import_module string.
:- import_module int.

main(!IO) :-
    io.read_line_as_string(ReadRes, !IO),
    (
        ReadRes = ok(Line0),
        Line = chomp(Line0),
        rand(R, !IO),
        repeat(format("a b c <= %s at example.org\n", [s(Line)]), R, !IO),
        main(!IO)
    ;
        ReadRes = eof
    ;
        ReadRes = error(_)
    ).

:- pred rand(int::out, io::di, io::uo) is det.

:- pragma foreign_proc("C",
    rand(R::out, _IO0::di, _IO::uo),
    [will_not_call_mercury, promise_pure, thread_safe, tabled_for_io],
"
    R = 1 + (rand() % 250);
").

:- pred repeat(pred(io, io), int, io, io).
:- mode repeat(in(pred(di, uo) is det), in, di, uo) is det.

repeat(P, N, !IO) :-
    ( if N =< 0 then
        true
    else
        P(!IO),
        repeat(P, N - 1, !IO)
    ).


More information about the users mailing list