[m-users.] Log parsing benchmark

Richard O'Keefe raoknz at gmail.com
Sun Jun 23 22:40:22 AEST 2019


I do not understand what the purpose of this benchmark is.
For example, I get the following times with data generated
by generate.m:
Python3  : 122 seconds
C        : 226 seconds (topsender.c, gcc cranked up high)
C        : 218 seconds (topsender.c, PGI cc cranked up high)
AWK      :  46 seconds (mawk 1.4, NOT using a regexp for parsing)
Smalltalk: 113 seconds (astc -O2, hash tables implemented in Smalltalk)

Picking the top k by sorting n items is not a terribly efficient method.
The Smalltalk version uses a heap of k items and the AWK version does
something similar.


On Thu, 20 Jun 2019 at 20:18, Julian Fondren <jfondren at hostgator.com> wrote:

> Howdy,
>
> I've added Mercury to a benchmark at
> https://github.com/jrfondren/topsender-bench , the task: run a regex over
> a 400 MB logfile, count email senders (as matched by the regex) and report
> the top 5 senders with their counts. I don't have a good sanitized logfile
> to use so this test isn't reproducible at the moment unless you have your
> own exim logs, sorry.
>
> The Mercury code I've written is using PCRE for the actual regex matching
> (by linking pcre_d_shim.o, the same code I wrote to speed up the D
> candidates for this benchmark, earlier), the io module to read lines from
> the log, and the hash_table module to count senders.
>
> On this benchmark, Mercury's about 5x slower than the reference C version,
> or only 1.4x as slow as the Python3 version. Mercury's memory usage is
> pretty disappointing; I experimented with adding some manual calls to the
> GC on every Nth match, and that could shave off dozens of MB, but at a
> significant runtime cost.
>
> The need for `with_type` in the following code really surprised me, but...
> as of writing this email, I finally get it: ordering() is so generic that
> it could be asked to compare two curried predicate snd()s, rather than two
> ints that the func snd() returns.
>
>   :- func cmp(pair(string, int), pair(string, int)) = comparison_result.
>   cmp(A, B) = ordering(snd(B) `with_type` int, snd(A) `with_type` int).
>
> This is all with asm_fast.gc. I get a segfault when I compile with hlc.gc.
> ltrace (library trace, not strace) shows that the SIGSEGV happens at
>
> mercury__hash_table_search_3_p_0(0x608ce0, 0x608b40, 0x440000002d,
> 0x7f638d2c1fe0 <no return ....>
> --- SIGSEGV (Segmentation fault)
>
> *** Mercury runtime: caught segmentation violation ***
> cause: address not mapped to object
> address involved: 0x4400000045
>
> which is the very first call to hash_table.search in the program. I'd like
> to troubleshoot this further, but I don't have any debugging hlc grades.
> Right now I only have the default libgrades installed, but I still have the
> compilation environment from the install hanging around. Is there an easy
> way to go back there and say "compile and install this one additional
> libgrade" without rebuilding everything?
>
>
> If I were to iterate further on the Mercury candidate, I'd start with a
> more optimal file handling, say with a "Applies the given closure to each
> regex match against lines from the input file" routine, that could delay
> allocating memory for strings until the regex succeeds.
>
> Cheers.
> _______________________________________________
> users mailing list
> users at lists.mercurylang.org
> https://lists.mercurylang.org/listinfo/users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mercurylang.org/archives/users/attachments/20190624/7504d02b/attachment-0001.html>


More information about the users mailing list