[m-users.] Log parsing benchmark
jfondren at minimaltype.com
jfondren at minimaltype.com
Mon Jun 24 08:47:24 AEST 2019
On 2019-06-22 21:24, Peter Wang wrote:
> On Thu, 20 Jun 2019 03:17:57 -0500, Julian Fondren
> <jfondren at hostgator.com> wrote:
>> Howdy,
>>
>> I've added Mercury to a benchmark at
>> https://github.com/jrfondren/topsender-bench , the task: run a regex
>> over a
>> 400 MB logfile, count email senders (as matched by the regex) and
>> report
>> the top 5 senders with their counts. I don't have a good sanitized
>> logfile
>> to use so this test isn't reproducible at the moment unless you have
>> your
>> own exim logs, sorry.
>>
>> The Mercury code I've written is using PCRE for the actual regex
>> matching
>> (by linking pcre_d_shim.o, the same code I wrote to speed up the D
>> candidates for this benchmark, earlier), the io module to read lines
>> from
>> the log, and the hash_table module to count senders.
>>
>> On this benchmark, Mercury's about 5x slower than the reference C
>> version,
>> or only 1.4x as slow as the Python3 version. Mercury's memory usage is
>> pretty disappointing; I experimented with adding some manual calls to
>> the
>> GC on every Nth match, and that could shave off dozens of MB, but at a
>> significant runtime cost.
>
> Hi,
>
> I created a dumb program (attached) to generate a sample exim_mainlog
> file, but I wasn't able to reproduce your result.
This generates a log where 100% of lines match the regex, vs. 5% from
my log. When I use my 5% of matches to produce a log that repeats the
5% enough times to get a similarly-sized log, I still don't get your
results though. When I take my 5% of matches and repeat *each line*
9 times, I get a similar-sized log and these results:
| topsender_c | 3.069x | 12 s 838 ms | 25.1 MB
|
| topsender_pcre_c | 1x | 4 s 834 ms | 25.2 MB
|
| topsender_nim | 0.375x | 1448.018 ms | 16.0 MB
|
| topsender_altsort_nim | 0.400x | 1541.838333333333 ms | 26.3 MB
|
| topsender_npeg_nim | 0.603x | 2326.258 ms | 21.6 MB
|
| topsender_regex_nim | 4.277x | 16 s 495 ms | 26.0 MB
|
| topsender_dmd | 1.292x | 5 s 982 ms | 55.8 MB
|
| topsender_pcre_dmd | 0.295x | 1137.986666666667 ms | 38.8 MB
|
| topsender_pcre_getline_dmd | 0.191x | 737.6916666666666 ms | 55.3 MB
|
| topsender_ldc | 0.502x | 1937.030333333333 ms | 59.8 MB
|
| topsender_pcre_ldc | 0.260x | 1004.470333333333 ms | 43.0 MB
|
| topsender_pcre_getline_ldc | 0.165x | 638.2376666666667 ms | 59.5 MB
|
| topsender_m | 0.850x | 3 s 277 ms | 116.8 MB
|
| topsender_getline_m | 0.220x | 848.4716666666667 ms | 96.8 MB
|
I reckon the repeated lines gives you more predictable branches and
unrealistic performance on modern CPUs.
>>
>> The need for `with_type` in the following code really surprised me,
>> but...
>> as of writing this email, I finally get it: ordering() is so generic
>> that
>> it could be asked to compare two curried predicate snd()s, rather than
>> two
>> ints that the func snd() returns.
>>
>> :- func cmp(pair(string, int), pair(string, int)) =
>> comparison_result.
>> cmp(A, B) = ordering(snd(B) `with_type` int, snd(A) `with_type`
>> int).
>
> You can write : instead of `with_type`. I would probably write:
>
> :- pred cmp(pair(string, int), pair(string, int),
> comparison_result).
> :- mode cmp(in, in, out) is det.
>
> cmp(_-A, _-B, R) :-
> compare(R, B, A).
>
>>
>> This is all with asm_fast.gc. I get a segfault when I compile with
>> hlc.gc.
>> ltrace (library trace, not strace) shows that the SIGSEGV happens at
>>
>> mercury__hash_table_search_3_p_0(0x608ce0, 0x608b40, 0x440000002d,
>> 0x7f638d2c1fe0 <no return ....>
>> --- SIGSEGV (Segmentation fault)
>>
>> *** Mercury runtime: caught segmentation violation ***
>> cause: address not mapped to object
>> address involved: 0x4400000045
>
> Weird, it works for me (rotd-2019-06-13, but any version should work).
>
> I think it doesn't make any difference in this test, but I suggest
> normally compiling with -O5 --intermodule-optimisation.
> I don't know if -O6 ever gets tested.
It's an error in my C code perhaps. hlc.gc works fine for
topsender_getline,
which doesn't use pcre_d_shim.o
>>
>> which is the very first call to hash_table.search in the program. I'd
>> like
>> to troubleshoot this further, but I don't have any debugging hlc
>> grades.
>> Right now I only have the default libgrades installed, but I still
>> have the
>> compilation environment from the install hanging around. Is there an
>> easy
>> way to go back there and say "compile and install this one additional
>> libgrade" without rebuilding everything?
>
> make install LIBGRADES=...
>
> Peter
>
> _______________________________________________
> users mailing list
> users at lists.mercurylang.org
> https://lists.mercurylang.org/listinfo/users
More information about the users
mailing list