[m-users.] Log parsing benchmark

jfondren at minimaltype.com jfondren at minimaltype.com
Mon Jun 24 08:48:38 AEST 2019


On 2019-06-23 07:40, Richard O'Keefe wrote:
> I do not understand what the purpose of this benchmark is.

It's similar to a task I've had, and it's a task where my solution was
slow enough that it ultimately couldn't be used, and my work was
discarded.

Anecdote: I added the 'regex' Nim version to the table on the
recommendation of someone on the Nim IRC channel. That recommendation
had the character of "You're using stdlib re? Heh. It's garbage. You
should use regex instead." I would've been pretty unhappy if I'd just
accepted this advice and then two weeks later, in some real project,
then started asking how I could speed my code up.

Without this benchmark, I would've told you that the main benefit of
PCRE is all the extra regex features it gives you. Surely, libc regex is
competitive enough in the features it does implement...

The main benefit that I've gotten from this benchmark is from seeing how
I could answer the question "OK, how can I speed this up?" in different
languages. Mercury solution #1 is one of the slower solutions, and
Mercury solution #2 is already one of the very best. (I'm divided on
how cheaty that solution is though. Just look at getline.m)

I find that when people complain about data, it's usually because they
imagine that someone else is taking the data and running right off a
pier with it. Should benchmarks only be documented in Latin, so that
common people won't be driven to confusion?

> For example, I get the following times with data generated
> by generate.m:
> Python3  : 122 seconds
> C        : 226 seconds (topsender.c, gcc cranked up high)
> C        : 218 seconds (topsender.c, PGI cc cranked up high)
> AWK      :  46 seconds (mawk 1.4, NOT using a regexp for parsing)
> Smalltalk: 113 seconds (astc -O2, hash tables implemented in
> Smalltalk)
> 
> Picking the top k by sorting n items is not a terribly efficient
> method.
> The Smalltalk version uses a heap of k items and the AWK version does
> something similar.


More information about the users mailing list