[m-dev.] Ralph's SPEC benchmark & MLDS back-end

Fergus Henderson fjh at cs.mu.OZ.AU
Thu Nov 16 06:16:30 AEDT 2000
Previous message: [m-dev.] Re: [mercury-users] with_stream
Next message: [m-dev.] diff: move code into switch_util.m
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
The following is edited down from a discussion I had in email with
Ralph.  I thought it might be of wider interest.

The short summary is that I reran Ralph's `compress' SPEC benchmark,
and the results were

	Version			Time
	C			 3.02
	Mercury hlc.gc		 9.31
	Mercury asm_fast.gc	10.9

but after fixing some problems with inlining I got some major
improvements, especially to the MLDS version:

	Version			Time
	C			 3.02
	Mercury hlc.gc		 4.46
	Mercury asm_fast.gc	 7.73

For details, see below.

Ralph Becket <rbeck at microsoft.com> wrote:
> On 16-Nov-2000, Fergus Henderson <fjh at cs.mu.oz.au> wrote:
> > On 15-Nov-2000, Ralph Becket <rbeck at microsoft.com> wrote:
> > > Hi chaps,
> > > 
> > > I wonder if you could do me a big favour?  I'd like to see how well the
> > > SPEC benchmark stuff performs using the new MLDS C backend.  I've
> > > attached the .tgz file containing both Mercury and C versions
> > > (``mmake depend; mmake; run'' for the former and ``make; run'' for the
> > > latter).  I'm having real trouble getting hlc.gc grade stuff to compile.
> > > If you could do this in the next 24 hours or so I'd be very grateful as
> > > I'd like to be able to report the results at a talk I'm giving tomorrow
> > > afternoon.
> > 
> > I ran them on hg (egcs-1.1.2, 500MHz PIII, Linux)
> > For the C version, I get
> > 
> > 	fjh$ ./run
> > 	SPEC 129.compress harness
> > 	Enter count, start-char, random-seed: Initial File Size:1000000 Start character:t
> > 	[time +3.130]
> > 
> > With `-O3 -fomit-frame-pointer' instead of `-O', I get
> > 
> > 	fjh$ ./run
> > 	SPEC 129.compress harness
> > 	Enter count, start-char, random-seed: Initial File Size:1000000 Start character:t
> > 	[time +3.020]
> > 
> > However these results vary quite a bit from run to run.  E.g. for five runs I got
> >
> > 	[time +2.970]
> > 	[time +3.160]
> > 	[time +3.080]
> > 	[time +3.240]
> > 	[time +2.970]
> >
> > Mercury, hlc.gc:
> > 
> > 	******* compress5 x 10 *******
> > 	[Time: +0.000s, 233.849s, C Stack: 0.145k,
> > 	#GCs: 1782, Heap used since last GC: 580.258k, Total used: 8404.000k]
> > 	[Time: +9.750s, 243.599s, C Stack: 0.152k,
> > 	#GCs: 1822, Heap used since last GC: 1412.871k, Total used: 8404.000k]
> > 
> > Mercury, hlc.gc, with MGNUCFLAGS=--inline-alloc:
> > 
> > 	[Time: +0.000s, 227.309s, C Stack: 0.168k,
> > 	#GCs: 1844, Heap used since last GC: 247.219k, Total used: 8404.000k]
> > 	[Time: +9.310s, 236.619s, C Stack: 0.145k,
> > 	#GCs: 1884, Heap used since last GC: 256.336k, Total used: 8404.000k]
> > 
> > Mercury, asm_fast.gc:
> > 
> > 	[Time: +0.010s, 165.519s, D Stack: 0.035k, ND Stack: 0.039k, C Stack: 10.023k,
> > 	#GCs: 802, Heap used since last GC: 2348.629k, Total used: 13988.000k]
> > 	[Time: +10.910s, 176.429s, D Stack: 0.023k, ND Stack: 0.039k, C Stack: 10.023k,
> > 	#GCs: 818, Heap used since last GC: 9.914k, Total used: 13988.000k]
> > 
> > and asm_fast.gc again (so you get an idea of the measurement variation)
> > 
> > 	[Time: +0.000s, 165.509s, D Stack: 0.035k, ND Stack: 0.039k, C Stack: 10.023k,
> > 	#GCs: 802, Heap used since last GC: 2348.629k, Total used: 13988.000k]
> > 	[Time: +10.950s, 176.459s, D Stack: 0.023k, ND Stack: 0.039k, C Stack: 10.023k,
> > 	#GCs: 818, Heap used since last GC: 9.914k, Total used: 13988.000k]
> 
> Fergus, you're a star.  Many many thanks.  I owe you lots of
> beer.
> 
> hlc seems to make about 6% difference, going up to 12% with 
> --inline-malloc on compress5.
> 
> The interesting thing is that your results seem to show
> the best mercury efforts not quite getting to within a
> factor of 3 of the C runtimes, whereas running asm_fast.gc
> tests on my Windows box (450MHz PIII, 128MB) I get to
> within a factor of 2.5
> 
> If you were to hazard a guess at where the loss of
> efficiency lies, I'd love to hear it.  Looking at the
> Mercury code, I can't see what it's doing that the C
> isn't (other than the C is liberally sprinkled with
> register vars and makes heavy use of globals - which 
> I would have thought would be a loser.)
> 
> The first version I wrote was an attempt at transliterating
> the C into Mercury (horrid experience).  That ran the same
> speed as the best clean version.
> 
> I have a colleague who transliterated the C into Java and,
> using state-of-the-art compilers & JVMs got to within a few
> tens of per cent of the C runtime.
> 
> Cheers,
> 
> Ralph

I had a look at the MLDS code, and the main problem turned out to be
poor inlining.  There were several causes of this.  One is that
inlining of `pragma c_code' is currently disabled in the MLDS
back-end.  Another is that the default inlining heuristics set the
inlining limits too low.  And the last is that we weren't doing
a good job of handling transitive intermodule inlining.

To see what the impact of those were, I re-enabled `pragma c_code'
inlining (it works for this benchmark, but fails for others; fixing
things so that it works in all cases is not too hard, but just hasn't
been done), added some extra compilation options to increase the
inlining limits, added an extra `import_module bmio' declaration
to compress5, to enable inlining of bmio__read_byte.
I got the following results:

asm_fast.gc:
	******* compress5 x 10 *******
	[Time: +0.000s, 16.229s, D Stack: 0.035k, ND Stack: 0.039k, C Stack: 10.023k,
	#GCs: 120, Heap used since last GC: 1449.684k, Total used: 9892.000k]
	[Time: +7.730s, 23.959s, D Stack: 0.023k, ND Stack: 0.039k, C Stack: 10.023k,
	#GCs: 126, Heap used since last GC: 539.074k, Total used: 11940.000k]

hlc.gc:
	******* compress5 x 10 *******
	[Time: +0.000s, 11.809s, C Stack: 0.129k,
	#GCs: 521, Heap used since last GC: 89.980k, Total used: 3244.000k]
	[Time: +4.460s, 16.269s, C Stack: 0.105k,
	#GCs: 540, Heap used since last GC: 1078.148k, Total used: 3244.000k]

Much better!  The time is less than 1.5 times that of C.

You can also speed things up by a couple of per cent by disabling
array bounds checks, but it doesn't make very much difference.

My guess is that the majority of the remaining overhead is for GC.

-- 
Fergus Henderson <fjh at cs.mu.oz.au>  |  "I have always known that the pursuit
                                    |  of excellence is a lethal habit"
WWW: <http://www.cs.mu.oz.au/~fjh>  |     -- the last words of T. S. Garp.
--------------------------------------------------------------------------
mercury-developers mailing list
Post messages to:       mercury-developers at cs.mu.oz.au
Administrative Queries: owner-mercury-developers at cs.mu.oz.au
Subscriptions:          mercury-developers-request at cs.mu.oz.au
--------------------------------------------------------------------------
Previous message: [m-dev.] Re: [mercury-users] with_stream
Next message: [m-dev.] diff: move code into switch_util.m
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the developers mailing list