[mercury-users] Additional comments to my today's talk

Ondrej Bojar bojar at csse.unimelb.edu.au
Thu Nov 22 21:01:44 AEDT 2007

Hi, all.

In case anyone is interested, the slides of my talk on modules I created for my 
natural language processing projects are available at:

To answer Julien's question about data sizes, here is a sample experiment I ran: 
All these tinycdb databases of pickled Mercury objects were used (simultaneously):

340M    models/t_pre001.557.multiprob_map
420M    models/t_pre001.55B.multiprob_map
240M    models/t_pre001.55F.multiprob_map
11M     models/t_wor001.563.gdbm
6.4M    models/t_wor001.569.gdbm

Translated 964 sentences (0 had no translation) in 42.41m (2.6+-1.7s/sent,
worst: 71 in 12.15s), stacksize 100.
Real time of Translating ../../corpora/wmtdevtest.en.in:        192.40s (3.21m,
13.9% Mem)

(My last two months average is about 50 experiments/month. I'd guess about a 
half of them are sanity checks and about half of them are probes towards better 

The training phase happened before (cached) and is not included in the time 
reported. All the models are based on a 90 MB text file compactly representing 
sentence structure and English-Czech alignment. As mentioned, I'm planning to 
move to about 20x the magnitude of the data.
Real time (3.21 minutes) is shorter than total time (42.41m), because it ran in 
20 parallel jobs. Some experiments take also a day of wallclock time even with 
20 jobs (but part of that is my laziness).

Each of our ~40*4 CPUs is Intel(R) Xeon(R) CPU 3.00GHz, each machine has 16 GB 
RAM, so no job should go above 4 GB RAM. In fact, I had to implement 
'co-operative suicide' to prevent complete node freeze if some of the jobs 
allocate too much memory. SGE could be probably set up to perform 'pre-emptive 
jobicide', but our admins don't like the idea.

This reminds me of another reason to avoid interfacing other languages via 
forking: imagine 4 jobs running, each taking 20 % of RAM, and each suddenly 
forking to perform a part of the computation. Even in case the spawned process 
is tiny, until the actual 'exec', the kernel has to reserve twice the current 
memory consumption. (My current SGE interface has exactly this issue, what I 
should be doing is an API call.)

Cheers, Ondrej.
mercury-users mailing list
Post messages to:       mercury-users at csse.unimelb.edu.au
Administrative Queries: owner-mercury-users at csse.unimelb.edu.au
Subscriptions:          mercury-users-request at csse.unimelb.edu.au

More information about the users mailing list