[mercury-users] Additional comments to my today's talk
Ondrej Bojar
bojar at csse.unimelb.edu.au
Thu Nov 22 21:01:44 AEDT 2007
Hi, all.
In case anyone is interested, the slides of my talk on modules I created for my
natural language processing projects are available at:
http://ufal.mff.cuni.cz/~bojar/publications/2007-PRES-bojar_mercury_2007-talk.pdf
To answer Julien's question about data sizes, here is a sample experiment I ran:
All these tinycdb databases of pickled Mercury objects were used (simultaneously):
340M models/t_pre001.557.multiprob_map
420M models/t_pre001.55B.multiprob_map
240M models/t_pre001.55F.multiprob_map
11M models/t_wor001.563.gdbm
6.4M models/t_wor001.569.gdbm
Translated 964 sentences (0 had no translation) in 42.41m (2.6+-1.7s/sent,
worst: 71 in 12.15s), stacksize 100.
Real time of Translating ../../corpora/wmtdevtest.en.in: 192.40s (3.21m,
13.9% Mem)
(My last two months average is about 50 experiments/month. I'd guess about a
half of them are sanity checks and about half of them are probes towards better
models.)
The training phase happened before (cached) and is not included in the time
reported. All the models are based on a 90 MB text file compactly representing
sentence structure and English-Czech alignment. As mentioned, I'm planning to
move to about 20x the magnitude of the data.
Real time (3.21 minutes) is shorter than total time (42.41m), because it ran in
20 parallel jobs. Some experiments take also a day of wallclock time even with
20 jobs (but part of that is my laziness).
Each of our ~40*4 CPUs is Intel(R) Xeon(R) CPU 3.00GHz, each machine has 16 GB
RAM, so no job should go above 4 GB RAM. In fact, I had to implement
'co-operative suicide' to prevent complete node freeze if some of the jobs
allocate too much memory. SGE could be probably set up to perform 'pre-emptive
jobicide', but our admins don't like the idea.
This reminds me of another reason to avoid interfacing other languages via
forking: imagine 4 jobs running, each taking 20 % of RAM, and each suddenly
forking to perform a part of the computation. Even in case the spawned process
is tiny, until the actual 'exec', the kernel has to reserve twice the current
memory consumption. (My current SGE interface has exactly this issue, what I
should be doing is an API call.)
Cheers, Ondrej.
--------------------------------------------------------------------------
mercury-users mailing list
Post messages to: mercury-users at csse.unimelb.edu.au
Administrative Queries: owner-mercury-users at csse.unimelb.edu.au
Subscriptions: mercury-users-request at csse.unimelb.edu.au
--------------------------------------------------------------------------
More information about the users
mailing list