Advanced search

Message boards : Number crunching : Need more space. Again

Author Message
[VENETO] boboviz
Send message
Joined: 10 Sep 10
Posts: 158
Credit: 388,132
RAC: 0
Level

Scientific publications
wat
Message 50332 - Posted: 29 Aug 2018 | 15:52:27 UTC

I've crunched quietly on my Linux Vm until today.
Now, suddenly, this message:

Quantum Chemistry needs 13195.37MB more disk space. You currently have 15414.86 MB available and it needs 28610.23 MB.


30 gb for a wu?? Are you kidding?

PappaLitto
Send message
Joined: 21 Mar 16
Posts: 511
Credit: 4,672,242,755
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 50333 - Posted: 29 Aug 2018 | 16:37:08 UTC - in response to Message 50332.

In a previous message from Stefan he mentions these new WUs are much larger simulations that require much larger data-sets. It's just the nature of the beast.

[VENETO] boboviz
Send message
Joined: 10 Sep 10
Posts: 158
Credit: 388,132
RAC: 0
Level

Scientific publications
wat
Message 50334 - Posted: 30 Aug 2018 | 7:27:07 UTC - in response to Message 50333.

Ok, i changed the vm disk, now i have 40gb free.
But wus stuck at 10%: "Waiting to run"

PappaLitto
Send message
Joined: 21 Mar 16
Posts: 511
Credit: 4,672,242,755
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 50339 - Posted: 30 Aug 2018 | 10:49:18 UTC - in response to Message 50334.

Ok, i changed the vm disk, now i have 40gb free.
But wus stuck at 10%: "Waiting to run"

These WUs use an extremely large amount of memory, most likely it has completely filled your ram and is taking from swap (hard drive) so it cannot run at full speed. I suggest everyone simply wait as unless you have more ram, there is no way to speed this up.

tullio
Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 50340 - Posted: 30 Aug 2018 | 11:49:05 UTC

I have aborted one which seemed to go on forever. Now I have another running and one waiting to run. Let's hope they complete.
Tullio
____________

[VENETO] boboviz
Send message
Joined: 10 Sep 10
Posts: 158
Credit: 388,132
RAC: 0
Level

Scientific publications
wat
Message 50341 - Posted: 30 Aug 2018 | 12:12:57 UTC - in response to Message 50339.

These WUs use an extremely large amount of memory, most likely it has completely filled your ram and is taking from swap (hard drive) so it cannot run at full speed.


6gb of ram is not enough for a single wu?
I think they have to work on the app code...

tullio
Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 50344 - Posted: 30 Aug 2018 | 12:58:43 UTC - in response to Message 50341.

I have 8 GB RAM also on the laptop. Two tasks have completed so far, another is running. I have aborted one which was going on endlessly.
Tullio

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,635,065,645
RAC: 10,764,900
Level
Tyr
Scientific publications
watwatwatwatwat
Message 50351 - Posted: 30 Aug 2018 | 19:56:59 UTC

These are needing 30gb of disk space and 6gb of memory? Wow.

PappaLitto
Send message
Joined: 21 Mar 16
Posts: 511
Credit: 4,672,242,755
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 50352 - Posted: 30 Aug 2018 | 22:35:31 UTC

So my Ryzen r7 1700 system runs 4 of these WUs at once. I had 8gb of RAM and ran the old WUs no problem, 100% CPU usage and ~6GB of RAM usage.

With these new WUs I was getting extremely low CPU usage with 4 running and maxed out my RAM. So I switched to 16GB of RAM. It now uses ~13gb of RAM but I still have very low CPU usage? Why is this? It's not maxing out RAM, is it taking from the SSD constantly? Swap is completely empty as well. I would assume the application would put all the vital info in the RAM until it couldn't anymore.

tullio
Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 50357 - Posted: 31 Aug 2018 | 13:28:33 UTC

I get 197% CPU usage if no other task is running,173% if a GPU task is running alongside. I have only two cores CPU,both on the Linux laptop and the Linux SUN workstation. The Windows 10 PC has 4 cores but all GPU tasks overheat its GTX 1050 Ti, not overclocked. Einstein@home GPU tasks run fine on it.
Tullio
____________

[VENETO] boboviz
Send message
Joined: 10 Sep 10
Posts: 158
Credit: 388,132
RAC: 0
Level

Scientific publications
wat
Message 50358 - Posted: 31 Aug 2018 | 16:35:39 UTC
Last modified: 31 Aug 2018 | 16:37:13 UTC

Ok. I passed from 6 to 8 gb on Vm.
Now the wu doesn't "pause", but still remain at 10% until the end of the remaining time. When time is 0, the wu passes immediatly to 100% and continue to crunch.....and cpu use is 0%.

tullio
Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 50359 - Posted: 31 Aug 2018 | 19:10:12 UTC - in response to Message 50358.

Toni in a News post has explained why the task progress remains at 10%. But my tasks end and report when they finish.
Tullio
____________

[VENETO] boboviz
Send message
Joined: 10 Sep 10
Posts: 158
Credit: 388,132
RAC: 0
Level

Scientific publications
wat
Message 50379 - Posted: 3 Sep 2018 | 12:53:59 UTC - in response to Message 50359.

Toni in a News post has explained why the task progress remains at 10%. But my tasks end and report when they finish.
Tullio


Ok, i've read the post. And i leave the wu crunching
But i have this error:
09:28:46 (2923): wrapper (7.7.26016): starting
09:28:46 (2923): wrapper: running /usr/bin/flock (/var/lib/boinc-client/projects/www.gpugrid.net/miniconda.lock -c "/bin/bash ./miniconda-installer.sh -b -u -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda &&
/var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/conda install -m -y -n qmml2 --override-channels -c defaults -c gpugrid --file requirements.txt ")
Python 3.6.5 :: Anaconda, Inc.


==> WARNING: A newer version of conda exists. <==
current version: 4.5.4
latest version: 4.5.11

Please update conda by running

$ conda update -n base conda


09:29:31 (2923): /usr/bin/flock exited; CPU time 39.730884
09:29:31 (2923): wrapper: running /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/envs/qmml2/bin/python (run.py)
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libpcm.so.1 00007F1FC12EE24F Unknown Unknown Unknown
libpthread-2.19.s 00007F1FDC054330 Unknown Unknown Unknown
libpthread-2.19.s 00007F1FDC0533AD read Unknown Unknown
core.so 00007F1FCAA5F503 _ZN3psi4PSIO2rwEm Unknown Unknown
core.so 00007F1FCA561B28 _ZN3psi8DiskDFJK1 Unknown Unknown
core.so 00007F1FCA55E3DB _ZN3psi8DiskDFJK1 Unknown Unknown
core.so 00007F1FCA44FD23 _ZN3psi2JK7comput Unknown Unknown
core.so 00007F1FCA20A9FC Unknown Unknown Unknown
core.so 00007F1FCA1E5092 Unknown Unknown Unknown
core.so 00007F1FCA1EDC89 Unknown Unknown Unknown
core.so 00007F1FC8878881 Unknown Unknown Unknown
core.so 00007F1FC84A99B6 Unknown Unknown Unknown
python3.6 00007F1FDC595B94 _PyCFunction_Fast Unknown Unknown
python3.6 00007F1FDC6257CE Unknown Unknown Unknown
python3.6 00007F1FDC647CBA _PyEval_EvalFrame Unknown Unknown
python3.6 00007F1FDC620459 PyEval_EvalCodeEx Unknown Unknown
python3.6 00007F1FDC621376 Unknown Unknown Unknown
python3.6 00007F1FDC59599E PyObject_Call Unknown Unknown
python3.6 00007F1FDC649470 _PyEval_EvalFrame Unknown Unknown
python3.6 00007F1FDC620459 PyEval_EvalCodeEx Unknown Unknown
python3.6 00007F1FDC621376 Unknown Unknown Unknown
python3.6 00007F1FDC59599E PyObject_Call Unknown Unknown
python3.6 00007F1FDC649470 _PyEval_EvalFrame Unknown Unknown
python3.6 00007F1FDC620A9E PyEval_EvalCodeEx Unknown Unknown
python3.6 00007F1FDC621376 Unknown Unknown Unknown
python3.6 00007F1FDC59599E PyObject_Call Unknown Unknown
python3.6 00007F1FDC649470 _PyEval_EvalFrame Unknown Unknown
python3.6 00007F1FDC61EDAE Unknown Unknown Unknown
python3.6 00007F1FDC61F941 Unknown Unknown Unknown
python3.6 00007F1FDC625755 Unknown Unknown Unknown
python3.6 00007F1FDC648A7A _PyEval_EvalFrame Unknown Unknown
python3.6 00007F1FDC61EA94 Unknown Unknown Unknown
python3.6 00007F1FDC61F941 Unknown Unknown Unknown
python3.6 00007F1FDC625755 Unknown Unknown Unknown
python3.6 00007F1FDC648A7A _PyEval_EvalFrame Unknown Unknown
python3.6 00007F1FDC620459 PyEval_EvalCodeEx Unknown Unknown
python3.6 00007F1FDC6211EC PyEval_EvalCode Unknown Unknown
python3.6 00007F1FDC69B9A4 Unknown Unknown Unknown
python3.6 00007F1FDC69BDA1 PyRun_FileExFlags Unknown Unknown
python3.6 00007F1FDC69BFA4 PyRun_SimpleFileE Unknown Unknown
python3.6 00007F1FDC69FA9E Py_Main Unknown Unknown
python3.6 00007F1FDC5674BE main Unknown Unknown
libc-2.19.so 00007F1FDBC9CF45 __libc_start_main Unknown Unknown
python3.6 00007F1FDC64E773 Unknown Unknown Unknown
18:01:43 (1686): wrapper (7.7.26016): starting
18:01:43 (1686): wrapper (7.7.26016): starting
18:01:43 (1686): wrapper: running /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/envs/qmml2/bin/python (run.py)
18:29:54 (1686): $PROJECT_DIR/miniconda/envs/qmml2/bin/python exited; CPU time 5398.666724
13:57:32 (1694): wrapper (7.7.26016): starting
13:57:32 (1694): wrapper (7.7.26016): starting
14:46:12 (1694): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>6952_16_18_20_23_47359a98_n00001-SDOERR_SELE2-0-1-RND9339_0_1</file_name>
<error_code>-131 (file size too big)</error_code>
</file_xfer_error>

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50380 - Posted: 3 Sep 2018 | 13:03:41 UTC - in response to Message 50379.

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>6952_16_18_20_23_47359a98_n00001-SDOERR_SELE2-0-1-RND9339_0_1</file_name>
<error_code>-131 (file size too big)</error_code>
</file_xfer_error>

They have already been notified.
http://www.gpugrid.net/forum_thread.php?id=4785&nowrap=true#50349

[VENETO] boboviz
Send message
Joined: 10 Sep 10
Posts: 158
Credit: 388,132
RAC: 0
Level

Scientific publications
wat
Message 50388 - Posted: 4 Sep 2018 | 6:20:02 UTC - in response to Message 50380.

They have already been notified.
http://www.gpugrid.net/forum_thread.php?id=4785&nowrap=true#50349


And no answer....

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 50550 - Posted: 19 Sep 2018 | 0:55:10 UTC

So I just discovered something interesting.

I'm running 4 GPUs along with 2 CPU work units. 64 GB in the machine.

Virtual memory for the GPU tasks are 32 GB each.......

Not sure how much virtual memory the CPU are using but even if it's the say...45 GB. I can see why I'm running out of space on my SSD

4 x 32 = 128 GB. The SSD is only 120GB total. So yeah, think it's time to upgrade the SSD.....
____________

tullio
Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 50551 - Posted: 19 Sep 2018 | 1:22:41 UTC - in response to Message 50550.

Virtual memory size is 4.15 GB on my 8 GB RAM on a Linux box. But more than 700 GB are available to BOINC of a 1 TB disk. I had SSDs on my HP laptop and they all failed. I installed a hybrid disk on it with 8 MB SSD out of 1 TB and it works.
Tullio

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 50552 - Posted: 19 Sep 2018 | 3:21:48 UTC - in response to Message 50551.

Virtual memory size is 4.15 GB on my 8 GB RAM on a Linux box. But more than 700 GB are available to BOINC of a 1 TB disk. I had SSDs on my HP laptop and they all failed. I installed a hybrid disk on it with 8 MB SSD out of 1 TB and it works.
Tullio


I'm getting old..haha... My memory is faulty. Turns out that I had already upgraded that SSD to a 500GB. So that leaves the question of why it's running out of space. I turned off 1 QC work unit, leaving only 1 running along with the 4 GPUs. Better for the temps on the CPU, they were spiking to 100C, now they are back down to 60C-70C
____________

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 50554 - Posted: 19 Sep 2018 | 8:25:04 UTC

A single WU will only use 4GB RAM. If it uses significantly more than that (i.e. 6GB) report to me because it might be a bug in the software.

Other than that yes I mentioned that some of the WUs I submitted now might use up to 50GB of scratch space on the disk. There is not much I can do about it other than simply not simulating them.

[VENETO] boboviz
Send message
Joined: 10 Sep 10
Posts: 158
Credit: 388,132
RAC: 0
Level

Scientific publications
wat
Message 50557 - Posted: 19 Sep 2018 | 9:31:40 UTC - in response to Message 50552.
Last modified: 19 Sep 2018 | 9:31:52 UTC

Turns out that I had already upgraded that SSD to a 500GB. So that leaves the question of why it's running out of space.


New beta wus on my linux VM

<![CDATA[
<message>
Maximum disk usage exceeded
</message>

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 50558 - Posted: 19 Sep 2018 | 9:54:33 UTC - in response to Message 50557.

Right, we increased now the limit. Should not happen much anymore

tullio
Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 50560 - Posted: 19 Sep 2018 | 13:45:25 UTC

First error
Stderr output

<core_client_version>7.8.3</core_client_version>
<![CDATA[
<message>
Disk usage limit exceeded</message>
<stderr_txt>

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 50561 - Posted: 19 Sep 2018 | 13:52:03 UTC - in response to Message 50560.

Yes I just checked my error numbers, 120 total, up from 55 before. So a lot of work units errored out before I started to ones that complete. I'm guessing they erred before he made the change. Will keep an eye on results to see if the new parameters make a difference on the completion and validations.

Thanks Stefan
____________

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 50568 - Posted: 20 Sep 2018 | 8:13:44 UTC - in response to Message 50561.

Yes, please inform me if they still crash due to disk space limitations (not due to upload file size limit, that's a different issue).

[VENETO] boboviz
Send message
Joined: 10 Sep 10
Posts: 158
Credit: 388,132
RAC: 0
Level

Scientific publications
wat
Message 50581 - Posted: 21 Sep 2018 | 10:47:09 UTC - in response to Message 50568.
Last modified: 21 Sep 2018 | 10:51:02 UTC

Yes, please inform me if they still crash due to disk space limitations (not due to upload file size limit, that's a different issue).


I have 45gb free space on my VM for boinc, but this message:
Quantum chemistry needs 11087.61Mb more disk space

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 50585 - Posted: 21 Sep 2018 | 13:30:45 UTC - in response to Message 50581.

Yes, we require now 60GB of disk space for one WU

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 50586 - Posted: 21 Sep 2018 | 13:37:05 UTC - in response to Message 50585.
Last modified: 21 Sep 2018 | 13:40:46 UTC

To clarify: only a few WUs will actually use that disk space. But we need to set a maximum.

In summary, resource occupation for each QC:

* 4 GB memory max
* 4 Threads max (less if not available)
* Large-ish (up to 60 GB, likely much less) temporary disk space while running

Additionally:

* Moderate (~3 GB) disk space for downloading and storing the app (can be reclaimed resetting the project)

[VENETO] boboviz
Send message
Joined: 10 Sep 10
Posts: 158
Credit: 388,132
RAC: 0
Level

Scientific publications
wat
Message 50622 - Posted: 28 Sep 2018 | 15:41:52 UTC - in response to Message 50586.

Ok, i'll wait for a windows version, hoping in a less space request..

Post to thread

Message boards : Number crunching : Need more space. Again

//