Advanced search

Message boards : Number crunching : App restarts after being suspended and restarted.

Author Message
jm7
Send message
Joined: 2 Aug 22
Posts: 1
Credit: 6,825,500
RAC: 0
Level
Ser
Scientific publications
wat
Message 59666 - Posted: 29 Dec 2022 | 1:58:27 UTC

I suspended a Python Apps for GPU hosts 4.04 (cuda1131) for a few minutes to allow some other tasks to finish, and the Time counter started at 0 again. It was at 3 days and a couple of hours. You really need to checkpoint more often that once every 3 days.

It appears to be stuck at 4% for over a day now.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1341
Credit: 7,688,699,064
RAC: 13,300,537
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59667 - Posted: 29 Dec 2022 | 4:38:56 UTC - in response to Message 59666.

I suspended a Python Apps for GPU hosts 4.04 (cuda1131) for a few minutes to allow some other tasks to finish, and the Time counter started at 0 again. It was at 3 days and a couple of hours. You really need to checkpoint more often that once every 3 days.

It appears to be stuck at 4% for over a day now.

The tasks do checkpoint in fact. It takes a few minutes depending on the speed of the system to replay computations back to the last checkpoint.

Upon restart the task will display the low % percentage and then jump forward to the last checkpoint percentage. You can check when the last checkpoint was written by viewing the task properties in the Manager sidebar.

On Windows hosts I have heard that stopping a task midstream and restarting can often hang the task. You should see this verbage repeating over and over

Starting!!
Define rollouts storage
Define scheme
Created CWorker with worker_index 0
Created GWorker with worker_index 0
Created UWorker with worker_index 0
Created training scheme.
Define learner
Created Learner.
Look for a progress_last_chk file - if exists, adjust target_env_steps
Define train loop
11:06:52 (6450): wrapper (7.7.26016): starting
11:06:54 (6450): wrapper (7.7.26016): starting
11:06:54 (6450): wrapper: running bin/python (run.py)

for every restart in the stderr.txt file in the slot that the running task occupies, then the task is likely hung and you can either try restarting the host and BOINC to see if you can persuade it back into running or abort it and get another task and try not to interrupt it.

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 128,533,046
RAC: 1,324,206
Level
Cys
Scientific publications
wat
Message 59669 - Posted: 29 Dec 2022 | 12:27:02 UTC - in response to Message 59667.

It also writes logs to wrapper_run.out

jjch
Send message
Joined: 10 Nov 13
Posts: 101
Credit: 15,586,400,388
RAC: 4,234,677
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59674 - Posted: 3 Jan 2023 | 4:59:28 UTC - in response to Message 59669.

jm7 - It's not the checkpoints that are the problem. You tasks are failing with multiple different errors. Your system needs a GPU driver update and a few tune-up adjustments.

Regarding the last task 33220676

http://www.gpugrid.net/result.php?resultid=33220676

This task has failed with a RuntimeError: Unable to find a valid cuDNN algorithm to run convolution That error is somewhat inconclusive however it seems to be related to a CUDA error. Your GPU driver is version 512.78 and the latest is version 527.56. Download the driver here: https://www.nvidia.com/download/driverResults.aspx/197460/en-us/

I would recommend fully de-installing the driver using DDU first before you reinstall it. DDU can be found here: https://www.guru3d.com/files-details/display-driver-uninstaller-download.html It's a bit convoluted to find the download but keep drilling down for it.

Regarding the previous task 33221790

http://www.gpugrid.net/result.php?resultid=33221790

This one failed with a RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:81] data. DefaultCPUAllocator: not enough memory: you tried to allocate 3612672 bytes.

It's telling you that it ran out of memory and needed a bit more than 3Gb additional memory. There are two things that can cause this. One is your physical memory installed in the system. Your laptop has 16GB so it's on the low side. If you have the opportunity I would recommend installing more memory.

Other problems can be caused by the default Boinc settings. Check the Disk and memory tab under Options > Computing preferences. Make sure you are not restricting the disk space. You can set this, if you have to, however leaving it unrestricted will allow it to use all the available space. This also requires that the disk where your home directory for Boinc is located has enough free space. Make sure the Memory percentages are enough for the GPUgrid python tasks. You can bump this up a bit if needed. Also, set the Page/swap file to 100%.

As for your first task 33222758 that failed

http://www.gpugrid.net/result.php?resultid=33222758

It ran out of paging/swap space first ImportError: DLL load failed while importing algos: The paging file is too small for this operation to complete. Then you restarted it and it ran out of memory.
RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:81] data. DefaultCPUAllocator: not enough memory: you tried to allocate 6422528 bytes.

You will need to increase your paging file. It should be set to at least 50GB. You can monitor this and see if it is enough after setting it. Not quite sure if you can find it the same way in Windows 11 but on Win 10 it's under Settings > System > About > Advanced system settings > Advanced tab > Performance - Settings... button > Advanced tab > Virtual memory - Change... button. Remove the check for Automatically.... Select the Custom size radio button. The Initial size doesn't matter too much. You can set it to whatever was currently allocated. The main thing is to set Maximum size to at least 51200. It can be more but you shouldn't need more than 60GB. Hit the Set button and back out with OK ... etc.

A couple more pointers - Try not to stop and restart the GPUgrid tasks too much. They should replay from the last checkpoint but they shouldn't take more than a few minutes. Normally they will complete in a bit more than 24hrs total. If it is staying at 4% or running more than a couple days it is stalled and probably should be aborted. You can check the stderr file in the slot the task is running in. Also, don't run a lot of other programs on your PC at the same time. GPUgrid needs a considerable amount of CPU resources, memory and swap space. If you are competing with other applications it may run short. GLHF

JJCH

Post to thread

Message boards : Number crunching : App restarts after being suspended and restarted.

//