Author |
Message |
|
I'm seeing a problem with tasks that have PABLO in their names. More than half complete without problems, but the remainder seem to go into an endless loop that stops them from writing any more checkpoints or making any more progress. Estimated remaining time eventually drops to zero without changing the progress percentage.
Workaround that lets progress resume - suspend the task for at least one minute. Then resume the task. Expect most of the elapsed time to be lost when this is done, but progress then resumes.
A task where this happened:
http://www.gpugrid.net/result.php?resultid=16421650
Computer where this happened:
http://www.gpugrid.net/show_host_detail.php?hostid=422382
A wingmate for this workunit got this error:
The simulation has become unstable. Terminating to avoid lock-up (1)
I've seen the problem on one or two tasks before, but did not save enough information about those tasks to tell you which ones. |
|
|
|
I think this is more common on Windows 10.
I haven't encountered it on Windows 7 yet (or on my single Windows 10 machine, come to think of it). |
|
|
|
Another PABLO task that seems to go into an endless loop, unless suspended, losing nearly a full day of compute time:
http://www.gpugrid.net/result.php?resultid=16435516
http://www.gpugrid.net/workunit.php?wuid=12651837
Do these task have enough debugging enabled to show the cause of the endless loop? The slot directory does not appears to contain a text file showing anything relevant.
Running under 64-bit Windows 10. Problem does not happen on all of the PABLO tasks. |
|
|
|
This may be a shot in the dark, but ...
Do you have anything BOINC-related, in the following folder:
C:\Users\{username}\AppData\Local\VirtualStore\
I have seen some GPUGrid strangeness at one time, where I was playing with compatibility modes, and Windows created files in that "VirtualStore" folder that ... get used, instead of the normal (C:\Program Files\BOINC\) files. Worse yet, BOINC-related "VirtualStore" folders won't get properly cleaned by BOINC!
In my case, at that time, my tasks were erroneously insta-completing.
Anyway .. So, do you have anything in that "VirtualStore" folder?
If you do have BOINC-related stuff in there, try closing BOINC then removing the BOINC-related stuff then restarting BOINC. |
|
|
|
This may be a shot in the dark, but ...
Do you have anything BOINC-related, in the following folder:
C:\Users\{username}\AppData\Local\VirtualStore\
[snip]
There are a number of files there, but none appear to be BOINC-related.
|
|
|
|
What setting are you using for:
"Use at most X% of CPU time"
If you're not using 100%, can you try it and see if it fixes the problem? |
|
|
|
What setting are you using for:
"Use at most X% of CPU time"
If you're not using 100%, can you try it and see if it fixes the problem?
100%
|
|
|
|
Another task with the problem:
http://www.gpugrid.net/result.php?resultid=16442636
http://www.gpugrid.net/workunit.php?wuid=12657605
I've bought a GTX 1080, and have told BOINC not to download any more GPU workunits for now so all of them can finish before I install the new graphics board tomorrow. |
|
|
|
Now using the GTX 1080, which appears to have stopped the problem.
Some PABLO tasks run for an unexpectedly long time now, but they finish and verify properly. |
|
|
|
Another PABLO task that apparantly went into an endless loop:
http://www.gpugrid.net/workunit.php?wuid=12708240
http://www.gpugrid.net/result.php?resultid=16504434
It reached:
55.160% progress, 1d 10:29:13 elapsed, --- remaining
no change to these numbers other than elapsed for at least 12 hours
using GTX 1080 with 385.28 driver, and i7-5980X CPU, BOINC 7.6.33
I suspended it and installed the 385.41 driver (without the 3D sections).
It now indicates 55.160% progress, 03:38:48 elapsed, 04:03:15 remaining. Running has not resumed - BOINC appears to be catching up on other GPU work.
This suggests that the problem is soon after writing a checkpoint, but before anything that does the next progress increase. Resuming from a checkpoint instead does not appear to give the problem.
NO error messages on the screen, or in any file in the slot that looked likely to be a non-empty text file. |
|
|
|
Another PABLO task that apparantly went into an endless loop:
http://www.gpugrid.net/workunit.php?wuid=12708240
http://www.gpugrid.net/result.php?resultid=16504434
[snip]
It finally resumed from the checkpoint, and started updating the progress percentage about once a second. The task completed with a few hours, was reported, and has now validated.
This suggests that also enabling debug output between resuming from the checkpoint file and first update of the progress percentage would allow comparing the failed first try to the second try that worked better.
|
|
|
VariableSend message
Joined: 20 Nov 13 Posts: 21 Credit: 440,148,605 RAC: 66,901 Level
Scientific publications
|
I have also had this problem using my 1070, on Win7. GPUgrid tasks will compute to halfway or so and then stop for hours. The card is dedicated to GPUgrid only so no other projects are competing with it for compute time. The core load of the card indicates it is just sitting idle. Exiting BOINC and restarting it will resume computation on the task, as will suspending it and then resuming. It's happening pretty frequently, every 1-2 days.
I have the most recent BOINC version, but I have not updated graphics drivers in a while so my next step is to try that. |
|
|
wiyosayaSend message
Joined: 22 Nov 09 Posts: 114 Credit: 589,114,683 RAC: 0 Level
Scientific publications
|
What setting are you using for:
"Use at most X% of CPU time"
If you're not using 100%, can you try it and see if it fixes the problem?
100%
Not sure whether this will help, but I just got a 1060 up last night on Win 10 x64, and noted that while the elapsed time kept incrementing, the percent done stopped. This was after it had crunched about 2.5-percent of the WU. I looked into the BOINC client log and there was a message in there that said "CPU busy, suspending work" or something like that. I was using my computer at the time with non-CPU intensive stuff like running my web browser.
I checked "Suspend work when non-BOINC CPU usage is above" in "When and how BOINC uses your computer" under "Preferences" on my account page and noted it was set to 80%. I then set it to 0 which means to run BOINC projects all the time regardless of host CPU usage.
I suggest checking that setting.
I then did an Update on GPUGrid, and it still did not restart. So I exited BOINC and restarted, and the problem seemed to disappear - that is, I was still using my computer, however, the task did not suspend and ran to completion in a timely manner. Perhaps this is coincidental, IDK. However, I have a 6-core processor and there is no way that non-BOINC total core/thread usage was 80% or above at the time, unless the browser briefly spun up 11 or 12 threads.
The next time I get a PABLO, I will watch for this again. If it is not coincidental and the job of checking that setting is in each individual client's code, then perhaps there is a bug in that code with PABLO units. If the code is in BOINC, then perhaps there is a bug there.
____________
|
|
|
|
[snip]
I checked "Suspend work when non-BOINC CPU usage is above" in "When and how BOINC uses your computer" under "Preferences" on my account page and noted it was set to 80%. I then set it to 0 which means to run BOINC projects all the time regardless of host CPU usage.
I suggest checking that setting.
[snip]
I don't see a "Suspend work when non-BOINC CPU usage is above" setting, but I would have set it to off. My computer has 8 physical cores plus hyperthreading, which makes it behave like it has 16 cores, and I've found that limiting the number of cores BOINC can use gives better results than limiting the percentage of CPU time it can use on each core. 14 cores are allowed for CPU tasks, leaving one for GPU tasks and one for non-BOINC programs.
Using BOINC 7.6.33 under Windows 10. |
|
|
wiyosayaSend message
Joined: 22 Nov 09 Posts: 114 Credit: 589,114,683 RAC: 0 Level
Scientific publications
|
Interesting to know of your experience, and it is also interesting that you do not see this setting.
For me, it is under -
1. My Account
2. Preferences section
3. "When and how BOINC uses your computer"
4. Click on "Computing Preferences" which is on the same line as "When and how BOINC uses your computer"
5. "Processor Usage"
6. In the "Processor Usage" section there is an entry "Suspend work when non-BOINC CPU usage is above"
If you also observe that this problem happens again, I suggest checking BOINC's Activity Log for a message similar to the one I found. To me, finding the same message would be a strong indicator that the same thing is happening on your system.
____________
|
|
|
|
Interesting to know of your experience, and it is also interesting that you do not see this setting.
[snip]
I finally found it. It was turned on, so I turned it off.
|
|
|
wiyosayaSend message
Joined: 22 Nov 09 Posts: 114 Credit: 589,114,683 RAC: 0 Level
Scientific publications
|
Interesting to know of your experience, and it is also interesting that you do not see this setting.
[snip]
I finally found it. It was turned on, so I turned it off.
Awesome! So we may have found the problem where these tasks seem to suspend and then not resume. I was reading another thread, and it seems like it may not be specific to PABLO tasks.
It will be interesting to know if you see it again. If I do, I will post to this thread.
____________
|
|
|
|
I have this problem many times. Win 10, latest Nvidia driver etc.
GPU run out of work, process still activ.
I tick the task_debug option in messageoption...
Output is:
12.11.2017 16:11:47 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed
12.11.2017 16:11:54 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed
12.11.2017 16:12:01 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed
12.11.2017 16:12:08 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed
12.11.2017 16:12:15 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed
12.11.2017 16:12:24 | | Suspending computation - CPU is busy
12.11.2017 16:12:24 | GPUGRID | [cpu_sched] Preempting e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 (left in memory)
12.11.2017 16:12:24 | GPUGRID | [task] task_state=SUSPENDED for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 from suspend
12.11.2017 16:12:44 | | Resuming computation
12.11.2017 16:12:44 | GPUGRID | [cpu_sched] Resuming e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0
12.11.2017 16:12:44 | GPUGRID | [task] task_state=EXECUTING for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 from unsuspend
12.11.2017 16:12:48 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed
12.11.2017 16:13:04 | | Suspending computation - CPU is busy
12.11.2017 16:13:04 | GPUGRID | [cpu_sched] Preempting e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 (left in memory)
12.11.2017 16:13:04 | GPUGRID | [task] task_state=SUSPENDED for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 from suspend
12.11.2017 16:13:06 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed
12.11.2017 16:13:12 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed
12.11.2017 16:13:14 | | Resuming computation
12.11.2017 16:13:14 | GPUGRID | [cpu_sched] Resuming e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0
12.11.2017 16:13:14 | GPUGRID | [task] task_state=EXECUTING for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 from unsuspend
12.11.2017 16:19:54 | | Suspending GPU computation - user request
12.11.2017 16:19:54 | GPUGRID | [cpu_sched] Preempting e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 (removed from memory)
12.11.2017 16:19:54 | GPUGRID | [task] task_state=QUIT_PENDING for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 from request_exit()
12.11.2017 16:19:54 | | request_exit(): PID 13004 has 1 descendants
12.11.2017 16:19:54 | | PID 5096
12.11.2017 16:20:02 | | Resuming GPU computation
12.11.2017 16:20:55 | GPUGRID | [task] quit request timed out, killing task e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0
12.11.2017 16:20:56 | GPUGRID | [task] Process for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 exited, exit code 0, task state 8
12.11.2017 16:20:56 | GPUGRID | [task] task_state=UNINITIALIZED for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 from handle_exited_app
12.11.2017 16:20:56 | GPUGRID | [task] task_state=EXECUTING for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 from start
12.11.2017 16:20:56 | GPUGRID | [cpu_sched] Restarting task e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 using acemdlong version 918 (cuda80) in slot 0
12.11.2017 16:21:06 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed
12.11.2017 16:21:13 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed
12.11.2017 16:21:20 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed
12.11.2017 16:21:27 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed
12.11.2017 16:21:35 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed
12.11.2017 16:21:42 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed
At the second resume at 16:13 the GPU-perfomance was lost.
I stop the GPU-usage at 16:19 and resume at 16:20.
The job worked again.
The different I found was that it was removed from memory (16:19).
Too bad that there is no solution.
There is also a checkpoint every 15 seconds. I think it's usually 5-15 min. |
|
|
|
hsdecalc, I haven't seen the problem lately. My recent changes include installing BOINC 7.8.3, and setting the number of CPU cores BOINC is allowed to use to two less than the total number present. |
|
|
|
hsdecalc, another recent change, probably after the last time I saw the problem.
Note that you must be in the advanced view, not the simple view, to follow the directions below. Click on View to start changing which view you have.
Interesting to know of your experience, and it is also interesting that you do not see this setting.
For me, it is under -
1. My Account
2. Preferences section
3. "When and how BOINC uses your computer"
4. Click on "Computing Preferences" which is on the same line as "When and how BOINC uses your computer"
5. "Processor Usage"
6. In the "Processor Usage" section there is an entry "Suspend work when non-BOINC CPU usage is above"
If you also observe that this problem happens again, I suggest checking BOINC's Activity Log for a message similar to the one I found. To me, finding the same message would be a strong indicator that the same thing is happening on your system.
|
|
|
|
Thanks for reply.
The above procedure is a workaround not a solution for me. I have sometimes high cpu-usage by video-playing, so I need the pause-option. |
|
|
|
Thanks for reply.
The above procedure is a workaround not a solution for me. I have sometimes high cpu-usage by video-playing, so I need the pause-option. Then you should put those games and applications to the settings -> exclusive applications list.
|
|
|