Problem with PABLO tasks

Message boards : Number crunching : Problem with PABLO tasks

Author	Message
robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 755,434,080 RAC: 186,180 Level Scientific publications	Message 47698 - Posted: 27 Jul 2017 \| 21:22:02 UTC Last modified: 27 Jul 2017 \| 21:23:12 UTC
	I'm seeing a problem with tasks that have PABLO in their names. More than half complete without problems, but the remainder seem to go into an endless loop that stops them from writing any more checkpoints or making any more progress. Estimated remaining time eventually drops to zero without changing the progress percentage. Workaround that lets progress resume - suspend the task for at least one minute. Then resume the task. Expect most of the elapsed time to be lost when this is done, but progress then resumes. A task where this happened: http://www.gpugrid.net/result.php?resultid=16421650 Computer where this happened: http://www.gpugrid.net/show_host_detail.php?hostid=422382 A wingmate for this workunit got this error: The simulation has become unstable. Terminating to avoid lock-up (1) I've seen the problem on one or two tasks before, but did not save enough information about those tasks to tell you which ones.
	ID: 47698 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,866,392,698 RAC: 20,073,067 Level Scientific publications	Message 47700 - Posted: 27 Jul 2017 \| 21:34:29 UTC - in response to Message 47698.
	I think this is more common on Windows 10. I haven't encountered it on Windows 7 yet (or on my single Windows 10 machine, come to think of it).
	ID: 47700 \| Rating: 0 \| rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 755,434,080 RAC: 186,180 Level Scientific publications	Message 47737 - Posted: 4 Aug 2017 \| 0:19:50 UTC
	Another PABLO task that seems to go into an endless loop, unless suspended, losing nearly a full day of compute time: http://www.gpugrid.net/result.php?resultid=16435516 http://www.gpugrid.net/workunit.php?wuid=12651837 Do these task have enough debugging enabled to show the cause of the endless loop? The slot directory does not appears to contain a text file showing anything relevant. Running under 64-bit Windows 10. Problem does not happen on all of the PABLO tasks.
	ID: 47737 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 47745 - Posted: 6 Aug 2017 \| 1:24:15 UTC Last modified: 6 Aug 2017 \| 1:28:54 UTC
	This may be a shot in the dark, but ... Do you have anything BOINC-related, in the following folder: C:\Users\{username}\AppData\Local\VirtualStore\ I have seen some GPUGrid strangeness at one time, where I was playing with compatibility modes, and Windows created files in that "VirtualStore" folder that ... get used, instead of the normal (C:\Program Files\BOINC\) files. Worse yet, BOINC-related "VirtualStore" folders won't get properly cleaned by BOINC! In my case, at that time, my tasks were erroneously insta-completing. Anyway .. So, do you have anything in that "VirtualStore" folder? If you do have BOINC-related stuff in there, try closing BOINC then removing the BOINC-related stuff then restarting BOINC.
	ID: 47745 \| Rating: 0 \| rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 755,434,080 RAC: 186,180 Level Scientific publications	Message 47746 - Posted: 6 Aug 2017 \| 3:22:43 UTC - in response to Message 47745.
	This may be a shot in the dark, but ... Do you have anything BOINC-related, in the following folder: C:\Users\{username}\AppData\Local\VirtualStore\ [snip] There are a number of files there, but none appear to be BOINC-related.
	ID: 47746 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 47747 - Posted: 6 Aug 2017 \| 3:54:02 UTC - in response to Message 47746.
	What setting are you using for: "Use at most X% of CPU time" If you're not using 100%, can you try it and see if it fixes the problem?
	ID: 47747 \| Rating: 0 \| rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 755,434,080 RAC: 186,180 Level Scientific publications	Message 47748 - Posted: 6 Aug 2017 \| 4:39:06 UTC - in response to Message 47747.
	What setting are you using for: "Use at most X% of CPU time" If you're not using 100%, can you try it and see if it fixes the problem? 100%
	ID: 47748 \| Rating: 0 \| rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 755,434,080 RAC: 186,180 Level Scientific publications	Message 47750 - Posted: 6 Aug 2017 \| 22:13:48 UTC Last modified: 6 Aug 2017 \| 22:15:21 UTC
	Another task with the problem: http://www.gpugrid.net/result.php?resultid=16442636 http://www.gpugrid.net/workunit.php?wuid=12657605 I've bought a GTX 1080, and have told BOINC not to download any more GPU workunits for now so all of them can finish before I install the new graphics board tomorrow.
	ID: 47750 \| Rating: 0 \| rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 755,434,080 RAC: 186,180 Level Scientific publications	Message 47797 - Posted: 21 Aug 2017 \| 14:42:31 UTC
	Now using the GTX 1080, which appears to have stopped the problem. Some PABLO tasks run for an unexpectedly long time now, but they finish and verify properly.
	ID: 47797 \| Rating: 0 \| rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 755,434,080 RAC: 186,180 Level Scientific publications	Message 47855 - Posted: 8 Sep 2017 \| 21:04:11 UTC Last modified: 8 Sep 2017 \| 21:08:52 UTC
	Another PABLO task that apparantly went into an endless loop: http://www.gpugrid.net/workunit.php?wuid=12708240 http://www.gpugrid.net/result.php?resultid=16504434 It reached: 55.160% progress, 1d 10:29:13 elapsed, --- remaining no change to these numbers other than elapsed for at least 12 hours using GTX 1080 with 385.28 driver, and i7-5980X CPU, BOINC 7.6.33 I suspended it and installed the 385.41 driver (without the 3D sections). It now indicates 55.160% progress, 03:38:48 elapsed, 04:03:15 remaining. Running has not resumed - BOINC appears to be catching up on other GPU work. This suggests that the problem is soon after writing a checkpoint, but before anything that does the next progress increase. Resuming from a checkpoint instead does not appear to give the problem. NO error messages on the screen, or in any file in the slot that looked likely to be a non-empty text file.
	ID: 47855 \| Rating: 0 \| rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 755,434,080 RAC: 186,180 Level Scientific publications	Message 47859 - Posted: 9 Sep 2017 \| 14:53:44 UTC - in response to Message 47855.
	Another PABLO task that apparantly went into an endless loop: http://www.gpugrid.net/workunit.php?wuid=12708240 http://www.gpugrid.net/result.php?resultid=16504434 [snip] It finally resumed from the checkpoint, and started updating the progress percentage about once a second. The task completed with a few hours, was reported, and has now validated. This suggests that also enabling debug output between resuming from the checkpoint file and first update of the progress percentage would allow comparing the failed first try to the second try that worked better.
	ID: 47859 \| Rating: 0 \| rate: / Reply Quote

Variable Send message Joined: 20 Nov 13 Posts: 21 Credit: 440,148,605 RAC: 66,901 Level Scientific publications	Message 47860 - Posted: 11 Sep 2017 \| 15:36:38 UTC
	I have also had this problem using my 1070, on Win7. GPUgrid tasks will compute to halfway or so and then stop for hours. The card is dedicated to GPUgrid only so no other projects are competing with it for compute time. The core load of the card indicates it is just sitting idle. Exiting BOINC and restarting it will resume computation on the task, as will suspending it and then resuming. It's happening pretty frequently, every 1-2 days. I have the most recent BOINC version, but I have not updated graphics drivers in a while so my next step is to try that.
	ID: 47860 \| Rating: 0 \| rate: / Reply Quote

wiyosaya Send message Joined: 22 Nov 09 Posts: 114 Credit: 589,114,683 RAC: 0 Level Scientific publications	Message 47861 - Posted: 12 Sep 2017 \| 17:27:51 UTC - in response to Message 47748. Last modified: 12 Sep 2017 \| 17:33:45 UTC
	What setting are you using for: "Use at most X% of CPU time" If you're not using 100%, can you try it and see if it fixes the problem? 100% Not sure whether this will help, but I just got a 1060 up last night on Win 10 x64, and noted that while the elapsed time kept incrementing, the percent done stopped. This was after it had crunched about 2.5-percent of the WU. I looked into the BOINC client log and there was a message in there that said "CPU busy, suspending work" or something like that. I was using my computer at the time with non-CPU intensive stuff like running my web browser. I checked "Suspend work when non-BOINC CPU usage is above" in "When and how BOINC uses your computer" under "Preferences" on my account page and noted it was set to 80%. I then set it to 0 which means to run BOINC projects all the time regardless of host CPU usage. I suggest checking that setting. I then did an Update on GPUGrid, and it still did not restart. So I exited BOINC and restarted, and the problem seemed to disappear - that is, I was still using my computer, however, the task did not suspend and ran to completion in a timely manner. Perhaps this is coincidental, IDK. However, I have a 6-core processor and there is no way that non-BOINC total core/thread usage was 80% or above at the time, unless the browser briefly spun up 11 or 12 threads. The next time I get a PABLO, I will watch for this again. If it is not coincidental and the job of checking that setting is in each individual client's code, then perhaps there is a bug in that code with PABLO units. If the code is in BOINC, then perhaps there is a bug there. ____________
	ID: 47861 \| Rating: 0 \| rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 755,434,080 RAC: 186,180 Level Scientific publications	Message 47862 - Posted: 12 Sep 2017 \| 21:07:06 UTC - in response to Message 47861.
	[snip] I checked "Suspend work when non-BOINC CPU usage is above" in "When and how BOINC uses your computer" under "Preferences" on my account page and noted it was set to 80%. I then set it to 0 which means to run BOINC projects all the time regardless of host CPU usage. I suggest checking that setting. [snip] I don't see a "Suspend work when non-BOINC CPU usage is above" setting, but I would have set it to off. My computer has 8 physical cores plus hyperthreading, which makes it behave like it has 16 cores, and I've found that limiting the number of cores BOINC can use gives better results than limiting the percentage of CPU time it can use on each core. 14 cores are allowed for CPU tasks, leaving one for GPU tasks and one for non-BOINC programs. Using BOINC 7.6.33 under Windows 10.
	ID: 47862 \| Rating: 0 \| rate: / Reply Quote

wiyosaya Send message Joined: 22 Nov 09 Posts: 114 Credit: 589,114,683 RAC: 0 Level Scientific publications	Message 47863 - Posted: 12 Sep 2017 \| 21:24:01 UTC - in response to Message 47862.
	Interesting to know of your experience, and it is also interesting that you do not see this setting. For me, it is under - 1. My Account 2. Preferences section 3. "When and how BOINC uses your computer" 4. Click on "Computing Preferences" which is on the same line as "When and how BOINC uses your computer" 5. "Processor Usage" 6. In the "Processor Usage" section there is an entry "Suspend work when non-BOINC CPU usage is above" If you also observe that this problem happens again, I suggest checking BOINC's Activity Log for a message similar to the one I found. To me, finding the same message would be a strong indicator that the same thing is happening on your system. ____________
	ID: 47863 \| Rating: 0 \| rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 755,434,080 RAC: 186,180 Level Scientific publications	Message 47864 - Posted: 12 Sep 2017 \| 21:27:44 UTC - in response to Message 47863.
	Interesting to know of your experience, and it is also interesting that you do not see this setting. [snip] I finally found it. It was turned on, so I turned it off.
	ID: 47864 \| Rating: 0 \| rate: / Reply Quote

wiyosaya Send message Joined: 22 Nov 09 Posts: 114 Credit: 589,114,683 RAC: 0 Level Scientific publications	Message 47865 - Posted: 12 Sep 2017 \| 21:43:47 UTC - in response to Message 47864. Last modified: 12 Sep 2017 \| 21:44:26 UTC
	Interesting to know of your experience, and it is also interesting that you do not see this setting. [snip] I finally found it. It was turned on, so I turned it off. Awesome! So we may have found the problem where these tasks seem to suspend and then not resume. I was reading another thread, and it seems like it may not be specific to PABLO tasks. It will be interesting to know if you see it again. If I do, I will post to this thread. ____________
	ID: 47865 \| Rating: 0 \| rate: / Reply Quote

hsdecalc Send message Joined: 5 Jul 15 Posts: 2 Credit: 135,260,724 RAC: 0 Level Scientific publications	Message 48162 - Posted: 12 Nov 2017 \| 21:00:53 UTC Last modified: 12 Nov 2017 \| 21:05:05 UTC
	I have this problem many times. Win 10, latest Nvidia driver etc. GPU run out of work, process still activ. I tick the task_debug option in messageoption... Output is: 12.11.2017 16:11:47 \| GPUGRID \| [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed 12.11.2017 16:11:54 \| GPUGRID \| [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed 12.11.2017 16:12:01 \| GPUGRID \| [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed 12.11.2017 16:12:08 \| GPUGRID \| [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed 12.11.2017 16:12:15 \| GPUGRID \| [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed 12.11.2017 16:12:24 \| \| Suspending computation - CPU is busy 12.11.2017 16:12:24 \| GPUGRID \| [cpu_sched] Preempting e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 (left in memory) 12.11.2017 16:12:24 \| GPUGRID \| [task] task_state=SUSPENDED for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 from suspend 12.11.2017 16:12:44 \| \| Resuming computation 12.11.2017 16:12:44 \| GPUGRID \| [cpu_sched] Resuming e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 12.11.2017 16:12:44 \| GPUGRID \| [task] task_state=EXECUTING for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 from unsuspend 12.11.2017 16:12:48 \| GPUGRID \| [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed 12.11.2017 16:13:04 \| \| Suspending computation - CPU is busy 12.11.2017 16:13:04 \| GPUGRID \| [cpu_sched] Preempting e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 (left in memory) 12.11.2017 16:13:04 \| GPUGRID \| [task] task_state=SUSPENDED for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 from suspend 12.11.2017 16:13:06 \| GPUGRID \| [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed 12.11.2017 16:13:12 \| GPUGRID \| [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed 12.11.2017 16:13:14 \| \| Resuming computation 12.11.2017 16:13:14 \| GPUGRID \| [cpu_sched] Resuming e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 12.11.2017 16:13:14 \| GPUGRID \| [task] task_state=EXECUTING for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 from unsuspend 12.11.2017 16:19:54 \| \| Suspending GPU computation - user request 12.11.2017 16:19:54 \| GPUGRID \| [cpu_sched] Preempting e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 (removed from memory) 12.11.2017 16:19:54 \| GPUGRID \| [task] task_state=QUIT_PENDING for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 from request_exit() 12.11.2017 16:19:54 \| \| request_exit(): PID 13004 has 1 descendants 12.11.2017 16:19:54 \| \| PID 5096 12.11.2017 16:20:02 \| \| Resuming GPU computation 12.11.2017 16:20:55 \| GPUGRID \| [task] quit request timed out, killing task e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 12.11.2017 16:20:56 \| GPUGRID \| [task] Process for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 exited, exit code 0, task state 8 12.11.2017 16:20:56 \| GPUGRID \| [task] task_state=UNINITIALIZED for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 from handle_exited_app 12.11.2017 16:20:56 \| GPUGRID \| [task] task_state=EXECUTING for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 from start 12.11.2017 16:20:56 \| GPUGRID \| [cpu_sched] Restarting task e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 using acemdlong version 918 (cuda80) in slot 0 12.11.2017 16:21:06 \| GPUGRID \| [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed 12.11.2017 16:21:13 \| GPUGRID \| [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed 12.11.2017 16:21:20 \| GPUGRID \| [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed 12.11.2017 16:21:27 \| GPUGRID \| [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed 12.11.2017 16:21:35 \| GPUGRID \| [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed 12.11.2017 16:21:42 \| GPUGRID \| [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed At the second resume at 16:13 the GPU-perfomance was lost. I stop the GPU-usage at 16:19 and resume at 16:20. The job worked again. The different I found was that it was removed from memory (16:19). Too bad that there is no solution. There is also a checkpoint every 15 seconds. I think it's usually 5-15 min.
	ID: 48162 \| Rating: 0 \| rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 755,434,080 RAC: 186,180 Level Scientific publications	Message 48164 - Posted: 13 Nov 2017 \| 4:16:05 UTC
	hsdecalc, I haven't seen the problem lately. My recent changes include installing BOINC 7.8.3, and setting the number of CPU cores BOINC is allowed to use to two less than the total number present.
	ID: 48164 \| Rating: 0 \| rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 755,434,080 RAC: 186,180 Level Scientific publications	Message 48166 - Posted: 13 Nov 2017 \| 5:36:33 UTC - in response to Message 47863.
	hsdecalc, another recent change, probably after the last time I saw the problem. Note that you must be in the advanced view, not the simple view, to follow the directions below. Click on View to start changing which view you have. Interesting to know of your experience, and it is also interesting that you do not see this setting. For me, it is under - 1. My Account 2. Preferences section 3. "When and how BOINC uses your computer" 4. Click on "Computing Preferences" which is on the same line as "When and how BOINC uses your computer" 5. "Processor Usage" 6. In the "Processor Usage" section there is an entry "Suspend work when non-BOINC CPU usage is above" If you also observe that this problem happens again, I suggest checking BOINC's Activity Log for a message similar to the one I found. To me, finding the same message would be a strong indicator that the same thing is happening on your system.
	ID: 48166 \| Rating: 0 \| rate: / Reply Quote

hsdecalc Send message Joined: 5 Jul 15 Posts: 2 Credit: 135,260,724 RAC: 0 Level Scientific publications	Message 48186 - Posted: 14 Nov 2017 \| 9:10:56 UTC - in response to Message 48166.
	Thanks for reply. The above procedure is a workaround not a solution for me. I have sometimes high cpu-usage by video-playing, so I need the pause-option.
	ID: 48186 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,217,465,968 RAC: 1,257,790 Level Scientific publications	Message 48224 - Posted: 21 Nov 2017 \| 15:04:21 UTC - in response to Message 48186.
	Thanks for reply. The above procedure is a workaround not a solution for me. I have sometimes high cpu-usage by video-playing, so I need the pause-option. Then you should put those games and applications to the settings -> exclusive applications list.
	ID: 48224 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Number crunching : Problem with PABLO tasks

	About	Science	Volunteers	Performance	Forum	Join us	Donate