Problem jobs that never finish

Message boards : Number crunching : Problem jobs that never finish

Author	Message
tzpmrz Send message Joined: 8 May 10 Posts: 5 Credit: 140,025,313 RAC: 0 Level Scientific publications	Message 20725 - Posted: 20 Mar 2011 \| 1:24:09 UTC
	I've got several of these in a row now. They usually say "long runs(8-12 hours on fastest card)" They run for 55+ hours or so, longer than the report deadline. These jobs make my PC very very slow to where I can barely get the BOINC manager up to stop the processing. If I restart processing they run fine for about 30 mins and then the PC is in the same state again. It seems when the PC is very sluggish the GPUGRID job gets no progress toward completion. How can I cancel one of these jobs myself?
	ID: 20725 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 20728 - Posted: 20 Mar 2011 \| 9:18:46 UTC - in response to Message 20725.
	If you have to terminate a task, In Boinc Manager (Advanced View) under the Tasks tab, click on the task, and click Abort. I'm guessing this task was ok; it finished on a GTX 470 in about 10h. Your task ran for about 2.4days, so your system was off for more than 2.6days during the 5 day limit: Sent 14 Mar 2011 11:32:13 UTC Received 20 Mar 2011 0:20:53 UTC Report deadline 19 Mar 2011 11:32:13 UTC This task was started and stoped 18 times. That's far too much, and no doubt interfeed with checkpointing (after a restart you go back to the last checkpoint). The first checkpoint might not be being reached before the task is stopped (@ 5% I think). Do you have Leave Appliucations In Memory (LAIM) checked? All GPUGrid crunchers should have. If not I guess you are restarting, shutting down/starting up 3 or 4 times a day.
	ID: 20728 \| Rating: 0 \| rate: / Reply Quote

tzpmrz Send message Joined: 8 May 10 Posts: 5 Credit: 140,025,313 RAC: 0 Level Scientific publications	Message 20746 - Posted: 21 Mar 2011 \| 12:35:20 UTC - in response to Message 20728.
	This is true, I do stop the jobs since the PC is not totally dedicated to the project only. It's my home PC. I won't stop processing for say web browsing but I have to stop it to play video games, it is what it is. The problem with the jobs I'm complaining about is that I can't even web browse, they are leaking video memory or something because they cripple the machine. There are GPUGRID jobs running now and I can still web browse, 99% of the jobs don't cripple the machine.
	ID: 20746 \| Rating: 0 \| rate: / Reply Quote

tzpmrz Send message Joined: 8 May 10 Posts: 5 Credit: 140,025,313 RAC: 0 Level Scientific publications	Message 20783 - Posted: 25 Mar 2011 \| 13:48:53 UTC - in response to Message 20728.
	the change in my system is that I have enable SLI. it works out if I only run one GPUGRID project at a time I don't get the crippling affect. I can suspend either one of the jobs and the PC becomes usable again. I think the system is video memory thrashing when just the right mix of GPUGRID projects are run together.
	ID: 20783 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 20784 - Posted: 25 Mar 2011 \| 16:47:23 UTC - in response to Message 20783.
	It's recommended that crunchers disable Sli when crunching.
	ID: 20784 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 20789 - Posted: 27 Mar 2011 \| 10:33:27 UTC - in response to Message 20784.
	A while back I had two GT240's in one i7-860 2003 x64 system. I noticed a few tasks that caused an error message and then restarted the task (from the start). It only happened now and again, but over time there was quite a lot of such failures. Sometimes the tasks ran for many hours, before resetting (5h, 12h, 15h). Each time a Windows error message popped up. In the past we could restart the system and these tasks would begin from the last checkpoint. This no longer works. When I only used one GPU the situation seemed to go away. I have since replaced the two GT240's with one GTX260-216, and moved to 6.13 to crunch the long tasks. Until now this had worked without any problem. This morning the problem reappeared with an IBUCH task. 27/3/11 Faulting application acemdlong_6.13_windows_intelx86__cuda31.exe, version 0.0.0.0, faulting module acemdlong_6.13_windows_intelx86__cuda31.exe, version 0.0.0.0, fault address 0x000025e4. 19/3/11 Faulting application acemd2_6.12_windows_intelx86__cuda, version 0.0.0.0, faulting module acemd2_6.12_windows_intelx86__cuda, version 0.0.0.0, fault address 0x000026e4. 17/3/11 (twice) Faulting application acemd2_6.12_windows_intelx86__cuda, version 0.0.0.0, faulting module acemd2_6.12_windows_intelx86__cuda, version 0.0.0.0, fault address 0x000026e4. 16/3/11 Faulting application acemd2_6.12_windows_intelx86__cuda, version 0.0.0.0, faulting module acemd2_6.12_windows_intelx86__cuda, version 0.0.0.0, fault address 0x000026e4. 7/3/11 Faulting application acemd2_6.12_windows_intelx86__cuda, version 0.0.0.0, faulting module acemd2_6.12_windows_intelx86__cuda, version 0.0.0.0, fault address 0x000026e4.
	ID: 20789 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 20796 - Posted: 27 Mar 2011 \| 22:06:33 UTC - in response to Message 20789.
	Got another IBUCH failure, p5-IBUCH_1_wtEGFR_110325-1-10-RND6118_2. This one did not restart, so perhaps this is a different problem.
	ID: 20796 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 20831 - Posted: 31 Mar 2011 \| 20:15:21 UTC - in response to Message 20796.
	Got another Error message just after a reboot. This seems to be the common theme. Opened Task Manager and killed the process trying to report the error and this prevented the task failing. This is clearly a MS security handling issue and nothing to do with the cuda app, agian.
	ID: 20831 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 20882 - Posted: 8 Apr 2011 \| 19:16:38 UTC - in response to Message 20831.
	... and again. This time on restarting after tasks were suspended. No ability to close anything in Task Manager to stop the task error, and closing the error message [X] killed the task! This is down to the way windows interprets the Application Popup. Event Type: Information Event Source: Application Popup Event Category: None Event ID: 26 Date: 08/04/2011 Time: 19:00:09 User: N/A Computer: S Description: Application popup: acemdlong_6.13_windows_intelx86__cuda31.exe - Application Error : The exception unknown software exception (0x40000015) occurred in the application at location 0x004025e4. Click on OK to terminate the program Click on CANCEL to debug the program Event Type: Error Event Source: Application Error Event Category: (100) Event ID: 1000 Date: 08/04/2011 Time: 19:00:09 User: N/A Computer: S Description: Faulting application acemdlong_6.13_windows_intelx86__cuda31.exe, version 0.0.0.0, faulting module acemdlong_6.13_windows_intelx86__cuda31.exe, version 0.0.0.0, fault address 0x000025e4. Failed after 5h for no reason other than the operating system decided to terminate it.
	ID: 20882 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 20952 - Posted: 14 Apr 2011 \| 9:08:32 UTC - in response to Message 20882.
	2003 server x64 with GTX260. Running a long IBUCH task for 8h50min, 4h to go, but %age complete is zero and I get this acemdlong_6.13_windows_intelx86__cuda21.exe error, The exception unknown software exception (0x40000005) occurred in the application at location 0x004025e4. Click on OK to terminate the program Click on CANCEL to debug the program
	ID: 20952 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Number crunching : Problem jobs that never finish

	About	Science	Volunteers	Performance	Forum	Join us	Donate