Advanced search

Message boards : Number crunching : Problem jobs that never finish

Author Message
tzpmrz
Send message
Joined: 8 May 10
Posts: 5
Credit: 140,025,313
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwat
Message 20725 - Posted: 20 Mar 2011 | 1:24:09 UTC

I've got several of these in a row now. They usually say "long runs(8-12 hours on fastest card)" They run for 55+ hours or so, longer than the report deadline. These jobs make my PC very very slow to where I can barely get the BOINC manager up to stop the processing. If I restart processing they run fine for about 30 mins and then the PC is in the same state again. It seems when the PC is very sluggish the GPUGRID job gets no progress toward completion. How can I cancel one of these jobs myself?

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20728 - Posted: 20 Mar 2011 | 9:18:46 UTC - in response to Message 20725.

If you have to terminate a task,
In Boinc Manager (Advanced View) under the Tasks tab, click on the task, and click Abort.

I'm guessing this task was ok; it finished on a GTX 470 in about 10h.
Your task ran for about 2.4days, so your system was off for more than 2.6days during the 5 day limit:

Sent 14 Mar 2011 11:32:13 UTC
Received 20 Mar 2011 0:20:53 UTC
Report deadline 19 Mar 2011 11:32:13 UTC

This task was started and stoped 18 times. That's far too much, and no doubt interfeed with checkpointing (after a restart you go back to the last checkpoint). The first checkpoint might not be being reached before the task is stopped (@ 5% I think).
Do you have Leave Appliucations In Memory (LAIM) checked? All GPUGrid crunchers should have.
If not I guess you are restarting, shutting down/starting up 3 or 4 times a day.

tzpmrz
Send message
Joined: 8 May 10
Posts: 5
Credit: 140,025,313
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwat
Message 20746 - Posted: 21 Mar 2011 | 12:35:20 UTC - in response to Message 20728.

This is true, I do stop the jobs since the PC is not totally dedicated to the project only. It's my home PC. I won't stop processing for say web browsing but I have to stop it to play video games, it is what it is.

The problem with the jobs I'm complaining about is that I can't even web browse, they are leaking video memory or something because they cripple the machine. There are GPUGRID jobs running now and I can still web browse, 99% of the jobs don't cripple the machine.

tzpmrz
Send message
Joined: 8 May 10
Posts: 5
Credit: 140,025,313
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwat
Message 20783 - Posted: 25 Mar 2011 | 13:48:53 UTC - in response to Message 20728.

the change in my system is that I have enable SLI.
it works out if I only run one GPUGRID project at a time I don't get the crippling affect. I can suspend either one of the jobs and the PC becomes usable again. I think the system is video memory thrashing when just the right mix of GPUGRID projects are run together.



Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20784 - Posted: 25 Mar 2011 | 16:47:23 UTC - in response to Message 20783.

It's recommended that crunchers disable Sli when crunching.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20789 - Posted: 27 Mar 2011 | 10:33:27 UTC - in response to Message 20784.

A while back I had two GT240's in one i7-860 2003 x64 system. I noticed a few tasks that caused an error message and then restarted the task (from the start). It only happened now and again, but over time there was quite a lot of such failures. Sometimes the tasks ran for many hours, before resetting (5h, 12h, 15h). Each time a Windows error message popped up. In the past we could restart the system and these tasks would begin from the last checkpoint. This no longer works. When I only used one GPU the situation seemed to go away.

I have since replaced the two GT240's with one GTX260-216, and moved to 6.13 to crunch the long tasks. Until now this had worked without any problem. This morning the problem reappeared with an IBUCH task.

27/3/11
Faulting application acemdlong_6.13_windows_intelx86__cuda31.exe, version 0.0.0.0, faulting module acemdlong_6.13_windows_intelx86__cuda31.exe, version 0.0.0.0, fault address 0x000025e4.

19/3/11
Faulting application acemd2_6.12_windows_intelx86__cuda, version 0.0.0.0, faulting module acemd2_6.12_windows_intelx86__cuda, version 0.0.0.0, fault address 0x000026e4.

17/3/11 (twice)
Faulting application acemd2_6.12_windows_intelx86__cuda, version 0.0.0.0, faulting module acemd2_6.12_windows_intelx86__cuda, version 0.0.0.0, fault address 0x000026e4.

16/3/11
Faulting application acemd2_6.12_windows_intelx86__cuda, version 0.0.0.0, faulting module acemd2_6.12_windows_intelx86__cuda, version 0.0.0.0, fault address 0x000026e4.

7/3/11
Faulting application acemd2_6.12_windows_intelx86__cuda, version 0.0.0.0, faulting module acemd2_6.12_windows_intelx86__cuda, version 0.0.0.0, fault address 0x000026e4.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20796 - Posted: 27 Mar 2011 | 22:06:33 UTC - in response to Message 20789.

Got another IBUCH failure, p5-IBUCH_1_wtEGFR_110325-1-10-RND6118_2. This one did not restart, so perhaps this is a different problem.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20831 - Posted: 31 Mar 2011 | 20:15:21 UTC - in response to Message 20796.

Got another Error message just after a reboot. This seems to be the common theme.

Opened Task Manager and killed the process trying to report the error and this prevented the task failing. This is clearly a MS security handling issue and nothing to do with the cuda app, agian.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20882 - Posted: 8 Apr 2011 | 19:16:38 UTC - in response to Message 20831.

... and again. This time on restarting after tasks were suspended. No ability to close anything in Task Manager to stop the task error, and closing the error message [X] killed the task! This is down to the way windows interprets the Application Popup.

Event Type: Information
Event Source: Application Popup
Event Category: None
Event ID: 26
Date: 08/04/2011
Time: 19:00:09
User: N/A
Computer: S
Description:
Application popup: acemdlong_6.13_windows_intelx86__cuda31.exe - Application Error : The exception unknown software exception (0x40000015) occurred in the application at location 0x004025e4.

Click on OK to terminate the program
Click on CANCEL to debug the program


Event Type: Error
Event Source: Application Error
Event Category: (100)
Event ID: 1000
Date: 08/04/2011
Time: 19:00:09
User: N/A
Computer: S
Description:
Faulting application acemdlong_6.13_windows_intelx86__cuda31.exe, version 0.0.0.0, faulting module acemdlong_6.13_windows_intelx86__cuda31.exe, version 0.0.0.0, fault address 0x000025e4.

Failed after 5h for no reason other than the operating system decided to terminate it.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20952 - Posted: 14 Apr 2011 | 9:08:32 UTC - in response to Message 20882.

2003 server x64 with GTX260. Running a long IBUCH task for 8h50min, 4h to go, but %age complete is zero and I get this acemdlong_6.13_windows_intelx86__cuda21.exe error,
The exception unknown software exception (0x40000005) occurred in the application at location 0x004025e4.
Click on OK to terminate the program
Click on CANCEL to debug the program

Post to thread

Message boards : Number crunching : Problem jobs that never finish

//