Author |
Message |
|
In every two or three days I catch on a random one of my hosts a NATHAN_dhfr36 workunit staying at 0% progress after a very long time (while the GPU usage seems to be normal). I've caught one after 86 hours. Some of them can be fixed by a system restart.
The latest one is this workunit (my host is the 4th). It was hanging for 15 hours when I spotted it, a restart didn't helped in this case, so I've aborted it. It usually happens on my multi-GPU hosts. |
|
|
nateSend message
Joined: 6 Jun 11 Posts: 124 Credit: 2,928,865 RAC: 0 Level
Scientific publications
|
Well, that's strange. Thanks for letting us know. For the most part my NATHAN workunits have been very stable and giving low errors, though I'm not sure if such issues like you have experienced would be seen in the error statistics (I'll look at that). Since crunchers have been complaining about this type of problem more often recently (meaning stuck work units), I wonder if there is some subtle issue with the app/client/driver, and not just a specific group of workunits... I'm noticing some very strange error messages in that group you linked. Hmmm... |
|
|
|
I've got another one of this. I've spotted it after 74 hours...
After a system restart it's ran into an application error. |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
This is your tasks Stderr output:
<core_client_version>6.10.60</core_client_version>
<![CDATA[
<message>
The system cannot find the path specified. (0x3) - exit code 3 (0x3)
</message>
<stderr_txt>
MDIO: cannot open file "output.restart.coor"
MDIO: cannot open file "output.restart.coor"
Kernel not foundAssertion failed: a, file swanlibnv2.cpp, line 59
This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
</stderr_txt>
]]>
Earlier today I had a W7x64 system blue screen and on restarting the app crashed. It was a NOELIA_PEPTGPRC WU, however I also got the Kernel error message,
Kernel not foundAssertion failed: a, file swanlibnv2.cpp, line 59
Might be something in it.
____________
FAQ's
HOW TO:
- Opt out of Beta Tests
- Ask for Help |
|
|
Beyond Send message
Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level
Scientific publications
|
Earlier today I had a W7x64 system blue screen and on restarting the app crashed. It was a NOELIA_PEPTGPRC WU, however I also got the Kernel error message,
Kernel not foundAssertion failed: a, file swanlibnv2.cpp, line 59
Might be something in it.
Any WU with the word NOELIA in it is a nightmare here and unfortunately forces me to move to a different project. Just wasted another several days of GPU time :-(
|
|
|
TJSend message
Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level
Scientific publications
|
Is it wise to abort this WU: I41R15-NATHAN_dhfr36_6-8-32-RND7935?
By five previous crunchers it errored out. On my GTX550Ti it will take another 34 hours to complete after 1 hour and 20 minutes gone (9.5%).
Thanks for the input.
____________
Greetings from TJ |
|
|
|
Cannot see your set up but if in doubt, abort. 5 resends does not inspire confidence! |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
9.5% in 80min:- it would take ~14h (if it doesn't crash). Best to ignore some of Boinc's estimates.
If you post the resultid then we might be able to see the WU.
Is it wise to abort this WU: I41R15-NATHAN_dhfr36_6-8-32-RND7935?
By five previous crunchers it errored out. On my GTX550Ti it will take another 34 hours to complete after 1 hour and 20 minutes gone (9.5%).
Thanks for the input.
____________
FAQ's
HOW TO:
- Opt out of Beta Tests
- Ask for Help |
|
|
TJSend message
Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level
Scientific publications
|
This is the WU http://www.gpugrid.net/workunit.php?wuid=4427345.
Has 51% done in 7 hours.
____________
Greetings from TJ |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
These are the recent results from the computers that your task has previously ran on:
http://www.gpugrid.net/results.php?hostid=151016 - GTX570, 314.22, W7
http://www.gpugrid.net/results.php?hostid=150071 - GTX 470, ? , Linux
http://www.gpugrid.net/results.php?hostid=138340 - GTX560Ti, 306.97, W7
http://www.gpugrid.net/results.php?hostid=151494 - GTX 470, 310.90, W7
http://www.gpugrid.net/results.php?hostid=151214 - GTX 560, 314.22, W7
If you click the links you will see that most of these systems have had a lot of errors recently, suggesting that their Boinc settings are bad. They might for example have configured their systems to not use the GPU when they are using the system, and are getting lots of driver restarts and app crashes as a result. The systems differ in their drivers, GPU's and there is even a Linux rig.
Their error messages were also different. Again this indicates setup problems. If the errors were all the same it would indicate a bug/WU problem.
It's a pity moderators/testers can't actually see the Boinc setups being used, as this prevents proactive advice, and limits what we have to go on to give advice to those that ask for it. Also, still no Linux driver in the list!
____________
FAQ's
HOW TO:
- Opt out of Beta Tests
- Ask for Help |
|
|
TJSend message
Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level
Scientific publications
|
Thanks for your explanation skgiven!
I will let the WU run, still 5 hours estimate, so I let it know tomorrow how it ended.
Indeed I have set to use GPU always and nod driver op app problems.
The only error I had the last weeks was when Microsoft was updating the nVidia driver, what I terminated and I did set that Microsoft only is allowed to update the Windows stuff and I decide when to install and the reboot. Mostly I set BOINC to no new work, to be sure.
____________
Greetings from TJ |
|
|
TJSend message
Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level
Scientific publications
|
Thanks for your explanation skgiven!
I will let the WU run, still 5 hours estimate, so I let it know tomorrow how it ended.
Well it finished in 14.3 hours without error and got 70.800 credits for it.
So happy that I let it run.
____________
Greetings from TJ |
|
|
|
First NATHAN_dhfr36 task finished in 4.96 hours on 660TI running Linux Mint. No problems and 70,800 credits. Will be rebooting to XP soon to see how they run there. |
|
|
|
First NATHAN_dhfr36 task completed without errors on XP with 660TI. Took about 13 minutes longer than the same card on Linux Mint. |
|
|
Beyond Send message
Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level
Scientific publications
|
Thankfully the NATHANs are back so all 7 of my GPUs can again run GPUGrid. So far have run hundreds of the NATHAN_dhfr36 WUs without any errors that I know of (TONIs too). |
|
|