Advanced search

Message boards : Number crunching : Lots of errors

Author Message
nanoprobe
Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 41152 - Posted: 27 May 2015 | 0:00:10 UTC

I'm starting to get a high number or short tasks that error out. Can someone explain why this is happening and how I can fix it? Have changed no settings.
Here is the log from one of the tasks.
WinXP SP3 dual 750Ti



http://www.gpugrid.net/result.php?resultid=14202446

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 485
Credit: 11,087,203,479
RAC: 15,809,942
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41154 - Posted: 27 May 2015 | 0:59:50 UTC - in response to Message 41152.

I am getting errors too, but mine are with GERARD_EQUI WU's. Three had errors, two finished ok. It seems to be bad batch.

https://www.gpugrid.net/result.php?resultid=14210451



895456x4-GERARD_EQUI_26Apr_CXCL-0-1-RND0321_4
Workunit 10949024
Created 26 May 2015 | 23:34:38 UTC
Sent 26 May 2015 | 23:34:54 UTC
Received 26 May 2015 | 23:50:42 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -97 (0xffffffffffffff9f) Unknown error number
Computer ID 30790
Report deadline 31 May 2015 | 23:34:54 UTC
Run time 87.09
CPU time 77.31
Validate state Invalid
Credit 0.00
Application version Short runs (2-3 hours on fastest card) v8.47 (cuda65)
Stderr output
<core_client_version>7.4.42</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -97 (0xffffff9f)
</message>
<stderr_txt>
# GPU [GeForce GTX 690] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 1 :
# Name : GeForce GTX 690
# ECC : Disabled
# Global mem : 2047MB
# Capability : 3.0
# PCI ID : 0000:05:00.0
# Device clock : 1019MHz
# Memory clock : 3004MHz
# Memory width : 256bit
# Driver version : r343_98 : 34411
# GPU [GeForce GTX 690] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 1 :
# Name : GeForce GTX 690
# ECC : Disabled
# Global mem : 2047MB
# Capability : 3.0
# PCI ID : 0000:05:00.0
# Device clock : 1019MHz
# Memory clock : 3004MHz
# Memory width : 256bit
# Driver version : r343_98 : 34411
# GPU [GeForce GTX 690] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 1 :
# Name : GeForce GTX 690
# ECC : Disabled
# Global mem : 2047MB
# Capability : 3.0
# PCI ID : 0000:05:00.0
# Device clock : 1019MHz
# Memory clock : 3004MHz
# Memory width : 256bit
# Driver version : r343_98 : 34411
# GPU [GeForce GTX 690] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 690
# ECC : Disabled
# Global mem : 2047MB
# Capability : 3.0
# PCI ID : 0000:04:00.0
# Device clock : 1019MHz
# Memory clock : 3004MHz
# Memory width : 256bit
# Driver version : r343_98 : 34411
# GPU 0 : 63C
# GPU 1 : 73C
# The simulation has become unstable. Terminating to avoid lock-up (1)



Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41155 - Posted: 27 May 2015 | 1:07:33 UTC
Last modified: 27 May 2015 | 1:10:10 UTC

I'm getting some nasty errors too, with the GERARD_EQUI_26Apr_CXCL tasks. They're causing major TDRs, which in turn then make the computer have hardware acceleration problems in other tasks (like web browsing, or gaming), and also cause driver problems where the clocks never go back to 3d-mode clocks.

Admins: Please look into which batches need to be revoked, to prevent these problems. It's a major headache, for me at least.

1154144x3-GERARD_EQUI_26Apr_CXCL-0-1-RND9216_7
http://www.gpugrid.net/result.php?resultid=14210052
895456x5-GERARD_EQUI_26Apr_CXCL-0-1-RND9089_5
http://www.gpugrid.net/result.php?resultid=14210507

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 41156 - Posted: 27 May 2015 | 2:06:03 UTC

All my tasks are now erroring out. Suspending this project for now until this issue is resolved.

Eric
Send message
Joined: 12 Apr 15
Posts: 1
Credit: 49,381,475
RAC: 0
Level
Val
Scientific publications
watwat
Message 41157 - Posted: 27 May 2015 | 4:52:30 UTC

I've actually been having issues with the Graphics drivers themselves crashing and windows having to recover.

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,374,498
RAC: 227
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41158 - Posted: 27 May 2015 | 5:46:47 UTC

Same here. Now have five GERARD_EQUI_26Apr_CXCL tasks crashed.

Gerard
Send message
Joined: 26 Mar 14
Posts: 101
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 41160 - Posted: 27 May 2015 | 9:05:55 UTC

Could you please post your errors in this thread? I will cancel the batch if they persist. Thanks for your patience...

Profile tito
Send message
Joined: 21 May 09
Posts: 22
Credit: 1,732,763,678
RAC: 6,584,943
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 41161 - Posted: 27 May 2015 | 10:54:28 UTC

https://www.gpugrid.net/result.php?resultid=14210324
Short WU errored after 80sek at 750Ti.

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 41162 - Posted: 27 May 2015 | 11:10:02 UTC
Last modified: 27 May 2015 | 11:10:52 UTC

Could you please post your errors in this thread? I will cancel the batch if they persist. Thanks for your patience...



https://www.gpugrid.net/result.php?resultid=14210504

Gerard
Send message
Joined: 26 Mar 14
Posts: 101
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 41164 - Posted: 27 May 2015 | 12:50:55 UTC
Last modified: 27 May 2015 | 12:52:26 UTC

We detected an unexpected parameterization error in some of the simulations and we just cancelled them. Sorry for any inconvience caused and thank your for reporting it to us! If you find any other errors please do not hesitate to tell us (hopefully this particular issue is already resolved).

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41166 - Posted: 27 May 2015 | 12:56:58 UTC

Excellent. Thank you!!

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 41168 - Posted: 27 May 2015 | 15:10:08 UTC

All short run tasks still failing here. Links to last 4

https://www.gpugrid.net/result.php?resultid=14213349

https://www.gpugrid.net/result.php?resultid=14211957

https://www.gpugrid.net/result.php?resultid=14211914

https://www.gpugrid.net/result.php?resultid=14211712

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41169 - Posted: 27 May 2015 | 16:07:07 UTC - in response to Message 41168.
Last modified: 27 May 2015 | 16:08:05 UTC

nanoprobe:

What is the exact make/model of your GPU? Do the tasks still fail when the Boost clock is set to the reference clock? My hunch is that your GPU is overclocked too much, either by the factory or by you.

"The simulation has become unstable. Terminating to avoid lock-up"
... generally means that you are overclocking too much, or have a hardware problem... from my experience.

[CSF] Thomas H.V. DUPONT
Send message
Joined: 20 Jul 14
Posts: 732
Credit: 126,845,366
RAC: 190,805
Level
Cys
Scientific publications
watwatwatwatwatwatwatwat
Message 41170 - Posted: 27 May 2015 | 17:14:22 UTC - in response to Message 41166.

Excellent. Thank you!!

+1 :)
____________
[CSF] Thomas H.V. Dupont
Founder of the team CRUNCHERS SANS FRONTIERES 2.0
www.crunchersansfrontieres

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 41171 - Posted: 27 May 2015 | 17:37:37 UTC - in response to Message 41169.
Last modified: 27 May 2015 | 17:39:06 UTC

nanoprobe:

What is the exact make/model of your GPU? Do the tasks still fail when the Boost clock is set to the reference clock? My hunch is that your GPU is overclocked too much, either by the factory or by you.

"The simulation has become unstable. Terminating to avoid lock-up"
... generally means that you are overclocking too much, or have a hardware problem... from my experience.

Cards are PNY 750Ti. No factory O/C. No six pin PCI-E power plugs. 60Watt load @99%. They've been running stock out of the box since I bought them and I've been running the short tasks on these cards since I got them and have never had the failure rate I've been experiencing lately.
If it was one card producing all/most of the errors then I would suspect the card but the tasks are failing on both cards.

zdnko
Send message
Joined: 17 Jan 09
Posts: 2
Credit: 20,638,157
RAC: 15,020
Level
Pro
Scientific publications
watwatwatwatwat
Message 41172 - Posted: 27 May 2015 | 17:52:14 UTC

1232906x8-GERARD_EQUI_26Apr_CXCL-0-1-RND1418_4 causes a lot of crash of gpu drivers. Stopped!

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41173 - Posted: 27 May 2015 | 18:13:46 UTC - in response to Message 41171.
Last modified: 27 May 2015 | 18:21:17 UTC

nanoprobe:

Can you supply the exact model of the GPU, to confirm that it's not factory-overclocked?

Alternatively, could you use GPU-Z to confirm that the GPU Clock and Default Clock say 1020 MHz (which is the stock speed of a GTX 750 Ti, per http://en.wikipedia.org/wiki/GeForce_700_series)

If it's anything above 1020, then it is in fact overclocked, and I recommend using EVGA Precision X to downclock it back to reference 1020 MHz, to see if it helps.

I'm getting frustrated trying to help by offering advice that gets ignored.

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 41190 - Posted: 29 May 2015 | 1:00:26 UTC - in response to Message 41173.

nanoprobe:

[quote]I'm getting frustrated trying to help by offering advice that gets ignored.


WOW! Let me offer you some advise. If it doesn't concern life, death or health then is surely isn't worth getting frustrated over.

FWIW the issue seems to have cleared up. The faulty WUs have been taken care of. Thanks for your help.

John C MacAlister
Send message
Joined: 17 Feb 13
Posts: 181
Credit: 144,871,276
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 41191 - Posted: 29 May 2015 | 3:05:39 UTC

I have had over 20 WUs fail...on my GTX 660 Ti devices. I will stop gettings tasks and now go to bed....

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41192 - Posted: 29 May 2015 | 3:49:35 UTC - in response to Message 41190.
Last modified: 29 May 2015 | 3:52:39 UTC

nanoprobe:

[quote]I'm getting frustrated trying to help by offering advice that gets ignored.


WOW! Let me offer you some advise. If it doesn't concern life, death or health then is surely isn't worth getting frustrated over.

FWIW the issue seems to have cleared up. The faulty WUs have been taken care of. Thanks for your help.


There were some faulty WUs, but they have nothing to do with tasks erroring out with "Simulation has become unstable." messages and no other error messages. Errors like yours are usuall a result of overclocking too much. Please keep my advice (lower clocks to reference clocks) in mind, the next time you try to troubleshoot those errors.

Good luck,
Jacob

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 41194 - Posted: 29 May 2015 | 9:54:12 UTC

There were some faulty WUs, but they have nothing to do with tasks erroring out with "Simulation has become unstable." messages and no other error messages.

There is no way you could know this for sure.

Errors like yours are usuall a result of overclocking too much.

As I stated before these cards are not overclocked.

The problem came and left without me changing anything on my set up. Therefore my conclusion is that the WUs were the problem.

Moving on.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41197 - Posted: 29 May 2015 | 12:54:28 UTC - in response to Message 41194.
Last modified: 29 May 2015 | 12:56:30 UTC

The problem is still ongoing, for you.
See: https://www.gpugrid.net/result.php?resultid=14213909
# The simulation has become unstable. Terminating to avoid lock-up (1)

I know you said there is no factory overclock, but I still would love to know what GPU-Z says for your "GPU Clock" and "Default Clock". If you refuse to share, so be it. And if it's anything above 1020, then it is in fact overclocked.

Killersocke
Send message
Joined: 18 Oct 13
Posts: 53
Credit: 406,647,419
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41198 - Posted: 29 May 2015 | 14:20:26 UTC

Jacob:
We are all very well aware of the idea of overclocking.
But , honestly : Are you really thinking I will and need to shut down my system to get the GPU- stuff running ? Really ?
Let´s put it that way: If GPU -stuff is not running properly on a majority of systems and several users have the same experience and we ( the users ) do that for free - we shut down a over all properly running system for that ?
I really don´t think so.
I my opinion, if GPU do not work - make them work instead of begging on the overclocking stuff.
If GPU will not work properly, I simply switch to something else - you know, MY PC, MY time, MY decission.
Or to use a proverb: Not my circus , not my monkeys.

Kind regards,

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41200 - Posted: 29 May 2015 | 15:31:45 UTC - in response to Message 41198.
Last modified: 29 May 2015 | 15:33:21 UTC

I have definitely had certain GPU things, such as games and GPUGrid tasks, crash or error ("Simulation has become unstable"), as a direct result of a factory-overclock that was too aggressive. If the GPU is overclocked at all, and you are trying to resolve any GPU problem, you should see if lowering the clocks resolves the problem.

Yes, honestly.

I have 3 factory-overclocked GPUs. GPU 1 was factory-overclocked way too aggressively, and I've had to dial it back quite a bit to be completely stable in my games and with GPUGrid. GPU 2 was factory-overclocked too little, and I could push it even farther before noticing problems. And GPU 3 was factory-overclocked just right.

Forgive me for trying to help.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41201 - Posted: 29 May 2015 | 15:35:07 UTC - in response to Message 41198.
Last modified: 29 May 2015 | 15:36:18 UTC

Nanoprobe, It may be the case that these Noelia WU's tax the card more than other WU's and it does appear (from what I've seen) to effect the smaller/older cards more; the same WU's that fail on your GTX750Ti complete on other systems but some also fail on the older/smaller cards.

While GPU temps look OK the GDDR5 memory temps might be quite high. I would suggest reducing the GDDR5 clocks and the GPU clocks by 10% to see if that prevents the errors recurring, or just crunch the long WU's which are a different type (and are now fixed).

Recently used XP with a couple of GPU's and found the drivers to be not great. Would also suggest a regular cold-start, just in case of runaway errors which appears to be the case back on the 20th.

Good luck,
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,206,655,749
RAC: 261,147
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41202 - Posted: 29 May 2015 | 16:03:47 UTC - in response to Message 41198.
Last modified: 29 May 2015 | 16:16:14 UTC

Jacob:
We are all very well aware of the idea of overclocking.
But , honestly : Are you really thinking I will and need to shut down my system to get the GPU- stuff running ? Really ?

Sometimes this is the only (and the fastest) way to fix a malfunctioning system. I had some GPUGrid app crashes in the past on one of my dual GPU systems which caused the other GPU to fail tasks too. In my opinion it's a good practice to restart (by a scheduled task) a Windows based system once a week - regardless if it's running error free - to maintain its stability (especially when running GPU and CPU tasks simultaneously).

Let´s put it that way: If GPU -stuff is not running properly on a majority of systems and several users have the same experience and we ( the users ) do that for free - we shut down a over all properly running system for that ?

This is more like a rhetorical question, but - as you probably know - there's no warranty for any software (free or commercial) to work on every existing hardware.
Besides, your question takes a set of other softwares as a reference which qualifies a system properly running, but from the "no warranty" thing comes that there's no such set of softwares exist. To put it in another way: I wouldn't call a system properly running, if GPUGrid tasks produce "The simulation has become unstable. Terminating to avoid lock-up" messages on that particular system only while these tasks run fine on the next host they were assigned to.

I really don´t think so.
I my opinion, if GPU do not work - make them work instead of begging on the overclocking stuff.

If someone ask for help, it comes from that they can't figure out the reason of the error, so it might be useful to try things which don't make sense at first sight. I have a GTX780Ti on which I had to reduce the GDDR5 clock to 2900MHz (from 3500MHz) to make it work with GPUGrid (it was brand new). GPUs (and other components) are aging so they might not perform as good as before, different tasks tax the GPU differently. You can't step in the same river twice.

If GPU will not work properly, I simply switch to something else - you know, MY PC, MY time, MY decission.
Or to use a proverb: Not my circus , not my monkeys.

If all else fails, or you've tired of trying different workarounds you can do it. Still, fixing the errors on a given system is not the project's responsibility.

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 41203 - Posted: 29 May 2015 | 17:23:50 UTC - in response to Message 41197.

The problem is still ongoing, for you.
See: https://www.gpugrid.net/result.php?resultid=14213909
# The simulation has become unstable. Terminating to avoid lock-up (1)

I know you said there is no factory overclock, but I still would love to know what GPU-Z says for your "GPU Clock" and "Default Clock". If you refuse to share, so be it. And if it's anything above 1020, then it is in fact overclocked.


I think 1 error out of 20 tasks is about the same I was experiencing before the problem WUs arrived.
And just for the record the task you linked to completed and validated. The last failed one was more that 2 days ago.
https://www.gpugrid.net/result.php?resultid=14213349

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41204 - Posted: 29 May 2015 | 17:29:54 UTC
Last modified: 29 May 2015 | 17:30:24 UTC

Please keep my advice (use lower clocks) in mind, the next time you try to troubleshoot any GPU errors.

[CSF] Thomas H.V. DUPONT
Send message
Joined: 20 Jul 14
Posts: 732
Credit: 126,845,366
RAC: 190,805
Level
Cys
Scientific publications
watwatwatwatwatwatwatwat
Message 41211 - Posted: 30 May 2015 | 6:33:42 UTC - in response to Message 41204.

Please keep my advice (use lower clocks) in mind, the next time you try to troubleshoot any GPU errors.

Jacob, please keep in mind this point : your tips are always appreciated and fortunately that we have you ! :)
I will also make adjustments on my GTX 760 via Precision X because I also get errors with LONG RUNS (Gerard).
I will publish the settings and the results in this thread.

Of course, I also think to Retvari Zoltan* and skgiven whose advices are also very valuable :) Thanks guys!
____________
[CSF] Thomas H.V. Dupont
Founder of the team CRUNCHERS SANS FRONTIERES 2.0
www.crunchersansfrontieres

Post to thread

Message boards : Number crunching : Lots of errors

//