Author |
Message |
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
Skip Da Shu is getting "Energies have become nan" task errors on this system.
All tasks error out with,
ERROR: file deven.cpp line 879: # Energies have become nan
System is Linux, Boinc Version 6.12.15.
Failed tasks are TONI_KKAL2 and GIANNI_DHFR1000 tasks. No other tasks ran.
The tasks have mostly failed on the GTS250, but also on the GT340.
Some tasks have started running on one card and later ran on the other (after a Boinc or system restart).
I would suggest you remove the GTS250 and reinstall the drivers if need be. |
|
|
MarkJ Volunteer moderator Volunteer tester Send message
Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level
Scientific publications
|
I got one of these just now. Machine has a GTX570 in it. Running 266.58 drivers under Win 7 x64. It was a KASHIF_HIVPR work unit this time. It ran for 3 hours 46 mins before it died.
Link to wu here
____________
BOINC blog |
|
|
|
Sorted my old nan problem by under clocking the card memory by 10% :-) |
|
|
MarkJ Volunteer moderator Volunteer tester Send message
Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level
Scientific publications
|
And another one tonight here
I don't have the memory overclocked, but I do have the processor clock cranked up a little (1675Mhz for this run). The previous wu I had it at 1700Mhz. I might try dropping it some more.
____________
BOINC blog |
|
|
|
And another one tonight here
I don't have the memory overclocked, but I do have the processor clock cranked up a little (1675Mhz for this run). The previous wu I had it at 1700Mhz. I might try dropping it some more.
I observed that high (above 93%) GPU utilizing tasks (typically GIANNI_DHFRs, KASHIF_HIVPRs and TONI_KKALs) are more prone to error out this way than others. The solution is either you raise the GPU voltage by 0.025V (and the fan speed too), or lower the shader clock (or the memory clock of the GPU) until these errors cease popping up. |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
The GIANNI tasks sometimes failed when I was overclocking my GTX470's. Just reducing the clocks back to normal was enough, but some people had to up the voltage and some had to reduce the memory freq. Every GPU is different.
The temperatures rose a bit with these tasks as well, but I'm now using MSI Afterburner, which is configured to adjust fan speed automatically in response to temperature changes. |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
Ton (ftpd) is seeing "Energies have become nan" failure messages and also "SWAN : FATAL : Cuda driver error 2bc in file 'swanlib_nv.c' in line 244" failures on his GTX295. Mostly on Toni's tasks, but seems to impact on long and short tasks and similar errors for Ignasi's work.
(XP x86, driver: 27051, Boinc 6.10.60)
Is this a known issue that is being looked at? |
|
|
|
I've got one of the SWAN Cuda errors as well.
http://www.gpugrid.net/result.php?resultid=3882394
____________
Radio Caroline, the world's most famous offshore pirate radio station.
Great music since April 1964. Support Radio Caroline Team -
Radio Caroline |
|
|
ftpd Send message
Joined: 6 Jun 08 Posts: 152 Credit: 328,250,382 RAC: 0 Level
Scientific publications
|
And again after more then 7 hours processing 2 (two) wu's cancelled with gtx295.
And again i have to wait a long time for new download.
A waste of valuable time/money/power etc!
Please look at it!
____________
Ton (ftpd) Netherlands |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
I had 4 "ERROR: file tclutil.cpp line 31: get_Dvec() element 0 (b)
called boinc_finish" errors, but they all errored out inside 10sec. Using driver: 26724 on both systems, one with a GTX260 and the other with GTX470's.
The one task I recently had fail after some time (5h) was triggered by the system, again.
Ton, your problems might be being exasperated by the more recent Beta driver, but others have not reported problems with it, and the problems are there with the earlier driver. So I think the problem is more likely to do with the tasks themselves - basically down to Toni, Ignasi and the rest of the team to sort out.
Good luck guys, |
|
|
Ross*Send message
Joined: 6 May 09 Posts: 34 Credit: 443,507,669 RAC: 0 Level
Scientific publications
|
The following happened to several Wus with this driver on 2 boxes
A198-TONI_AGG1-8-100-RND4916_0
Workunit 2448115
Created 21 Apr 2011 19:47:36 UTC
Sent 21 Apr 2011 19:52:50 UTC
Received 21 Apr 2011 23:20:56 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status 98 (0x62)
Computer ID 95964
Report deadline 26 Apr 2011 19:52:50 UTC
Run time 8796.666317
CPU time 1981.4
stderr out <core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
- exit code 98 (0x62)
</message>
<stderr_txt>
# Using device 0
# There are 2 devices supporting CUDA
# Device 0: "GeForce GTX 570"
# Clock rate: 1.56 GHz
# Total amount of global memory: 1275658240 bytes
# Number of multiprocessors: 15
# Number of cores: 120
# Device 1: "GeForce GTX 570"
# Clock rate: 1.56 GHz
# Total amount of global memory: 1275789312 bytes
# Number of multiprocessors: 15
# Number of cores: 120
MDIO ERROR: cannot open file "restart.coor"
ERROR: file deven.cpp line 879: # Energies have become nan
called boinc_finish
</stderr_txt>
]]>
Validate state Invalid
Claimed credit 35140.0810185185
Granted credit 0
application version Long runs (8-12 hours on fastest card) v6.13 (cuda31)
Also these
A560r5-TONI_AB1-21-100-RND5976_1
Workunit 2447770
Created 21 Apr 2011 15:12:33 UTC
Sent 21 Apr 2011 22:50:47 UTC
Received 22 Apr 2011 0:06:48 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status 98 (0x62)
Computer ID 96625
Report deadline 26 Apr 2011 22:50:47 UTC
Run time 1046.854
CPU time 272.0657
stderr out <core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
- exit code 98 (0x62)
</message>
<stderr_txt>
# Using device 0
# There are 2 devices supporting CUDA
# Device 0: "GeForce GTX 580"
# Clock rate: 1.59 GHz
# Total amount of global memory: 1543045120 bytes
# Number of multiprocessors: 16
# Number of cores: 128
# Device 1: "GeForce GTX 580"
# Clock rate: 1.59 GHz
# Total amount of global memory: 1543176192 bytes
# Number of multiprocessors: 16
# Number of cores: 128
MDIO ERROR: warning: redefined atom parameters for ht
MDIO ERROR: warning: redefined atom parameters for ot
MDIO ERROR: warning: redefined atom parameters for cph1
MDIO ERROR: warning: redefined atom parameters for cph2
MDIO ERROR: warning: redefined atom parameters for nr1
MDIO ERROR: warning: redefined atom parameters for nr2
MDIO ERROR: warning: redefined atom parameters for hr3
MDIO ERROR: warning: redefined atom parameters for hr1
MDIO ERROR: warning: redefined bond parameters for ht ht
MDIO ERROR: warning: redefined bond parameters for ht ot
MDIO ERROR: warning: redefined bond parameters for nr1 cph1
MDIO ERROR: warning: redefined bond parameters for nr1 cph2
MDIO ERROR: warning: redefined bond parameters for nr2 cph1
MDIO ERROR: warning: redefined bond parameters for nr2 cph2
MDIO ERROR: warning: redefined bond parameters for cph1 cph1
MDIO ERROR: warning: redefined bond parameters for hr1 cph2
MDIO ERROR: warning: redefined bond parameters for hr3 cph1
MDIO ERROR: warning: redefined angle parameters for cph2 nr1 cph1
MDIO ERROR: warning: redefined angle parameters for cph2 nr2 cph1
MDIO ERROR: warning: redefined angle parameters for nr1 cph1 cph1
MDIO ERROR: warning: redefined angle parameters for nr1 cph2 nr2
MDIO ERROR: warning: redefined angle parameters for nr2 cph1 cph1
MDIO ERROR: warning: redefined angle parameters for nr1 cph2 hr1
MDIO ERROR: warning: redefined angle parameters for nr2 cph2 hr1
MDIO ERROR: warning: redefined angle parameters for hr3 cph1 cph1
MDIO ERROR: warning: redefined angle parameters for nr1 cph1 hr3
MDIO ERROR: warning: redefined angle parameters for nr2 cph1 hr3
MDIO ERROR: warning: redefined angle parameters for ht ot ht
MDIO ERROR: warning: redefined dihedral parameters for d cph2 nr1 cph1 cph1
MDIO ERROR: warning: redefined dihedral parameters for d cph2 nr2 cph1 cph1
MDIO ERROR: warning: redefined dihedral parameters for d nr1 cph1 cph1 hr3
MDIO ERROR: warning: redefined dihedral parameters for d nr1 cph2 nr2 cph1
MDIO ERROR: warning: redefined dihedral parameters for d nr2 cph1 cph1 nr1
MDIO ERROR: warning: redefined dihedral parameters for d nr2 cph2 nr1 cph1
MDIO ERROR: warning: redefined dihedral parameters for d hr1 cph2 nr1 cph1
MDIO ERROR: warning: redefined dihedral parameters for d hr1 cph2 nr2 cph1
MDIO ERROR: warning: redefined dihedral parameters for d hr3 cph1 cph1 hr3
MDIO ERROR: warning: redefined dihedral parameters for d hr3 cph1 nr1 cph2
MDIO ERROR: warning: redefined dihedral parameters for d hr3 cph1 nr2 cph2
MDIO ERROR: warning: redefined dihedral parameters for d nr2 cph1 cph1 hr3
MDIO ERROR: warning: redefined improper parameters for i hr1 nr1 nr2 cph2
MDIO ERROR: warning: redefined improper parameters for i hr1 nr2 nr1 cph2
MDIO ERROR: warning: redefined improper parameters for i hr3 cph1 nr1 cph1
MDIO ERROR: warning: redefined improper parameters for i hr3 cph1 nr2 cph1
MDIO ERROR: warning: redefined improper parameters for i hr3 nr1 cph1 cph1
MDIO ERROR: warning: redefined improper parameters for i hr3 nr2 cph1 cph1
MDIO ERROR: cannot open file "restart.coor"
# Using device 0
# There are 2 devices supporting CUDA
# Device 0: "GeForce GTX 580"
# Clock rate: 1.59 GHz
# Total amount of global memory: 1543045120 bytes
# Number of multiprocessors: 16
# Number of cores: 128
# Device 1: "GeForce GTX 580"
# Clock rate: 1.59 GHz
# Total amount of global memory: 1543176192 bytes
# Number of multiprocessors: 16
# Number of cores: 128
MDIO ERROR: warning: redefined atom parameters for ht
MDIO ERROR: warning: redefined atom parameters for ot
MDIO ERROR: warning: redefined atom parameters for cph1
MDIO ERROR: warning: redefined atom parameters for cph2
MDIO ERROR: warning: redefined atom parameters for nr1
MDIO ERROR: warning: redefined atom parameters for nr2
MDIO ERROR: warning: redefined atom parameters for hr3
MDIO ERROR: warning: redefined atom parameters for hr1
MDIO ERROR: warning: redefined bond parameters for ht ht
MDIO ERROR: warning: redefined bond parameters for ht ot
MDIO ERROR: warning: redefined bond parameters for nr1 cph1
MDIO ERROR: warning: redefined bond parameters for nr1 cph2
MDIO ERROR: warning: redefined bond parameters for nr2 cph1
MDIO ERROR: warning: redefined bond parameters for nr2 cph2
MDIO ERROR: warning: redefined bond parameters for cph1 cph1
MDIO ERROR: warning: redefined bond parameters for hr1 cph2
MDIO ERROR: warning: redefined bond parameters for hr3 cph1
MDIO ERROR: warning: redefined angle parameters for cph2 nr1 cph1
MDIO ERROR: warning: redefined angle parameters for cph2 nr2 cph1
MDIO ERROR: warning: redefined angle parameters for nr1 cph1 cph1
MDIO ERROR: warning: redefined angle parameters for nr1 cph2 nr2
MDIO ERROR: warning: redefined angle parameters for nr2 cph1 cph1
MDIO ERROR: warning: redefined angle parameters for nr1 cph2 hr1
MDIO ERROR: warning: redefined angle parameters for nr2 cph2 hr1
MDIO ERROR: warning: redefined angle parameters for hr3 cph1 cph1
MDIO ERROR: warning: redefined angle parameters for nr1 cph1 hr3
MDIO ERROR: warning: redefined angle parameters for nr2 cph1 hr3
MDIO ERROR: warning: redefined angle parameters for ht ot ht
MDIO ERROR: warning: redefined dihedral parameters for d cph2 nr1 cph1 cph1
MDIO ERROR: warning: redefined dihedral parameters for d cph2 nr2 cph1 cph1
MDIO ERROR: warning: redefined dihedral parameters for d nr1 cph1 cph1 hr3
MDIO ERROR: warning: redefined dihedral parameters for d nr1 cph2 nr2 cph1
MDIO ERROR: warning: redefined dihedral parameters for d nr2 cph1 cph1 nr1
MDIO ERROR: warning: redefined dihedral parameters for d nr2 cph2 nr1 cph1
MDIO ERROR: warning: redefined dihedral parameters for d hr1 cph2 nr1 cph1
MDIO ERROR: warning: redefined dihedral parameters for d hr1 cph2 nr2 cph1
MDIO ERROR: warning: redefined dihedral parameters for d hr3 cph1 cph1 hr3
MDIO ERROR: warning: redefined dihedral parameters for d hr3 cph1 nr1 cph2
MDIO ERROR: warning: redefined dihedral parameters for d hr3 cph1 nr2 cph2
MDIO ERROR: warning: redefined dihedral parameters for d nr2 cph1 cph1 hr3
MDIO ERROR: warning: redefined improper parameters for i hr1 nr1 nr2 cph2
MDIO ERROR: warning: redefined improper parameters for i hr1 nr2 nr1 cph2
MDIO ERROR: warning: redefined improper parameters for i hr3 cph1 nr1 cph1
MDIO ERROR: warning: redefined improper parameters for i hr3 cph1 nr2 cph1
MDIO ERROR: warning: redefined improper parameters for i hr3 nr1 cph1 cph1
MDIO ERROR: warning: redefined improper parameters for i hr3 nr2 cph1 cph1
MDIO ERROR: cannot open file "restart.coor"
ERROR: file deven.cpp line 879: # Energies have become nan
called boinc_finish
</stderr_txt>
]]>
Validate state Invalid
Claimed credit 38584.7222222222
Granted credit 0
application version Long runs (8-12 hours on fastest card) v6.13 (cuda31)
I have since gone back the 266.58 and had no issues.
Cheers
Ross*
____________
|
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
The error is "Energies have become nan".
Not sure I would attribute this to the driver though; this error has been seen many times before under many drivers.
|
|
|
MarkJ Volunteer moderator Volunteer tester Send message
Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level
Scientific publications
|
I would add its typically caused by over-clocking. I had similar issues on my reference design GTX570 which seemed to go away after dropping back to stock.
See the Energies have become nan thread
____________
BOINC blog |
|
|
|
I have work units failing, sporadically, with "energies have become nan". The programmer in me loves the mysterious nature of the message. The cruncher in me wonders "what I am doing wrong"?
EVGA GTX-690 Signature, not OC'ed and no manual adjustments made to any settings.
1 GPU at about 82C, utilization about 87%
1 GPU at about 60C, utilization about 60%
Intel E8200, not OC'ed
8G ram
Antec Earth Watts 650
Single 7200RPM sata drive
Nothing else in the machine
Boinc 7.0.25
Win7, X64
Since I've never made manual adjustments to a gpu, if changes are necessary, please give me specific recommendations. I will be using EVGA's PrecisionX tool.
Thanks,
Ken
|
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
nan means "not a number".
I think this error occurs when the GPU has an unrecoverable failure, and as described in this thread is often a result of overclocking, temps being too high, or voltage being insufficient. Not all tasks are the same here; some utilize the GPU to a greater extent, demanding more from it. These tasks are more prone to failures. The lack of thread-safe code may also be an issue, but to some extent that would just hide a bad setup.
82°C is pushing it, slightly.
Try the generic recommendations for this situation:
Increase the GPU Fan speed so that it keeps the GPU below 70°C,
Reduce the GDDR5 frequency by 10% or 20% should that fail,
Increase the Voltage (if you can on that card), but only by the least amount (typically ~0.025V)
Also consider better cooling - adding a case fan or two, or just leaving the door off, slightly. Removing back plates might also be useful, as might a better CPU fan; one that radiates or blows less heat onto the GPU. If you don't get anywhere with the above, try downclocking the CPU in W7 Power management (not ideal but substantially reduces heat from the CPU).
____________
FAQ's
HOW TO:
- Opt out of Beta Tests
- Ask for Help |
|
|
|
The definition of NaN from Wikipedia explains a lot. |
|
|
|
My candid feeling is that the 6.16 (Cuda 42) is an agressive routine. My 470's are running at stock voltages and frequencies. Some potential adjustments (especially for the inexperienced of us) could shorten the life of these cards and reduce our contributions to this project. The previous coding ran nearly flawlessly for me. I take these suggestions to heart, but also with a grain of hesitation. I, for one, am not so experienced to acheive high success on this Grid while compromising future potential. It is, as it always is, to each their own. Caution is always best. I may be wrong, but I am always willing to learn. |
|
|
|
My candid feeling is that the 6.16 (Cuda 42) is an agressive routine.
I came to the same conclusion when I had to set my GTX 590 to 625MHz for the CUDA 4.2 client. It was running fine at 725MHz with the CUDA 3.1 client. However, the CUDA 4.2 client is 40% faster than the CUDA 3.1 client, so it can do 20% more work at the lower frequency.
My 470's are running at stock voltages and frequencies.
These cards were made for gaming, not for crunching. When a card fails once in a 4 hour gaming session, the player hardly notices the glitch caused by that failure at all. So these cards maybe under-voltaged by factory setting, especially the GTX 470 and the GTX 480. When a 4 hour workunit experiences the same faliure, it will run into an error message. It is debatable, that the client should try to go on from the last checkpoint in this case, instead of aborting the task immediately. But if a card is unreliable, it's safer to discard the entire workunit.
Some potential adjustments (especially for the inexperienced of us) could shorten the life of these cards and reduce our contributions to this project.
Errors reduce the contributions to this project. Errors cost a lot of electricity for the cruncher in vain. If you lower your GPU frequency, it won't shorten your card's lifespan, but it will make it more reliable at the same voltage.
The previous coding ran nearly flawlessly for me.
I have a Gigabyte GTX 480 with 1000mV GPU voltage by default. It was nearly flawless, while my ASUS GTX 480 at 1025mV was really flawless. I raised the Gigabyte's voltage to 1025mV, and it became also really flawless. It was more than 2 years ago, and these cards are still crunching 24/7 at an even higher voltage and frequency (equipped with a better than factory cooling). |
|
|
|
Thank you Retvari for your reply. I'm making small changes to try and improve my reliability. The card's <less than ideal> gaming configuration is a valuable point and I have learned something. |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
We could do with some feedback regarding these recent "Energies have become nan" errors:
Name 48_14-NOELIA_hfXA_long_30-0-2-RND7978_2
Workunit 4080964
Created 30 Jan 2013 | 10:26:34 UTC
Sent 30 Jan 2013 | 14:18:32 UTC
Received 30 Jan 2013 | 22:24:51 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 98 (0x62)
Computer ID 139265
Report deadline 4 Feb 2013 | 14:18:32 UTC
Run time 28,221.08
CPU time 28,208.29
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v6.17 (cuda42)
Stderr output
<core_client_version>7.0.44</core_client_version>
<![CDATA[
<message>
- exit code 98 (0x62)
</message>
<stderr_txt>
MDIO: cannot open file "restart.coor"
SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1574.
Assertion failed: a, file swanlibnv2.cpp, line 59
This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
MDIO: cannot open file "restart.coor"
SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1574.
Assertion failed: a, file swanlibnv2.cpp, line 59
This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
MDIO: cannot open file "restart.coor"
MDIO: cannot open file "restart.coor"
ERROR: file deven.cpp line 1106: # Energies have become nan
called boinc_finish
</stderr_txt>
]]>
Thanks,
____________
FAQ's
HOW TO:
- Opt out of Beta Tests
- Ask for Help
|
|
|