Errors piling up, bad batch of NOELIA?

Message boards : Number crunching : Errors piling up, bad batch of NOELIA?

Author	Message
Khali Send message Joined: 13 Jan 14 Posts: 7 Credit: 360,154,533 RAC: 711,379 Level Scientific publications	Message 37808 - Posted: 4 Sep 2014 \| 16:35:27 UTC
	I have had several, nine and counting, Noelia tasks fail after only running for around 2 seconds. <core_client_version>7.2.42</core_client_version> <![CDATA[ <message> (unknown error) - exit code -98 (0xffffff9e) </message> <stderr_txt> # GPU [GeForce GTX 680] Platform [Windows] Rev [3301M] VERSION [60] # SWAN Device 1 : # Name : GeForce GTX 680 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:02:00.0 # Device clock : 1058MHz # Memory clock : 3004MHz # Memory width : 256bit # Driver version : r340_00 : 34052 ERROR: file mdioload.cpp line 162: No CHARMM parameter file specified 12:16:00 (3092): called boinc_finish </stderr_txt> ]]>
	ID: 37808 \| Rating: 0 \| rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 37809 - Posted: 4 Sep 2014 \| 17:07:01 UTC - in response to Message 37808. Last modified: 4 Sep 2014 \| 17:08:52 UTC
	Have you just done an update from Windows Update WDDM? I did and have got multiple errors since then and can't uninstall despite trying to use restore. Update not mentioned in Uninstall Programs To Add Finally got one running.
	ID: 37809 \| Rating: 0 \| rate: / Reply Quote

MrJo Send message Joined: 18 Apr 14 Posts: 43 Credit: 1,192,135,172 RAC: 0 Level Scientific publications	Message 37811 - Posted: 4 Sep 2014 \| 17:18:40 UTC
	The same with me. Suddenly more than 30 errors. No changes in my system than updating the driver to the last version. But I have the same issue on machines where nothing has changed since weeks.. <core_client_version>7.2.42</core_client_version> <![CDATA[ <message> (unknown error) - exit code -98 (0xffffff9e) </message> <stderr_txt> # GPU [GeForce GTX 750 Ti] Platform [Windows] Rev [3301M] VERSION [60] # SWAN Device 0 : # Name : GeForce GTX 750 Ti # ECC : Disabled # Global mem : 2048MB # Capability : 5.0 # PCI ID : 0000:01:00.0 # Device clock : 1241MHz # Memory clock : 2700MHz # Memory width : 128bit # Driver version : r340_00 : 34052 ERROR: file mdioload.cpp line 162: No CHARMM parameter file specified 18:44:55 (2516): called boinc_finish </stderr_txt> ]]> ____________ Regards, Josef
	ID: 37811 \| Rating: 0 \| rate: / Reply Quote

Khali Send message Joined: 13 Jan 14 Posts: 7 Credit: 360,154,533 RAC: 711,379 Level Scientific publications	Message 37813 - Posted: 4 Sep 2014 \| 17:25:46 UTC - in response to Message 37809.
	Have you just done an update from Windows Update WDDM? I did and have got multiple errors since then and can't uninstall despite trying to use restore. Update not mentioned in Uninstall Programs To Add Finally got one running. No updates of any kind on my system. Looking at my GPU Grid account it shows I have one task in progress when I have nothing in my queue for GPU Grid. I have done a project rest and rebooted the system. Still getting the same problem. Instead of wasting time and bandwidth downloading tasks that are going to fail I have set the project to no new tasks until some one comes up with an answer.
	ID: 37813 \| Rating: 0 \| rate: / Reply Quote

MrJo Send message Joined: 18 Apr 14 Posts: 43 Credit: 1,192,135,172 RAC: 0 Level Scientific publications	Message 37815 - Posted: 4 Sep 2014 \| 17:35:00 UTC
	Current problem solving: Switched my preferences to ACEMD short runs (2-3 hours on fastest card). And all works fine again. Seems that the problem is caused by the long ones. ____________ Regards, Josef
	ID: 37815 \| Rating: 0 \| rate: / Reply Quote

neilp62 Send message Joined: 23 Nov 10 Posts: 14 Credit: 7,907,445,653 RAC: 2,537,628 Level Scientific publications	Message 37817 - Posted: 4 Sep 2014 \| 17:47:40 UTC - in response to Message 37813.
	Same issue here. 16 errors and increasing. No changes to platform; platform is >50% complete processing three NOELIA_tpam2 tasks without issue. NOELIA_TRPs fail within 3 seconds with following error: ERROR: file mdioload.cpp line 162: No CHARMM parameter file specified 10:34:25 (8728): called boinc_finish Switching to another project just to conserve internet bandwidth.
	ID: 37817 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,886,844,890 RAC: 19,857,568 Level Scientific publications	Message 37819 - Posted: 4 Sep 2014 \| 17:58:36 UTC Last modified: 4 Sep 2014 \| 18:02:46 UTC
	Just had two bad ones in quick succession: 260-NOELIA_TRP188-0-1-RND4118 364-NOELIA_TRP188-0-1-RND5307 Erroring on every computer they're sent to, so I'm pretty sure it's the task make-up, nothing local to our machines. Make that three: 337-NOELIA_TRP188-0-1-RND9634
	ID: 37819 \| Rating: 0 \| rate: / Reply Quote

klepel Send message Joined: 23 Dec 09 Posts: 189 Credit: 4,721,188,000 RAC: 1,849,266 Level Scientific publications	Message 37820 - Posted: 4 Sep 2014 \| 18:03:52 UTC Last modified: 4 Sep 2014 \| 18:04:07 UTC
	Me too!
	ID: 37820 \| Rating: 0 \| rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 37825 - Posted: 4 Sep 2014 \| 20:01:14 UTC
	Indeed the third type of Noelia's WU's are errorsome. I have now 47 of them and wing(wo)man too. This seems be the error: file mdioload.cpp line 162: No CHARMM parameter file specified. It are xxx-NOELIA_TRPxxx WU's the other two types run like a train. We have to wait for Noelia to repair this. ____________ Greetings from TJ
	ID: 37825 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,239,065,968 RAC: 3,161,193 Level Scientific publications	Message 37826 - Posted: 4 Sep 2014 \| 20:19:45 UTC Last modified: 4 Sep 2014 \| 20:19:57 UTC
	I'm having the same issue with NOELIA_TRP188 workunits. All of these run to the following error after 2 seconds: ERROR: file mdioload.cpp line 162: No CHARMM parameter file specified
	ID: 37826 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 485 Credit: 11,126,027,899 RAC: 15,465,518 Level Scientific publications	Message 37828 - Posted: 4 Sep 2014 \| 21:33:51 UTC - in response to Message 37820.
	Please cancel these units already, they are all erroring out. ERROR: file mdioload.cpp line 162: No CHARMM parameter file specified 14:24:19 (2920): called boinc_finish
	ID: 37828 \| Rating: 0 \| rate: / Reply Quote

labrat42 Send message Joined: 13 May 10 Posts: 7 Credit: 452,806,864 RAC: 0 Level Scientific publications	Message 37830 - Posted: 5 Sep 2014 \| 3:35:25 UTC
	Have these errors been resolved? ____________ Bill42
	ID: 37830 \| Rating: 0 \| rate: / Reply Quote

neilp62 Send message Joined: 23 Nov 10 Posts: 14 Credit: 7,907,445,653 RAC: 2,537,628 Level Scientific publications	Message 37831 - Posted: 5 Sep 2014 \| 3:46:04 UTC - in response to Message 37830.
	Unclear if resolved. Within the last 30 minutes, I am no longer receiving NOELIA_TRP long run tasks. All three of my platforms have received NOELIA_tpam tasks, so I have switched back to the long runs for now.
	ID: 37831 \| Rating: 0 \| rate: / Reply Quote

MrJo Send message Joined: 18 Apr 14 Posts: 43 Credit: 1,192,135,172 RAC: 0 Level Scientific publications	Message 37832 - Posted: 5 Sep 2014 \| 6:55:07 UTC - in response to Message 37831.
	...I am no longer receiving NOELIA_TRP long run tasks. All three of my platforms have received NOELIA_tpam tasks, so I have switched back to the long runs for now. Me the same. Right now all seems to be fine. ____________ Regards, Josef
	ID: 37832 \| Rating: 0 \| rate: / Reply Quote

biodoc Send message Joined: 26 Aug 08 Posts: 183 Credit: 10,085,929,375 RAC: 1,709,448 Level Scientific publications	Message 37834 - Posted: 5 Sep 2014 \| 10:39:19 UTC Last modified: 5 Sep 2014 \| 10:49:13 UTC
	Feel free to send the NOELIA_TRP WUs my way. They run fine on my 780Ti under linux. http://www.gpugrid.net/workunit.php?wuid=10045657 I've finished 2 of these WUs successfully. PNY 780Ti reference card; no overclock Nvidia driver version: 337.25
	ID: 37834 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 37837 - Posted: 5 Sep 2014 \| 12:34:46 UTC - in response to Message 37826. Last modified: 5 Sep 2014 \| 12:37:03 UTC
	GPUGrid Admins: Retvari said it best: I'm having the same issue with NOELIA_TRP188 workunits. All of these run to the following error after 2 seconds: ERROR: file mdioload.cpp line 162: No CHARMM parameter file specified That is exactly what's happening for me. The NOELIA_TRP188 work units error out just like that, for Windows machines. http://www.gpugrid.net/workunit.php?wuid=10045781 http://www.gpugrid.net/workunit.php?wuid=10045192 http://www.gpugrid.net/workunit.php?wuid=10045206 Then, BOINC immediately reports the error (per your setting), gets a network backoff for GPUGrid, then starts asking my backup projects for work. So, I get to work on backup projects until you can fix this. Can you kindly please determine the problem, and cancel the work units that have problems, and then relay to us what happened? Thanks, Jacob
	ID: 37837 \| Rating: 0 \| rate: / Reply Quote

RaymondFO* Send message Joined: 22 Nov 12 Posts: 72 Credit: 14,040,706,346 RAC: 0 Level Scientific publications	Message 37838 - Posted: 5 Sep 2014 \| 12:49:41 UTC - in response to Message 37832. Last modified: 5 Sep 2014 \| 13:03:57 UTC
	...I am no longer receiving NOELIA_TRP long run tasks. All three of my platforms have received NOELIA_tpam tasks, so I have switched back to the long runs for now. Me the same. Right now all seems to be fine. I am still getting them, just not as frequently. Feel free to send the NOELIA_TRP WUs my way. They run fine on my 780Ti under linux. http://www.gpugrid.net/workunit.php?wuid=10045657 I've finished 2 of these WUs successfully. PNY 780Ti reference card; no overclock Nvidia driver version: 337.25 While you may think this to be true, however my linux computers have had their share of these errors, just not as frequently. FWIW, going back to Einstein@home until this is resolved. Enough is enough.
	ID: 37838 \| Rating: 0 \| rate: / Reply Quote

biodoc Send message Joined: 26 Aug 08 Posts: 183 Credit: 10,085,929,375 RAC: 1,709,448 Level Scientific publications	Message 37839 - Posted: 5 Sep 2014 \| 15:06:15 UTC - in response to Message 37838.
	While you may think this to be true, however my linux computers have had their share of these errors, just not as frequently. Good point. I've only had 2 WUs so that's certainly not enough for me to conclude that these WUs are ok on my machine.
	ID: 37839 \| Rating: 0 \| rate: / Reply Quote

MrJo Send message Joined: 18 Apr 14 Posts: 43 Credit: 1,192,135,172 RAC: 0 Level Scientific publications	Message 37847 - Posted: 5 Sep 2014 \| 18:42:45 UTC - in response to Message 37838.
	I am still getting them, just not as frequently. You're right. I just checked one of my PC's. Two attempts failed before the third has worked. 18:03 fail 18:16 fail 18:20 Ok ____________ Regards, Josef
	ID: 37847 \| Rating: 0 \| rate: / Reply Quote

klepel Send message Joined: 23 Dec 09 Posts: 189 Credit: 4,721,188,000 RAC: 1,849,266 Level Scientific publications	Message 37853 - Posted: 6 Sep 2014 \| 17:35:15 UTC
	Any up-date? Can I change back to long runs? My computers run without supervision over the week-end, so I do not like to pile up errors.
	ID: 37853 \| Rating: 0 \| rate: / Reply Quote

RaymondFO* Send message Joined: 22 Nov 12 Posts: 72 Credit: 14,040,706,346 RAC: 0 Level Scientific publications	Message 37857 - Posted: 7 Sep 2014 \| 11:04:36 UTC - in response to Message 37853.
	Any up-date? Can I change back to long runs? My computers run without supervision over the week-end, so I do not like to pile up errors. While you may still get a bad task here or there, I would venture to say the number of current bad tasks has sharply dwindled.
	ID: 37857 \| Rating: 0 \| rate: / Reply Quote

Bjarke Send message Joined: 1 Mar 09 Posts: 8 Credit: 74,871,366 RAC: 0 Level Scientific publications	Message 37862 - Posted: 9 Sep 2014 \| 5:39:43 UTC
	I have had 5 NOELIA fails since 3rd of September. The last unit was sent to my workstation 7 Sep 2014 12:59:07 UTC. tpamx3835-NOELIA_tpam2-1-2-RND0880_1 668-NOELIA_TRP188-0-1-RND1496_3 tpamx2552-NOELIA_tpam2-1-2-RND0011_0 tpamx4367-NOELIA_tpam2-0-2-RND6555_0 tpamx3206-NOELIA_tpam2-0-2-RND6114_0 Most of them run a long time before the error shows.
	ID: 37862 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,239,065,968 RAC: 3,161,193 Level Scientific publications	Message 37863 - Posted: 9 Sep 2014 \| 7:31:01 UTC - in response to Message 37862.
	Those which run for a long time before the error shows (the NOELIA_tpam2 workunits) are failing because they got a lot of "The simulation has become unstable. Terminating to avoid lock-up" messages before they actually fail. This kind of error is usually caused by: - too high GPU frequency - too high GDDR5 frequency - too high GPU temperature - too low GPU voltage The bad batch this thread is about consists of NOELIA_TRP188 workunits, which usually fail right after the start.
	ID: 37863 \| Rating: 0 \| rate: / Reply Quote

Bjarke Send message Joined: 1 Mar 09 Posts: 8 Credit: 74,871,366 RAC: 0 Level Scientific publications	Message 37874 - Posted: 11 Sep 2014 \| 7:23:43 UTC - in response to Message 37863.
	Those which run for a long time before the error shows (the NOELIA_tpam2 workunits) are failing because they got a lot of "The simulation has become unstable. Terminating to avoid lock-up" messages before they actually fail. This kind of error is usually caused by: - too high GPU frequency - too high GDDR5 frequency - too high GPU temperature - too low GPU voltage The bad batch this thread is about consists of NOELIA_TRP188 workunits, which usually fail right after the start. I disagree either of the 4 points being the issue. My Nvidia Quadro K4000 GPU is completely stock with absolutely no modifications or overclocking applied. So the frequencies are right. Also the case of my hostDell Precision T7610 is completely unmodified and the case fans are regulated automatically as always. The GPU run at 75 degree C which is well on the safe side. Further, I haven't performed a driver update for months. May I add that I haven't noticed any failed WU's on my system until now. Within 5 days 5 NOELIA WU's failed.
	ID: 37874 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,239,065,968 RAC: 3,161,193 Level Scientific publications	Message 37875 - Posted: 11 Sep 2014 \| 9:38:25 UTC - in response to Message 37874. Last modified: 11 Sep 2014 \| 9:39:43 UTC
	Those which run for a long time before the error shows (the NOELIA_tpam2 workunits) are failing because they got a lot of "The simulation has become unstable. Terminating to avoid lock-up" messages before they actually fail. This kind of error is usually caused by: - too high GPU frequency - too high GDDR5 frequency - too high GPU temperature - too low GPU voltage The bad batch this thread is about consists of NOELIA_TRP188 workunits, which usually fail right after the start. I disagree either of the 4 points being the issue. Of course there are more possibilities, but these 4 points are the most frequent ones, also these could be checked easily by tuning the card with software tools (like MSI Afterburner). Furthermore these errors could be caused by a faulty (or inadequate) power supply, and the aging of the components (especially the GPU). These are much harder to fix, but you can still have a stable system with these components if you reduce the GPU/GDDR5 frequency. It's better to have a 10% slower system than a system producing (more and more frequent) random errors. My Nvidia Quadro K4000 GPU is completely stock with absolutely no modifications or overclocking applied. So the frequencies are right. The statement in the second sentence is not in consequence of the first sentence. The frequencies (for the given system) are right when there are no errors. The GPUGrid client pushes the card very hard, like the infamous FurMark GPU test, so we had a lot of surprises over the years (regarding stock frequencies). Also the case of my host Dell Precision T7610 is completely unmodified and the case fans are regulated automatically as always. The GPU run at 75 degree C which is well on the safe side. Further, I haven't performed a driver update for months. It is really strange, that a card could have errors even below 80°C. I have two GTX 780Ti's in the same system, one of them is an NVidia standard design, the other is an OC model (BTW both of them Gigabyte). I had errors with the OC model right from the start while its temperature was under 70°C (only with GPUGrid, no other testing tools showed any errors), but reducing its GDDR5 frequency from 3500MHz to 2700MHz (!) solved my problem. After a BIOS update this card is running error free at 2900MHz, but it's still way below the factory setting. May I add that I haven't noticed any failed WU's on my system until now. Within 5 days 5 NOELIA WU's failed. If you check the logs of your successful tasks, those also have this "The simulation has become unstable. Terminating to avoid lock-up" messages, so you were lucky that those workunits were successful. If you check my (similar NOELIA) workunits, none of them has these messages. So, give it a try to reduce the GPU frequency (its harder to reduce the GDDR5 frequency, as you have to flash the GPU's BIOS).
	ID: 37875 \| Rating: 0 \| rate: / Reply Quote

Stefan Project administrator Project developer Project tester Project scientist Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level Scientific publications	Message 37881 - Posted: 12 Sep 2014 \| 8:32:58 UTC - in response to Message 37875.
	The direct crashes should be fixed now.
	ID: 37881 \| Rating: 0 \| rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 37888 - Posted: 13 Sep 2014 \| 0:26:43 UTC
	The Noelia's doing okay on my systems after last week hick up but now the latest Santi's, (final) have this error: ERROR: file mdioload.cpp line 81: Unable to read bincoordfile. I see errors with wing(wo)man too. In run with Win7. ____________ Greetings from TJ
	ID: 37888 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Number crunching : Errors piling up, bad batch of NOELIA?

	About	Science	Volunteers	Performance	Forum	Join us	Donate