Advanced search

Message boards : Number crunching : All Tasks Failed

Author Message
Paul Raney
Send message
Joined: 26 Dec 10
Posts: 115
Credit: 416,576,946
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 24577 - Posted: 24 Apr 2012 | 14:03:10 UTC

on one of my computers, every task started to fail. I just restarted the system - is there any way to get a new task now? Any ideas on what happened? This system has been running fine for weeks.

http://www.gpugrid.net/show_host_detail.php?hostid=117970

thank you

Profile nenym
Send message
Joined: 31 Mar 09
Posts: 137
Credit: 1,308,230,581
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 24578 - Posted: 24 Apr 2012 | 16:39:49 UTC - in response to Message 24577.

is there any way to get a new task now?
I know a bit strange way, that affects statistics of your host:
- detach from GPUGRID
- chanage the hostname
- reboot
- attach to GPUGRID
Other connected projects changes your hostname only (as I can remember).
You can look for problems, if the host is connected to LAN.

Paul Raney
Send message
Joined: 26 Dec 10
Posts: 115
Credit: 416,576,946
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 24581 - Posted: 25 Apr 2012 | 0:54:51 UTC - in response to Message 24578.

Thanks for the hack. I want to keep my stats so I will just let the machine idle for a day or so.

Did anyone else have this issue?

Paul Raney
Send message
Joined: 26 Dec 10
Posts: 115
Credit: 416,576,946
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 24583 - Posted: 25 Apr 2012 | 13:55:08 UTC - in response to Message 24581.

It looks like tasks continue to fail. Does anyone have any ideas of what might be wrong with this host?

Thx

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,374,498
RAC: 227
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 24585 - Posted: 25 Apr 2012 | 16:32:09 UTC

The original clock rate was 1.88 GHz. Now it's 1.46 GHz & still failing. Is this the same card? Try under clocking the memory by 20%. Does it run other projects OK?

Paul Raney
Send message
Joined: 26 Dec 10
Posts: 115
Credit: 416,576,946
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 24592 - Posted: 26 Apr 2012 | 12:17:04 UTC - in response to Message 24585.

It is the same card in the same computer. I lowered the clock rate to see if that would correct the condition.

That computer is down now. It should be running again this weekend. We will see if power was an issue.

thx

Paul Raney
Send message
Joined: 26 Dec 10
Posts: 115
Credit: 416,576,946
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 24595 - Posted: 26 Apr 2012 | 13:55:43 UTC - in response to Message 24592.

Same problem on a different host. Could the 275.33 drivers be the issue? I have a different host with 285 drivers and it appears to be working fine.

any help is appreciated

http://www.gpugrid.net/show_host_detail.php?hostid=119703

Paul Raney
Send message
Joined: 26 Dec 10
Posts: 115
Credit: 416,576,946
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 24606 - Posted: 28 Apr 2012 | 3:20:40 UTC - in response to Message 24595.

The hosts appear to be working correctly again. Were the work units bad?

RichF
Send message
Joined: 6 Jan 09
Posts: 7
Credit: 5,741,255
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 24743 - Posted: 5 May 2012 | 15:04:40 UTC - in response to Message 24606.

All my WUs have been failing for the past couple of days, too. Is this a widespread problem, and how can we fix it? Thanks.

Old man
Send message
Joined: 24 Jan 09
Posts: 42
Credit: 16,676,387
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 24746 - Posted: 5 May 2012 | 16:05:16 UTC
Last modified: 5 May 2012 | 16:05:52 UTC

Here also tasks failed.

Nimi 9px10-MJHARVEY_MJHXA1-8-30-RND0616_5
Työpaketti 3395291
Luotu 5 May 2012 | 11:53:47 UTC
Lähetetty 5 May 2012 | 15:23:02 UTC
Vastaanotettu 5 May 2012 | 15:26:05 UTC
Tila palvelimella Valmis
Tulos Laskentavirhe
Tila ohjelmassa Laskentavirhe
Exit status 98 (0x62)
Tietokoneen tunniste 123486
Raportoinnin takaraja 10 May 2012 | 15:23:02 UTC
Laskenta-aika 2.70
Suoritinaika 0.80
Vahvistuksen tila Vahvistamattomat
Pisteet 0.00
Sovellusversio ACEMD2: GPU molecular dynamics v6.16 (cuda31)
Stderr output

<core_client_version>6.12.34</core_client_version>
<![CDATA[
<message>
- exit code 98 (0x62)
</message>
<stderr_txt>
# Using device 0
# There are 2 devices supporting CUDA
# Device 0: "GeForce GTX 470"
# Clock rate: 1.21 GHz
# Total amount of global memory: 1275658240 bytes
# Number of multiprocessors: 14
# Number of cores: 112
# Device 1: "GeForce GTX 260"
# Clock rate: 1.30 GHz
# Total amount of global memory: 891748352 bytes
# Number of multiprocessors: 27
# Number of cores: 216
MDIO: read error for file "input.coor", byte number 4: number of atoms (-45219840) != (47792) expected
ERROR: Unable to read bincoordfile

called boinc_finish

</stderr_txt>
]]>

nimi 9px10-MJHARVEY_MJHXA1-8-30-RND0616
sovellus ACEMD2: GPU molecular dynamics
luotu 4 May 2012 | 14:27:28 UTC
oikeita tuloksia vähintään 1
alustavia toisintoja 1
suurin lkm virheitä/kokonaismääriä/onnistuneita tehtäviä 7, 10, 6
Tehtävä
napsauta tietoihin Tietokone Lähetetty Raportointiaika
tai takaraja
selite Tila Laskenta-aika
(sekuntia) Suoritinaika
(sekuntia) Pisteet Sovellus
5326942 124335 4 May 2012 | 17:49:33 UTC 4 May 2012 | 17:54:12 UTC Virhe latauksessa 0.00 0.00 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31)
5327658 112695 4 May 2012 | 20:20:30 UTC 4 May 2012 | 21:24:15 UTC Virhe laskennassa 2.07 0.41 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31)
5328368 124628 5 May 2012 | 2:08:11 UTC 5 May 2012 | 2:14:55 UTC Virhe laskennassa 7.75 0.00 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31)
5329342 105945 5 May 2012 | 5:41:17 UTC 5 May 2012 | 5:48:23 UTC Virhe laskennassa 3.67 0.81 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31)
5329857 102639 5 May 2012 | 11:26:08 UTC 5 May 2012 | 11:53:44 UTC Virhe laskennassa 2.15 0.53 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31)
5330904 123486 5 May 2012 | 15:23:02 UTC 5 May 2012 | 15:26:05 UTC Virhe laskennassa 2.70 0.80 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31)
5331534 --- --- --- Lähettämättä --- --- ---

As you can see, also all others have failed to run task :-(

RichF
Send message
Joined: 6 Jan 09
Posts: 7
Credit: 5,741,255
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 24747 - Posted: 5 May 2012 | 17:19:49 UTC - in response to Message 24743.

Here is the error message I've been getting. Any help would be appreciated.


Stderr output
<core_client_version>6.12.34</core_client_version>
<![CDATA[
<message>
The system cannot find the path specified. (0x3) - exit code 3 (0x3)
</message>
<stderr_txt>
# Using device 1
SWAN: FATAL : Unable to enumerate devices
Assertion failed: 0, file swanlib_nv.c, line 390

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.

</stderr_txt>
]]>

Mika_at_home
Send message
Joined: 16 Apr 12
Posts: 2
Credit: 297,794
RAC: 0
Level

Scientific publications
wat
Message 24750 - Posted: 5 May 2012 | 22:12:28 UTC

I also have had failed workunits on this week. Of the last five three have failed. The first failed on wednesday, the next failed on friday and the latest failed tonight. Of all those there are messages like these in BOINC log:

5.5.2012 23:56:56 GPUGRID Computation for task 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1 finished
5.5.2012 23:56:56 GPUGRID Output file 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1_1 for task 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1 absent
5.5.2012 23:56:56 GPUGRID Output file 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1_2 for task 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1 absent
5.5.2012 23:56:56 GPUGRID Output file 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1_3 for task 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1 absent

The following ACEMD2 workunit failed on friday:

1x21-MJHARVEY_MJHXA1-8-30-RND8065_0

Stderr output

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
- exit code -99 (0xffffff9d)
</message>
<stderr_txt>
# Using device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 560 Ti"
# Clock rate: 1.46 GHz
# Total amount of global memory: 1341849600 bytes
# Number of multiprocessors: 14
# Number of cores: 112
MDIO: cannot open file "restart.coor"
# Using device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 560 Ti"
# Clock rate: 1.46 GHz
# Total amount of global memory: 1341849600 bytes
# Number of multiprocessors: 14
# Number of cores: 112
# Using device 0

I also run Einstein@home with about 2 million credit points and their WU:s have never failed. My graphics card is a Gigabyte GTX 560Ti 448 which runs at reference clock speed of 723 MHz and temps are between 70 - 75 C. I have lowered fan speed with MSI Afterburner. It has been running GPUGRID workunits for about a week now. So should I suspect my computer of these failures?

Thank you

NeoMetal*
Send message
Joined: 30 Mar 11
Posts: 1
Credit: 1,005,009
RAC: 0
Level
Ala
Scientific publications
watwatwat
Message 24769 - Posted: 7 May 2012 | 0:42:13 UTC

I got a failed WU because of: MDIO: cannot open file "output.restart.coor"
First time I've ever seen that. WU completed fine but errored when it tried to upload. No anti virus or backup running. Just a basic Win 7 install for crunching. This sucks, 21 hours wasted on a most likely valid WU because of a locked or disappearing file.

I see Mika_at_home has a similar error in his post above: MDIO: cannot open file "restart.coor". Is this happening to anyone else? Seems like a rash of errors recently. Could this be something needing fixing?

Stderr output

<core_client_version>7.0.25</core_client_version>
<![CDATA[
<message>
- exit code 98 (0x62)
</message>
<stderr_txt>
# Using device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 560 Ti"
# Clock rate: 1.90 GHz
# Total amount of global memory: 1073741824 bytes
# Number of multiprocessors: 8
# Number of cores: 64
MDIO: cannot open file "output.restart.coor"
ERROR: get_Dvec() element 0 (b)
called boinc_finish

</stderr_txt>
]]>



NM*

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,206,655,749
RAC: 261,147
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 24770 - Posted: 7 May 2012 | 1:04:59 UTC

The MDIO: cannot open file "output.restart.coor" message is not a real error, it appears in every task, even in the successful ones.
Your real error message is ERROR: get_Dvec() element 0 (b), and I think that such an error cannot be caused by the upload, nor "a locked or disappearing file". This error is happened during processing the wu, probably near its completion, that is why it seems like to be caused by the upload.

Paul Raney
Send message
Joined: 26 Dec 10
Posts: 115
Credit: 416,576,946
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 24771 - Posted: 7 May 2012 | 2:49:32 UTC - in response to Message 24770.

Several of my work units failed very near the end of the calculation process. Any ideas on why? The clock rate has been reduced to see if that will correct the issue.

thank you

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 24781 - Posted: 7 May 2012 | 14:13:47 UTC - in response to Message 24750.

Mika_at_home, it seems that your tasks are getting suspended and resumed many times. I think there is more chance of failures running this way. I suggest you configure Boinc Manager to allow GPU tasks to run when the system is in use.

All of the MJHARVEY tasks that failed on your system failed on at least one more system, and some repeatedly failed on many systems, suggesting an issue with the tasks; errors Too many errors (may have bug)
Sometimes these issues are very difficult to track down, as they only rarely appear on some combinations of operating system/driver/GPU, but in the above 'Too many errors' case the problem seems independent of GPU, driver and operating system, and my guess is that it was a badly built task,
MDIO: read error for file "input.coor", byte number 4: number of atoms (-45219840) != (47525) expected
ERROR: Unable to read bincoordfile


I would be more concerned by the tasks that fail after 10K sec than 2sec.

Paul Raney, as different task types are failing on your system it's more likely that the issue is a setup one (GPU clock, overuse of CPU, interference from another program...). 'Energies have become nan' is often symptomatic of a GPU issue with the clock, voltage or temps (but may also be linked to overuse of the CPU).
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

wiyosaya
Send message
Joined: 22 Nov 09
Posts: 114
Credit: 589,114,683
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 24784 - Posted: 7 May 2012 | 16:51:42 UTC

Running on Windows, one thing that I have done is to turn off the BOINC screen saver. After doing so, I have rarely had any GPUGrid WUs report computation error.

On most PCs these days with LCD monitors, screen savers are only eye candy as LCD monitors do not suffer from burn in as tube based monitors did.

To elaborate a bit further, I set my screen saver to NONE on the two machines where I currently run GPUGrid. I am bringing a third machine on line in the next week or so, and I will also turn off the screen saver on that one, too.
____________

Mika_at_home
Send message
Joined: 16 Apr 12
Posts: 2
Credit: 297,794
RAC: 0
Level

Scientific publications
wat
Message 24798 - Posted: 8 May 2012 | 11:50:51 UTC - in response to Message 24781.

skgiven, thanks for your analysis and advice. I have now completed one ACEMD2 workunit with the GPU task running always. It didn't cause any problems at least with web browsing and e-mail use. I also changed my screensaver to a more simple windows standard screensaver. Now I will get my Einstein GPU-WU:s completed quicker, too. :)

-Mika

lohphat
Send message
Joined: 21 Jan 10
Posts: 44
Credit: 1,308,281,672
RAC: 8,208,683
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 24859 - Posted: 10 May 2012 | 4:57:56 UTC

All my GPUGRID WUs are failing. I even replaced my 9800 GTX with a GTX 680 and the WUs fail in less than 5 seconds.

I suspect the Nvidia driver. 301.10 is the only driver for the GTX 680.

5pot
Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 24860 - Posted: 10 May 2012 | 5:25:24 UTC

gtx 680 for Windows has not been released yet. They are currently working on it. Linux version was just released for beta, when Windows is released it will be on beta as well.

lohphat
Send message
Joined: 21 Jan 10
Posts: 44
Credit: 1,308,281,672
RAC: 8,208,683
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 24886 - Posted: 10 May 2012 | 17:57:40 UTC

I just updated to the 301.24 beta driver and the WU failed again within 3 seconds.

WIn7 x64
GTX 680
24GB RAM
i7 960 CPU

5pot
Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 24887 - Posted: 10 May 2012 | 18:04:04 UTC

Read my above post.

Patience.

lohphat
Send message
Joined: 21 Jan 10
Posts: 44
Credit: 1,308,281,672
RAC: 8,208,683
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 25074 - Posted: 14 May 2012 | 17:01:23 UTC - in response to Message 24887.

I don't fully understand your post:

"gtx 680 for Windows has not been released yet"

Yes it has. I have one. Installed. Running.

Or are you referring to the GPUGRID software compatible with the GTX 680 CUDA code?

5pot
Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 25077 - Posted: 14 May 2012 | 18:30:08 UTC

Sorry for the misunderstanding.

Currently, the CUDA app that is being used for both the short and long tasks do not work on the 680 using either Windows or Linux. There is currently a beta app in progress that allows for the usage of the Kepler series (CUDA 4.2), as well as shorter runtimes for other series as well. However, currently this Beta app is only for Linux at the moment.

Keep an eye out on the Graphics Cards section for when the Windows version will be released.

FYI- There are some projects that can run on Kepler, however they are not optimized for it (don't work as well as they could). Currently, my 680 is on Einstein@Home until GPUgrid releases the Beta app for Windows.

Hope this cleared things up.

lohphat
Send message
Joined: 21 Jan 10
Posts: 44
Credit: 1,308,281,672
RAC: 8,208,683
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 25092 - Posted: 15 May 2012 | 6:30:18 UTC - in response to Message 25077.

It did. Thanks!

lohphat
Send message
Joined: 21 Jan 10
Posts: 44
Credit: 1,308,281,672
RAC: 8,208,683
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 25169 - Posted: 20 May 2012 | 0:17:47 UTC

I've been getting the small beta (cuda42) WUs today that are < 2min and they've all seemd to have processed and uploaded properly.

Paul Raney
Send message
Joined: 26 Dec 10
Posts: 115
Credit: 416,576,946
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 25451 - Posted: 2 Jun 2012 | 16:09:40 UTC - in response to Message 25169.

Guys:

any ideas on why this one failed http://www.gpugrid.net/workunit.php?wuid=3472235

I just moved all my cards around so it might just be fall out from the changes.


____________
Thx - Paul

Note: Please don't use driver version 295 or 296! Recommended versions are 266 - 285.

Paul Raney
Send message
Joined: 26 Dec 10
Posts: 115
Credit: 416,576,946
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 25484 - Posted: 4 Jun 2012 | 12:03:01 UTC - in response to Message 25451.

Here is another weird failure

http://www.gpugrid.net/result.php?resultid=5459110

____________
Thx - Paul

Note: Please don't use driver version 295 or 296! Recommended versions are 266 - 285.

Profile nenym
Send message
Joined: 31 Mar 09
Posts: 137
Credit: 1,308,230,581
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 25487 - Posted: 4 Jun 2012 | 12:49:25 UTC - in response to Message 25484.

# Using device 0
# There are 2 devices supporting CUDA
# Device 0: "GeForce GTX 570"
# Clock rate: 1.88 GHz
# Total amount of global memory: 1275658240 bytes
# Number of multiprocessors: 15
# Number of cores: 120
# Device 1: "GeForce GTX 570"
# Clock rate: 1.62 GHz
# Total amount of global memory: 1275789312 bytes
# Number of multiprocessors: 15
# Number of cores: 120

Maybe OC of dvice 0.

Profile Francis Butts
Send message
Joined: 6 Nov 08
Posts: 1
Credit: 65,381,567
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 25493 - Posted: 4 Jun 2012 | 16:45:42 UTC

name 249-GIANNI_TEST7-0-5-RND2924
application ACEMD beta version
created 2 Jun 2012 | 9:34:56 UTC
I'm using an XPS 720, 4 core, 64 bit windows 7 operating system with a GTX 560 ti
for working this latest batch of work units...All have failed! Will you please acknowledge and advise.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 25496 - Posted: 4 Jun 2012 | 18:58:51 UTC - in response to Message 25493.

Francis Butts, I suggest you stop trying to crunch the Beta App. We know there is a problem with it (lots of posts saying the 6.45app fails tasks).

http://www.gpugrid.net/prefs.php?subset=project

Run test applications? No
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Post to thread

Message boards : Number crunching : All Tasks Failed

//