Advanced search

Message boards : Number crunching : 1-GERARD_MO_MOR WUs failing immediately

Author Message
Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44012 - Posted: 18 Jul 2016 | 18:03:02 UTC

A couple examples:

https://www.gpugrid.net/workunit.php?wuid=11674070

https://www.gpugrid.net/workunit.php?wuid=11674120

The error is always: ERROR: file force.cpp line 513: TCL evaluation of [calcforces]

However I do have a 0-GERARD_MO_MOR WU that's still running after 5 hours.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 485
Credit: 11,126,177,899
RAC: 15,400,256
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44016 - Posted: 18 Jul 2016 | 22:01:45 UTC - in response to Message 44012.

A couple examples:

https://www.gpugrid.net/workunit.php?wuid=11674070

https://www.gpugrid.net/workunit.php?wuid=11674120

The error is always: ERROR: file force.cpp line 513: TCL evaluation of [calcforces]

However I do have a 0-GERARD_MO_MOR WU that's still running after 5 hours.



Same thing here!

Name e1s34_1-GERARD_MO_MOR_1-0-1-RND2358_2
Workunit 11674055
Created 18 Jul 2016 | 14:57:53 UTC
Sent 18 Jul 2016 | 14:57:56 UTC
Received 18 Jul 2016 | 19:18:31 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -98 (0xffffffffffffff9e) Unknown error number
Computer ID 263612
Report deadline 23 Jul 2016 | 14:57:56 UTC
Run time 3.25
CPU time 1.22
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65)
Stderr output

<core_client_version>7.6.22</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -98 (0xffffff9e)
</message>
<stderr_txt>
# GPU [GeForce GTX 980 Ti] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 980 Ti
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:01:00.0
# Device clock : 1266MHz
# Memory clock : 3505MHz
# Memory width : 384bit
# Driver version : r358_00 : 35906
ERROR: file force.cpp line 513: TCL evaluation of [calcforces]
15:16:11 (6520): called boinc_finish

</stderr_txt>
]]>

Name e1s41_1-GERARD_MO_MOR_2-0-1-RND9697_0
Workunit 11674112
Created 18 Jul 2016 | 12:13:48 UTC
Sent 18 Jul 2016 | 12:40:09 UTC
Received 18 Jul 2016 | 19:18:31 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -98 (0xffffffffffffff9e) Unknown error number
Computer ID 263612
Report deadline 23 Jul 2016 | 12:40:09 UTC
Run time 3.45
CPU time 1.17
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65)
Stderr output

<core_client_version>7.6.22</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -98 (0xffffff9e)
</message>
<stderr_txt>
# GPU [GeForce GTX 980 Ti] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 980 Ti
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:01:00.0
# Device clock : 1266MHz
# Memory clock : 3505MHz
# Memory width : 384bit
# Driver version : r358_00 : 35906
ERROR: file force.cpp line 513: TCL evaluation of [calcforces]
15:16:07 (2340): called boinc_finish

</stderr_txt>
]]>



[AF>Amis des Lapins]Abidi...
Send message
Joined: 22 Dec 12
Posts: 2
Credit: 272,996,387
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwat
Message 44018 - Posted: 19 Jul 2016 | 8:20:38 UTC - in response to Message 44016.

Same here :

Stderr output
<core_client_version>7.6.22</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -98 (0xffffff9e)
</message>
<stderr_txt>
# GPU [GeForce GTX 980] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 980
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:0F:00.0
# Device clock : 1215MHz
# Memory clock : 3505MHz
# Memory width : 256bit
# Driver version : r352_00 : 35362
ERROR: file force.cpp line 513: TCL evaluation of [calcforces]
23:14:00 (5980): called boinc_finish

</stderr_txt>
]]>

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 485
Credit: 11,126,177,899
RAC: 15,400,256
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44021 - Posted: 19 Jul 2016 | 23:48:27 UTC

I had 2 more of these units error out:


Name e1s39_1-GERARD_MO_MOR_1-0-1-RND2698_6
Workunit 11674060
Created 19 Jul 2016 | 21:42:11 UTC
Sent 19 Jul 2016 | 21:42:18 UTC
Received 19 Jul 2016 | 21:44:52 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -98 (0xffffffffffffff9e) Unknown error number
Computer ID 263612
Report deadline 24 Jul 2016 | 21:42:18 UTC
Run time 3.08
CPU time 1.17
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65)
Stderr output

<core_client_version>7.6.22</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -98 (0xffffff9e)
</message>
<stderr_txt>
# GPU [GeForce GTX 980 Ti] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 1 :
# Name : GeForce GTX 980 Ti
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:02:00.0
# Device clock : 1190MHz
# Memory clock : 3505MHz
# Memory width : 384bit
# Driver version : r358_00 : 35906
ERROR: file force.cpp line 513: TCL evaluation of [calcforces]
17:47:08 (6440): called boinc_finish

</stderr_txt>
]]>


Name e1s50_1-GERARD_MO_MOR_1-0-1-RND9002_1
Workunit 11674071
Created 19 Jul 2016 | 17:21:34 UTC
Sent 19 Jul 2016 | 17:21:49 UTC
Received 19 Jul 2016 | 21:37:22 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -98 (0xffffffffffffff9e) Unknown error number
Computer ID 30790
Report deadline 24 Jul 2016 | 17:21:49 UTC
Run time 6.00
CPU time 2.83
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65)
Stderr output

<core_client_version>7.6.22</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -98 (0xffffff9e)
</message>
<stderr_txt>
# GPU [GeForce GTX 980 Ti] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 980 Ti
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:02:00.0
# Device clock : 1190MHz
# Memory clock : 3505MHz
# Memory width : 384bit
# Driver version : r355_00 : 35582
ERROR: file force.cpp line 513: TCL evaluation of [calcforces]
17:39:53 (3556): called boinc_finish

</stderr_txt>
]]>

This looks like a bad batch.


Greger
Send message
Joined: 6 Jan 15
Posts: 76
Credit: 24,195,148,333
RAC: 10,296,438
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 44023 - Posted: 20 Jul 2016 | 19:50:39 UTC
Last modified: 20 Jul 2016 | 19:51:07 UTC

Same here 90.45% ends to fail, what can we do about it, and how does these 9.55 manage to success to those wu:s? 3-4 sec this time so low lost but would like to see low error rate.

http://www.gpugrid.net/workunit.php?wuid=11674051
http://www.gpugrid.net/workunit.php?wuid=11674098
http://www.gpugrid.net/workunit.php?wuid=11674051

Got the same wu but to another host.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,887,618,176
RAC: 19,841,041
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44024 - Posted: 20 Jul 2016 | 21:34:13 UTC

I had e1s27_1-GERARD_MO_MOR_1-0-1-RND0098 - reached maximum number of errors with no successful returns at all.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44025 - Posted: 20 Jul 2016 | 21:39:49 UTC - in response to Message 44023.

90% fail rate says there is a problem with the model.
No tasks available confirms there is a problem.

Why are we not using a beta queue to test such tasks?

-----------------------------------------------------------------------------
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44027 - Posted: 21 Jul 2016 | 0:38:03 UTC - in response to Message 44023.
Last modified: 21 Jul 2016 | 0:40:00 UTC

Same here 90.45% ends to fail, what can we do about it, and how does these 9.55 manage to success to those wu:s? 3-4 sec this time so low lost but would like to see low error rate.

Because the 0-GERARD_MO_MOR WUs seem OK while the 1-GERARD_MO_MOR WUs all fail.
All the admins are apparently either on vacation or asleep.

>> Why are we not using a beta queue to test such tasks?

We're open to theories...

Post to thread

Message boards : Number crunching : 1-GERARD_MO_MOR WUs failing immediately

//