Author |
Message |
Beyond Send message
Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level
Scientific publications
|
Just received 5 of these:
GERARD_CXCL12LOCKMONO
Haven't seen this type before. All failed in about 1 second. |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
Most failing on Linux too. Errors:
Stderr output
<core_client_version>7.6.31</core_client_version>
<![CDATA[
<message>
process exited with code 176 (0xb0, -80)
</message>
<stderr_txt>
# SWAN Device 0 :
# Name : GeForce GTX 970
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:01:00.0
# Device clock : 1215MHz
# Memory clock : 3505MHz
# Memory width : 256bit
# The simulation has become unstable. Terminating to avoid lock-up (1)
# Attempting restart (step 5000)
</stderr_txt>
]]>
Stderr output
<core_client_version>7.6.31</core_client_version>
<![CDATA[
<message>
process exited with code 199 (0xc7, -57)
</message>
<stderr_txt>
# SWAN Device 0 :
# Name : GeForce GTX 970
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:01:00.0
# Device clock : 1215MHz
# Memory clock : 3505MHz
# Memory width : 256bit
SWAN : FATAL : Cuda driver error 700 in file 'swanlibnv2.cpp' in line 1963.
# SWAN swan_assert -57
</stderr_txt>
]]>
Stderr output
<core_client_version>7.6.31</core_client_version>
<![CDATA[
<message>
process exited with code 176 (0xb0, -80)
</message>
<stderr_txt>
# SWAN Device 0 :
# Name : GeForce GTX 970
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:01:00.0
# Device clock : 1215MHz
# Memory clock : 3505MHz
# Memory width : 256bit
# The simulation has become unstable. Terminating to avoid lock-up (1)
# Attempting restart (step 5000)
</stderr_txt>
]]>
Stderr output
<core_client_version>7.6.31</core_client_version>
<![CDATA[
<message>
process exited with code 199 (0xc7, -57)
</message>
<stderr_txt>
# SWAN Device 0 :
# Name : GeForce GTX 970
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:01:00.0
# Device clock : 1215MHz
# Memory clock : 3505MHz
# Memory width : 256bit
SWAN : FATAL : Cuda driver error 700 in file 'swanlibnv2.cpp' in line 1963.
# SWAN swan_assert -57
</stderr_txt>
]]>
Maybe these are designed to 'fail early' if they are likely to fail at all?
Task 15166493 has reached 5.5% after 1h on my Linux system, so the odd one appears to be running normally.
1x39-GERARD_CXCL12LOCKMONO-0-3-RND8941_0
____________
FAQ's
HOW TO:
- Opt out of Beta Tests
- Ask for Help |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
Server says:
Detailed computing status
Application unsent in progress success error rate
--
GERARD_CXCL12LOCKMON 234 170 0 100%
http://www.gpugrid.net/server_status.php - scroll down
If it's a new batch this is to be expected, lots of immediate failures will return before completed tasks. The first completed task might not report until ~7h after it was sent.
The first healthy looking LOCKMONO task I'm running is now at 10% now and I've another on W10 at 1.3% after 11 mins. GPU usage 80% @73% power and temp throttling enabled, using 1GB GDDR at present - that all looks quite normal.
____________
FAQ's
HOW TO:
- Opt out of Beta Tests
- Ask for Help |
|
|
|
I had a couple of these WUs fail as well, before getting a couple of good WUs, which are crunching well, right now. I hope they finish successfully.
Name 0x20-GERARD_CXCL12LOCKMONO-0-3-RND0335_2
Workunit 11645809
Created 22 Jun 2016 | 6:51:54 UTC
Sent 22 Jun 2016 | 8:20:34 UTC
Received 22 Jun 2016 | 9:06:57 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -97 (0xffffffffffffff9f) Unknown error number
Computer ID 263612
Report deadline 27 Jun 2016 | 8:20:34 UTC
Run time 0.00
CPU time 0.00
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65)
Stderr output
<core_client_version>7.6.22</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -97 (0xffffff9f)
</message>
<stderr_txt>
# GPU [GeForce GTX 980 Ti] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 1 :
# Name : GeForce GTX 980 Ti
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:02:00.0
# Device clock : 1190MHz
# Memory clock : 3505MHz
# Memory width : 384bit
# Driver version : r358_00 : 35906
# GPU 0 : 63C
# GPU 1 : 48C
# The simulation has become unstable. Terminating to avoid lock-up (1)
# Attempting restart (step 5000)
# GPU [GeForce GTX 980 Ti] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 1 :
# Name : GeForce GTX 980 Ti
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:02:00.0
# Device clock : 1190MHz
# Memory clock : 3505MHz
# Memory width : 384bit
# Driver version : r358_00 : 35906
# The simulation has become unstable. Terminating to avoid lock-up (1)
</stderr_txt>
]]>
Name 0x23-GERARD_CXCL12LOCKMONO-0-3-RND1112_0
Workunit 11645812
Created 21 Jun 2016 | 15:15:19 UTC
Sent 22 Jun 2016 | 5:09:24 UTC
Received 22 Jun 2016 | 5:53:14 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -97 (0xffffffffffffff9f) Unknown error number
Computer ID 263612
Report deadline 27 Jun 2016 | 5:09:24 UTC
Run time 0.00
CPU time 0.00
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65)
Stderr output
<core_client_version>7.6.22</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -97 (0xffffff9f)
</message>
<stderr_txt>
# GPU [GeForce GTX 980 Ti] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 980 Ti
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:01:00.0
# Device clock : 1266MHz
# Memory clock : 3505MHz
# Memory width : 384bit
# Driver version : r358_00 : 35906
# GPU 0 : 61C
# GPU 1 : 57C
# The simulation has become unstable. Terminating to avoid lock-up (1)
# Attempting restart (step 5000)
# GPU [GeForce GTX 980 Ti] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 980 Ti
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:01:00.0
# Device clock : 1266MHz
# Memory clock : 3505MHz
# Memory width : 384bit
# Driver version : r358_00 : 35906
# The simulation has become unstable. Terminating to avoid lock-up (1)
</stderr_txt>
]]>
|
|
|
Beyond Send message
Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level
Scientific publications
|
If it's a new batch this is to be expected, lots of immediate failures will return before completed tasks. The first completed task might not report until ~7h after it was sent.
The first healthy looking LOCKMONO task I'm running is now at 10% now and I've another on W10 at 1.3% after 11 mins. GPU usage 80% @73% power and temp throttling enabled, using 1GB GDDR at present - that all looks quite normal.
On all our machines (you, Bedrich and me), the failed ones all begin with "0x". Some are up to 5 errors, no successes. I have 4 more GERARD_CXCL12LOCKMONO running now that begin with "1x" that seem to be progressing normally (one's over 30%). |
|
|
Beyond Send message
Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level
Scientific publications
|
Server says: Detailed computing status
Application unsent in progress success error rate
--
GERARD_CXCL12LOCKMON 234 170 0 100%
http://www.gpugrid.net/server_status.php - scroll down
Looks like 4 have now been completed. Bet they all have the "1x" prefix, not "0x":
GERARD_CXCL12LOCKMON 36 368 4 98.46%
At least we know that some of them are OK.
|
|
|
eXaPowerSend message
Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level
Scientific publications
|
If it's a new batch this is to be expected, lots of immediate failures will return before completed tasks. The first completed task might not report until ~7h after it was sent.
The first healthy looking LOCKMONO task I'm running is now at 10% now and I've another on W10 at 1.3% after 11 mins. GPU usage 80% @73% power and temp throttling enabled, using 1GB GDDR at present - that all looks quite normal.
On all our machines (you, Bedrich and me), the failed ones all begin with "0x". Some are up to 5 errors, no successes. I have 4 more GERARD_CXCL12LOCKMONO running now that begin with "1x" that seem to be progressing normally (one's over 30%).
My host #208061:
0x83-GERARD_CXCL12LOCKMONO-0-3-RND0285_1
-97 (0xffffffffffffff9f) Unknown error number)
Attempting restart (step 5000)
Run time 1.00
CPU time 0.00
Paradoxically a -97 error thought as an overclocking problem.
3x96-GERARD_CXCL12LOCKMONO-0-3-RND9182_0
-97 (0xffffffffffffff9f) Unknown error number
Attempting restart (step 3845000)
Run time 10,433.28
CPU time 3,276.63
A note about (e6s8_e5s7p0f230-GIANNI_MORC36bCHL1-0-1-RND7755_0:
Faulted at 99.992% WU completion a couple days ago on my host.
ERROR: file force.cpp line 513: TCL evaluation of [calcforces]
-98 (0xffffffffffffff9e) Unknown error number
Run time 68,745.34
CPU time 20,137.64
|
|
|
|
I had 2 of these WUs complete successfully. They were 1x and 2x WUs.
Name 1x28-GERARD_CXCL12LOCKMONO-0-3-RND5689_0
Workunit 11645918
Created 21 Jun 2016 | 15:17:47 UTC
Sent 22 Jun 2016 | 5:59:44 UTC
Received 22 Jun 2016 | 14:21:20 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 263612
Report deadline 27 Jun 2016 | 5:59:44 UTC
Run time 29,436.96
CPU time 29,315.81
Validate state Valid
Credit 294,750.00
Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65)
Name 2x67-GERARD_CXCL12LOCKMONO-0-3-RND9322_0
Workunit 11646058
Created 21 Jun 2016 | 15:21:10 UTC
Sent 22 Jun 2016 | 9:12:36 UTC
Received 22 Jun 2016 | 17:48:52 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 263612
Report deadline 27 Jun 2016 | 9:12:36 UTC
Run time 30,274.28
CPU time 30,138.36
Validate state Valid
Credit 294,750.00
Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65)
I also had another 0x WU fail, which makes 3 for me.
Name 0x54-GERARD_CXCL12LOCKMONO-0-3-RND3534_2
Workunit 11645843
Created 22 Jun 2016 | 10:05:17 UTC
Sent 22 Jun 2016 | 13:11:49 UTC
Received 22 Jun 2016 | 13:33:01 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -97 (0xffffffffffffff9f) Unknown error number
Computer ID 30790
Report deadline 27 Jun 2016 | 13:11:49 UTC
Run time 1.31
CPU time 1.31
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65)
Looks like the 0x WUs are all bad, and should be canceled.
|
|
|
|
Enough with these 0x WUs, already, I had 2 more fail on me.
Name 0x0-GERARD_CXCL12LOCKMONO-0-3-RND6293_6
Workunit 11645789
Created 23 Jun 2016 | 17:22:43 UTC
Sent 23 Jun 2016 | 18:54:17 UTC
Received 23 Jun 2016 | 19:40:13 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -97 (0xffffffffffffff9f) Unknown error number
Computer ID 263612
Report deadline 28 Jun 2016 | 18:54:17 UTC
Run time 13.19
CPU time 13.19
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65)
Name 0x47-GERARD_CXCL12LOCKMONO-0-3-RND0003_4
Workunit 11645836
Created 24 Jun 2016 | 5:30:48 UTC
Sent 24 Jun 2016 | 7:04:56 UTC
Received 24 Jun 2016 | 7:31:29 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -97 (0xffffffffffffff9f) Unknown error number
Computer ID 30790
Report deadline 29 Jun 2016 | 7:04:56 UTC
Run time 1.22
CPU time 1.22
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65)
These WUs are bad. Please cancel them.
|
|
|