Message boards : Number crunching : Errors resuming after power outage
Author | Message |
---|---|
My computer recently restarted, unexpectedly. It may have been a brief power outage, though I am not 100% sure.
Server state Over
Outcome Computation error
Client state Compute error
Exit status -52 (0xffffffffffffffcc) Unknown error number
Validate state Invalid
And they all had the following at the bottom of their stderr.txt: SWAN : FATAL Unable to load module .mshake_kernel.cu. (702) Can anything be done to make this scenario, able to be restarted and resumed, for GPUGrid GPU tasks? e13s16_e1s33f90-NOELIA_1mgx1-2-4-RND0021_0 http://www.gpugrid.net/result.php?resultid=13982550 e26s10_e20s4f232-SDOERR_villinpub2-0-1-RND0381_3 http://www.gpugrid.net/result.php?resultid=13983070 e15s46_e1s400f24-NOELIA_1mgx2-1-4-RND5323_0 http://www.gpugrid.net/result.php?resultid=13983199 2Mgx471-NOELIA_INSP-11-12-RND1315_0 http://www.gpugrid.net/result.php?resultid=13983283 e12s13_e4s36f65-NOELIA_1mgx1-3-4-RND7924_0 http://www.gpugrid.net/result.php?resultid=13983393 e15s46_e1s386f84-NOELIA_1mgx1-1-4-RND3709_0 http://www.gpugrid.net/result.php?resultid=13983801 Note: On this computer, I load my 3 GPUs with 2-tasks-per-GPU. | |
ID: 40492 | Rating: 0 | rate: / Reply Quote | |
Seconded. | |
ID: 40507 | Rating: 0 | rate: / Reply Quote | |
MJH: Any chance you might look at this problem? | |
ID: 40536 | Rating: 0 | rate: / Reply Quote | |
I have two validation errors after an abrupt power-off of a host. The wus resumed from some check points and completed but ended in validation errors (two GPUs host). | |
ID: 40729 | Rating: 0 | rate: / Reply Quote | |
I had the same problem 2 days ago. The WU's either crash immediately or they continue normally and than you get the validation error, when they upload. The crashing immediately is not a new problem, but the validation error is. | |
ID: 40730 | Rating: 0 | rate: / Reply Quote | |
I just had a WU that gave a Computation error after 23 hours of crunching because of a power failure. It happens to me every now and then, especially during rainy seasons when thunder causes the power in my house to trip. Over the years, I've probably lost 30-40 half completed WUs this way. | |
ID: 47003 | Rating: 0 | rate: / Reply Quote | |
Duane, | |
ID: 47005 | Rating: 0 | rate: / Reply Quote | |
Duane, Thanks for the suggestion. I checked in my Device Manager under the drive > policies and find that the Write Caching box is already unchecked.... yet I still lost the WU after the power outage. But I don't think data corruption of the checkpoint is the issue. This is the report I see for the WU: SWAN : FATAL Unable to load module .mshake_kernel.cu. (719) Seems after the power failure and reboot it has some kernel error? It is the exact same problem that the starter of this thread reported. But yet the next WU in the queue starts crunching fine after that. <core_client_version>7.6.33</core_client_version> <![CDATA[ <message> (unknown error) - exit code -52 (0xffffffcc) </message> <stderr_txt> # GPU [GeForce GTX 960] Platform [Windows] Rev [3212] VERSION [80] # SWAN Device 0 : # Name : GeForce GTX 960 # ECC : Disabled # Global mem : 2048MB # Capability : 5.2 # PCI ID : 0000:28:00.0 # Device clock : 1291MHz # Memory clock : 3600MHz # Memory width : 128bit # Driver version : r381_64 : 38165 # GPU 0 : 59C # GPU 0 : 60C # GPU [GeForce GTX 960] Platform [Windows] Rev [3212] VERSION [80] # SWAN Device 0 : # Name : GeForce GTX 960 # ECC : Disabled # Global mem : 2048MB # Capability : 5.2 # PCI ID : 0000:28:00.0 # Device clock : 1291MHz # Memory clock : 3600MHz # Memory width : 128bit # Driver version : r381_64 : 38165 # GPU 0 : 57C Can't acquire lockfile - exiting No heartbeat from core client for 30 sec - exiting # GPU [GeForce GTX 960] Platform [Windows] Rev [3212] VERSION [80] # SWAN Device 0 : # Name : GeForce GTX 960 # ECC : Disabled # Global mem : 2048MB # Capability : 5.2 # PCI ID : 0000:28:00.0 # Device clock : 1291MHz # Memory clock : 3600MHz # Memory width : 128bit # Driver version : r381_64 : 38165 # GPU 0 : 58C # GPU [GeForce GTX 960] Platform [Windows] Rev [3212] VERSION [80] # SWAN Device 0 : # Name : GeForce GTX 960 # ECC : Disabled # Global mem : 2048MB # Capability : 5.2 # PCI ID : 0000:28:00.0 # Device clock : 1291MHz # Memory clock : 3600MHz # Memory width : 128bit # Driver version : r381_64 : 38165 # GPU 0 : 57C # GPU 0 : 58C # GPU 0 : 59C # GPU 0 : 60C # GPU 0 : 61C # GPU [GeForce GTX 960] Platform [Windows] Rev [3212] VERSION [80] # SWAN Device 0 : # Name : GeForce GTX 960 # ECC : Disabled # Global mem : 2048MB # Capability : 5.2 # PCI ID : 0000:28:00.0 # Device clock : 1291MHz # Memory clock : 3600MHz # Memory width : 128bit # Driver version : r381_64 : 38165 SWAN : FATAL Unable to load module .mshake_kernel.cu. (719) </stderr_txt> ]]> | |
ID: 47006 | Rating: 0 | rate: / Reply Quote | |
Message boards : Number crunching : Errors resuming after power outage