Advanced search

Message boards : Number crunching : GPU units failing

Author Message
_Ryle_
Send message
Joined: 7 Jun 09
Posts: 24
Credit: 1,149,643,416
RAC: 9,306
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50313 - Posted: 28 Aug 2018 | 13:01:10 UTC

Looks like there is a problem with current GPU Workunits:

http://gpugrid.net/workunit.php?wuid=14409360

I've had around 6 failing in a short time, my quota is now exceeded. Linux here.

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 50315 - Posted: 28 Aug 2018 | 13:30:34 UTC - in response to Message 50313.

True, " Error reading parmtop file " means something amiss in the WU.

tullio
Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 50319 - Posted: 28 Aug 2018 | 13:51:03 UTC

They all fail immediately,both in Windows and Linux.
Tullio

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,256,332,676
RAC: 29,209,412
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 50321 - Posted: 28 Aug 2018 | 15:50:34 UTC

same thing here on Windows:

ERROR: file mdioload.cpp line 229: Error reading parmtop file

:-(

MartinKanne
Send message
Joined: 27 Dec 16
Posts: 6
Credit: 53,210,225
RAC: 0
Level
Thr
Scientific publications
watwatwatwat
Message 50324 - Posted: 28 Aug 2018 | 18:15:08 UTC

in my case, short runs are working, long runs not (Windows)

MartinKanne
Send message
Joined: 27 Dec 16
Posts: 6
Credit: 53,210,225
RAC: 0
Level
Thr
Scientific publications
watwatwatwat
Message 50325 - Posted: 28 Aug 2018 | 18:15:13 UTC

in my case, short runs are working, long runs not (Windows)

_Ryle_
Send message
Joined: 7 Jun 09
Posts: 24
Credit: 1,149,643,416
RAC: 9,306
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50326 - Posted: 28 Aug 2018 | 18:37:46 UTC

Is it a good idea to pull the bad batch out of the queue, since they have 100% error rate, according to the server status page? They also have weird file names compared to "normal" workunits, fx AGUAGUAGUA etc. I don't know if that is of any significance though.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,256,332,676
RAC: 29,209,412
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 50327 - Posted: 28 Aug 2018 | 19:29:39 UTC - in response to Message 50326.

Is it a good idea to pull the bad batch out of the queue ...

it definitely is; particularly in view of the fact that if a (low) number of such tasks has failed on a given host, this host won't receive any new tasks (regardless of good or bad ones) for the next 24 hours.
So these hosts are "punished", although they didn't do anything wrong.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,256,332,676
RAC: 29,209,412
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 50329 - Posted: 29 Aug 2018 | 10:58:58 UTC
Last modified: 29 Aug 2018 | 11:00:08 UTC

Can anyone explain me why while the Project Status Page shows hundreds of unsent "PABLO_2IDP..." tasks, all my hosts download only these faulty "PABLO_prod_1_UUAUACCUACCA_350K_2_ru" (and similar) tasks, although from them there are only a handful unsent left?
This is rather annoying :-(
What I don't understand: once it became clear that they all are faulty - why have they not been withdrawn from the queue?

As a consequence, I have given up crunching GPUGRID for the time being, and will turn to other projects.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50330 - Posted: 29 Aug 2018 | 11:33:22 UTC - in response to Message 50329.

It would help if GPUGrid used a short beta test for new work units. The bad ones would show up in a hurry, and it would avoid filling up the main cache with them. It always takes time to clear out the faulty ones in BOINC; I don't think there is a fast way to do it.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,256,332,676
RAC: 29,209,412
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 50338 - Posted: 30 Aug 2018 | 10:42:17 UTC - in response to Message 50330.

It would help if GPUGrid used a short beta test for new work units. The bad ones would show up in a hurry ...

in fact, I am surprised that new tasks are not being testet at all before they are sent to the download queue :-(

Post to thread

Message boards : Number crunching : GPU units failing

//