Bad batch of WU's

Message boards : Number crunching : Bad batch of WU's

Author	Message
flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 49260 - Posted: 13 Apr 2018 \| 21:13:40 UTC Last modified: 13 Apr 2018 \| 21:15:01 UTC
	I just shot up to 21 errors, I was watching a group of WU's startup and all of them pushed my 3 1080's to 2100MHz and they failed. They are failing on everyone's computers, they are Adria's WU's.
	ID: 49260 \| Rating: 0 \| rate: / Reply Quote

WPrion Send message Joined: 30 Apr 13 Posts: 96 Credit: 2,686,634,111 RAC: 20,729,906 Level Scientific publications	Message 49261 - Posted: 13 Apr 2018 \| 21:25:54 UTC - in response to Message 49260. Last modified: 13 Apr 2018 \| 21:28:24 UTC
	I just had two Pablo's crash as soon as they started. e175s2_e44s10p0f2-PABLO_p27_wild_0_sj403-0 e174s112_e63s8p1f198-PABLO_p27_wild_0_sj403_IDP-0 Update - I just checked my tasks list. I've had seven bad Pablos today. Win
	ID: 49261 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 49262 - Posted: 13 Apr 2018 \| 21:31:02 UTC - in response to Message 49261.
	Strange, both their WU's are crashing. You have 7 from today, I was poking around and everyone is getting errors on long WU's. ____________
	ID: 49262 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,217,465,968 RAC: 1,257,790 Level Scientific publications	Message 49263 - Posted: 13 Apr 2018 \| 22:45:41 UTC - in response to Message 49262. Last modified: 13 Apr 2018 \| 22:46:06 UTC
	All tasks error out immediately with: ERROR: file mdioload.cpp line 81: Unable to read bincoordfile This should be fixed by the staff ASAP.
	ID: 49263 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 49265 - Posted: 14 Apr 2018 \| 14:57:47 UTC
	Any word yet? Has anyone gotten any WU's yet and did they fail? Where are the moderators at, I haven't seen any mods sense I've been back. Also, has anyone heard anything about CPDN that crunches for them? Their forum and everything else has been down for almost 2 weeks now, no WU's, not a word. their main page is up but no mention of what happened, I know there's some people here that crunch for them. just wondering if they might have heard something.
	ID: 49265 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,866,392,698 RAC: 20,073,067 Level Scientific publications	Message 49266 - Posted: 14 Apr 2018 \| 15:33:20 UTC - in response to Message 49265.
	There was a batch of bad PABLO tasks tasks created between about 17:30 - 18:00 UTC yesterday afternoon. I've watched some crash, and I've aborted some others (after checking that they had failed on other machines first). But there are good tasks created before and after.
	ID: 49266 \| Rating: 0 \| rate: / Reply Quote

3de64piB5uZAS6SUNt1GFDU9d... Send message Joined: 20 Apr 15 Posts: 285 Credit: 1,102,216,607 RAC: 0 Level Scientific publications	Message 49267 - Posted: 14 Apr 2018 \| 20:12:00 UTC
	I had quite a few today as well... scared me to death. Frankly I suspected my 4 month young ASUS ROG gtx1070 of being defective and was (figuratively) about to throw it out of the window... when I stumbled across the same error ERROR: file mdioload.cpp line 81: Unable to read bincoordfile Saved by the bell :) ____________ I would love to see HCF1 protein folding and interaction simulations to help my little boy... someday.
	ID: 49267 \| Rating: 0 \| rate: / Reply Quote

STFC9F22 Send message Joined: 10 Nov 17 Posts: 7 Credit: 154,876,594 RAC: 0 Level Scientific publications	Message 49277 - Posted: 16 Apr 2018 \| 12:02:37 UTC
	It seems another bad batch of PABLO_p27_wild_0_sj403_ID has just been released. On 13 April around 18:10 I received seven of these which all failed after about 6 seconds (the event log reporting three absent files), and was then locked out, according to the event log, due to exceeding my daily quota. I have just (16 April at 11:34) received two more tasks failing in the same manner, but have temporarily set the Project to 'No New Tasks' to avoid being locked out again. The files shown as absent in the event log are: 16/04/2018 12:33:53 \| GPUGRID \| Output file e174s447_e62s29p1f212-PABLO_p27_wild_0_sj403_IDP-0-2-RND7636_1_1 for task e174s447_e62s29p1f212-PABLO_p27_wild_0_sj403_IDP-0-2-RND7636_1 absent 16/04/2018 12:33:53 \| GPUGRID \| Output file e174s447_e62s29p1f212-PABLO_p27_wild_0_sj403_IDP-0-2-RND7636_1_2 for task e174s447_e62s29p1f212-PABLO_p27_wild_0_sj403_IDP-0-2-RND7636_1 absent 16/04/2018 12:33:53 \| GPUGRID \| Output file e174s447_e62s29p1f212-PABLO_p27_wild_0_sj403_IDP-0-2-RND7636_1_3 for task e174s447_e62s29p1f212-PABLO_p27_wild_0_sj403_IDP-0-2-RND7636_1 absent - although as these are output files I guess it might be that this is a symptom of the failure rather than the cause.
	ID: 49277 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,866,392,698 RAC: 20,073,067 Level Scientific publications	Message 49278 - Posted: 16 Apr 2018 \| 13:03:00 UTC - in response to Message 49277. Last modified: 16 Apr 2018 \| 13:07:17 UTC
	- although as these are output files I guess it might be that this is a symptom of the failure rather than the cause. Yes, those are symptoms, not causes. If you follow through your account (name link at top of page) / computer / workunit / task, you should be able to see something like workunit 13443713 - that one was well worth aborting. And if you look at one of the errored tasks, the real cause: ERROR: file mdioload.cpp line 81: Unable to read bincoordfile That was the earlier batch. Today's are possibly similar, but we need to see one to be sure. Your computers are hidden, and we don't have the 'find task by name' tool here, so we'll have to ask you to look it up for use. Edit - thanks for the heads up, I've got one of those too. WU 13451812 is indeed the same as before, created 16 Apr 2018 \| 11:22:43 UTC. That can go in the bit-bucket with the others.
	ID: 49278 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,866,392,698 RAC: 20,073,067 Level Scientific publications	Message 49279 - Posted: 16 Apr 2018 \| 13:26:21 UTC
	I've just been sent another from today's bad batch e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-2-RND8766. The files downloaded were 16/04/2018 14:16:44 \| GPUGRID \| Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-LICENSE 16/04/2018 14:16:44 \| GPUGRID \| Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-COPYRIGHT 16/04/2018 14:16:46 \| GPUGRID \| Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-coor_file 16/04/2018 14:16:46 \| GPUGRID \| Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-vel_file 16/04/2018 14:16:47 \| GPUGRID \| Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-idx_file 16/04/2018 14:16:47 \| GPUGRID \| Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-pdb_file 16/04/2018 14:16:49 \| GPUGRID \| Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-psf_file 16/04/2018 14:16:52 \| GPUGRID \| Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-par_file 16/04/2018 14:16:54 \| GPUGRID \| Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-conf_file_enc 16/04/2018 14:16:55 \| GPUGRID \| Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-metainp_file 16/04/2018 14:16:55 \| GPUGRID \| Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-hills_file 16/04/2018 14:16:56 \| GPUGRID \| Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-xsc_file 16/04/2018 14:16:56 \| GPUGRID \| Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-prmtop_file For comparison, I'm working on an older one, resent this morning but created on 11 April - e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-2-RND9196. Those files were called 16/04/2018 09:06:22 \| GPUGRID \| Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-LICENSE 16/04/2018 09:06:22 \| GPUGRID \| Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-COPYRIGHT 16/04/2018 09:06:23 \| GPUGRID \| Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-0-2-RND9196_1 16/04/2018 09:06:23 \| GPUGRID \| Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-0-2-RND9196_2 16/04/2018 09:06:24 \| GPUGRID \| Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-0-2-RND9196_3 16/04/2018 09:06:24 \| GPUGRID \| Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-pdb_file 16/04/2018 09:06:25 \| GPUGRID \| Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-psf_file 16/04/2018 09:06:25 \| GPUGRID \| Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-par_file 16/04/2018 09:06:26 \| GPUGRID \| Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-conf_file_enc 16/04/2018 09:06:27 \| GPUGRID \| Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-metainp_file 16/04/2018 09:06:27 \| GPUGRID \| Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-0-2-RND9196_7 16/04/2018 09:06:28 \| GPUGRID \| Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-0-2-RND9196_10 16/04/2018 09:06:28 \| GPUGRID \| Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-prmtop_file Quite a difference.
	ID: 49279 \| Rating: 0 \| rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 755,434,080 RAC: 186,180 Level Scientific publications	Message 49280 - Posted: 16 Apr 2018 \| 14:58:28 UTC
	Two of my recent PABLO tasks gave Error while computing, with this error message in the stderr file: ERROR: file mdioload.cpp line 81: Unable to read bincoordfile Could you check if this is due to a missing file that should have been sent with the task?
	ID: 49280 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 49281 - Posted: 16 Apr 2018 \| 15:04:02 UTC
	I just got 10 more bad Pablo's, that brings me to 34 total. The server is going to give me the boot for to many errors in such a short amount of time, I hope they figure this out soon. Richard, do you have any idea what's going on over at CPDN? Everything has been down for 2 weeks or so and I was curious when they might be back up, thanks.
	ID: 49281 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,866,392,698 RAC: 20,073,067 Level Scientific publications	Message 49283 - Posted: 16 Apr 2018 \| 15:50:55 UTC - in response to Message 49281.
	Richard, do you have any idea what's going on over at CPDN? Everything has been down for 2 weeks or so and I was curious when they might be back up, thanks. I received the same emails as have been quoted on the BOINC message board at CPDN project going offline this afternoon, but I've had no more specific news that that. Better to consolidate all the news that we do get in that thread, I think.
	ID: 49283 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,866,392,698 RAC: 20,073,067 Level Scientific publications	Message 49284 - Posted: 16 Apr 2018 \| 16:01:21 UTC - in response to Message 49280. Last modified: 16 Apr 2018 \| 16:08:58 UTC
	Could you check if this is due to a missing file that should have been sent with the task? All the failed / failing tasks over the last four days have had the exact string PABLO_p27_wild in their name. I can see that I've completed at least one successfully with that string: also, at least one other with PABLO_p27_O43806_wild I'll go and search the message logs to see what I can find, but I think any completely missing files would show up as a problem at the download stage, and never get as far as attempting to run. I think it's more likely that the contents are badly formatted in some way, and it won't be possible to compare good and bad after the event. Edit - well, e173s16_e149s4p1f23-PABLO_p27_wild_0_sj403_IDP-1-2-RND2043 had file names with the workunit name embedded, like the second example in my comparison example earlier. I think that Pablo, or whoever is submitting the work on Pablo's behalf, might be using the wrong script/template when preparing the workunits.
	ID: 49284 \| Rating: 0 \| rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 337 Credit: 7,635,065,645 RAC: 10,764,900 Level Scientific publications	Message 49286 - Posted: 16 Apr 2018 \| 17:25:07 UTC
	2nd batch of bad tasks today. 8 on the 13th and 4 more today.
	ID: 49286 \| Rating: 0 \| rate: / Reply Quote

Tuna Ertemalp Send message Joined: 28 Mar 15 Posts: 46 Credit: 1,547,496,701 RAC: 0 Level Scientific publications	Message 49288 - Posted: 16 Apr 2018 \| 17:43:52 UTC
	So, one my one, my seven hosts will be jailed, wasting 12x 1080Ti and 2x TitanX... Something is very non-ideal with that picture. Given that this sort of bad batches are happening with some relevant non-ignorable frequency, there should be a way to unblock machines in bulk that were blocked/limited due to bad batch issues, methinks. ____________
	ID: 49288 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 49289 - Posted: 16 Apr 2018 \| 18:31:20 UTC
	Why don't they know what's going on? Don't they monitor their project? Thanks Richard, sorry to bother you, I didn't think to check the BOINC forums. I haven't heard nothing even before they went down on the CPDN forum.
	ID: 49289 \| Rating: 0 \| rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 755,434,080 RAC: 186,180 Level Scientific publications	Message 49290 - Posted: 16 Apr 2018 \| 19:10:17 UTC - in response to Message 49284.
	Could you check if this is due to a missing file that should have been sent with the task? All the failed / failing tasks over the last four days have had the exact string PABLO_p27_wild in their name. I can see that I've completed at least one successfully with that string: also, at least one other with PABLO_p27_O43806_wild I'll go and search the message logs to see what I can find, but I think any completely missing files would show up as a problem at the download stage, and never get as far as attempting to run. I think it's more likely that the contents are badly formatted in some way, and it won't be possible to compare good and bad after the event. Edit - well, e173s16_e149s4p1f23-PABLO_p27_wild_0_sj403_IDP-1-2-RND2043 had file names with the workunit name embedded, like the second example in my comparison example earlier. I think that Pablo, or whoever is submitting the work on Pablo's behalf, might be using the wrong script/template when preparing the workunits. I'd expect the download stage to fail if the file was missing on the server, but only if the name of the file was included on the list of files sent with the task to tell the client what files to download before starting the task. If the name of the file was missing from that list, I'd expect download stage to download all the files on the list, report success for that stage, and the problem to become visible only when the application tries to open the file.
	ID: 49290 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 49291 - Posted: 17 Apr 2018 \| 7:16:01 UTC
	Why are these work units still in the que? Is anyone running this program? Where are the moderators at? This place is like an airplane on autopilot, it seems like some of these projects have no more enthusiasm.
	ID: 49291 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 49292 - Posted: 17 Apr 2018 \| 7:16:05 UTC Last modified: 17 Apr 2018 \| 7:17:05 UTC
	Double post
	ID: 49292 \| Rating: 0 \| rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 49293 - Posted: 17 Apr 2018 \| 8:16:50 UTC - in response to Message 49291. Last modified: 17 Apr 2018 \| 8:53:36 UTC
	Why are these work units still in the que? Is anyone running this program? Where are the moderators at? This place is like an airplane on autopilot, it seems like some of these projects have no more enthusiasm. A similar post of mine about this projects participation with its contributors. http://www.gpugrid.net/forum_thread.php?id=4585#47369 another one, http://www.gpugrid.net/forum_thread.php?id=4368&nowrap=true#48039
	ID: 49293 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 49294 - Posted: 17 Apr 2018 \| 8:32:49 UTC - in response to Message 49293.
	I'm under the impression they think were all getting paid for crunching, I will never do data mining, ever!! I'll bet a lot of the dedicated crunchers that have been here for a while are now miners for hire.
	ID: 49294 \| Rating: 0 \| rate: / Reply Quote

STFC9F22 Send message Joined: 10 Nov 17 Posts: 7 Credit: 154,876,594 RAC: 0 Level Scientific publications	Message 49295 - Posted: 17 Apr 2018 \| 9:25:23 UTC - in response to Message 49278.
	If you follow through your account (name link at top of page) / computer / workunit / task, you should be able to see something like workunit 13443713 - that one was well worth aborting. Richard, thank you for your post. I have received a single further failing task this morning and following the link as you advise it seems to have originated in yesterday's bad batch at 11:17 (UTC). I notice that by following the links for all ten of my failed tasks, seven arising from the 13 April bad batch (around 17:50 UTC) and three from the 16 April bad Batch (around 11:20 UTC), they now all show exactly eight 'Error while Computing' failures so perhaps there is some automatic mechanism whereby they are automatically pulled after eight failures on different computers? I also notice on the Server Status Page that PABLO_p27_wild_0_sj403_ID currently shows 740 successes and a 92.33% error rate which, if my maths is correct, suggests over 9600 failures, potentially resulting in a large number of computers having been temporarily locked out. Frustrating though that is for we donors it doesn't appear to have created a (GPU) processing backlog and perhaps is an indication that the processing resource offered by donors currently far exceeds the requirements of the available work.[/url]
	ID: 49295 \| Rating: 0 \| rate: / Reply Quote

Tuna Ertemalp Send message Joined: 28 Mar 15 Posts: 46 Credit: 1,547,496,701 RAC: 0 Level Scientific publications	Message 49299 - Posted: 17 Apr 2018 \| 15:36:32 UTC - in response to Message 49288. Last modified: 17 Apr 2018 \| 15:36:43 UTC
	So, one my one, my seven hosts will be jailed, wasting 12x 1080Ti and 2x TitanX... Something is very non-ideal with that picture. Given that this sort of bad batches are happening with some relevant non-ignorable frequency, there should be a way to unblock machines in bulk that were blocked/limited due to bad batch issues, methinks. Case in point: When my hosts are fully utilized, I would see "State: In progress (28)" under my account's Tasks page (I have a custom config file that tells BOINC that GPUGRID tasks are each 1 CPU + 0.5 GPU, and that works well for these cards, I found, so 14 cards = 28 tasks). Last night I saw it at 26, then 22, went to sleep, then this morning at 14, now it is 12. For instance, when one of my single TitanX machines (http://www.gpugrid.net/results.php?hostid=205349) that has NOTHING ELSE going on in BOINC contacts GPUGRID, it gets: 4/17/2018 8:23:54 AM \| GPUGRID \| Sending scheduler request: To fetch work. 4/17/2018 8:23:54 AM \| GPUGRID \| Requesting new tasks for CPU and NVIDIA GPU 4/17/2018 8:23:56 AM \| GPUGRID \| Scheduler request completed: got 0 new tasks 4/17/2018 8:23:56 AM \| GPUGRID \| No tasks sent Quite ironic when the Server Status says "Tasks ready to send 34,375"... :(
	ID: 49299 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,866,392,698 RAC: 20,073,067 Level Scientific publications	Message 49300 - Posted: 17 Apr 2018 \| 15:52:56 UTC - in response to Message 49295.
	I notice that by following the links for all ten of my failed tasks, seven arising from the 13 April bad batch (around 17:50 UTC) and three from the 16 April bad Batch (around 11:20 UTC), they now all show exactly eight 'Error while Computing' failures so perhaps there is some automatic mechanism whereby they are automatically pulled after eight failures on different computers? Yes, on the workunit page, you should see a red banner above the task list saying Too many errors (may have bug) Once that appears (at this project, after 8 failures), no more are sent out.
	ID: 49300 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 49301 - Posted: 17 Apr 2018 \| 16:47:50 UTC - in response to Message 49299.
	Quite ironic when the Server Status says "Tasks ready to send 34,375"... :( Those are Quantum Chemistery WU's for your CPU. ____________
	ID: 49301 \| Rating: 0 \| rate: / Reply Quote

STFC9F22 Send message Joined: 10 Nov 17 Posts: 7 Credit: 154,876,594 RAC: 0 Level Scientific publications	Message 49302 - Posted: 17 Apr 2018 \| 16:53:12 UTC - in response to Message 49299.
	For instance, when one of my single TitanX machines (http://www.gpugrid.net/results.php?hostid=205349) that has NOTHING ELSE going on in BOINC contacts GPUGRID, it gets: 4/17/2018 8:23:54 AM \| GPUGRID \| Sending scheduler request: To fetch work. 4/17/2018 8:23:54 AM \| GPUGRID \| Requesting new tasks for CPU and NVIDIA GPU 4/17/2018 8:23:56 AM \| GPUGRID \| Scheduler request completed: got 0 new tasks 4/17/2018 8:23:56 AM \| GPUGRID \| No tasks sent Quite ironic when the Server Status says "Tasks ready to send 34,375"... :( Hi Tuna, The 'Tasks ready to send' figure on the Server Status page is the total of all task types ready to send. There is a table beneath this showing a breakdown of 'Tasks by application' (although for some reason the totals always differ by one). You should see that almost all, if not all, of the unsent tasks are currently Quantum Chemistry tasks which do not run on GPUs or Windows; they run on Linux CPUs only. The Unsent Short runs and Unsent Long Runs figures show the work available for GPUs. I cannot remember the exact wording, but when my own PC was temporarily jailed, requests for new work in the event log then reported the reason that the daily quota (3) had been exceeded - as the log you have posted above does not say this I suspect you might not be locked out and the reason you are not receiving tasks is simply that currently, most of the time, there are no GPU tasks available to send. As I understand it, tasks are only sent in response to requests from the client so it is down a matter of luck as to whether tasks are available when your PC makes its requests.
	ID: 49302 \| Rating: 0 \| rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 337 Credit: 7,635,065,645 RAC: 10,764,900 Level Scientific publications	Message 49303 - Posted: 17 Apr 2018 \| 17:02:29 UTC
	I had 8 fail on the 13th at 18:38:35 UTC and received 4 more on the 15th at 6:17:47 UTC. So if they were jailed it was less than 2 days.
	ID: 49303 \| Rating: 0 \| rate: / Reply Quote

Tuna Ertemalp Send message Joined: 28 Mar 15 Posts: 46 Credit: 1,547,496,701 RAC: 0 Level Scientific publications	Message 49304 - Posted: 17 Apr 2018 \| 17:03:25 UTC - in response to Message 49302.
	Ooops. Yup. I didn't scroll down far enough, I guess... :)
	ID: 49304 \| Rating: 0 \| rate: / Reply Quote

STFC9F22 Send message Joined: 10 Nov 17 Posts: 7 Credit: 154,876,594 RAC: 0 Level Scientific publications	Message 49307 - Posted: 17 Apr 2018 \| 22:44:28 UTC - in response to Message 49303.
	I had 8 fail on the 13th at 18:38:35 UTC and received 4 more on the 15th at 6:17:47 UTC. So if they were jailed it was less than 2 days. Yes, as in these circumstances the event log refers to a daily quota being exceeded, I guess the lockout only lasts for a day or the remaining part of a day.
	ID: 49307 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1132 Credit: 10,256,332,676 RAC: 29,209,412 Level Scientific publications	Message 49320 - Posted: 19 Apr 2018 \| 5:40:30 UTC - in response to Message 49284.
	Richard wrote on April 16th: All the failed / failing tasks over the last four days have had the exact string PABLO_p27_wild in their name. These faulty task are still in the queue; I got such one this morning: http://gpugrid.net/result.php?resultid=17472190 again, as before: ERROR: file mdioload.cpp line 81: Unable to read bincoordfile
	ID: 49320 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1132 Credit: 10,256,332,676 RAC: 29,209,412 Level Scientific publications	Message 49321 - Posted: 19 Apr 2018 \| 6:43:45 UTC
	just now, the next faulty PABLO_927_wild task :-((( The fourth one within 6 hours! Which is really annoying. What I don't understand: don't the people at GPUGRID monitor what's happening? This specific problem has been known for 6 days now, and still there are these faulty tasks in the queue. Not nice at all :-(((
	ID: 49321 \| Rating: 0 \| rate: / Reply Quote

STFC9F22 Send message Joined: 10 Nov 17 Posts: 7 Credit: 154,876,594 RAC: 0 Level Scientific publications	Message 49323 - Posted: 19 Apr 2018 \| 8:38:14 UTC - in response to Message 49320.
	These faulty task are still in the queue; I got such one this morning: http://gpugrid.net/result.php?resultid=17472190 Hi Erich, As Richard confirmed earlier in the thread there is a mechanism whereby tasks are automatically withdrawn after eight failures and it seems this is being relied on to remove the faulty batches (for example http://www.gpugrid.net/workunit.php?wuid=13443881). If you look at the history of the Work Unit that you picked up http://gpugrid.net/workunit.php?wuid=13443681 you can see that it originated in the bad batch released on 13 April. After it failed for you, it was reissued to flashawk (who aborted it) and is currently reissued to an anonymous user but still requires three further failures to trigger automatic removal. I guess that because there is currenlty so little GPU work available these remaining bad units are still a significant proportion of available work and agree that it is disappointing, and seems a little disrespectful to donors, that no action has been taken to remove them proactively.
	ID: 49323 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1132 Credit: 10,256,332,676 RAC: 29,209,412 Level Scientific publications	Message 49325 - Posted: 19 Apr 2018 \| 10:26:25 UTC
	@STFC9F22, thank you for your explanations and insights. You are right, the way this problem is being handled by GPUGRID is a little dissapointing for us donors, particularly since a donor is being punished by not getting any more tasks for a certain timespan as the host is being considered "unreliable". Exactly this happened to me last Friday with one of my hosts, and I think it's not okay at all. The mechanism of "host punishment" should definitely be suspended in such cases where the cause for the problem is a faulty task.
	ID: 49325 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 49327 - Posted: 19 Apr 2018 \| 18:29:53 UTC - in response to Message 49323. Last modified: 19 Apr 2018 \| 18:30:43 UTC
	Hi Erich, As Richard confirmed earlier in the thread there is a mechanism whereby tasks are automatically withdrawn after eight failures and it seems this is being relied on to remove the faulty batches (for example http://www.gpugrid.net/workunit.php?wuid=13443881). Most of us are and have been aware of this for sometime. The problem is when a computer gets to many errors it's put on a black list for a time, I know right after I get an error I can't download any WU's for a time. We should'nt be taking these hits. It's as though he loaded up these last WU's, packed his bags and left.
	ID: 49327 \| Rating: 0 \| rate: / Reply Quote

PappaLitto Send message Joined: 21 Mar 16 Posts: 511 Credit: 4,672,242,755 RAC: 0 Level Scientific publications	Message 49328 - Posted: 19 Apr 2018 \| 18:36:07 UTC - in response to Message 49327.
	It's as though he loaded up these last WU's, packed his bags and left. You know he might be on vacation. Scientists have real lives too.
	ID: 49328 \| Rating: 0 \| rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 755,434,080 RAC: 186,180 Level Scientific publications	Message 49329 - Posted: 19 Apr 2018 \| 18:45:26 UTC - in response to Message 49327. Last modified: 19 Apr 2018 \| 18:47:41 UTC
	Hi Erich, As Richard confirmed earlier in the thread there is a mechanism whereby tasks are automatically withdrawn after eight failures and it seems this is being relied on to remove the faulty batches (for example http://www.gpugrid.net/workunit.php?wuid=13443881). Most of us are and have been aware of this for sometime. The problem is when a computer gets to many errors it's put on a black list for a time, I know right after I get an error I can't download any WU's for a time. We should'nt be taking these hits. It's as though he loaded up these last WU's, packed his bags and left. When the tasks are withdrawn after eight failures, they should also no longer be counted as failures for the eight computers that ran them. While this is fixed, compute errors in 2015 and earlier years should be removed from the lists of failures for computers, even for workunits that never had a successful task.
	ID: 49329 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 49331 - Posted: 19 Apr 2018 \| 20:31:20 UTC - in response to Message 49329.
	I have 8 failed WU's from 2013 through 2014 that won't go away, how do I get those removed? I've asked twice and got no response, are their mods here anymore?
	ID: 49331 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1132 Credit: 10,256,332,676 RAC: 29,209,412 Level Scientific publications	Message 49345 - Posted: 20 Apr 2018 \| 15:55:57 UTC
	I just notice that many more of these faulty PABLO_P27_wild tasks are distributed, although they have an error rate of 88% (which means that nearly 9 out of 10 are bad). Can anyone explain what sense this makes?
	ID: 49345 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 49346 - Posted: 20 Apr 2018 \| 19:59:35 UTC - in response to Message 49345.
	I just notice that many more of these faulty PABLO_P27_wild tasks are distributed, although they have an error rate of 88% (which means that nearly 9 out of 10 are bad). Can anyone explain what sense this makes? I have 4 over 50% right now and they seem fine, they may have been reworked.
	ID: 49346 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1132 Credit: 10,256,332,676 RAC: 29,209,412 Level Scientific publications	Message 49347 - Posted: 21 Apr 2018 \| 6:42:38 UTC - in response to Message 49346.
	...they may have been reworked. maybe so; I got 2 of them within the past 12 hours, and both of them did NOT fail so far :-)
	ID: 49347 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Number crunching : Bad batch of WU's

	About	Science	Volunteers	Performance	Forum	Join us	Donate

Author	Message
flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 49260 - Posted: 13 Apr 2018 \| 21:13:40 UTC Last modified: 13 Apr 2018 \| 21:15:01 UTC
	I just shot up to 21 errors, I was watching a group of WU's startup and all of them pushed my 3 1080's to 2100MHz and they failed. They are failing on everyone's computers, they are Adria's WU's.
	ID: 49260 \| Rating: 0 \| rate: / Reply Quote

WPrion Send message Joined: 30 Apr 13 Posts: 96 Credit: 2,686,634,111 RAC: 20,729,906 Level Scientific publications	Message 49261 - Posted: 13 Apr 2018 \| 21:25:54 UTC - in response to Message 49260. Last modified: 13 Apr 2018 \| 21:28:24 UTC
	I just had two Pablo's crash as soon as they started. e175s2_e44s10p0f2-PABLO_p27_wild_0_sj403-0 e174s112_e63s8p1f198-PABLO_p27_wild_0_sj403_IDP-0 Update - I just checked my tasks list. I've had seven bad Pablos today. Win
	ID: 49261 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 49262 - Posted: 13 Apr 2018 \| 21:31:02 UTC - in response to Message 49261.
	Strange, both their WU's are crashing. You have 7 from today, I was poking around and everyone is getting errors on long WU's. ____________
	ID: 49262 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,217,465,968 RAC: 1,257,790 Level Scientific publications	Message 49263 - Posted: 13 Apr 2018 \| 22:45:41 UTC - in response to Message 49262. Last modified: 13 Apr 2018 \| 22:46:06 UTC
	All tasks error out immediately with: ERROR: file mdioload.cpp line 81: Unable to read bincoordfile This should be fixed by the staff ASAP.
	ID: 49263 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 49265 - Posted: 14 Apr 2018 \| 14:57:47 UTC
	Any word yet? Has anyone gotten any WU's yet and did they fail? Where are the moderators at, I haven't seen any mods sense I've been back. Also, has anyone heard anything about CPDN that crunches for them? Their forum and everything else has been down for almost 2 weeks now, no WU's, not a word. their main page is up but no mention of what happened, I know there's some people here that crunch for them. just wondering if they might have heard something.
	ID: 49265 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,866,392,698 RAC: 20,073,067 Level Scientific publications	Message 49266 - Posted: 14 Apr 2018 \| 15:33:20 UTC - in response to Message 49265.
	There was a batch of bad PABLO tasks tasks created between about 17:30 - 18:00 UTC yesterday afternoon. I've watched some crash, and I've aborted some others (after checking that they had failed on other machines first). But there are good tasks created before and after.
	ID: 49266 \| Rating: 0 \| rate: / Reply Quote

3de64piB5uZAS6SUNt1GFDU9d... Send message Joined: 20 Apr 15 Posts: 285 Credit: 1,102,216,607 RAC: 0 Level Scientific publications	Message 49267 - Posted: 14 Apr 2018 \| 20:12:00 UTC
	I had quite a few today as well... scared me to death. Frankly I suspected my 4 month young ASUS ROG gtx1070 of being defective and was (figuratively) about to throw it out of the window... when I stumbled across the same error ERROR: file mdioload.cpp line 81: Unable to read bincoordfile Saved by the bell :) ____________ I would love to see HCF1 protein folding and interaction simulations to help my little boy... someday.
	ID: 49267 \| Rating: 0 \| rate: / Reply Quote

STFC9F22 Send message Joined: 10 Nov 17 Posts: 7 Credit: 154,876,594 RAC: 0 Level Scientific publications	Message 49277 - Posted: 16 Apr 2018 \| 12:02:37 UTC
	It seems another bad batch of PABLO_p27_wild_0_sj403_ID has just been released. On 13 April around 18:10 I received seven of these which all failed after about 6 seconds (the event log reporting three absent files), and was then locked out, according to the event log, due to exceeding my daily quota. I have just (16 April at 11:34) received two more tasks failing in the same manner, but have temporarily set the Project to 'No New Tasks' to avoid being locked out again. The files shown as absent in the event log are: 16/04/2018 12:33:53 \| GPUGRID \| Output file e174s447_e62s29p1f212-PABLO_p27_wild_0_sj403_IDP-0-2-RND7636_1_1 for task e174s447_e62s29p1f212-PABLO_p27_wild_0_sj403_IDP-0-2-RND7636_1 absent 16/04/2018 12:33:53 \| GPUGRID \| Output file e174s447_e62s29p1f212-PABLO_p27_wild_0_sj403_IDP-0-2-RND7636_1_2 for task e174s447_e62s29p1f212-PABLO_p27_wild_0_sj403_IDP-0-2-RND7636_1 absent 16/04/2018 12:33:53 \| GPUGRID \| Output file e174s447_e62s29p1f212-PABLO_p27_wild_0_sj403_IDP-0-2-RND7636_1_3 for task e174s447_e62s29p1f212-PABLO_p27_wild_0_sj403_IDP-0-2-RND7636_1 absent - although as these are output files I guess it might be that this is a symptom of the failure rather than the cause.
	ID: 49277 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,866,392,698 RAC: 20,073,067 Level Scientific publications	Message 49278 - Posted: 16 Apr 2018 \| 13:03:00 UTC - in response to Message 49277. Last modified: 16 Apr 2018 \| 13:07:17 UTC
	- although as these are output files I guess it might be that this is a symptom of the failure rather than the cause. Yes, those are symptoms, not causes. If you follow through your account (name link at top of page) / computer / workunit / task, you should be able to see something like workunit 13443713 - that one was well worth aborting. And if you look at one of the errored tasks, the real cause: ERROR: file mdioload.cpp line 81: Unable to read bincoordfile That was the earlier batch. Today's are possibly similar, but we need to see one to be sure. Your computers are hidden, and we don't have the 'find task by name' tool here, so we'll have to ask you to look it up for use. Edit - thanks for the heads up, I've got one of those too. WU 13451812 is indeed the same as before, created 16 Apr 2018 \| 11:22:43 UTC. That can go in the bit-bucket with the others.
	ID: 49278 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,866,392,698 RAC: 20,073,067 Level Scientific publications	Message 49279 - Posted: 16 Apr 2018 \| 13:26:21 UTC
	I've just been sent another from today's bad batch e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-2-RND8766. The files downloaded were 16/04/2018 14:16:44 \| GPUGRID \| Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-LICENSE 16/04/2018 14:16:44 \| GPUGRID \| Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-COPYRIGHT 16/04/2018 14:16:46 \| GPUGRID \| Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-coor_file 16/04/2018 14:16:46 \| GPUGRID \| Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-vel_file 16/04/2018 14:16:47 \| GPUGRID \| Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-idx_file 16/04/2018 14:16:47 \| GPUGRID \| Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-pdb_file 16/04/2018 14:16:49 \| GPUGRID \| Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-psf_file 16/04/2018 14:16:52 \| GPUGRID \| Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-par_file 16/04/2018 14:16:54 \| GPUGRID \| Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-conf_file_enc 16/04/2018 14:16:55 \| GPUGRID \| Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-metainp_file 16/04/2018 14:16:55 \| GPUGRID \| Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-hills_file 16/04/2018 14:16:56 \| GPUGRID \| Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-xsc_file 16/04/2018 14:16:56 \| GPUGRID \| Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-prmtop_file For comparison, I'm working on an older one, resent this morning but created on 11 April - e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-2-RND9196. Those files were called 16/04/2018 09:06:22 \| GPUGRID \| Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-LICENSE 16/04/2018 09:06:22 \| GPUGRID \| Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-COPYRIGHT 16/04/2018 09:06:23 \| GPUGRID \| Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-0-2-RND9196_1 16/04/2018 09:06:23 \| GPUGRID \| Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-0-2-RND9196_2 16/04/2018 09:06:24 \| GPUGRID \| Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-0-2-RND9196_3 16/04/2018 09:06:24 \| GPUGRID \| Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-pdb_file 16/04/2018 09:06:25 \| GPUGRID \| Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-psf_file 16/04/2018 09:06:25 \| GPUGRID \| Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-par_file 16/04/2018 09:06:26 \| GPUGRID \| Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-conf_file_enc 16/04/2018 09:06:27 \| GPUGRID \| Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-metainp_file 16/04/2018 09:06:27 \| GPUGRID \| Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-0-2-RND9196_7 16/04/2018 09:06:28 \| GPUGRID \| Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-0-2-RND9196_10 16/04/2018 09:06:28 \| GPUGRID \| Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-prmtop_file Quite a difference.
	ID: 49279 \| Rating: 0 \| rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 755,434,080 RAC: 186,180 Level Scientific publications	Message 49280 - Posted: 16 Apr 2018 \| 14:58:28 UTC
	Two of my recent PABLO tasks gave Error while computing, with this error message in the stderr file: ERROR: file mdioload.cpp line 81: Unable to read bincoordfile Could you check if this is due to a missing file that should have been sent with the task?
	ID: 49280 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 49281 - Posted: 16 Apr 2018 \| 15:04:02 UTC
	I just got 10 more bad Pablo's, that brings me to 34 total. The server is going to give me the boot for to many errors in such a short amount of time, I hope they figure this out soon. Richard, do you have any idea what's going on over at CPDN? Everything has been down for 2 weeks or so and I was curious when they might be back up, thanks.
	ID: 49281 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,866,392,698 RAC: 20,073,067 Level Scientific publications	Message 49283 - Posted: 16 Apr 2018 \| 15:50:55 UTC - in response to Message 49281.
	Richard, do you have any idea what's going on over at CPDN? Everything has been down for 2 weeks or so and I was curious when they might be back up, thanks. I received the same emails as have been quoted on the BOINC message board at CPDN project going offline this afternoon, but I've had no more specific news that that. Better to consolidate all the news that we do get in that thread, I think.
	ID: 49283 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,866,392,698 RAC: 20,073,067 Level Scientific publications	Message 49284 - Posted: 16 Apr 2018 \| 16:01:21 UTC - in response to Message 49280. Last modified: 16 Apr 2018 \| 16:08:58 UTC
	Could you check if this is due to a missing file that should have been sent with the task? All the failed / failing tasks over the last four days have had the exact string PABLO_p27_wild in their name. I can see that I've completed at least one successfully with that string: also, at least one other with PABLO_p27_O43806_wild I'll go and search the message logs to see what I can find, but I think any completely missing files would show up as a problem at the download stage, and never get as far as attempting to run. I think it's more likely that the contents are badly formatted in some way, and it won't be possible to compare good and bad after the event. Edit - well, e173s16_e149s4p1f23-PABLO_p27_wild_0_sj403_IDP-1-2-RND2043 had file names with the workunit name embedded, like the second example in my comparison example earlier. I think that Pablo, or whoever is submitting the work on Pablo's behalf, might be using the wrong script/template when preparing the workunits.
	ID: 49284 \| Rating: 0 \| rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 337 Credit: 7,635,065,645 RAC: 10,764,900 Level Scientific publications	Message 49286 - Posted: 16 Apr 2018 \| 17:25:07 UTC
	2nd batch of bad tasks today. 8 on the 13th and 4 more today.
	ID: 49286 \| Rating: 0 \| rate: / Reply Quote

Tuna Ertemalp Send message Joined: 28 Mar 15 Posts: 46 Credit: 1,547,496,701 RAC: 0 Level Scientific publications	Message 49288 - Posted: 16 Apr 2018 \| 17:43:52 UTC
	So, one my one, my seven hosts will be jailed, wasting 12x 1080Ti and 2x TitanX... Something is very non-ideal with that picture. Given that this sort of bad batches are happening with some relevant non-ignorable frequency, there should be a way to unblock machines in bulk that were blocked/limited due to bad batch issues, methinks. ____________
	ID: 49288 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 49289 - Posted: 16 Apr 2018 \| 18:31:20 UTC
	Why don't they know what's going on? Don't they monitor their project? Thanks Richard, sorry to bother you, I didn't think to check the BOINC forums. I haven't heard nothing even before they went down on the CPDN forum.
	ID: 49289 \| Rating: 0 \| rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 755,434,080 RAC: 186,180 Level Scientific publications	Message 49290 - Posted: 16 Apr 2018 \| 19:10:17 UTC - in response to Message 49284.
	Could you check if this is due to a missing file that should have been sent with the task? All the failed / failing tasks over the last four days have had the exact string PABLO_p27_wild in their name. I can see that I've completed at least one successfully with that string: also, at least one other with PABLO_p27_O43806_wild I'll go and search the message logs to see what I can find, but I think any completely missing files would show up as a problem at the download stage, and never get as far as attempting to run. I think it's more likely that the contents are badly formatted in some way, and it won't be possible to compare good and bad after the event. Edit - well, e173s16_e149s4p1f23-PABLO_p27_wild_0_sj403_IDP-1-2-RND2043 had file names with the workunit name embedded, like the second example in my comparison example earlier. I think that Pablo, or whoever is submitting the work on Pablo's behalf, might be using the wrong script/template when preparing the workunits. I'd expect the download stage to fail if the file was missing on the server, but only if the name of the file was included on the list of files sent with the task to tell the client what files to download before starting the task. If the name of the file was missing from that list, I'd expect download stage to download all the files on the list, report success for that stage, and the problem to become visible only when the application tries to open the file.
	ID: 49290 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 49291 - Posted: 17 Apr 2018 \| 7:16:01 UTC
	Why are these work units still in the que? Is anyone running this program? Where are the moderators at? This place is like an airplane on autopilot, it seems like some of these projects have no more enthusiasm.
	ID: 49291 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 49292 - Posted: 17 Apr 2018 \| 7:16:05 UTC Last modified: 17 Apr 2018 \| 7:17:05 UTC
	Double post
	ID: 49292 \| Rating: 0 \| rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 49293 - Posted: 17 Apr 2018 \| 8:16:50 UTC - in response to Message 49291. Last modified: 17 Apr 2018 \| 8:53:36 UTC
	Why are these work units still in the que? Is anyone running this program? Where are the moderators at? This place is like an airplane on autopilot, it seems like some of these projects have no more enthusiasm. A similar post of mine about this projects participation with its contributors. http://www.gpugrid.net/forum_thread.php?id=4585#47369 another one, http://www.gpugrid.net/forum_thread.php?id=4368&nowrap=true#48039
	ID: 49293 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 49294 - Posted: 17 Apr 2018 \| 8:32:49 UTC - in response to Message 49293.
	I'm under the impression they think were all getting paid for crunching, I will never do data mining, ever!! I'll bet a lot of the dedicated crunchers that have been here for a while are now miners for hire.
	ID: 49294 \| Rating: 0 \| rate: / Reply Quote

STFC9F22 Send message Joined: 10 Nov 17 Posts: 7 Credit: 154,876,594 RAC: 0 Level Scientific publications	Message 49295 - Posted: 17 Apr 2018 \| 9:25:23 UTC - in response to Message 49278.
	If you follow through your account (name link at top of page) / computer / workunit / task, you should be able to see something like workunit 13443713 - that one was well worth aborting. Richard, thank you for your post. I have received a single further failing task this morning and following the link as you advise it seems to have originated in yesterday's bad batch at 11:17 (UTC). I notice that by following the links for all ten of my failed tasks, seven arising from the 13 April bad batch (around 17:50 UTC) and three from the 16 April bad Batch (around 11:20 UTC), they now all show exactly eight 'Error while Computing' failures so perhaps there is some automatic mechanism whereby they are automatically pulled after eight failures on different computers? I also notice on the Server Status Page that PABLO_p27_wild_0_sj403_ID currently shows 740 successes and a 92.33% error rate which, if my maths is correct, suggests over 9600 failures, potentially resulting in a large number of computers having been temporarily locked out. Frustrating though that is for we donors it doesn't appear to have created a (GPU) processing backlog and perhaps is an indication that the processing resource offered by donors currently far exceeds the requirements of the available work.[/url]
	ID: 49295 \| Rating: 0 \| rate: / Reply Quote

Tuna Ertemalp Send message Joined: 28 Mar 15 Posts: 46 Credit: 1,547,496,701 RAC: 0 Level Scientific publications	Message 49299 - Posted: 17 Apr 2018 \| 15:36:32 UTC - in response to Message 49288. Last modified: 17 Apr 2018 \| 15:36:43 UTC
	So, one my one, my seven hosts will be jailed, wasting 12x 1080Ti and 2x TitanX... Something is very non-ideal with that picture. Given that this sort of bad batches are happening with some relevant non-ignorable frequency, there should be a way to unblock machines in bulk that were blocked/limited due to bad batch issues, methinks. Case in point: When my hosts are fully utilized, I would see "State: In progress (28)" under my account's Tasks page (I have a custom config file that tells BOINC that GPUGRID tasks are each 1 CPU + 0.5 GPU, and that works well for these cards, I found, so 14 cards = 28 tasks). Last night I saw it at 26, then 22, went to sleep, then this morning at 14, now it is 12. For instance, when one of my single TitanX machines (http://www.gpugrid.net/results.php?hostid=205349) that has NOTHING ELSE going on in BOINC contacts GPUGRID, it gets: 4/17/2018 8:23:54 AM \| GPUGRID \| Sending scheduler request: To fetch work. 4/17/2018 8:23:54 AM \| GPUGRID \| Requesting new tasks for CPU and NVIDIA GPU 4/17/2018 8:23:56 AM \| GPUGRID \| Scheduler request completed: got 0 new tasks 4/17/2018 8:23:56 AM \| GPUGRID \| No tasks sent Quite ironic when the Server Status says "Tasks ready to send 34,375"... :(
	ID: 49299 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,866,392,698 RAC: 20,073,067 Level Scientific publications	Message 49300 - Posted: 17 Apr 2018 \| 15:52:56 UTC - in response to Message 49295.
	I notice that by following the links for all ten of my failed tasks, seven arising from the 13 April bad batch (around 17:50 UTC) and three from the 16 April bad Batch (around 11:20 UTC), they now all show exactly eight 'Error while Computing' failures so perhaps there is some automatic mechanism whereby they are automatically pulled after eight failures on different computers? Yes, on the workunit page, you should see a red banner above the task list saying Too many errors (may have bug) Once that appears (at this project, after 8 failures), no more are sent out.
	ID: 49300 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 49301 - Posted: 17 Apr 2018 \| 16:47:50 UTC - in response to Message 49299.
	Quite ironic when the Server Status says "Tasks ready to send 34,375"... :( Those are Quantum Chemistery WU's for your CPU. ____________
	ID: 49301 \| Rating: 0 \| rate: / Reply Quote

STFC9F22 Send message Joined: 10 Nov 17 Posts: 7 Credit: 154,876,594 RAC: 0 Level Scientific publications	Message 49302 - Posted: 17 Apr 2018 \| 16:53:12 UTC - in response to Message 49299.
	For instance, when one of my single TitanX machines (http://www.gpugrid.net/results.php?hostid=205349) that has NOTHING ELSE going on in BOINC contacts GPUGRID, it gets: 4/17/2018 8:23:54 AM \| GPUGRID \| Sending scheduler request: To fetch work. 4/17/2018 8:23:54 AM \| GPUGRID \| Requesting new tasks for CPU and NVIDIA GPU 4/17/2018 8:23:56 AM \| GPUGRID \| Scheduler request completed: got 0 new tasks 4/17/2018 8:23:56 AM \| GPUGRID \| No tasks sent Quite ironic when the Server Status says "Tasks ready to send 34,375"... :( Hi Tuna, The 'Tasks ready to send' figure on the Server Status page is the total of all task types ready to send. There is a table beneath this showing a breakdown of 'Tasks by application' (although for some reason the totals always differ by one). You should see that almost all, if not all, of the unsent tasks are currently Quantum Chemistry tasks which do not run on GPUs or Windows; they run on Linux CPUs only. The Unsent Short runs and Unsent Long Runs figures show the work available for GPUs. I cannot remember the exact wording, but when my own PC was temporarily jailed, requests for new work in the event log then reported the reason that the daily quota (3) had been exceeded - as the log you have posted above does not say this I suspect you might not be locked out and the reason you are not receiving tasks is simply that currently, most of the time, there are no GPU tasks available to send. As I understand it, tasks are only sent in response to requests from the client so it is down a matter of luck as to whether tasks are available when your PC makes its requests.
	ID: 49302 \| Rating: 0 \| rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 337 Credit: 7,635,065,645 RAC: 10,764,900 Level Scientific publications	Message 49303 - Posted: 17 Apr 2018 \| 17:02:29 UTC
	I had 8 fail on the 13th at 18:38:35 UTC and received 4 more on the 15th at 6:17:47 UTC. So if they were jailed it was less than 2 days.
	ID: 49303 \| Rating: 0 \| rate: / Reply Quote

Tuna Ertemalp Send message Joined: 28 Mar 15 Posts: 46 Credit: 1,547,496,701 RAC: 0 Level Scientific publications	Message 49304 - Posted: 17 Apr 2018 \| 17:03:25 UTC - in response to Message 49302.
	Ooops. Yup. I didn't scroll down far enough, I guess... :)
	ID: 49304 \| Rating: 0 \| rate: / Reply Quote

STFC9F22 Send message Joined: 10 Nov 17 Posts: 7 Credit: 154,876,594 RAC: 0 Level Scientific publications	Message 49307 - Posted: 17 Apr 2018 \| 22:44:28 UTC - in response to Message 49303.
	I had 8 fail on the 13th at 18:38:35 UTC and received 4 more on the 15th at 6:17:47 UTC. So if they were jailed it was less than 2 days. Yes, as in these circumstances the event log refers to a daily quota being exceeded, I guess the lockout only lasts for a day or the remaining part of a day.
	ID: 49307 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1132 Credit: 10,256,332,676 RAC: 29,209,412 Level Scientific publications	Message 49320 - Posted: 19 Apr 2018 \| 5:40:30 UTC - in response to Message 49284.
	Richard wrote on April 16th: All the failed / failing tasks over the last four days have had the exact string PABLO_p27_wild in their name. These faulty task are still in the queue; I got such one this morning: http://gpugrid.net/result.php?resultid=17472190 again, as before: ERROR: file mdioload.cpp line 81: Unable to read bincoordfile
	ID: 49320 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1132 Credit: 10,256,332,676 RAC: 29,209,412 Level Scientific publications	Message 49321 - Posted: 19 Apr 2018 \| 6:43:45 UTC
	just now, the next faulty PABLO_927_wild task :-((( The fourth one within 6 hours! Which is really annoying. What I don't understand: don't the people at GPUGRID monitor what's happening? This specific problem has been known for 6 days now, and still there are these faulty tasks in the queue. Not nice at all :-(((
	ID: 49321 \| Rating: 0 \| rate: / Reply Quote

STFC9F22 Send message Joined: 10 Nov 17 Posts: 7 Credit: 154,876,594 RAC: 0 Level Scientific publications	Message 49323 - Posted: 19 Apr 2018 \| 8:38:14 UTC - in response to Message 49320.
	These faulty task are still in the queue; I got such one this morning: http://gpugrid.net/result.php?resultid=17472190 Hi Erich, As Richard confirmed earlier in the thread there is a mechanism whereby tasks are automatically withdrawn after eight failures and it seems this is being relied on to remove the faulty batches (for example http://www.gpugrid.net/workunit.php?wuid=13443881). If you look at the history of the Work Unit that you picked up http://gpugrid.net/workunit.php?wuid=13443681 you can see that it originated in the bad batch released on 13 April. After it failed for you, it was reissued to flashawk (who aborted it) and is currently reissued to an anonymous user but still requires three further failures to trigger automatic removal. I guess that because there is currenlty so little GPU work available these remaining bad units are still a significant proportion of available work and agree that it is disappointing, and seems a little disrespectful to donors, that no action has been taken to remove them proactively.
	ID: 49323 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1132 Credit: 10,256,332,676 RAC: 29,209,412 Level Scientific publications	Message 49325 - Posted: 19 Apr 2018 \| 10:26:25 UTC
	@STFC9F22, thank you for your explanations and insights. You are right, the way this problem is being handled by GPUGRID is a little dissapointing for us donors, particularly since a donor is being punished by not getting any more tasks for a certain timespan as the host is being considered "unreliable". Exactly this happened to me last Friday with one of my hosts, and I think it's not okay at all. The mechanism of "host punishment" should definitely be suspended in such cases where the cause for the problem is a faulty task.
	ID: 49325 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 49327 - Posted: 19 Apr 2018 \| 18:29:53 UTC - in response to Message 49323. Last modified: 19 Apr 2018 \| 18:30:43 UTC
	Hi Erich, As Richard confirmed earlier in the thread there is a mechanism whereby tasks are automatically withdrawn after eight failures and it seems this is being relied on to remove the faulty batches (for example http://www.gpugrid.net/workunit.php?wuid=13443881). Most of us are and have been aware of this for sometime. The problem is when a computer gets to many errors it's put on a black list for a time, I know right after I get an error I can't download any WU's for a time. We should'nt be taking these hits. It's as though he loaded up these last WU's, packed his bags and left.
	ID: 49327 \| Rating: 0 \| rate: / Reply Quote

PappaLitto Send message Joined: 21 Mar 16 Posts: 511 Credit: 4,672,242,755 RAC: 0 Level Scientific publications	Message 49328 - Posted: 19 Apr 2018 \| 18:36:07 UTC - in response to Message 49327.
	It's as though he loaded up these last WU's, packed his bags and left. You know he might be on vacation. Scientists have real lives too.
	ID: 49328 \| Rating: 0 \| rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 755,434,080 RAC: 186,180 Level Scientific publications	Message 49329 - Posted: 19 Apr 2018 \| 18:45:26 UTC - in response to Message 49327. Last modified: 19 Apr 2018 \| 18:47:41 UTC
	Hi Erich, As Richard confirmed earlier in the thread there is a mechanism whereby tasks are automatically withdrawn after eight failures and it seems this is being relied on to remove the faulty batches (for example http://www.gpugrid.net/workunit.php?wuid=13443881). Most of us are and have been aware of this for sometime. The problem is when a computer gets to many errors it's put on a black list for a time, I know right after I get an error I can't download any WU's for a time. We should'nt be taking these hits. It's as though he loaded up these last WU's, packed his bags and left. When the tasks are withdrawn after eight failures, they should also no longer be counted as failures for the eight computers that ran them. While this is fixed, compute errors in 2015 and earlier years should be removed from the lists of failures for computers, even for workunits that never had a successful task.
	ID: 49329 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 49331 - Posted: 19 Apr 2018 \| 20:31:20 UTC - in response to Message 49329.
	I have 8 failed WU's from 2013 through 2014 that won't go away, how do I get those removed? I've asked twice and got no response, are their mods here anymore?
	ID: 49331 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1132 Credit: 10,256,332,676 RAC: 29,209,412 Level Scientific publications	Message 49345 - Posted: 20 Apr 2018 \| 15:55:57 UTC
	I just notice that many more of these faulty PABLO_P27_wild tasks are distributed, although they have an error rate of 88% (which means that nearly 9 out of 10 are bad). Can anyone explain what sense this makes?
	ID: 49345 \| Rating: 0 \| rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 49346 - Posted: 20 Apr 2018 \| 19:59:35 UTC - in response to Message 49345.
	I just notice that many more of these faulty PABLO_P27_wild tasks are distributed, although they have an error rate of 88% (which means that nearly 9 out of 10 are bad). Can anyone explain what sense this makes? I have 4 over 50% right now and they seem fine, they may have been reworked.
	ID: 49346 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1132 Credit: 10,256,332,676 RAC: 29,209,412 Level Scientific publications	Message 49347 - Posted: 21 Apr 2018 \| 6:42:38 UTC - in response to Message 49346.
	...they may have been reworked. maybe so; I got 2 of them within the past 12 hours, and both of them did NOT fail so far :-)
	ID: 49347 \| Rating: 0 \| rate: / Reply Quote