Author |
Message |
ignasiSend message
Joined: 10 Apr 08 Posts: 254 Credit: 16,836,000 RAC: 0 Level
Scientific publications
|
We have modified some scheduling parameters to send retries for failed WUs to "reliable hosts".
Let us know if you see any problem with the scheduler.
thanks,
ignasi |
|
|
|
We have modified some scheduling parameters to send retries for failed WUs to "reliable hosts".
Let us know if you see any problem with the scheduler.
Well, it still won't let me queue up 87 days worth of work ... :) |
|
|
GDFVolunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message
Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level
Scientific publications
|
So far it seems to work.
To let people know, having an host with an error rate less than 5% and returning WUs withing 24 hours in average classifies your host as RELIABLE. This is approximatively 25% all of hosts.
The advantage of these hosts is that they will receive all WUs available, so they are unlikely to be left without work. While other hosts receive only a subset WUs for which a failure is not a problem.
Rationale: For us it is very important when we send a batch of WUs, to obtain all of them back. A single one missing and we need to wait to perform the analysis.
gdf |
|
|
|
Can this be flagged in the computer data page? It should be one of the public data items ... not only will it let us know how our machines are doing, it can help with debugging them ...
This should be rolled back into the UCB baseline also if you get it working ... all the world wants to know ... :) |
|
|
uBronan Send message
Joined: 1 Feb 09 Posts: 139 Credit: 575,023 RAC: 0 Level
Scientific publications
|
Does this have to be withing 24 hours because i think slower machines can be more reliable then faster ones in my opinion.
On my machine i never had an computation error, other then mistakes made by detaching boinc client without me wanting that.
|
|
|
|
SO, is this error message assosiated with this...it's for a brand new machine, with no history!:
3/31/2009 12:23:29 PM|GPUGRID|Message from server: No work sent
3/31/2009 12:23:29 PM|GPUGRID|Message from server: (reached daily quota of 8 results)
3/31/2009 12:23:29 PM|GPUGRID|Message from server: (Project has no jobs available)
____________
|
|
|
|
SO, is this error message assosiated with this...it's for a brand new machine, with no history!:
3/31/2009 12:23:29 PM|GPUGRID|Message from server: No work sent
3/31/2009 12:23:29 PM|GPUGRID|Message from server: (reached daily quota of 8 results)
3/31/2009 12:23:29 PM|GPUGRID|Message from server: (Project has no jobs available)
Do you have a link to the host? Is it this one? http://www.gpugrid.net/results.php?hostid=31246? If it is that host - well it only has computation errors...
____________
pixelicious.at - my little photoblog |
|
|
|
"error rate less than 5%"
over what period of time or over what number of results? I had problems (self induced) when I first started and it would be a shame for my PC (i7 + GTX295) which is crunching 24/7 and I normally keep the queue at .75 days to not be considered "reliable".
Steve |
|
|
GDFVolunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message
Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level
Scientific publications
|
This is a standard boinc averaging of results.
I am not sure over how long they compute the averages.
gdf |
|
|
|
I will try to look around before I post next time :-)
AdaptiveReplication
BOINC maintains an estimate E(H) of host H's recent error rate. This is maintained as follows:
It is initialized to 0.1
It is multiplied by 0.95 when H reports a correct (replicated) result.
It is incremented by 0.1 when H reports an incorrect (replicated) result.
Thus, it takes a long time to earn a good reputation and a short time to lose it. |
|
|
|
If anyone is interested I just built a quicky spreadsheet that I copied all of my completed tasks into and added the *reliability* formula ...
1. Paste the tasks table header starting at A1
2. Clean these up so you can start pasting the actual results in A2
3. in K1 enter .1
4. enter the following forumla in K2
=IF(F2="Success",K1*0.95,IF(F2="Client error",K1-0.1,K1))
5. Copy K2 down through all of the rows that have data.
I just make it at 4.95% :-)
Steve |
|
|
GDFVolunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message
Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level
Scientific publications
|
We will increase to just below 0.1, probably 0.09 |
|
|
Zydor Send message
Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level
Scientific publications
|
A minor issue in the great scheme of things, what will happen if there is a repeat of the bad batch of WU issues on 21 Mar? On that day there was a bad batch let loose, it was swiftly detected and recalled - held WUs from it were marked "cancelled by server".
However those that got totalled prior to recall were marked "Client Error, Compute Error". It happened in seconds, and those with a queue of circa 3/4/5 or more would have all of them so marked in seconds flat as the GPU worked through the held ones. In my case it was only two before the recall. Under the new rules that would mean I would have to submitt 28 or so "good" ones to be "reliable" once more. In my case, no real biggie as that would only be circa 9 days or so, not exactly "the end of life on Planet Earth as we know it".
However for those with a bigger queue, it could be much longer - especially if they did a refill of another 5 or 6 before recall, and they got totalled as well - life on Planet Earth could get a little shakey :)
If its not already factored in the coding for this, it should be reviewed to have a look at the issue, possibly by some kind of automatic labelling of a bad batch on recall, such that any search/calculation for the "reliability" factor ignores WUs from the batch that went bad prior to recall.
Its a rare event for sure, but has potential for grief if it occured with mega crunchers and larger queues.
Regards
Zy |
|
|
|
Zydor,
The "good" news is that the queues are small regardless in that I have 4 in flight and 4 in queue and that is the most I can have. The bad news is, and you are right, is that if there is a poisoned batch I could get a slew of them in short order and with 4 GPUs munching on them go through my daily quota in a very short time. (my daily norm is 15-16 a day on that system)
In the past if the staff were watching and the participant noticed they did do some occasional resets of peoples error rate so that they could get work again that day ... but that is hit or miss... |
|
|
|
Hi!
I am wonder if GPUGRID scheduling is connected somehow with other project, eg. Rosetta@home:
see this post and My problems:
http://boinc.bakerlab.org/rosetta/forum_thread.php?id=4841 |
|
|
|
Hi!
I am wonder if GPUGRID scheduling is connected somehow with other project, eg. Rosetta@home:
see this post and My problems:
http://boinc.bakerlab.org/rosetta/forum_thread.php?id=4841
No it isn't ... :)
Hi ...
I have been reporting your troubles and we are working on it. I am going to post a note over there ... |
|
|