Author |
Message |
Zydor Send message
Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level
Scientific publications
|
A strange anomoly - on the face of it - occured today. I currently have the buffer in preferences set to zero. I am trying to minimise the queue size.
In previous efforts I set it to 0.1, that worked to a degree, but there were still occasions when the sever sent two work units not one. The latter heppened when I have only one WU completing (with zero cache remaining). If I have one completing and one still there in cache, it does nothing until the cache is empty and the current WU is almost completed and it sent two.
I set it to zero, thinking it would send none - just refresh when current WU is completed. It sends one which is good, but this time when the current WU has about 8hrs to complete. The latter is no biggie, will do fine, my only query is, is this as designed (sending early with zero buffer set), or is there another setting I missed that I need to check?
If its as designed, no problem, works for me, just aiming off in case there is something else involved I am not aware of, as on the face of it, zero buffer set implies no download until completed, when in fact its doing it early.
Regards
Zy |
|
|
|
One thing about 6.5.0, maybe its only real flaw: its scheduler doesn't differentiate between cpu and gpu. Thus it may think your CPUs need more work to keep them from starving.. so it downloads more WUs.
MrS
____________
Scanning for our furry friends since Jan 2002 |
|
|
Zydor Send message
Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level
Scientific publications
|
It was not that this time I have 5-10 days cache with other projects to cover their downtimes - dont need that with GPUGRID its very rarely down. Although I hear the thought - I understand from SETI boards that BOINC is having dramas trying to get all the permutations and combinations re cpu/gpu correct. Personaly I think they will never do it, and will have to - in the end - give some kind of manual "switch", its too complex, too many possibilities that are conditioned on subjective opinion.
EG long running models like CPDN suffer from Pre-Emption where the BOINC schedular will always pre-empt with shorter tasks from another project. CPDN models are valid for up to a year or more on "completion date" due to the long crunch times needed (can be up to 3-6 months per model), and crunchers who do them, want continuous crunching, not pre-empted on shorter theoretical completion times of other projects.
Producing a logic for that one is a nightmare they will never resolve without a manually set "leave it alone" switch in BOINC preferences - hard coding CPDN as an exception is obviously not acceptable in programming terms, there will be others come down the pipe with similar issues.
Regards
Zy |
|
|
|
The buffer size, sadly, is set for the client for all projects. That means that if you set it to 0.1 you set it to 0.1 for all projects. Generally, if you have cable or DSL and are always on the Internet there is little reason to have really large buffers. That does presume that you are attached to a variety of projects (3+) so that when one goes down the others can fill in the gaps.
The down side to that is when, in my case, Comcast decides to make my "always on" access into sometimes off access ... (and they wonder why I don't trust them with my phone service). In that case without some work cached you will run dry. For this instance I usually do 1 day which to this point has been a decent balance between so much work I cannot sort through it and not enough to cover minor Comcast glitches.
As to the versions of BOINC... they had a pretty decent system that was pretty much working when the GPU thing came up. Sadly, they have made major changes in how they plan to balance the work loads and to this point have not gotten it to work right. There are, I think, 6 major bugs/issues that I am pursuing with the developers and made a major effort this weekend to figure out the problems. The question now is if changes will be in the offing.
Perhaps the major issue is that the design model Dr. Anderson has chosen is that if you sign up to a project that you want to devote x percentage of your resources to that project. He has not, to this point, considered that I might want to devote 10% of my CPU but 50% of my GPU to the project (see Resource Share Obsolete?). Even more complex is the questions of "mixed" mode projects (like The Lattice Project) where both the GPU and CPU will be heavily loaded on the same task. I proposed a model for this which has been, as usual, ignored ... but who knows, five or ten years from now ...
As a side note, we have asked for more manual options and have been told that they are not in the cards as that would lead to <shudder> <shudder> "micromanagement" ...
Of course, never explained is why giving the participant controls is evil ...
Here are my current notes of the major system issues and the relevant threads on the developer and alpha mailing lists if you care to pursue them.
Paul's Known Issues
01) 6.6.20 may cause tasks to run up to 4 times longer (Fixed in 6.6.23?)
-- known to affect versions as early as 6.6.15, have not seen it in 6.6.23 (yet)
02) GPU / CPU debt imbalance to builds till one side or the other does not fill queue
-- Debt may accumulate against the wrong resource classes for projects
--- I have demonstrated it against GPU Grid, Rosetta participant has shown it against Rosetta (50/50 RaH and GPU Grid)
-- Requests may be registered against projects for the wrong resource class (Richard's reports)
Thread: 6.6.25 Work Fetch and Improper Debt Accumulation (Wrong Resource Class)
03) Resource scheduling unstable
-- done too often (no floor to number of times executed per minute)
-- All tasks appear to be considered eligible for preemption
-- TSI is not respected
-- work mixes not kept interesting
-- false positive deadline issues abound
Thread: 6.6.25 and Unstable Resource Scheduling, TSI Not respected
04) Work fetch DoS attacks schedulers
-- asks for work in resource class that is not provided by project
-- double or triple requests are the norm not the exception.
-- hard to see with lots of logging turned on, with only Sched Op Debug it is pretty clear
-- System is not "bundling" requests so you may see a CPU request to a project followed by two GPU requests.
-- Not clear why, again a full log issue
Thread: Work Fetch, CUDA/GPU, back-off and DOS Attack of the 6.6.x clients
Thread: Work Scheduling (pt 2, Cooperative Scheduling?) (my post April 30, 2009 10:14:51 AM PDT)
05) DCF is not properly calculated
-- It is also not tracked by application, but by project (bad)
-- Capping DCF to 100 may be too much of a limitation (IBERCIVIS)
Thread: 6.6.25 and Broken Work Fetch Drives System Nuts
06) New CUDA code does not identify the test that failed.
-- I submitted code for demonstrating this in the logs
Thread: BOINC 6.6.25 released for Windows and Windows x64 (my post April 28, 2009 3:21:35 PM PDT)
07) Resource share as a model has significant flaws
-- May not want to have the same proportion for GPU as CPU
http://boinc.bakerlab.org/rosetta/forum_thread.php?id=4841
Thread: Resource Share Obsolete |
|
|
Zydor Send message
Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level
Scientific publications
|
Many Thanks Paul, very informative, and a timely reminder to all of us, as to how complex the BOINC scheduling is becoming. I know from your previous posts here and elsewhere how hard you work on this, so deserved kudos for the efforts, and more power to your elbow for future efforts.
If they can also stop long running projects being pre-empted by other shorter ones (and that applies not only to "long running" but any where the user needs or wants to have a permananent "running flag" set on a WU), then that would resolve a very long running sore with a lot of people.
I hear your frustration over the "manual intervention" issue, and strongly agree, heres hoping they finally get round to solving the issue now (with manual switch) and work on a more elegant solution - if one is at all feasible - as time goes on. There are times when interim solutions need implementing as an interim pending "gold standard" elegance later - this is one of those.
The lack of such manual intervention drives a massive coach and horses through the whole scheduling strategy of the software, and frankly always will until its addressed. At present, I am always ending up "micro-managing" irrelevant of BOINC version, because pre-emption for some wierd reason is not seen as a big issue.
Hopefully one day ..... :)
Regards
Zy |
|
|
|
Well, I found that issue ...
One line change ...
Who knows how long that will take.
And I would trade all the Kudos in the world for them to just pay a little attention to what I write. There is a secondary problem with at least the IBERCIVIS project where they don't have the task configuration set up correctly so the client drives the DCF through the roof. This means that all the internal calculations are cranked. If you reset IBERCIVIS and the tasks suddenly have 6 second run times then you are seeing what I did this week end. I posted that too ... I can't read Spanish so have no idea where to try to get the attention of the project admins, worse I doubt that they would listen to me in any case. I did send that information on to UCB and hope that they will contact IBERCIVIS and get them on board.
If you could identify any other project where you have seen this, and can test as I suggest to show this is the effect, I would appreciate a heads up. I can look into that project too ... though I think that there is another lingering error elsewhere that is accentuating the problem. The issue now is that there are so many interlocking problems that it is hard to separate the one from the next. If they can fix the one where I point right at the line where it is broken, that is step one. Then, I sent them detailed dumps that MAY allow them to find the other issues. But, not sure what will happen ... so far no comment at all on the list ... on ANYTHING I posted. Sigh ... hopefully it is just that they all took the whole weekend off ... though Rom did an emergency build and posted some notes about it ...
Anyway, here is the core of the analysis post from this weekend for the unstable Resource Scheduler:
The comment for this code fragment indicates that the most preemptable
task should be eligible if either the preemptable task has
checkpointed or the the scheduled result (should that not be the TO BE
scheduled result?) is in deadline trouble.
In this case, the NQ task is in debt trouble, not deadline trouble.
And in any case the most preemptable task is from the AI project. And
it is not all that eligible... (in this case).
What I think is happening, is this. The code fragment in lines 845 or
so is looping through the list of active tasks and it marks them as
preempted.
The main processing loop for determining which tasks to run next is
driven by the sorted list of ordered_scheduled_results (line 880).
Because this list is driven by the ever-changing debt values of the
tasks that ARE running, Tasks like the CPDN task
hadam3p_mnby_1980_2_006062432_4 can "move" down the list. Because it
is marked as preempted as the next step (without regard to TSI or
checkpoint state) this means that this task is bumped by the task that
appears earlier in the list of ordered_scheduled_results, ... In this
case the RCN task.
As I have stated before the block at line 845 summarily preempts tasks
and should NOT be there. This probably also means that the block that
re-enables the NCI tasks also is not needed ...
The main disconnect seems to be between the list of what we want to
run and what is running. (ordered_scheduled_results vs. active_tasks)
And the only place we should be marking a task as preempted is when we
know darn well it should be preempted. I might note that this may
also be why we have trouble with things like lock-files because we are
preempting tasks that SHOULD NOT HAVE BEEN PREEMTPTED ... and they
should not have been preempted because they were not properly
checkpointed.
ALTERNATIVE CHANGE:
Instead of deleting the two blocks of code (or commenting them out
until after a test), the following is an alternative:
For the running tasks, the next state by default should be
CPU_SCHED_SCHEDULED and let the preempt loop do its work unmolested.
In the block the NCI tasks can have the one line of code dropped that
does this task. This is likely a "safer" change, but either change
should do the trick.
What I would suggest it that a test build be made, which if it is made
for the XP Pro 32 bit world I would be happy to test to see if there
are any IMMEDIATE bad things happening.
|
|
|
Zydor Send message
Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level
Scientific publications
|
And the only place we should be marking a task as preempted is when we
know darn well it should be preempted.
The principle gets my vote. The path to achieving it appears to be well entrenched in a determination to code a solution ("I_Will_Code_This_It_Will_Not_beat_me" etc)
They will never achieve it, because the overall decision on what a User wants or does not want to be subject to pre-emption, is purely subjective on the Users part. It therefor follows as clear as the difference between night and day, its something they will never succeed in coding in a schedular as a total solution - partial, for sure, many aspects of pre-emption can be coded, ultimately however it must have a manual intervention mechanism, there's no way they will programme in it all.
I have yet to see a full blown AI come anywhere that ball park, let alone a schedular, and it never will because the code in a schedular (or AI for that matter) cannot, ultimately, read minds. This is where managerial oversight is so important, at times programmers can get lost in the woods when surrounded by so many trees, they forget the overiding Gold Standard of any Project is "Meet User Need" - it is not "Demonstrate Perfect Code".
One Day :)
Regards
Zy |
|
|
|
Um, well, I express it a different way, but it is a similar idea.
Dr. Anderson has this "vision" on how BOINC should be used and pays no attention to how it is used. So, he is coding to his vision and it is only by grace and luck that the system serves us as well as it does.
A couple of examples I have mentioned in the past. He thinks that people should be running multiple projects. It is not clear to me what that magic number is but it is more than one and less than 48. I would assume it is in the range of 5-10.
The trouble is that according to the numbers from the stat sites is that almost half of the participant population has only ever attached to one project. Now, that is partly because everyone that attaches and detaches in a short period of time will fall into that group. But, the number that attached to 4 or less is about 80% (eyeball guess) of the total population.
*MY* guess is (from these numbers) that most participants are single project types with one or so Safety projects. Hard to tell for sure, but, I think that is a reasonable guess. Yet, with Resource Share you cannot configure BOINC to do this automatically. What I mean is if you say you want share of 99 to 1 with the one being your safety project, well, you are going to be doing work for the safety project which is not what you want at all. So, many micromanage and attach and detach from projects when their main project goes down. Were they **REALLY** interested in reducing "micromanagement" this is one area that needs addressing.
There are special case projects and I saw this with CPDN and Sudoku where there is a need to be able to try to shepherd tasks in the system. The Sudoku project was trying to finish up the tasks so that they could write their paper. No way for me to manage that with out lots of TLC and fiddling to run off that last few tasks. For CPDN I want and don't mind downloading several tasks, but I never want more than one running at a time. But I cannot constrain my system to run at most x tasks from CPDN (or any other project for that matter). Another setting damned for being "micromanagement"...
Yet we have NNT, Suspend task and project, cc Config, and other tools that allow us to do just that... Micromanage the system to get to where we want to go. All we wanted was a "run till done" button to complement the suspend.
Last point about "the vision" thing, the other ideal is that no one monitors their system so the operational mode is to run with no intervention. I agree that that is a good ideal. But, I think that there is a large segment that does want more control over the channeling more or less of the work. You are correct the best is human intervention but the AI can come close if we were allowed to give it more "hints" as to what we want it to do. I mean if I tell it that I only want it to run one CPDN task and a resource goes idle, well, fine run the other task, but the queue better darn well be empty before it makes that choice.
I don't think they are trying to demonstrate perfect code but to perfect the code to realize the vision of how BOINC will be used in fantasy land. Meeting participant's needs or even project needs is not even close to a secondary goal. Were it so the code in 6.6.24 to block non-"Best" GPUs would never have been implemented.
Anyway, history says that participant input is to be ignored because we don't know what we are talking about ... |
|
|
Zydor Send message
Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level
Scientific publications
|
Anyway, history says that participant input is to be ignored because we don't know what we are talking about ...
rofl :) Keep smiling, History also shows the User always gets the last say in any software Project, no matter the theory, because they just reach for the ultimate manual switch - the off button .......
The latter tends to focus minds wonderfully no matter which area of theory anyone comes from, albeit its always sad to have to get to that extreme to reach The Land of Commonsense.
Regards
Zy
|
|
|
|
Yet we have NNT, Suspend task and project, cc Config, and other tools that allow us to do just that... Micromanage the system to get to where we want to go. All we wanted was a "run till done" button to complement the suspend.
Last point about "the vision" thing, the other ideal is that no one monitors their system so the operational mode is to run with no intervention. I agree that that is a good ideal. But, I think that there is a large segment that does want more control over the channeling more or less of the work. You are correct the best is human intervention but the AI can come close if we were allowed to give it more "hints" as to what we want it to do. I mean if I tell it that I only want it to run one CPDN task and a resource goes idle, well, fine run the other task, but the queue better darn well be empty before it makes that choice.
Yes, yes, yes. I would pay money for that manager which is able to manage only this one single feature. |
|
|
|
Anyway, history says that participant input is to be ignored because we don't know what we are talking about ...
rofl :) Keep smiling, History also shows the User always gets the last say in any software Project, no matter the theory, because they just reach for the ultimate manual switch - the off button .......
The latter tends to focus minds wonderfully no matter which area of theory anyone comes from, albeit its always sad to have to get to that extreme to reach The Land of Commonsense.
Regards
Zy
SaH has seen a significant decline in user population over some period of time. I have not looked at the graphs recently but the initial speculation was that the economic downturn was the likely cause. I pointed out that several of the other production class projects saw no decline at all using the same source of data.
If economics were the issue you would see a decline in participation across the board. The data did not support that. So what is the cause? Hostile postings, AP tasks, lack of project communication? Who knows, we don't collect the data (another suggestion ignored, that is, pop up a dialog, participate in a short survey on detach?).
And for ultimate... yes, would make my spouse very happy if I would drop out again. Would cut my electric bill by half or more. It surely lowers my frustration levels. I think the saddest thing is to see so many that used to have open minds that are so closed now ... |
|
|
uBronan Send message
Joined: 1 Feb 09 Posts: 139 Credit: 575,023 RAC: 0 Level
Scientific publications
|
Well maybe people which run the honest and plain s@h application to get a result being called a fake might also hurt the project.
As happened with me i see many good results send being reported false results ( running the basic s@h cruncher)
I still think all the anonymous and under admin account running ones not really healthy for any project.
On the open minds i must agree also i have been in contact with many great minds, but since a few years these minds are at present only focused on more money.
So i guess we have to live with that and vent our frustrations here on forums sometimes, lol its a lot better doing that then some idiots going into a killing session on public places. |
|
|
|
Yet, with Resource Share you cannot configure BOINC to do this automatically. What I mean is if you say you want share of 99 to 1 with the one being your safety project, well, you are going to be doing work for the safety project which is not what you want at all. So, many micromanage and attach and detach from projects when their main project goes down. Were they **REALLY** interested in reducing "micromanagement" this is one area that needs addressing.
you know there are a lot of people out here who want to be able to assign resources to projects themselves and directly.
such as 1 core project a - cores project b and 1 core shared for c and d.
CoPro 1 for e, Copro 2 for f....
AND
set the buffer for a to 1 hour, for b to a week, for c for 2 days and 0 for d....
this has been proposed so many times and pooof. finally it's nothing more than wasted time to tell them what WE want.
it would have been easy to allow both ways when creating an "expert mode" or what ever to call it.
but DA and his gang simply do not want to give us such a thing.
i wouldn't even wonder if finally some special priority-mode in that fubar-scheduler for a special project would be found - but that's very unlikely with that huge pile of crap-code.
on the other hand - if this goes on, DA might manage to convince the active community that it's about time to come up with an alternative.
|
|
|
|
Well, there are some technical reasons to not to try to schedule to specific cores. But, I do get your notion that I would want to more or less try to dedicated more strict percentages of CPU time to various projects. So that to a greater extent we would see that one core would be in service to one project as you suggest. The only good news in this vein is that in 6.6.28 we finally see a Resource Scheduler that does not change its mind with every internal event (which I was clocking at 5-6 times per minute over a 24 hour period, meaning that I could see tasks with 20 seconds of run time before being abandoned for the newest fad).
The problem is that there is a "mind set" as I noted that BOINC will be used a certain way and if you use it otherwise that is your look-out. Rather than looking at the way that participants want to use BOINC and making it work well for those various modes (Single project fanatic, Low Variety Project Sampler, High Variety Project Dilettante, Alpha Project Dare-Devil, and idiots like me that run it if it is issuing work (just about)). And so, we have a tool that likely runs well to meet the way Dr. Anderson want to run BOINC, but does not work well for anyone else in the world.
As to special modes, would you take my word for it that there is nothing in the code that does that? There is nothing in the logic that grants special rights or priorities to any specific project. There are still some bugs that might seem to do that, but it is an accident of the way you have your resource Shares set up and the vagaries of life that give rise to the appearance of favoritism...
There is a code fork BOINC client out there by N. Alvarez, I keep losing the link to it but can always ask if you are seriously thinking of a change. One problem is that as far as I know it does not yet do CUDA... |
|
|
|
The only good news in this vein is that in 6.6.28 we finally see a Resource Scheduler that does not change its mind with every internal event (which I was clocking at 5-6 times per minute over a 24 hour period, meaning that I could see tasks with 20 seconds of run time before being abandoned for the newest fad).
yes, it has improved - but as far as i can see, it still ignores that "don't switch apps" setting. work fetch is still erratic.
And so, we have a tool that likely runs well to meet the way Dr. Anderson want to run BOINC, but does not work well for anyone else in the world.
i would not use "work" as a term.. ;)
As to special modes, would you take my word for it that there is nothing in the code that does that? There is nothing in the logic that grants special rights or priorities to any specific project.
i can take your word - too rusty with C and too lazy to dig around myself. it's just that feeling when things happen, that somehow something interferes..
There are still some bugs that might seem to do that, but it is an accident of the way you have your resource Shares set up and the vagaries of life that give rise to the appearance of favoritism...
of course it has to be someones fault - probably i'm just too silly to use boinc 6.x.x
There is a code fork BOINC client out there by N. Alvarez, I keep losing the link to it but can always ask if you are seriously thinking of a change. One problem is that as far as I know it does not yet do CUDA...
i know, and afaik it's linux only.
|
|
|
|
...There is a code fork BOINC client out there by N. Alvarez, I keep losing the link to it...
Synedoche ;-)
____________
pixelicious.at - my little photoblog |
|
|
|
The only good news in this vein is that in 6.6.28 we finally see a Resource Scheduler that does not change its mind with every internal event (which I was clocking at 5-6 times per minute over a 24 hour period, meaning that I could see tasks with 20 seconds of run time before being abandoned for the newest fad).
yes, it has improved - but as far as i can see, it still ignores that "don't switch apps" setting. work fetch is still erratic.
Um, we are both talking 6.6.28, aren't we?
To be taking advantage of the more stable Resource Scheduler you have to be running 6.6.28 ... of course work fetch is still hammered and there are problems with LTD calculations as well ...
And there may be a lingering issue with the Resource Scheduler and in fact *MAY* still be causing some artificially long running tasks. I need to figure out a way to explain it and maybe look at the code again. But I am having trouble focusing so far today ... |
|
|
|
Um, we are both talking 6.6.28, aren't we?
yes we are!
To be taking advantage of the more stable Resource Scheduler you have to be running 6.6.28 ... of course work fetch is still hammered and there are problems with LTD calculations as well ...
to me it looks like CPU and GPU resource share is still interfering - for both, work fetch and actual tasks running.
on my quad with an gtx260 - with work buffer set to one day, it should not hit the limit of four tasks of GPUgrid. this morning it had one running and one in the queue. some hours later it had requested the limit of 4 and was still trying to get more.
and i can't even see how many it will request now because some bogus killed that "requesting x seconds" message. WHAT an improvement.. |
|
|
|
Um, we are both talking 6.6.28, aren't we?
yes we are!
To be taking advantage of the more stable Resource Scheduler you have to be running 6.6.28 ... of course work fetch is still hammered and there are problems with LTD calculations as well ...
to me it looks like CPU and GPU resource share is still interfering - for both, work fetch and actual tasks running.
on my quad with an gtx260 - with work buffer set to one day, it should not hit the limit of four tasks of GPUgrid. this morning it had one running and one in the queue. some hours later it had requested the limit of 4 and was still trying to get more.
and i can't even see how many it will request now because some bogus killed that "requesting x seconds" message. WHAT an improvement..
every thing you discuss here is work fetch only.
Resource Scheduling is what task to run, and how long to run it. If you have TSI set to the default of 1 hour that means that BOINC will likely change its mind on what task to run. and so that may be the source of some of what you are seeing.
I set my TSI to 720 so that it is 12 hours and tasks run to completion unless they take longer than 12 hours to complete at which point they may be switched out. |
|
|
|
every thing you discuss here is work fetch only.
Resource Scheduling is what task to run, and how long to run it. If you have TSI set to the default of 1 hour that means that BOINC will likely change its mind on what task to run. and so that may be the source of some of what you are seeing.
I set my TSI to 720 so that it is 12 hours and tasks run to completion unless they take longer than 12 hours to complete at which point they may be switched out.
that's the other story..
i'm running @ 600 - CPU-tasks are all under 3 hours for the longest, deadlines are all several days away. so it should not switch anything.
6.6.28 does no longer try to juggle everything around every other second, but it still does it from time to time.
anyway - shutdown time, going on vacation..
|
|
|
|
every thing you discuss here is work fetch only.
Resource Scheduling is what task to run, and how long to run it. If you have TSI set to the default of 1 hour that means that BOINC will likely change its mind on what task to run. and so that may be the source of some of what you are seeing.
I set my TSI to 720 so that it is 12 hours and tasks run to completion unless they take longer than 12 hours to complete at which point they may be switched out.
that's the other story..
i'm running @ 600 - CPU-tasks are all under 3 hours for the longest, deadlines are all several days away. so it should not switch anything.
6.6.28 does no longer try to juggle everything around every other second, but it still does it from time to time.
anyway - shutdown time, going on vacation..
Yeah, I saw that and lost the message log because of another issue - so I am still on the hunt for a clear example of the instability ... |
|
|