Remaining (Estimated) time is unusually high; duration correction factor unusually large

Message boards : Number crunching : Remaining (Estimated) time is unusually high; duration correction factor unusually large

Author	Message
Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32263 - Posted: 25 Aug 2013 \| 13:25:59 UTC Last modified: 25 Aug 2013 \| 13:27:50 UTC
	Problem: Remaining (Estimated) time is unusually high; duration correction factor unusually large Basically, recently I've noticed that my GPUGrid tasks are saying they're going to take a long long time (140 hours), when they normally complete in about 1-6 hours. I've tracked it down to the duration correction factor. This is usually between 1 and 2 on my system, but something is making it unusually large, right now it is above 9. Is the problem in: - the new beta acemd application? - the plan classes for that application? - the MJHARVEY TEST 10/12/13 tasks I've recently processed? - something else? Is there any way I can further diagnose this issue? We need to identify the problem, and correct it if possible. Thanks, Jacob
	ID: 32263 \| Rating: 0 \| rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 32405 - Posted: 28 Aug 2013 \| 19:09:06 UTC - in response to Message 32263.
	It could easily be caused by a wrong estimate in some WU. But sicne noone else replied it doesn't seem to be persistent issue. MrS ____________ Scanning for our furry friends since Jan 2002
	ID: 32405 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32409 - Posted: 28 Aug 2013 \| 19:25:16 UTC
	There was at least 1 other user that reported it also, in the beta news thread. All I can do is report the problem, with as much info as I can provide, like a good tester should. I wish it was properly investigated and fixed by the administrators.
	ID: 32409 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 32421 - Posted: 28 Aug 2013 \| 21:18:10 UTC - in response to Message 32409.
	I think the run time estimates are app based, so when a new batch of different WU's run's it takes time for their estimated run time to auto-correct (on everyone's systems). Not aware of what's been done server side, but there have been lots of WU changes and app changes recently. Some very short WU's turned up in the Long queue and in the Beta app test and different WU's appeared in the short queue. On my W7 system the remaining runtime of a Short NATHAN_baxbimx WU's is still being significantly over-estimated - 40% done after 80min but remaining is >8h (though it's falling fast). A NATHAN_baxbimy WU on my 650TiBoost (Linux) system is closer to the mark - 25% done after 50min, estimated remaining is 3h 50min (but falling rapidly). I think the estimates were worse yesterday and probably for the long runs too. So it's auto-correcting, but just slowly. IIRC an alternative to the existing correction factor equation was proposed some time ago by a very able party. ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help
	ID: 32421 \| Rating: 0 \| rate: / Reply Quote

nanoprobe Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level Scientific publications	Message 32432 - Posted: 29 Aug 2013 \| 0:25:28 UTC - in response to Message 32409.
	There was at least 1 other user that reported it also, in the beta news thread. All I can do is report the problem, with as much info as I can provide, like a good tester should. I wish it was properly investigated and fixed by the administrators. You could go into the client state xml file and edit the dcf for GPUGrid tasks to whatever suits you. Problem solved.
	ID: 32432 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32433 - Posted: 29 Aug 2013 \| 0:42:18 UTC - in response to Message 32432.
	Until you get more tasks that throw it out of balance. Problem remains to be solved at the server.
	ID: 32433 \| Rating: 0 \| rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 32436 - Posted: 29 Aug 2013 \| 1:05:51 UTC
	I see this too that the remaining time is not correct. I saw it first after we got a lot of MJH beta tests to optimize the app and then "normal" WU's again. I thought it was not a big deal as the WU crunch at good speed. However looking a bit to the system, it could be that as BOINC says it need 4 hours more to finish, it will not request new work. In reality these 4 hours mean 20 minutes, so it could ask for new work. This could explain why I did not get now work last afternoon, but when I hit the "update" button I got a WU. If this is true than it would be nice if it can be solved at server side. ____________ Greetings from TJ
	ID: 32436 \| Rating: 0 \| rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 32491 - Posted: 29 Aug 2013 \| 18:49:25 UTC - in response to Message 32433.
	Until you get more tasks that throw it out of balance. Problem remains to be solved at the server. Agreed - it's not an option to let everyone manually edit config files. Since yesterday you convinced me it's not an isolated issue. MrS ____________ Scanning for our furry friends since Jan 2002
	ID: 32491 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32586 - Posted: 1 Sep 2013 \| 12:11:50 UTC - in response to Message 32491. Last modified: 1 Sep 2013 \| 12:16:19 UTC
	... And now my Duration Correction Factor is back up to 12! I think the MJHARVEY beta tasks are throwing it off (by possibly having bad estimates). Is there a way to conclusively prove this? Admins: Is there a way to fix this server-side so it stops happening? It may even be related to the recent EXIT_TIME_LIMIT_EXCEEDED errors people are getting!
	ID: 32586 \| Rating: 0 \| rate: / Reply Quote

MJH Project administrator Project developer Project scientist Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32591 - Posted: 1 Sep 2013 \| 12:53:22 UTC - in response to Message 32586.
	Set aside the WUs which are failing with 'max time exceeded' errors, which are a symptom of another as yet unfixed problem, The application reports its progress to the BOINC client regularly, giving the client sufficient information able to linearly extrapolate the completion time accurately. If it's not getting it right, then that's a client bug. I've not heard of "duration correction factor" before, bit it looks like the result of over-thinking the problem. MJH
	ID: 32591 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32592 - Posted: 1 Sep 2013 \| 12:58:16 UTC - in response to Message 32591. Last modified: 1 Sep 2013 \| 13:12:54 UTC
	I'm fairly certain DCF is very relevant. The BOINC client "remembers" how much time a task was estimated to have taken, when it received said task. Then, when the task is done, it makes comparison to the actual time versus the estimated time. It stores that off as a client-side "factor" for the project (viewable in the project properties, called Duration Correction Factor DCF), so that in the future, when it gets a task from that project, it can "dynamically adjust" the Remaining time to better reflect the estimate. The problem that we're seeing is that DCF is way larger than expected, all of the sudden. And I believe the root cause is that recent tasks might have been sent with very incorrect estimated times (or maybe very incorrect estimated sizes of FLOPS or FPOPS). Can you please check into the estimates of recent tasks? Are the beta ones or the "_TEST" ones possibly very incorrect? :) Right now my DCF (which is usually between 1.0 and 1.6) is at 29.88, and a long-task (which usually completes in 8-16 hours) is currently estimated at 281 hours (~12 days!) Note: Projects can also choose to "not use" DCF, especially when they have various apps where estimates could be way off, per app. World Community Grid, for instance, has disabled DCF usage. I personally believe GPUGrid should keep using it.
	ID: 32592 \| Rating: 0 \| rate: / Reply Quote

MJH Project administrator Project developer Project scientist Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32593 - Posted: 1 Sep 2013 \| 13:13:14 UTC - in response to Message 32592.
	Well if this DCF thing is keeping a running average of WU runtimes then I expect it is simply losing its marbles when it encounters one of the WUs that make no progress. Am I correct in thinking that you like DCF because it gives you an estimate of WU runtime before the WU has begun executing? Certainly, once the WU is in progress the DCF is no longer relevant because the client has real data to work with. If it is still reporting the DCF-estimate completion time rather than a live extrapolation, that should be reported as a client bug. MJH
	ID: 32593 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32594 - Posted: 1 Sep 2013 \| 13:24:09 UTC - in response to Message 32593. Last modified: 1 Sep 2013 \| 14:17:34 UTC
	I'm not certain why it is "losing its marbles" lol, but I don't think it's related to a "no progress" task. Instead, I think it gets "adjusted" when a task is completed, and it compares the "estimated time before starting the task" versus "actual time after completing a task". The estimated task size is very important in that calculation, and it feels like some tasks went through with incorrect estimated sizes. I like DCF for several reasons. It accounts for "how BOINC normally runs, possibly fully loaded with CPU tasks", it accounts for computer-specific hardware scenarios that the project doesn't know about, maybe the user plays games while GPU crunching some of the time... all sorts of reasons to use it. I just wish it was app-based and not project-based. But, for GPUGrid, I think it's useful to still use, since you have been able to make accurate estimates for all your apps, in the past, and you seem to have always used the same method of making those estimates regardless of app. Regarding how it works... Once the WU is in progress, I believe DCF is still used even until the task finishes. (Not all tasks report linear progress like GPUGrid). So, for instance, here where the DCF is very skewed, the "remaining" time may start at 281 hours, but each clock-second that happens, the remaining time decreases 30 seconds. Once tasks complete, then the DCF gets adjusted for future tasks. Again, I believe the real issue is that something is wrong in the estimates of recent tasks. Have you investigated whether <rsc_fpops_est> ... was and is being set correctly for all recent tasks? If it's incorrect, it'll be the cause of DCF being improperly adjusted, resulting in bad estimated times people are seeing. Also, have you instigated whether <rsc_fpops_bound> ... was and is being set correctly for all recent tasks? If it's incorrect, it'll be the cause of the "exceeded maximum time" EXIT_TIME_LIMIT_EXCEEDED errors people are seeing.
	ID: 32594 \| Rating: 0 \| rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 32599 - Posted: 1 Sep 2013 \| 18:04:59 UTC
	I agree with what Jacob says here. MrS ____________ Scanning for our furry friends since Jan 2002
	ID: 32599 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,832,166,430 RAC: 19,849,425 Level Scientific publications	Message 32603 - Posted: 1 Sep 2013 \| 19:33:44 UTC
	My GTX 670 host 132158 has been running through a variety of Beta tasks I recently finished one of the full-length NOELIA_KLEBEbeta being put through the Beta queue: I'm now working through some more MJHARVEY_TEST. Runtimes vary, as you would expect: in very round figures, the NOELIA ran for 10 hours, and the test pieces for 100 minutes, 10 minutes, or 1 minute - a ratio of 600::1 between the longest and shortest. I've checked a few <rsc_fpops_est> and <rsc_fpops_bound> pairs, and every workunit had been identical: The values have been <rsc_fpops_est>5000000000000000.000000</rsc_fpops_est> <rsc_fpops_bound>250000000000000000.000000</rsc_fpops_bound> Giving identical estimates to such a wide range of tasks is going, I'm afraid, to end in tears. BOINC Manager has translated those fpops_est into hours and minutes for me. For the cuda55 plan class, the estimate is roughly right for the NOELIA full task I've just completed - 8 hours 41 minutes. But to get that figure, BOINC has had to apply a DCF of 18.4 - it should be somewhere round 1.0 One the other hand, one of the test units was allocated under cuda42, and was showing an estimate of 24 minutes. Same size task, different estimates, can only happen one way, and here it is: <app_name>acemdbeta</app_name> <version_num>804</version_num> <platform>windows_intelx86</platform> <avg_ncpus>0.350000</avg_ncpus> <max_ncpus>0.571351</max_ncpus> <flops>52770755214365.750000</flops> <plan_class>cuda42</plan_class> <app_name>acemdbeta</app_name> <version_num>804</version_num> <platform>windows_intelx86</platform> <avg_ncpus>0.350000</avg_ncpus> <max_ncpus>0.666596</max_ncpus> <flops>2945205633626.370100</flops> <plan_class>cuda55</plan_class> Pulling those out into a more legible format, my speeds are supposed to be: cuda42: 52,770,755,214,365 cuda55: 2,945,205,633,626 That's 3 TeraFlops for cuda55, and over 50 TeraFlops for cuda42 - no, I don't think so. The marketing people at NVidia, as interpreted by BOINC, do give these cards a rating of 2915 GFLOPS peak, but I think we all know that not every flop is usable in the real world - pesky things like PCIe bus transfers, and memory read/writes, get in the way. The APR (Average Processing Rate) figures for that host are shown on the application details page. For mainstream processing under cuda42, I'm showing 147 (units: GigaFlops) for both short and long runs. About one-twentieth of theoretical peak feels about right, and matches the new DCF. For the new Beta runs I'm showing crazy speeds here as well - up to (and above) 80 TeraFlops. Guys, I'm sorry, but that's what happens when you send out 1-minute tasks without changing <rsc_fpops_est> from the value you used for 10 hour tasks.
	ID: 32603 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32605 - Posted: 1 Sep 2013 \| 20:08:15 UTC - in response to Message 32603. Last modified: 1 Sep 2013 \| 20:10:50 UTC
	Yes, a million times over! Admins: Regardless of whether we're still sending these tasks out, could we PLEASE make sure the code is correctly setting the values for them? They're causing estimation chaos!
	ID: 32605 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,832,166,430 RAC: 19,849,425 Level Scientific publications	Message 32607 - Posted: 1 Sep 2013 \| 20:53:50 UTC
	Disclaimer first: I speak as a long-term (interested) volunteer on number of BOINC projects, but I have no first-hand experience of administering a BOINC server. What follows is my personal opinion only, but an opinion informed by seeing similar crises at a number of projects. My opinion is that the current issues largely arise from the code behind http://boinc.berkeley.edu/trac/wiki/RuntimeEstimation. It appears that this code is equisitely sensitive to outlier values in wu.rsc_fpops_est, and the hypersensitive code has a nasty habit of creeping out of the woodwork and biting sysadmins' ankles at times of stress. I think there are two reasons for that: 1) it's exactly at times like this - working weekend overtime to chase out a stubborn application bug, with a project-full of volunteers clamouring for new (error-free) work - that harrassed sysadmins are likely to take obvious shortcuts: put the new test app on the server with the same settings as last time, make the tasks shorter so the results come back sooner. I think we can all recognise, and empathise with, that syndrome. 2) BOINC's operations manual, and systems training courses (if they run any), don't emphasise enough: no shortcut through here - here be dragons, and they will demand a sacrifice if you get it wrong. Unless you run a completely sandboxed test server, these beta apps - it's always beta apps - will screw your averages and estimates. This runtime estimation code has been active in the master code base for BOINC servers for over three years now, and as I said, I've seen problems like this several times now - the most spectacular was at AQUA, where it spilled over into credit awards too. But to the best of my knowledge, no post-mortem has ever been carried out to find out why things go so badly wrong. In - again - my personal opinion only, this area of code needs to be re-visited, and re-written according to sound engineering - fault tolerant and error tolerant - principles (the way the original BOINC server code was written). However, I see no sign that the BOINC core programmers have any plans to do this. Some project administrators have looked into the code, and expressed frustration at having to work with it in its current form - Eric Korpela of SETI and Bernd Machenschalk of Einstein come to mind - but I've not heard of any improvements being migrated back into the master codebase. There is a workshop "for researchers, scientists and developers with significant experience or interest in BOINC" at Inria near Grenoble, at the end of this month. If anybody reading this has any sympathy with my point of view, that workshop might be a suitable opportunity to raise any concerns you might have.
	ID: 32607 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32608 - Posted: 1 Sep 2013 \| 21:05:57 UTC - in response to Message 32607.
	:) I hope we don't have to travel to France for a solution. Admins: Can you please solve this issue? We've given tons of information that should point you in the right direction I hope!
	ID: 32608 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 32609 - Posted: 1 Sep 2013 \| 23:22:50 UTC - in response to Message 32608.
	Many projects have separate apps (and queues) for different task types, as it makes things more manageable. However at GPUGrid many different WU types are run in the same queue. For example, I've just run a NOELIA_KLEBbeta WU that took 13.43h and has an estimate of 5000000GFLOPs, and a MJHARVEY_TEST11 that took 2.08h but also had an estimated WU requirement of 5000000GFLOPs. I thought the estimated GFLOPs is set on an application basis. Is that correct? If that's the case then you cannot run multiple WU types in the same app queue without messing up the estimated runtimes. PS. In Project Preferences there are 4 apps (queues), ACEMD short runs (2-3 hours on fastest card) for CUDA 4.2: ACEMD short runs (2-3 hours on fastest card) for CUDA 3.1: - deprecated? ACEMD beta: ACEMD long runs (8-12 hours on fastest GPU) for CUDA 4.2: In server status there are 3 apps, Short runs (2-3 hours on fastest card) ACEMD beta version Long runs (8-12 hours on fastest card) ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help
	ID: 32609 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32610 - Posted: 1 Sep 2013 \| 23:47:00 UTC
	Each task is sent with an estimate. You can even view that estimate in the task properties. It can be different amongst tasks within the same app. We need the admins to perform an investigation, and to fix the problem.
	ID: 32610 \| Rating: 0 \| rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 32611 - Posted: 2 Sep 2013 \| 0:22:34 UTC - in response to Message 32609. Last modified: 2 Sep 2013 \| 0:22:52 UTC
	The opposite happens too. I have now a NOELIA_KLEBEbeta-2-3... that has done 9% in 1 hour, but in 5 minutes it will be finished....? And an HARVEY_TEST that will take about 15 minutes, is expected to run 6h53m52s. If I have watched correct it all started with the HARVEY_TEST that took around 1 minute, but estimation of 12 hours. ____________ Greetings from TJ
	ID: 32611 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,832,166,430 RAC: 19,849,425 Level Scientific publications	Message 32618 - Posted: 2 Sep 2013 \| 8:04:06 UTC - in response to Message 32610.
	Each task is sent with an estimate. You can even view that estimate in the task properties. There are a number of components which go towards calculating that estimate. After playing around with those Beta tasks yesterday, I've now been given a re-sent NATHAN_KIDKIXc22 from the long queue (WU 4743036), so we can see what will happen when all this is over. From <workunit> in client_state: <name>I6R6-NATHAN_KIDKIXc22_6-12-50-RND1527</name> <app_name>acemdlong</app_name> <version_num>803</version_num> <rsc_fpops_est>5000000000000000.000000</rsc_fpops_est> <rsc_fpops_bound>250000000000000000.000000</rsc_fpops_bound> From <app_version> in client_state: <app_name>acemdlong</app_name> <version_num>803</version_num> <platform>windows_intelx86</platform> <avg_ncpus>0.666596</avg_ncpus> <max_ncpus>0.666596</max_ncpus> <flops>142541780304.165830</flops> <plan_class>cuda55</plan_class> From <project> in client_state: <duration_correction_factor>19.676844</duration_correction_factor> It's the local BOINC client on your machine that puts all those figures into the calculator. Size: 5,000,000,000,000,000 (5 PetaFpops, 5 quadrillion calculations) Speed: 142,541,780,304 (142.5 GigaFlops) DCF: 19.67 Put those together, and my calculator gets 690,213 seconds - 192 hours or 8 days. 28% of the way through the task (in 2.5 hours), BOINC is still estimating 174 hours - over a week - to go: BOINC is very slow to switch from 'estimate' to 'experience' as a task is running. We're going to get a lot of panic (and possibly even aborted tasks) from inexperienced users before all this unwinds.
	ID: 32618 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,206,655,749 RAC: 261,147 Level Scientific publications	Message 32626 - Posted: 2 Sep 2013 \| 10:06:17 UTC
	The BOINC manager suspends a WU which has normally estimated run time when it receives a fresh WU which has overestimated run time (my personal high score is 2878(!) hours), which makes my batch programs think that they are stuck (actually they are, it's intentional to give priority to the task with the overestimated run time). This is really annoying.
	ID: 32626 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,832,166,430 RAC: 19,849,425 Level Scientific publications	Message 32627 - Posted: 2 Sep 2013 \| 10:21:36 UTC - in response to Message 32626.
	The BOINC manager suspends a WU which has normally estimated run time when it receives a fresh WU which has overestimated run time (my personal high score is 2878(!) hours), which makes my batch programs think that they are stuck (actually they are, it's intentional to give priority to the task with the overestimated run time). This is really annoying. It depends which version of the BOINC client you run. I'm testing new BOINC versions as they come out, too - that rig is currently one step behind, on v7.2.10 The behaviour of 'stopping the current task, and starting a later one' when in High Priority was acknowledged to have been a bug, and has been corrected now. BOINC is hoping to promote v7.2.xx to 'recommended' status soon - that should cure your annoyance.
	ID: 32627 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32630 - Posted: 2 Sep 2013 \| 12:05:34 UTC - in response to Message 32626.
	This is really annoying. I agree, it is annoying. I reported it as soon as I spotted it, over a week ago. Hopefully the admins take it a bit more seriously.
	ID: 32630 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,832,166,430 RAC: 19,849,425 Level Scientific publications	Message 32634 - Posted: 2 Sep 2013 \| 13:37:08 UTC - in response to Message 32630.
	This is really annoying. I agree, it is annoying. I reported it as soon as I spotted it, over a week ago. Hopefully the admins take it a bit more seriously. The problem Retvari Zoltan is annoyed about - BOINC suspending one task and running a different one when high priority is needed - isn't something the project admins can solve (except by getting the estimates right so EDF isn't needed, obviously). Decisions about which task from the cache to run next are taken locally by the BOINC core client. v6.10.60 is getting quite old now - and yes, this bug has been around that long. It was fixed in v7.0.14: EDF policy says we should run the ones with earliest deadlines. Note: this is how it used to be (as designed by John McLeod). I attempted to improve it, and got it wrong. [DA]
	ID: 32634 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,206,655,749 RAC: 261,147 Level Scientific publications	Message 32636 - Posted: 2 Sep 2013 \| 15:55:53 UTC - in response to Message 32634.
	The problem Retvari Zoltan is annoyed about - BOINC suspending one task and running a different one when high priority is needed - isn't something the project admins can solve (except by getting the estimates right so EDF isn't needed, obviously). That's why I've posted about my annoyance here. This overestimation misleads the BOINC manager in another way: it won't ask for new work, since it thinks that there is enough work in its queue. Decisions about which task from the cache to run next are taken locally by the BOINC core client. v6.10.60 is getting quite old now - and yes, this bug has been around that long. It was fixed in v7.0.14: There is another annoying bug and an annoying GUI change, which makes me not to upgrade v6.10.60: The bug is in the calculation of the required CPU percentage for GPU tasks. It can change from below 0.5 to over 0.5. On a dual GPU system that change results in 1 CPU thread fluctuation. The v6.10.60 underestimates the required CPU percentage for Kepler based cards (0.04%), so the number of available CPUs won't fluctuate. This bug comes in handy. The annoying GUI change is the omitted "messages" tab (actually it's relocated to a submenu).
	ID: 32636 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,832,166,430 RAC: 19,849,425 Level Scientific publications	Message 32639 - Posted: 2 Sep 2013 \| 20:07:10 UTC
	I've just picked up a new Beta task - 7238202 New (to me) task type MJHARVEY_TESTSIN, and new cuda55 application version 8.06 Task has the same 5 PetaFpops estimate as we're used to seeing, and - what with it being a new app and all - speed estimate was 295 GigaFlops, lower than previously on this host. DCF is still high, so BOINC calculated the runtime as 87 hours. Looks like it's turning into a 100 minute job... (but still leapt into action in High Priority as soon as I let my AV complete the download).
	ID: 32639 \| Rating: 0 \| rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 32640 - Posted: 2 Sep 2013 \| 20:16:12 UTC - in response to Message 32636.
	You could configure the CPU percentage your self with an app_config (which the later BOINCs support). I could send you mine, if you're interested. And the message.. well, it's annoying. But only really needed when there are problems. Which, fortunately, isn't all that often for me. Anyway.. let's wait for MJH to work through these posts! MrS ____________ Scanning for our furry friends since Jan 2002
	ID: 32640 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,832,166,430 RAC: 19,849,425 Level Scientific publications	Message 32644 - Posted: 2 Sep 2013 \| 22:24:25 UTC - in response to Message 32639.
	I've just picked up a new Beta task - 7238202 Completed in under 2 hours, and awarded 150,000 credits. This is getting silly.
	ID: 32644 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,832,166,430 RAC: 19,849,425 Level Scientific publications	Message 32658 - Posted: 3 Sep 2013 \| 14:07:59 UTC
	Just a heads-up: estimated times for the current v8.10 (cuda55) Beta are unusually low - it is likely that many runs (especially of the full-length production NOELIA_KLEBEbeta tasks being processed through the Beta queue) will fail with EXIT_TIME_LIMIT_EXCEEDED - after about an hour, on my GTX 670. If you are a BOINC user experienced and competent enough to edit client_state.xml - and taking all the standard safety warnings as read - you can avoid this by increasing <rsc_fpops_bound> for all GPUGrid Beta tasks. A couple of orders of magnitude should do it, maybe three for luck.
	ID: 32658 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32659 - Posted: 3 Sep 2013 \| 14:42:25 UTC - in response to Message 32658.
	Yup, I just had some tasks fail because of that poor server estimation. This is getting very ridiculous. Just a heads-up: estimated times for the current v8.10 (cuda55) Beta are unusually low - it is likely that many runs (especially of the full-length production NOELIA_KLEBEbeta tasks being processed through the Beta queue) will fail with EXIT_TIME_LIMIT_EXCEEDED - after about an hour, on my GTX 670. If you are a BOINC user experienced and competent enough to edit client_state.xml - and taking all the standard safety warnings as read - you can avoid this by increasing <rsc_fpops_bound> for all GPUGrid Beta tasks. A couple of orders of magnitude should do it, maybe three for luck.
	ID: 32659 \| Rating: 0 \| rate: / Reply Quote

MJH Project administrator Project developer Project scientist Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32670 - Posted: 4 Sep 2013 \| 9:15:44 UTC Last modified: 4 Sep 2013 \| 9:37:48 UTC
	This problem should confined to the beta queue and is a side-effect of having issued a series of short running WUs with the same fpops estimate as normal longer-running ones. Please let me know if you start to see this problem on the important acemdshort and acemdlong queues. There's no reason why it should be happening there (any more than usual), but the client is full of surprises. MJH
	ID: 32670 \| Rating: 0 \| rate: / Reply Quote

juan BFP Send message Joined: 11 Dec 11 Posts: 21 Credit: 145,887,858 RAC: 0 Level Scientific publications	Message 32671 - Posted: 4 Sep 2013 \| 9:26:40 UTC Last modified: 4 Sep 2013 \| 9:29:52 UTC
	All the WU actualy running in this host has a estimated of +/- 130hrs! (normal time to crunching a WU = 8-9 hrs) http://www.gpugrid.net/results.php?hostid=157835&offset=0&show_names=0&state=1&appid= So actualy most of them are crunching in high priority mode. ____________
	ID: 32671 \| Rating: 0 \| rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 32672 - Posted: 4 Sep 2013 \| 9:37:29 UTC - in response to Message 32670. Last modified: 4 Sep 2013 \| 9:39:50 UTC
	Please let me know if you start to see this problem on the important acemdshort and acemdlong queues. There's no reason why it should be happening there (any more than usual), but the client is full of surprises. MJH I have tried 3 betas in the long queue, and two have failed at almost exactly the same running times. One was a Noelia, and the other was a Harvey. (On a GTX 650 Ti under Win7 64-bit and BOINC 7.2.11 x64) 8.10 ACEMD beta version (cuda55) 109nx46-NOELIA_KLEBEbeta-0-3-RND7143_2 01:58:09 Reported: Computation error (197,) 8.10 ACEMD beta version (cuda55) 66-MJHARVEY_TEST10-42-50-RND0504_0 01:58:02 Reported: Computation error (197,) The Noelia was also reported as failed by four other people thus far, whereas the Harvey was completed successfully by someone running ACEMD beta version v8.05 (cuda42)
	ID: 32672 \| Rating: 0 \| rate: / Reply Quote

MJH Project administrator Project developer Project scientist Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32673 - Posted: 4 Sep 2013 \| 9:40:43 UTC - in response to Message 32672.
	I have tried 3 betas in the long queue I'm not sure what you mean my that. Those are WUs from the acemdbeta queue, run by the beta application. The acemdlong queue isn't involved. MJH
	ID: 32673 \| Rating: 0 \| rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 32674 - Posted: 4 Sep 2013 \| 10:02:56 UTC - in response to Message 32673.
	Sorry. I thought you were asking about the betas. No problems with the longs thus far. All seven that I have received under CUDA 5.5 have been completed successfully. 8.03 Long runs (cuda55) 063ppx239-NOELIA_FRAG063pp-2-4-RND4537_0 20:57:08 8.03 Long runs (cuda55) 041px89-NOELIA_FRAG041p-3-4-RND5262_0 17:41:48 8.03 Long runs (cuda55) 041px89-NOELIA_FRAG041p-2-4-RND5262_0 17:39:49 8.03 Long runs (cuda55) 063ppx290-NOELIA_FRAG063pp-1-4-RND3152_0 20:59:32 8.03 Long runs (cuda55) I35R7-NATHAN_KIDKIXc22_6-8-50-RND8566_0 8.02 Long runs (cuda55) I50R6-NATHAN_KIDKIXc22_6-3-50-RND0333_0 17:48:16 8.00 Long runs (cuda55) I81R8-NATHAN_KIDKIXc22_6-4-50-RND0944_0 17:44:35
	ID: 32674 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,832,166,430 RAC: 19,849,425 Level Scientific publications	Message 32676 - Posted: 4 Sep 2013 \| 10:12:44 UTC - in response to Message 32671.
	All the WU actualy running in this host has a estimated of +/- 130hrs! (normal time to crunching a WU = 8-9 hrs) http://www.gpugrid.net/results.php?hostid=157835&offset=0&show_names=0&state=1&appid= So actualy most of them are crunching in high priority mode. That's DCF in action. It will work itself down eventually, but may take 20 - 30 tasks with proper <rsc_fpops_est> to get there. Unless DCF has already reached over 90 - the normalisation process is slower in those extreme cases.
	ID: 32676 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32677 - Posted: 4 Sep 2013 \| 10:14:54 UTC - in response to Message 32670. Last modified: 4 Sep 2013 \| 10:19:04 UTC
	MJH, Thanks for finding the cause of the problem. Has the problem been fixed, such that new tasks (including beta!) are sent with appropriate fpops estimates? Once the problem has been fixed at the server, if a user wants to immediately reset Duration Correction Factor (instead of waiting it to adjust down over the course of a few days/weeks), they could do these steps: - Exit BOINC, including stopping running tasks - Open client_state.xml within the data directory - Search for the <project_name>GPUGRID</project_name> element - Find the <duration_correction_factor> element within that project element - Change the line to read: <duration_correction_factor>1</duration_correction_factor> - Restart BOINC - Monitor the value in the UI by viewing the GPUGrid project properties, to test, to make sure it hopefully stays between 0.6 and 1.6. .... I just need to know when the problem has been fixed by the server, so I can begin testing the solution on my client (which has luckily not been full of too many surprises). Has it been fixed? This problem should confined to the beta queue and is a side-effect of having issued a series of short running WUs with the same fpops estimate as normal longer-running ones. Please let me know if you start to see this problem on the important acemdshort and acemdlong queues. There's no reason why it should be happening there (any more than usual), but the client is full of surprises. MJH
	ID: 32677 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,832,166,430 RAC: 19,849,425 Level Scientific publications	Message 32679 - Posted: 4 Sep 2013 \| 10:31:54 UTC - in response to Message 32673.
	I have tried 3 betas in the long queue I'm not sure what you mean my that. Those are WUs from the acemdbeta queue, run by the beta application. The acemdlong queue isn't involved. MJH I think he means long tasks, like those NOELIA_KLEBE jobs, being processed through the Beta queue currently alongside your quick test pieces. The problem is, that if you run a succession of short test units with full-size <rsc_fpops_est> values, then the BOINC server thinks your host is insanely fast - it thinks my GTX 670 can complete ACEMD beta version 8.11 tasks (bottom of linked list) at 79.2 TeraFlops. When BOINC attempts any reasonably-long job (a bit over an hour, in my case), the client thinks something has gone wrong, and aborts it for taking too long. There's nothing the user can do to overcome that problem, except 'innocculate' each individual task as received, with a big (100x or 1000x) increase to <rsc_fpops_bound>
	ID: 32679 \| Rating: 0 \| rate: / Reply Quote

MJH Project administrator Project developer Project scientist Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32682 - Posted: 4 Sep 2013 \| 10:43:36 UTC - in response to Message 32679.
	The problem is, that if you run a succession of short test units with full-size <rsc_fpops_est> values, then the BOINC server thinks your host is insanely fast The server doesn't have any part in it - it's the client making that decision. anyway, the last batch of short WUs have been submitted, with no more to follow. Hopefully the client will be as quick to correct itself back as before. MJH
	ID: 32682 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,832,166,430 RAC: 19,849,425 Level Scientific publications	Message 32683 - Posted: 4 Sep 2013 \| 10:57:28 UTC - in response to Message 32682.
	The problem is, that if you run a succession of short test units with full-size <rsc_fpops_est> values, then the BOINC server thinks your host is insanely fast The server doesn't have any part in it - it's the client making that decision. anyway, the last batch of short WUs have been submitted, with no more to follow. Hopefully the client will be as quick to correct itself back as before. MJH I beg to differ. Please have a look at http://boinc.berkeley.edu/trac/wiki/RuntimeEstimation - the problem is the host_app_version table described under 'The New System'. You have that here: the sever calculates the effective speed of the host, based on an average of previously completed tasks. You can see - but not alter - those 'effective' speeds as 'Average Processing Rate' on the application details page for each host. That's what I linked for mine: the units are gigaflops. The server passes that effective flops rating to the client with each work allocation, and yes: it's the client which makes the final decision to abort work with EXIT_TIME_LIMIT_EXCEEDED - but it does so on the basis of data maintained and supplied by the server.
	ID: 32683 \| Rating: 0 \| rate: / Reply Quote

Jeremy Zimmerman Send message Joined: 13 Apr 13 Posts: 61 Credit: 726,605,417 RAC: 0 Level Scientific publications	Message 32684 - Posted: 4 Sep 2013 \| 12:36:36 UTC - in response to Message 32682. Last modified: 4 Sep 2013 \| 12:41:31 UTC
	In the last month, my two machines seem to have stabilized for errors. Last errors were the server canceled runs Aug 24 and before that Jul 30 - Aug 11 with the NOELIA runs. Just added the Beta runs to be allowed a couple days ago and last night they started running on one machine. The HARVEY's went through no problem. The NOELIA_KLEBEbeta did error with 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED after running ~30 of the HARVEY's. http://www.gpugrid.net/result.php?resultid=7246089 When reviewing all the comments about rsc_fpops_bound, duration_correction_factor, etc., it seems that with the current setup, one minute runs and 8 hour runs should not be mixed under the same app. Since it is our side which is calculating how long it should take, and our side is limited by the app definitions. <app_version> <app_name>acemdshort</app_name> <version_num>800</version_num> <flops>741352314880.428590</flops> (0.74 TFlops) <file_name>acemd.800-42.exe</file_name> <app_version> <app_name>acemdshort</app_name> <version_num>802</version_num> <flops>243678296053.232120</flops> (0.24 TFlops) <file_name>acemd.802-55.exe</file_name> <app_version> <app_name>acemdlong</app_name> <version_num>803</version_num> <flops>133475073594.177120</flops> (0.13 TFlops) <file_name>acemd.800-55.exe</file_name> <app_version> <app_name>acemdlong</app_name> <version_num>803</version_num> <flops>185603220803.599580</flops> (0.19 TFlops) <file_name>acemd.800-55.exe</file_name> <app_version> <app_name>acemdbeta</app_name> <version_num>811</version_num> <flops>69462341903928.672000</flops> (69.46 TFlops) <file_name>acemd.811-55.exe</file_name> My two machines also make this a bit more complicated in that they both have a GTX680 and a GTX460 so it seems that the estimated time remaining is driven by the GTX460 speed. That seems to work out ok though. There seems to be no tracking ability local or server for the different cards so this is a moot point. So if the acemdbeta app could be run with at least different version with identical binary (e.g. 5min_811 and 8hr_811 where 5 minutes and 8 hours expected time), this could at least separate the runs such as HARVEY and KLEBE which have orders of magnitude in different run times. This may possibly reduce the number of errors due to time outs.
	ID: 32684 \| Rating: 0 \| rate: / Reply Quote

juan BFP Send message Joined: 11 Dec 11 Posts: 21 Credit: 145,887,858 RAC: 0 Level Scientific publications	Message 32685 - Posted: 4 Sep 2013 \| 12:42:39 UTC Last modified: 4 Sep 2013 \| 12:43:11 UTC
	Thanks all for te explanations, the path explained by Jacob aparently fix the times. Now the question remains, it´s fixed at the server side? can i allow to receive new Beta WU or is beter wait a little more? ____________
	ID: 32685 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,832,166,430 RAC: 19,849,425 Level Scientific publications	Message 32687 - Posted: 4 Sep 2013 \| 13:14:54 UTC - in response to Message 32685.
	Thanks all for te explanations, the path explained by Jacob aparently fix the times. Now the question remains, it´s fixed at the server side? can i allow to receive new Beta WU or is beter wait a little more? Having consulted the usual oracle on that, we think it would be wise to wait a little longer. Although changing DCF will change the displayed estimates for runtime, we don't think it affects the underlying calculations for EXIT_TIME_LIMIT_EXCEEDED. And whatever you change, DCF is re-calculated every time a task exits. If you happen to draw another NOELIA_KLEBEbeta, DCF will go right back up through the roof in one jump when it finishes. The only solution is a new Beta app installation, with a new set of APRs - and that has to be done on the server.
	ID: 32687 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32688 - Posted: 4 Sep 2013 \| 13:18:43 UTC - in response to Message 32687.
	Richard is right, as usual. I'm going to take my computer off of the beta queue, until a new app and new APRs are in place; too much wasted computing power is occurring. MJH: Please let us know when a new beta app (with new APRs) is released on the server. Thanks all for te explanations, the path explained by Jacob aparently fix the times. Now the question remains, it´s fixed at the server side? can i allow to receive new Beta WU or is beter wait a little more? Having consulted the usual oracle on that, we think it would be wise to wait a little longer. Although changing DCF will change the displayed estimates for runtime, we don't think it affects the underlying calculations for EXIT_TIME_LIMIT_EXCEEDED. And whatever you change, DCF is re-calculated every time a task exits. If you happen to draw another NOELIA_KLEBEbeta, DCF will go right back up through the roof in one jump when it finishes. The only solution is a new Beta app installation, with a new set of APRs - and that has to be done on the server.
	ID: 32688 \| Rating: 0 \| rate: / Reply Quote

juan BFP Send message Joined: 11 Dec 11 Posts: 21 Credit: 145,887,858 RAC: 0 Level Scientific publications	Message 32689 - Posted: 4 Sep 2013 \| 13:38:42 UTC - in response to Message 32687.
	Having consulted the usual oracle on that, we think it would be wise to wait a little longer. <Waiting>1<Waiting> hope not for a long time and thanks for the help. Richard is right, as usual. I'm going to take my computer off of the beta queue, until a new app and new APRs are in place; too much wasted computing power is occurring. Please let us know when a new beta app (with new APRs) is released on the server. +1 ____________
	ID: 32689 \| Rating: 0 \| rate: / Reply Quote

MJH Project administrator Project developer Project scientist Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32692 - Posted: 4 Sep 2013 \| 13:51:18 UTC - in response to Message 32688.
	Richard is right, as usual. I'm going to take my computer off of the beta queue, until a new app and new APRs are in place; too much wasted computing power is occurring. MJH: Please let us know when a new beta app (with new APRs) is released on the server. Jacob - the beta testing is over now. 8.11 is the final revision and is out now on beta and short. Now, I know these wrong estimates have been a cause of frustration for you, but in fact the WUs haven't been going to waste - they've been doing enough work to help me fix the bugs I was looking at. MJH
	ID: 32692 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32693 - Posted: 4 Sep 2013 \| 13:57:46 UTC - in response to Message 32692.
	MJH: If people get stuck in a "Maximum Time Exceeded" loop, where they cannot complete their tasks, then surely their time will be wasted with the work, no? Richard knows what needs to be done in order to get people out of that scenario. It deals with APRs on the Application Details page - Average processing rate is way too high for some applications. As long as the "busted APR" applications aren't used anymore, then I think the problem will go away from a user's perspective, as DCF will eventually work it's way back to 0.6-1.6. Since you say the beta is over, I take it we're done issuing tasks on these "busted APR" applications, right? (ie: no future beta will ever use these applications) But I'm also hopeful you can do something to prevent the issue from happening in the future - maybe some safeguard that ensures proper fpops estimates. Just a thought. For your reference, I care more about the "Maximum time exceeded" errors than the "bad estimates" problems. The former cause lost work/time.
	ID: 32693 \| Rating: 0 \| rate: / Reply Quote

MJH Project administrator Project developer Project scientist Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32694 - Posted: 4 Sep 2013 \| 14:11:43 UTC - in response to Message 32693.
	If people get stuck in a "Maximum Time Exceeded" loop, where they cannot complete their tasks, then surely their time will be wasted with the work, no? From a BOINC credit perspective, yes. But note that the completed WUs were receiving a generous award, which ought to have been some compensation. Importantly though, from a development perspective they were't wasted. The failures I was interested in happened very quickly after start-up. If the WU ran long enough for MTE, it had run long enough to accomplish its purpose. But I'm also hopeful you can do something to prevent the issue from happening in the future - maybe some safeguard that ensures proper fpops estimates. Yes, of course! We've not tried this method of live debugging using short WUs before and weren't expecting this unfortunate side-effect. Next time the fpops estimate will be dialled down appropriately. MJH
	ID: 32694 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32695 - Posted: 4 Sep 2013 \| 14:16:14 UTC - in response to Message 32694. Last modified: 4 Sep 2013 \| 14:19:53 UTC
	But I'm also hopeful you can do something to prevent the issue from happening in the future - maybe some safeguard that ensures proper fpops estimates. Yes, of course! We've not tried this method of live debugging using short WUs before and weren't expecting this unfortunate side-effect. Next time the fpops estimate will be dialled down appropriately. MJH Thank you. That is what I/we needed to hear. We understand it's a bit of a learning experience, since you were trying a new way to weed out errors and move forward. I'm glad you know more about this issue - APRs and app versions - How they affect fpops estimated - How fpops bound ends up affecting Maximum Time Exceeded - How the client keeps track of estimation using a project-wide-variable [Duraction Correction Factor (DCF)] to show estimated times in the UI Next time, I'm sure it'll go much more smoothly :) Thanks for your responses.
	ID: 32695 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,832,166,430 RAC: 19,849,425 Level Scientific publications	Message 32698 - Posted: 4 Sep 2013 \| 14:42:37 UTC
	I was just about to suggest that we wait until this urgent fine-tuning of the app was complete, but I see we've reached that point already if you're happy with v8.11 So I guess we're left with some MJHARVEY_TEST tasks with, ahem, 'creative' runtime estimates working their way through the system on the Beta queue, and in addition some of the full-length NOELIA_KLEBEbeta tasks. Are you able to monitor the numbers of tasks of each type still remaining in the queue? The best course of action might be to wait until all MJHARVEY_TEST tasks are complete and out of the way - or even help them along, with a 'cancel if unstarted' purge. Then, once we are sure that no MJHARVEY_TEST tasks are left or in any danger of being re-issued with current estimates, re-deploy the Beta app as version 812 (doesn't have to be any change in the app itself). That could then be left to clean up the remaining NOELIA_KLEBEbeta tasks with a new app_version detail record, which would automatically be populated with more realistic APRs. There's nothing you can do at the server end to hasten the correction of DCF - that's entirely managed by the client, at our end. Those who feel confident will adjust their own, others will have to wait, but it'll come right in the end. One other small point while we're here - I noticed that the CUDA 5.5 app_version referenced the v5.5 cudart and cufft DLLs - correct. But it also referenced the equivalent CUDA 4.2 DLLs. Subject to checking at your end, that sounds like duplication, which might add an unnecessary bandwidth overhead as both sets of files are downloaded by new testers. Might be scope for a bit of a saving there.
	ID: 32698 \| Rating: 0 \| rate: / Reply Quote

MJH Project administrator Project developer Project scientist Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32702 - Posted: 4 Sep 2013 \| 17:41:03 UTC - in response to Message 32698.
	There's nothing you can do at the server end to hasten the correction of DCF - that's entirely managed by the client, at our end. Those who feel confident will adjust their own, others will have to wait, but it'll come right in the end. Will disconnecting and re-attaching to the project force a reset? But it also referenced the equivalent CUDA 4.2 DLLs. Yes - 42 and 55 DLLs are both delivered irrespective of the app version. It's a side-effect of our deployment mechanism. Will probably fix it later. MJH
	ID: 32702 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32703 - Posted: 4 Sep 2013 \| 17:44:05 UTC - in response to Message 32702. Last modified: 4 Sep 2013 \| 17:44:28 UTC
	There's nothing you can do at the server end to hasten the correction of DCF - that's entirely managed by the client, at our end. Those who feel confident will adjust their own, others will have to wait, but it'll come right in the end. Will disconnecting and re-attaching to the project force a reset? Yes, I believe it will, but the user loses all the local stats for the project, plus any files that had been downloaded. For resetting DCF, I prefer to close BOINC, carefully edit the client_state.xml file, then reopen BOINC.
	ID: 32703 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32704 - Posted: 4 Sep 2013 \| 17:54:50 UTC - in response to Message 32698. Last modified: 4 Sep 2013 \| 18:12:12 UTC
	I just got the following task: http://www.gpugrid.net/result.php?resultid=7248506 Name 063px55-NOELIA_KLEBEbeta-2-3-RND9896_0 Created 4 Sep 2013 \| 17:38:22 UTC Sent 4 Sep 2013 \| 17:41:12 UTC Application version ACEMD beta version v8.11 (cuda55) I believe it is a new task on the beta queue, though I'm not sure if this app version has "busted APR" or not. Can we make sure that any new tasks are not using the "busted APR" app versions?
	ID: 32704 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,832,166,430 RAC: 19,849,425 Level Scientific publications	Message 32705 - Posted: 4 Sep 2013 \| 17:59:50 UTC - in response to Message 32702.
	There's nothing you can do at the server end to hasten the correction of DCF - that's entirely managed by the client, at our end. Those who feel confident will adjust their own, others will have to wait, but it'll come right in the end. Will disconnecting and re-attaching to the project force a reset? For DCF - detach/reattach will fix it, as will a simple 'Reset project' from BOINC Manager. Both routes will kill any tasks in progress, and will force a re-download of applications, DLLs and new tasks. People would probably wish to wait for a pause between jobs to do this: set 'No new tasks'; complete, upload and report all current work; and only then reset the project. For APR - 'Reset project' will do nothing, except kill tasks in progress and force the download of new ones. Detach/re-attach might help, but the BOINC server code in general tries to re-assign the previous HostID to a re-attaching host (if it recognises the IP address, Domain Name, and hardware configuration). If you get the same HostID, you get the APR values and other application details back, too. There are ways of forcing a new HostID, but they involve deliberately invoking BOINC's anti-cheating mechanism by falsifying the RPC sequence number.
	ID: 32705 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32706 - Posted: 4 Sep 2013 \| 18:02:40 UTC - in response to Message 32705.
	Richard, For APR: Does that means that, to effectively solve the APR problem, the solution is to create a new app version for any app version that had tasks sent with bad estimates. Is that correct?
	ID: 32706 \| Rating: 0 \| rate: / Reply Quote

MJH Project administrator Project developer Project scientist Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32708 - Posted: 4 Sep 2013 \| 18:04:49 UTC - in response to Message 32704.
	The last KLEBE beta WUs are being deleted now.
	ID: 32708 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,832,166,430 RAC: 19,849,425 Level Scientific publications	Message 32710 - Posted: 4 Sep 2013 \| 18:38:04 UTC - in response to Message 32706. Last modified: 4 Sep 2013 \| 18:38:53 UTC
	Richard, For APR: Does that means that, to effectively solve the APR problem, the solution is to create a new app version for any app version that had tasks sent with bad estimates. Is that correct? Yes, that's the only way I know from a user perspective. There is supposed to be an Application Reset tool on the server operations web-admin page, but I don't know if it can be applied selectively: the code is here http://boinc.berkeley.edu/trac/browser/boinc-v2/html/ops/app_reset.php but there's no mention of it on the associated Wiki page http://boinc.berkeley.edu/trac/wiki/HtmlOps I'd advise consulting another BOINC server admistrator before touching it: Oliver Bock (shown as the most recent contributer to that code page) can be contacted via Einstein@home or the BOINC email lists, and is normally very helpful.
	ID: 32710 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,832,166,430 RAC: 19,849,425 Level Scientific publications	Message 32711 - Posted: 4 Sep 2013 \| 18:40:49 UTC - in response to Message 32708.
	The last KLEBE beta WUs are being deleted now. Will that affect tasks in progress? :P I've just started one, with suitably modified <rsc_fpops_bound> (of course).
	ID: 32711 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 32712 - Posted: 4 Sep 2013 \| 20:40:12 UTC - in response to Message 32711. Last modified: 4 Sep 2013 \| 20:41:39 UTC
	While I expect you jest, for the rest of those who might be reading this thread, recently some work was aborted by mistake; tasks in progress as well as non started work. However, it was mentioned that a mechanism to avoid this will be used in the future. http://www.gpugrid.net/forum_thread.php?id=3448 http://www.gpugrid.net/forum_thread.php?id=3446&nowrap=true#32213 - just after starting (and finishing) the following beta, 1-MJHARVEY_TEST18-9-10-RND8069_0 4753024 4 Sep 2013 \| 20:21:07 UTC 4 Sep 2013 \| 20:24:14 UTC Completed and validated 148.76 9.55 150.00 ACEMD beta version v8.00 (cuda55) So, it's only the NOELIA betas that are being aborted!
	ID: 32712 \| Rating: 0 \| rate: / Reply Quote

MJH Project administrator Project developer Project scientist Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32713 - Posted: 4 Sep 2013 \| 20:53:00 UTC - in response to Message 32712.
	Huh, I thought all the TEST18s were done already. MJH
	ID: 32713 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 32714 - Posted: 4 Sep 2013 \| 20:58:52 UTC - in response to Message 32713.
	Created 4 Sep 2013 \| 20:18:12 UTC Sent 4 Sep 2013 \| 20:21:07 UTC Received 4 Sep 2013 \| 20:24:14 UTC ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help
	ID: 32714 \| Rating: 0 \| rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 32715 - Posted: 4 Sep 2013 \| 21:04:29 UTC
	I did a reset project, but the estimated time for a new WU afterwards is still wrong. One from 1.5 minutes was estimated 1h45m and a SANTI SR 7h40m52s. This will be faster already done 3% in 5m. ____________ Greetings from TJ
	ID: 32715 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32779 - Posted: 6 Sep 2013 \| 11:58:15 UTC
	I thought the problem had been solved, but now I don't know. Overnight I had completed the following 5 tasks, and now my Duration Correction Factor is the highest it's ever been, 96. Some new tasks are saying they'll take 848 hours to complete :) Do any of the following apps have "busted APR" from having tasks with inappropriate <rsc_fpops_est> values? - ACEMD beta version v8.11 (cuda55) - ACEMD beta version v8.13 (cuda42) - ACEMD beta version v8.13 (cuda55) 063px55-NOELIA_KLEBEbeta-2-3-RND9896_0 4752598 4 Sep 2013 \| 17:41:12 UTC 6 Sep 2013 \| 4:58:59 UTC Completed and validated 106,238.51 11,702.89 119,000.00 ACEMD beta version v8.11 (cuda55) 124-MJHARVEY_CRASH1-0-25-RND3516_1 4754388 5 Sep 2013 \| 16:39:18 UTC 6 Sep 2013 \| 4:58:21 UTC Completed and validated 17,615.38 7,935.19 18,750.00 ACEMD beta version v8.13 (cuda55) 139-MJHARVEY_CRASH2-1-25-RND6442_0 4756103 6 Sep 2013 \| 4:49:11 UTC 6 Sep 2013 \| 7:51:44 UTC Completed and validated 10,781.12 10,669.78 18,750.00 ACEMD beta version v8.13 (cuda42) 196-MJHARVEY_CRASH2-1-25-RND1142_1 4756328 6 Sep 2013 \| 7:51:00 UTC 6 Sep 2013 \| 10:49:48 UTC Completed and validated 10,559.78 10,479.89 18,750.00 ACEMD beta version v8.13 (cuda55) 149-MJHARVEY_CRASH2-1-25-RND2885_0 4756187 6 Sep 2013 \| 4:53:45 UTC 6 Sep 2013 \| 11:00:47 UTC Completed and validated 21,826.75 5,600.89 18,750.00 ACEMD beta version v8.13 (cuda55)
	ID: 32779 \| Rating: 0 \| rate: / Reply Quote

MJH Project administrator Project developer Project scientist Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32782 - Posted: 6 Sep 2013 \| 12:14:16 UTC - in response to Message 32779.
	All of the CRASH tasks are exact copies of SANTI-MAR4222s MJH
	ID: 32782 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32783 - Posted: 6 Sep 2013 \| 12:15:42 UTC - in response to Message 32782.
	Can you answer my question about the apps?
	ID: 32783 \| Rating: 0 \| rate: / Reply Quote

MJH Project administrator Project developer Project scientist Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32784 - Posted: 6 Sep 2013 \| 12:23:05 UTC - in response to Message 32783.
	rsc_fpops_est is the same as when they went out on acemdshort. I neither understand what APR is not know how to measure it.
	ID: 32784 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32785 - Posted: 6 Sep 2013 \| 12:33:36 UTC - in response to Message 32784. Last modified: 6 Sep 2013 \| 12:38:42 UTC
	Oh. I only know a little, but I'll share what I know. Richard knows a ton about it. From what I gather... each app-version that a host runs, gets its own "set of statistics". So... If you click my computers, and view the details of RacerX, and then click "Application details", you'll see all the app-versions I've run, including statistics on them. For the app-versions that were sent out with bad <rsc_fpops_est> values, you'll see that my "Average processing rate" (APR) was very jacked up (normal values for me are about ~350 for short-run queue, ~100 for long-run queue; but if you look at the beta apps on my list there, you'll see values like 4000, 7000, 16000) If the APR is very high, then the result is that the server thinks the card is like 40-160 times faster than it actually is! So, in a sense, we can use this APR to see if apps had been sent with inappropriate <rsc_fpops_est> values. At the client, when a task with incorrect <rsc_fpops_est> is processed, the clear indication is that Duration Correction Factor gets jacked up well. And then the client time estimates get jacked. I was hoping that we had fixed all this "bad <rsc_fpops_est> values" business (which I've termed as "busted-APR apps"), but last night it seems one of the tasks messed it up. I noticed it because DCF was wrong, and running tasks said 400+ hours. The main question I have: On the beta queue, which app version is the earliest where you believe the <rsc_fpops_est> issue was fully resolved? I'm pretty sure the answer is not "8.11"; I think the 8.11 task may have been the culprit out of the 5 tasks from a couple posts up.
	ID: 32785 \| Rating: 0 \| rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 32789 - Posted: 6 Sep 2013 \| 12:52:40 UTC - in response to Message 32782.
	All of the CRASH tasks are exact copies of SANTI-MAR4222s MJH That sounds promising. My 660 had big problems with Santi´s SR and LR, but so far all CRASH´s beta´s I got (4) finished with good result. ____________ Greetings from TJ
	ID: 32789 \| Rating: 0 \| rate: / Reply Quote

MJH Project administrator Project developer Project scientist Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32792 - Posted: 6 Sep 2013 \| 13:06:34 UTC - in response to Message 32785.
	Thanks Jacob, In fact our value for rsc_fpops_est has never changed - it is set the same for every WU we have ever put out on every queue. Evidently the mechanisms that depend on it being accurate had adjusted to accept as correct our typical WU lengths, but got very confused when I put out a whole batch of very short WUs with the same pops estimate. MJH
	ID: 32792 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,832,166,430 RAC: 19,849,425 Level Scientific publications	Message 32794 - Posted: 6 Sep 2013 \| 13:19:27 UTC - in response to Message 32785.
	Oh. I only know a little, but I'll share what I know. Richard knows a ton about it. From what I gather... each app-version that a host runs, gets its own "set of statistics". So... If you click my computers, and view the details of RacerX, and then click "Application details", you'll see all the app-versions I've run, including statistics on them. For the app-versions that were sent out with bad <rsc_fpops_est> values, you'll see that my "Average processing rate" (APR) was very jacked up (normal values for me are about ~350 for short-run queue, ~100 for long-run queue; but if you look at the beta apps on my list there, you'll see values like 4000, 7000, 16000) If the APR is very high, then the result is that the server thinks the card is like 40-160 times faster than it actually is! So, in a sense, we can use this APR to see if apps had been sent with inappropriate <rsc_fpops_est> values. At the client, when a task with incorrect <rsc_fpops_est> is processed, the clear indication is that Duration Correction Factor gets jacked up well. And then the client time estimates get jacked. I was hoping that we had fixed all this "bad <rsc_fpops_est> values" business (which I've termed as "busted-APR apps"), but last night it seems one of the tasks messed it up. I noticed it because DCF was wrong, and running tasks said 400+ hours. The main question I have: On the beta queue, which app version is the earliest where you believe the <rsc_fpops_est> issue was fully resolved? I'm pretty sure the answer is not "8.11"; I think the 8.11 task may have been the culprit out of the 5 tasks from a couple posts up. That's pretty much it. Some comments: Host names, as in your "view the details of RacerX", are only visible to the machine owner when logged in to their account. All other users - including the project staff - can only see the 'HostID' number, so it's better to quote (or even link) that. <rsc_fpops_est> is a property of the workunit, and hence of all tasks (including resends) generated from it. Workunits exist as entities in their own right - there's no such thing as a 'v8.11 workunit' - although the copy that got sent to your machine might well have appeared as a 'v8.11 task'. But another user might have got it as v8.06 or v8.13 - depends which was active at the time the task was allocated to the host in question. If the test tasks (the current 'CRASH' series) are copies of SANTI-MAR4222, they will be long enough - as I think I've already said somewhere - not to cause any timing problems. A bit of distortion, sure - DCF should rise to maybe 4, but still in single figures, which will clear by itself. The problem with hugely-distorted runtime estimates arose from the doctored 'TEST' workunits, some of which only ran for one minute while still carrying a <rsc_fpops_est> more appropriate for 10 hours. So long as any of those remain in the system, we could get recurrences - whichever version of the Beta app is deployed at the time a task for the WU is issued to a volunteer. On my Beta host, it looks as if estimates for v8.11 are thoroughly borked: I suspect they will be for all active participants. If anyone still has any tasks issued with that version, they may have problems running them - and if they get aborted, and re-generated by BOINC (i.e., if there are any problems with the WU or task cancellation on the server), then the later Beta versions may get 'poisoned' too. But for the time being, v8.12 and v8.13 look clean for me. @ Matt - don't feel bad about not understanding APR. *Nobody* understands APR and everything that lies behind it. Except possibly David Anderson (who wrote it), and we're not even sure about him. Grown men (and women) have wept when they tried to walk the code... Best to wait and watch, I think, and see if the issues clear themselves up as the Beta queue tasks pass through the system and into oblivion.
	ID: 32794 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32795 - Posted: 6 Sep 2013 \| 13:19:30 UTC - in response to Message 32792. Last modified: 6 Sep 2013 \| 13:23:15 UTC
	Thanks Jacob, In fact our value for rsc_fpops_est has never changed - it is set the same for every WU we have ever put out on every queue. Evidently the mechanisms that depend on it being accurate had adjusted to accept as correct our typical WU lengths, but got very confused when I put out a whole batch of very short WUs with the same pops estimate. MJH Right, I get that. But, when that happens, it "ruins" the APR for the given app-version. So, I guess what I was getting at is: Which beta app-version is the first one that couldn't possibly have been ruined? It's okay if you don't have an answer. Edit: It sounds as if Richard is saying that the task could get reissued into an app-version and poison it, so I guess my question is a bit invalid. Anyway, after processing that 8.11 and seeing DCF/estimates jacked, I (again) closed BOINC, edited my client_state.xml file to reset the DCF, and restarted BOINC. Hopefully I don't get any more tasks that ruin app-versions. Thanks, Jacob
	ID: 32795 \| Rating: 0 \| rate: / Reply Quote

MJH Project administrator Project developer Project scientist Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32796 - Posted: 6 Sep 2013 \| 13:24:24 UTC - in response to Message 32795.
	So, I guess what I was getting at is: Which beta app-version is the first one that couldn't possibly have been ruined? It's okay if you don't have an answer. If I understand what you are saying, it must be 8.12, since that was the first to do only CRASH WUs MJH
	ID: 32796 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32797 - Posted: 6 Sep 2013 \| 13:25:38 UTC - in response to Message 32796.
	Thank you Matt and Richard.
	ID: 32797 \| Rating: 0 \| rate: / Reply Quote

Damaraland Send message Joined: 7 Nov 09 Posts: 152 Credit: 16,181,924 RAC: 0 Level Scientific publications	Message 33944 - Posted: 20 Nov 2013 \| 19:48:41 UTC
	Not very sure if you still want this info. Maybe you could be more precise: CUDA: NVIDIA GPU 0: GeForce GTX 260 (driver version 331.65, CUDA version 6.0, compute capability 1.3, 896MB, 818MB available, 912 GFLOPS peak) ACEMD beta version v8.15 (cuda55) 77-KLAUDE_6429-0-2-RND1641_1 Expected to finish in 22h. 83% processed right so far. But on the other GPU I have one that is messing quite much with time estimation: CUDA: NVIDIA GPU 1: GeForce GTX 560 Ti (driver version 331.65, CUDA version 6.0, compute capability 2.1, 1024MB, 847MB available, 1352 GFLOPS peak) Long runs (8-12 hours on fastest card) v8.14 (cuda55) Progress 43% 7h so far, remaining 58h!! I don't really care about estimated times. Just in case this info is of some help.
	ID: 33944 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Number crunching : Remaining (Estimated) time is unusually high; duration correction factor unusually large

	About	Science	Volunteers	Performance	Forum	Join us	Donate