Message boards : Number crunching : NOELIA tasks - when suspended or exited, often crash drivers
Author | Message |
---|---|
Devs, | |
ID: 29318 | Rating: 0 | rate: / Reply Quote | |
Thanks. | |
ID: 29319 | Rating: 0 | rate: / Reply Quote | |
... GTX 660 Ti (usually runs 2 GPUGrid tasks) How do you get a GTX 660 TI to run TWO GPUGrid tasks?? ____________ | |
ID: 29629 | Rating: 0 | rate: / Reply Quote | |
How do you get a GTX 660 TI to run TWO GPUGrid tasks?? You use an app_config.xml file. I'd recommend doing plenty of research beforehand, though, using the following links: http://boinc.berkeley.edu/wiki/Client_configuration#Application_configuration http://www.gpugrid.net/forum_thread.php?id=3319 http://www.gpugrid.net/forum_thread.php?id=3331 And if you happen to notice any tasks completing immediately while still granting credit, which is a bug we're still tracking down, then please discontinue the use of the app_config.xml file, and post your results/info here: http://www.gpugrid.net/forum_thread.php?id=3332 Regards, Jacob | |
ID: 29630 | Rating: 0 | rate: / Reply Quote | |
How do you get a GTX 660 TI to run TWO GPUGrid tasks?? Blimey!! I'm on the case!!! Tom ____________ | |
ID: 29631 | Rating: 0 | rate: / Reply Quote | |
it often crashes the NVIDIA driver, and leads to Computation Errors on tasks that are running across all GPUs, causing me to lose work, even from other projects--------- | |
ID: 29676 | Rating: 0 | rate: / Reply Quote | |
GPUGrid.net Devs: | |
ID: 30033 | Rating: 0 | rate: / Reply Quote | |
I have this exact same problem. My specs are | |
ID: 30035 | Rating: 0 | rate: / Reply Quote | |
Today I replaced my GTX 460 with a GTX 660. My first WU is a Noelia, which looks like it will complete in 12 hours; 25% done in three hours. Much better! | |
ID: 30290 | Rating: 0 | rate: / Reply Quote | |
Today I replaced my GTX 460 with a GTX 660. My first WU is a Noelia, which looks like it will complete in 12 hours; 25% done in three hours. Much better! You didn't say what you have in your app config, just "posted here". I don't see it in this thread at least. Apparently you're trying to run 2 WUs concurrently. If so, they won't make the 24 hour deadline. The new NATHANS are even longer. Are you trying to increase your credit? Even if they run without problem, you will end up with lower credit than running 1X on a GTX 660. | |
ID: 30292 | Rating: 0 | rate: / Reply Quote | |
You didn't say what you have in your app config, just "posted here". I don't see it in this thread at least. Thank you for responding. You're right. In this thread there is only a pointer to another thread. Sorry for the confusion. Apparently you're trying to run 2 WUs concurrently. If so, they won't make the 24 hour deadline. The new NATHANS are even longer. Are you trying to increase your credit? Even if they run without problem, you will end up with lower credit than running 1X on a GTX 660. Ah! That's not what I had understood: that 50% + 50% = 100% but no bonuses... I just wonder why there has been so much kerfuffle here on a 'feature' (2x) that benefits no-one. Whatever, if only for a challenge I'd like to give 2x a try. Can you tell me why it does not work for the .XML file below? Thanks. <app_config> <app> <name>acemdlong</name> <max_concurrent>9999</max_concurrent> <gpu_versions> <gpu_usage>1</gpu_usage> <cpu_usage>0.001</cpu_usage> </gpu_versions> </app> <app> <name>acemd2</name> <max_concurrent>9999</max_concurrent> <gpu_versions> <gpu_usage>1</gpu_usage> <cpu_usage>0.001</cpu_usage> </gpu_versions> </app> <app> <name>acemdshort</name> <max_concurrent>9999</max_concurrent> <gpu_versions> <gpu_usage>1</gpu_usage> <cpu_usage>0.001</cpu_usage> </gpu_versions> </app> </app_config> ____________ | |
ID: 30299 | Rating: 0 | rate: / Reply Quote | |
Apparently you're trying to run 2 WUs concurrently. If so, they won't make the 24 hour deadline. The new NATHANS are even longer. Are you trying to increase your credit? Even if they run without problem, you will end up with lower credit than running 1X on a GTX 660. Jacob was running 2X on his 660 Ti with 3GB on the MUCH shorter NATHAN WUs that are now unfortunately gone. If your GPU won't make the 24hr deadline (including DL, UL & reporting time), then you will miss the 24hr bonus and your credit will take a significant hit. That's even if everything runs optimally: errors are likely to be more frequent. Running 1X should be better for the project too as the time from WU generation to WU completion will most likely be less, an issue here. | |
ID: 30302 | Rating: 0 | rate: / Reply Quote | |
tomba, <gpu_usage>1</gpu_usage> statement tells BOINC how many gpu's to use for each task. Currently it is set to use 1 full GPU for each task so only 1 task will run on each gpu at a time. If you set it to <gpu_usage>0.5</gpu_usage> that would tell BOINC to use half of a GPU for each task which would allow 2 tasks to run on each GPU. Be sure to post your test results so we can see if it helped. | |
ID: 30304 | Rating: 0 | rate: / Reply Quote | |
tomba, Wow! That works! Thank you!! I'm now running two Noelias: ...and below are the pre- and post- 2x results from my GPU Monitor gadget. I will certainly report back on results! Many thanks. [/img] ____________ | |
ID: 30305 | Rating: 0 | rate: / Reply Quote | |
I will certainly report back on results! It's very early days but, for both running Noelia WUs, the "Remaining (estimated)" time is counting down much faster than one per second... ____________ | |
ID: 30307 | Rating: 0 | rate: / Reply Quote | |
tomba, | |
ID: 30310 | Rating: 0 | rate: / Reply Quote | |
Developers: Devs, | |
ID: 30375 | Rating: 0 | rate: / Reply Quote | |
I will forward it. From a quick forum search it seems to be W7/W8 and driver related. So there might not be much we can do. Are you certain it only happens with Noelias and no other WUs? | |
ID: 30405 | Rating: 0 | rate: / Reply Quote | |
I will forward it. From a quick forum search it seems to be W7/W8 and driver related. So there might not be much we can do. Are you certain it only happens with Noelias and no other WUs? The issue happens sometimes when GPU tasks are suspended. This means it will hopefully be easy for you guys to reproduce. I believe I've only seen the problem on NOELIA tasks. For reference, I'm using Windows 8 x64, with the new v320.18 WHQL drivers. It should be a matter of letting the task run for a some time (15 seconds), then suspending it... then just keep doing that several times, and hopefully you'll see the problem after a few tries. I'd be curious to know if you (or anyone in GPUGrid) can reproduce it? Thanks, Jacob | |
ID: 30406 | Rating: 0 | rate: / Reply Quote | |
Guys, | |
ID: 30418 | Rating: 0 | rate: / Reply Quote | |
Guys, Hello Matt, The crash is on suspend. I've seen it happen when: - I click "Activity -> Suspend GPU" - I right-click the tray to choose "Snooze GPU" - I manually suspend the task by clicking the task "suspend" button in BOINC - as well as when BOINC suspends work due to me starting an app that is configured as an <exclusive_app> in my config.xml file. I do use the "Leave applications in memory while suspended" setting, so I never lose my CPU tasks' work, and I don't believe that option affects the GPU tasks. However, next time I get a NOELIA task, I will try testing with that option off. Have you been able to reproduce the issue? | |
ID: 30419 | Rating: 0 | rate: / Reply Quote | |
Starting to get a few funky NOELIA tasks as well. GTX 580 + GTX 670, win7 x64, | |
ID: 30439 | Rating: 0 | rate: / Reply Quote | |
JugNut, | |
ID: 30444 | Rating: 0 | rate: / Reply Quote | |
For me driver restart happens everytime with the current Noelias (or better: I have not observed it not happening) but not with other WUs. Win 8 drivers 320.18 and 314.22 (the last 2 WHQLs), "leave apps in memory" active (but I read it doesn't apply to GPUs, as it would be far too risky to leave something dirty or to run out of memory) and a GTX660Ti (Kepler). | |
ID: 30497 | Rating: 0 | rate: / Reply Quote | |
For me driver restart happens everytime with the current Noelias (or better: I have not observed it not happening) but not with other WUs. Win 8 drivers 320.18 and 314.22 (the last 2 WHQLs), "leave apps in memory" active (but I read it doesn't apply to GPUs, as it would be far too risky to leave something dirty or to run out of memory) and a GTX660Ti (Kepler). Try a test with the watchdog disabled. Seems to be working for me. I'm also running XP but I don't know if that has anything to do with it. | |
ID: 30516 | Rating: 0 | rate: / Reply Quote | |
Driver restart capability was introduced to mainstream desktops with the release of Vista, through the Windows Display Driver Model (WDDM). Driver restarting is unlikely to be an issue in XP, as the display driver architecture was very different. | |
ID: 30519 | Rating: 0 | rate: / Reply Quote | |
To answer your questions from earlier Jacob, we have not been able to reproduce the error, unfortunately. We only have one box running right now testing Windows 7, and we have not received a NOELIA since those tasks are now dwindling in number. We of course don't doubt that it is real, considering so many people confirming it, we just haven't been able to troubleshoot it yet. Even so, if it is being caused by the driver watchdog or some other Windows bug, we might not be able to do much about it. It will be interesting to see if it still occurs with the watchdog disabled. That should tell us a lot. | |
ID: 30522 | Rating: 0 | rate: / Reply Quote | |
Thank you for responding, Nate. | |
ID: 30523 | Rating: 0 | rate: / Reply Quote | |
Yes we do test the WU's, but unfortunately (at the moment) we test them locally on our Linux machines and not running GPUGrid. So we don't catch such problems. | |
ID: 30524 | Rating: 0 | rate: / Reply Quote | |
I'd also suggest setting up a test box with several OS's. Of course you can't test everything in advance, but if problems are reported under specific configurations you could react more quickly by just booting into an affected OS. | |
ID: 30549 | Rating: 0 | rate: / Reply Quote | |
NOELIA tasks are still very much an issue for me. | |
ID: 31557 | Rating: 0 | rate: / Reply Quote | |
Do not suspend, don't exit BOINC, haha. ;) | |
ID: 31672 | Rating: 0 | rate: / Reply Quote | |
Since I cannot trust any GPUGrid.net units to shutdown gracefully anymore, here has been my workaround: | |
ID: 31849 | Rating: 0 | rate: / Reply Quote | |
Unfortunately no :( There is really no time right now to focus on this. I understand it is quite a big problem and we are aware of it. | |
ID: 31862 | Rating: 0 | rate: / Reply Quote | |
This is VERY UNFORTUNATE that I have to do this tedious workaround any time I want to use my GPU. It's a Windows problem, not a GPUGrid problem. ____________ | |
ID: 31863 | Rating: 0 | rate: / Reply Quote | |
Hm, good to know. | |
ID: 31864 | Rating: 0 | rate: / Reply Quote | |
Hm, good to know. It can be a problem with any GPU project and many games. http://msdn.microsoft.com/en-us/library/windows/hardware/ff553893%28v=vs.85%29.aspx I've posted a fix that worked for me a couple of different places on this site but I'll post it here again for anyone to try. My fix goes a little farther than the Windows suggestion and it works for me. I can suspend and restart tasks, reboot the computer with tasks running, even do a hard shut down and restart. The tasks always restart from where they were with no errors or driver timeout/restarts. YMMV but this has worked well for me on ATI and Nvidia cards. Copy and paste the entire code below (including the Windows Registry Editor Version 5.00 part) into notepad. Rename it timeout fix.reg or something else if you'd like as long as it ends with the .reg extension. After renaming it right click on it and open it with registry editor. You'll get warnings about editing the registry. Just click yes and the code will be added to your registry. Reboot and you should be good to go. This should stop the driver has stopped responding messages and the errors to the WUs when the driver restarts. It will not affect anything else in the registry if it doesn't work. Windows Registry Editor Version 5.00 [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Watchdog] "DisableBugCheck"="1" [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Watchdog\Display] "EaRecovery"="0" | |
ID: 31872 | Rating: 0 | rate: / Reply Quote | |
It's a kind-of-a-windows-problem which gets triggered by Noelias tasks, certainly not the "god old" trouble free Nathans and I don't think the Santi SRs either, which I'm running now. | |
ID: 31883 | Rating: 0 | rate: / Reply Quote | |
Interesting: TJ is reporting no driver reset problems with the current Noelias over there. | |
ID: 31920 | Rating: 0 | rate: / Reply Quote | |
We have turned down priority on Noelia tasks. You should get less and less until she gets back. | |
ID: 31978 | Rating: 0 | rate: / Reply Quote | |
I greatly appreciate the stability my machine has had over the past couple weeks, due to not suspending any NOELIA tasks. | |
ID: 32317 | Rating: 0 | rate: / Reply Quote | |
Did you try nanoprobe's suggested fix? | |
ID: 32327 | Rating: 0 | rate: / Reply Quote | |
His suggested fix is to disable TDR, which I use for games and for other GPU applications. I rely on it. So, no, I didn't try it. | |
ID: 32329 | Rating: 0 | rate: / Reply Quote | |
Even with a 20second registry configured delay, Noelia's WU's still trigger a driver restart when suspended, changing app, CPU/Boinc Snooze or closing Boinc. | |
ID: 32334 | Rating: 0 | rate: / Reply Quote | |
Strange, that's never happened to me and I've suspended them dozens of times. | |
ID: 32340 | Rating: 0 | rate: / Reply Quote | |
His suggested fix is to disable TDR, which I use for games and for other GPU applications. I rely on it. So, no, I didn't try it. Same here: the watchdog saves me from real GPU errors often enough that I don't want to disable it. MrS ____________ Scanning for our furry friends since Jan 2002 | |
ID: 32408 | Rating: 0 | rate: / Reply Quote | |
I don't disable it either, I use a 20sec delay (but I don't game). I've had numerous experiences where the mouse arrow freezes for a few seconds and then everything is as was (without a driver restart and without WU's crashing). Prior to using it I had numerous crashy-the-driver experiences! | |
ID: 32416 | Rating: 0 | rate: / Reply Quote | |
The next beta will have additional critical section locking that will hopefully mitigate this problem. | |
ID: 32458 | Rating: 0 | rate: / Reply Quote | |
Thank you a million times over for setting aside some time to solve this. | |
ID: 32463 | Rating: 0 | rate: / Reply Quote | |
Try out 8.02. Give it a damn good suspending. | |
ID: 32474 | Rating: 0 | rate: / Reply Quote | |
Try out 8.02. Give it a damn good suspending. Awesome - Initial testing looks very promising! I cannot immediately make it crash. I will do more testing (especially with the exclusive app logic that suspends tasks) later tonight. Edit: I may have been able to make it still crash. Will test more later. What did you change/fix? I'm a developer, and am very curious about what the change was. Also, is it a change that could improve exit-logic for non-NOELIA tasks? | |
ID: 32478 | Rating: 0 | rate: / Reply Quote | |
The problem stems from BOINC killing off the process while a GPU operation is underway. The fix is to add BOINC critical section assertions around GPU operations. In the old app, not all GPU operations were so locked. http://boinc.berkeley.edu/trac/wiki/BasicApi There may be other circumstances under which a driver hang can be induced, but this should substantially reduce the incidence rate.
It'll be good for all WUs. Indeed, its not obvious why those poor NOELIAs always took the brunt of it. MJH | |
ID: 32480 | Rating: 0 | rate: / Reply Quote | |
Hey MJH, glad to have you back! The project feels alive again.. thanks! | |
ID: 32494 | Rating: 0 | rate: / Reply Quote | |
Try out 8.02. Give it a damn good suspending. 8.04 KLEBE tasks are still causing driver resets :( My scenario is that I have 2 of them running - 1 on my GTX 460 and 1 on my GTX 660 Ti, and I'm choosing "Suspend GPU" from the system tray. Can you please see if you need to add any more critical section mutexes? Thanks, Jacob | |
ID: 32573 | Rating: 0 | rate: / Reply Quote | |
As frequently as before? MJH | |
ID: 32584 | Rating: 0 | rate: / Reply Quote | |
Frequency is quite hard to conclusively prove. | |
ID: 32585 | Rating: 0 | rate: / Reply Quote | |
I've got 2 WU's which I couldn't start anymore because as soon as I resume them the nVidia Driver will crash. | |
ID: 33577 | Rating: 0 | rate: / Reply Quote | |
Hype, | |
ID: 33578 | Rating: 0 | rate: / Reply Quote | |
Unclear if these are crashing the driver; there is no message saying that it has. | |
ID: 33596 | Rating: 0 | rate: / Reply Quote | |
More of the same type of problem, on a different NOELIA task. | |
ID: 33598 | Rating: 0 | rate: / Reply Quote | |
Could be a driver issue, try updating to the latest (beta) driver. | |
ID: 33601 | Rating: 0 | rate: / Reply Quote | |
Unclear if these are crashing the driver; there is no message saying that it has. I couldn't find the beta test drivers. However, when the 331.58 and 331.65 drivers came out, I installed them. No such crashes under either of these. | |
ID: 33655 | Rating: 0 | rate: / Reply Quote | |
Unclear if these are crashing the driver; there is no message saying that it has. Correction - the 331.65 driver makes such crashes less frequent (perhaps every other day), but does not stop them completely. | |
ID: 33684 | Rating: 0 | rate: / Reply Quote | |
Crashes less frequent with the 331.65 driver, but not gone. GPU workunits for other BOINC projects still running properly. | |
ID: 33686 | Rating: 0 | rate: / Reply Quote | |
some update that looks likely to fix this problem. ??? explain please ::grabs popcorn:: | |
ID: 33687 | Rating: 0 | rate: / Reply Quote | |
some update that looks likely to fix this problem. I'm waiting for an Nvidia driver release, a BOINC release, or a GPUGRID application release before I enable GPUGRID workunits on that computer again. | |
ID: 33785 | Rating: 0 | rate: / Reply Quote | |
some update that looks likely to fix this problem. I've now installed BOINC 7.2.28 on the computer with the problem. I tried installing the 331.82 Nvidia driver a few times; it never installed correctly. I'm back to the 331.65 driver. GPUGRID workunits have run properly on that computer for the last few days, with no more driver crashes. | |
ID: 33977 | Rating: 0 | rate: / Reply Quote | |
some update that looks likely to fix this problem. This wasn't enough to fully fix the problem; however, the driver crashes no longer crash Windows also. I'll watch to see if the new crashes cause enough problems that I need to put GPUGRID back in No new tasks on that computer. | |
ID: 33993 | Rating: 0 | rate: / Reply Quote | |
I've been trying to crash NOELIA tasks on my 660T1 by repeatedly (20 X) suspending and resuming them but I can't. This seems like a Windows bug to me. Suggest installing Linux to fix it or a script that peeks at the names of the GPUgrid tasks in your cache every 60 secs and aborts them if they're NOELIA. Either way would be a path of lesser resistance leading to greater productivity and oneness with the Buddha ;) | |
ID: 34002 | Rating: 0 | rate: / Reply Quote | |
The 8.14 application version fixed the problem that this thread described. It has been resolved for a while now.. | |
ID: 34003 | Rating: 0 | rate: / Reply Quote | |
But I couldn't crash them with the previous version either. | |
ID: 34004 | Rating: 0 | rate: / Reply Quote | |
It was a Windows only fix. | |
ID: 34013 | Rating: 0 | rate: / Reply Quote | |
Message boards : Number crunching : NOELIA tasks - when suspended or exited, often crash drivers