Message boards : Number crunching : Is it time to have a conversation about...
Author | Message |
---|---|
...the very high failure rates on some WU's and what I would consider high failure rates on most WU's? | |
ID: 42028 | Rating: 0 | rate: / Reply Quote | |
Now that the error rates for different work units are posted in the Server Status page, I have to say that I'm totally shocked. | |
ID: 42359 | Rating: 0 | rate: / Reply Quote | |
Now that the error rates for different work units are posted in the Server Status page, I have to say that I'm totally shocked.These data is not "normalized" so those hosts which fail every single workunit are included, which makes the numbers look worse as they really are. To fail a workunit usually take much less time (less than a minute) than to complete it (4~24+ hours), so these numbers start from 100% failure rate, then gets lower. I've had a couple of posts up trying to minimize errors on my home PCs, and if the numbers are true, I'm one of the lucky ones. My error rate is closer to one in twenty.I'm having similar error rate. If we can get 20% more FLOPS just by increasing system stability, then that's gotta be a top priority.Those who care, usually read this forum and are guided to take proper actions, but there's some who don't. This is a volunteer project, so there's no time & resources to find & warn every single contributor with failing hosts, but everybody's welcome to do that. I can't speak for everyone, but my cards have not been overclocked past the factory default...factory default is usually safe, but it's not a guarantee of error free operation. Also if overclocking is done carefully, it won't harm stability ...and I'm always found GPUGrid programs to be weirdly fragile.They are, as they are using the GPU in a different way than other apps, so it has to be stable all the time. Power outages will render a WU invalid.Power outages could make the checkpoint corrupted, as the data is not written to the HDD/SSD before the power goes off. Turning off delayed writes could resolve this, but it will also decrease overall system responsiveness, and reduce HDD/SSD lifetime a little. Windows automatic updates will render a WU invalid.I think it could be only when there a GPU driver update comes with Windows updates, otherwise only the system restart could harm the WU, but I don't have restart related WU failures. One work unit crashing immediately on one GPU will somehow spill over and the other GPU will show up invalid.I had such experiences on one of my hosts. It turned out that the 12V power pins on the 24-pinned MB connector was burnt. It's recommended to use MBs with extra 12V connectors (Molex, SATA, or PCIe) for multiple GPU configurations. Sometimes work units will restart, showing 8 hours of computing but start at 0% again. After an extended time, those will be invalid.I have some same experiences. These workunits used to stuck at some point, when you restart them (or the system), they begin from 0%. When I catch them, I abort them. This doesn't happen with my other projects.Other projects don't utilize the GPU to such extent as GPUGrid does. But the power failure issue could happen on other projects too (if they do checkpoints too frequently, or doesn't do at all like CPDN) | |
ID: 42376 | Rating: 0 | rate: / Reply Quote | |
I've had a couple of posts up trying to minimize errors on my home PCs, and if the numbers are true, I'm one of the lucky ones. My error rate is closer to one in twenty.I'm having similar error rate. Wow. That is much too high. One of my systems went to 5% once so I took it apart and cleaned out the clogged fans. Anything above 1-2% is way too high. | |
ID: 42389 | Rating: 0 | rate: / Reply Quote | |
Most of the time when I see a GPUGRID work unit fail it is due to one of the servers crashing. Sometimes it happens when BOINC crashes. | |
ID: 42394 | Rating: 0 | rate: / Reply Quote | |
Those who care, usually read this forum and are guided to take proper actions, but there's some who don't. This is a volunteer project, so there's no time & resources to find & warn every single contributor with failing hosts, but everybody's welcome to do that. That's a good idea. I can't afford more GPUs right now, but if I can help reactivate a few hosts, perhaps I can be of assistance. Retvari, I've just combed through my work unit list to see which tasks had failed with other users. I've sent out a few busybody notes to users with 100% error rates encouraging them to check the forums. You know hardware well beyond my abilities, will you keep an eye on the forums over the holidays? | |
ID: 42471 | Rating: 0 | rate: / Reply Quote | |
I've sent out a few busybody notes to users with 100% error rates encouraging them to check the forums. You know hardware well beyond my abilities, will you keep an eye on the forums over the holidays?I will. However I think there won't be much need for my expertise during the holidays, as I don't share your optimism about that this time of the year is the best for turning "careless" volunteers to be careful :) I sent some warning messages in the past, but I've received very few response. | |
ID: 42474 | Rating: 0 | rate: / Reply Quote | |
Of about fifteen notes, I got one reply. A contributor was able to fix their problem. | |
ID: 42485 | Rating: 0 | rate: / Reply Quote | |
Message boards : Number crunching : Is it time to have a conversation about...