Message boards : Number crunching : ADRIA extremely slow - not checkpointing
Author | Message |
---|---|
Something wrong with ADRIA. It's not checkpointing. After 7 hours - it's 99.5% done, but slowed to a crawl. I had to reboot, and it restarted from 0%. | |
ID: 56388 | Rating: 0 | rate: / Reply Quote | |
they run for a long time. my RTX 2070 took like 17hrs to complete. just let it go. | |
ID: 56389 | Rating: 0 | rate: / Reply Quote | |
I just got my first WU and the clock is climbing. It's taken 1:40:00 for 6%, the estimated time is now at 26h and climbing. | |
ID: 56396 | Rating: 0 | rate: / Reply Quote | |
Update: It reached 10% at 2h50m elapsed. | |
ID: 56397 | Rating: 0 | rate: / Reply Quote | |
I have one running on my power limited RTX 2080 Ti. | |
ID: 56409 | Rating: 0 | rate: / Reply Quote | |
The progress indicator falls back to 0.666% after 4 minutes, then goes up by 0.666% increments raughly by every 4 minutes so it will take raughly 10 hours on that host. For the first minute, progress will be shown as zero. The next three minutes (variable according to the speed of your device) will be pseudo-progress invented by BOINC: no significance should be attached to the value shown. The 0.666% value and onwards will be genuine values, and can be used for estimating the final run time - as you have done. They show the points where the app writes a checkpoint file - I'm seeing "restart.chk" files being written in the slot directories. But the app doesn't appear to be able to detect or use these files to resume computation after a restart. | |
ID: 56412 | Rating: 0 | rate: / Reply Quote | |
Oh, I thought that the GPUGrid app does something strange. This must be a new feature, the v7.9 showed 0.000%.The progress indicator falls back to 0.666% after 4 minutes, then goes up by 0.666% increments raughly by every 4 minutes so it will take raughly 10 hours on that host. The 0.666% value and onwards will be genuine values, and can be used for estimating the final run time - as you have done.New estimation based on 20% done in 1h 56m: 9h 40m They show the points where the app writes a checkpoint file - I'm seeing "restart.chk" files being written in the slot directories.This is simply unacceptable behavior for such long workunits. | |
ID: 56418 | Rating: 0 | rate: / Reply Quote | |
The progress indicator falls back to 0.666% after 4 minutes, then goes up by 0.666% increments raughly by every 4 minutes so it will take raughly 10 hours on that host. What about the wrapper_checkpoint.txt file in the slots. Is that the file used to checkpoint restart with wrapper apps? Or is it always the restart.chk or restart.chk.bkp files that are used to restart a task? | |
ID: 56420 | Rating: 0 | rate: / Reply Quote | |
What about the wrapper_checkpoint.txt file in the slots. Is that the file used to checkpoint restart with wrapper apps?I don't think this file is sufficient for that, as it is 29 bytes long: 0 19978.890625 19938.000000 Or is it always the restart.chk or restart.chk.bkp files that are used to restart a task?I think these ones contain the real information. (~9Mb) | |
ID: 56421 | Rating: 0 | rate: / Reply Quote | |
This task must have used the checkpoint files. It also was able to switch between devices without erroring out. | |
ID: 56424 | Rating: 0 | rate: / Reply Quote | |
This task must have used the checkpoint files. It also was able to switch between devices without erroring out. I confirm that. I restarted my Linux and Windows 10 hosts, and the GPUgrid app resumed just fine from the last checkpoint. These are single GPU hosts though. | |
ID: 56427 | Rating: 0 | rate: / Reply Quote | |
And here. One on a 1050 Ti was swapped out after 24 hours (that was enough in the good old days ...), but didn't lose the 35% done on restart. | |
ID: 56428 | Rating: 0 | rate: / Reply Quote | |
The one task I referenced that was able to switch and restart on a different device may have been a fluke or something. | |
ID: 56429 | Rating: 0 | rate: / Reply Quote | |
The one task I referenced that was able to switch and restart on a different device may have been a fluke or something. from what I recall, I noticed this issue of "starting on a different device" to be quite intermittent and inconsistent. I always thought that if it restarted on the same device type (rtx 2070 -> rtx 2070, but different ids) that it would be fine and only caused an issue if the hardware was significantly different (rtx 2070 -> GTX 1060 for example) but that doesnt seem to be the case as i've seen successful pickups and failed pickups in most situations with no clear reason, sometimes it restarted on a different device id and was fine sometimes it restarted on a different device id and threw the error sometimes it got lucky and restarted on the same device id and was fine sometimes it got lucky and restarted on the same device id and still threw the error also from what I recall, I very rarely ever got failures from the MDAD tasks, but a much higher failure rate from restarts with PABLO tasks. I just try to never interrupt them anymore and just let it run. I'll only ever interrupt running GPUGRID tasks now if it's due to something like a power outage. ____________ | |
ID: 56430 | Rating: 0 | rate: / Reply Quote | |
Up until this one task outlier, I have always thrown an error when restarting on a different device type. Two of my three hosts are mixed gpus types, different generations. | |
ID: 56431 | Rating: 0 | rate: / Reply Quote | |
Datapoint: My progress has been in 0.333% increments. (GTX 980 Ti) | |
ID: 56434 | Rating: 0 | rate: / Reply Quote | |
No luck so far with this new ADRIA WUs so far! Detected memory leaks! Anybody has an idea, what the cause might be? I moved this GPU from my Linux Computer –never had a problem with the card - to the game computer of my child - other projects just run fine, plays “Fortnite” on it once or twice a day and we switch it off at night. On the Linux Computer http://www.gpugrid.net/results.php?hostid=523675 received the GTX1660ti from the gaming computer, and produced two ERRORs until now. The cards are factory overclocked – just a side question how to undervolted a GPU? | |
ID: 56446 | Rating: 0 | rate: / Reply Quote | |
AFAIK, there is no way to undervolt an Nvidia card. | |
ID: 56447 | Rating: 0 | rate: / Reply Quote | |
AFAIK, there is no way to undervolt an Nvidia card. *in Linux you can in Windows with programs liks MSI Afterburner. ____________ | |
ID: 56448 | Rating: 0 | rate: / Reply Quote | |
You cant undervolt in Linux directly, but you can lower the power limit if all you are trying to do is use less power. | |
ID: 56451 | Rating: 0 | rate: / Reply Quote | |
My first WU completed and validated. Finally. | |
ID: 56452 | Rating: 0 | rate: / Reply Quote | |
You cant undervolt in Linux directly, but you can lower the power limit if all you are trying to do is use less power. yeah I know about the power limiting. it's all you can do in Linux. but if you apply an overclock on top of the power limit you can claw back some lost performance. you can probably do better on power for that 2060. I have a system with 2070's that I power limit to 150W, and it completes tasks in about 60-61,000s ____________ | |
ID: 56453 | Rating: 0 | rate: / Reply Quote | |
10% complete 03:22:00 Oof. At least it's running. The prior attempt by another user errored-out in a few seconds. | |
ID: 56454 | Rating: 0 | rate: / Reply Quote | |
I was able to exit/restart the client at 54% and it continued as expected. (Win64) | |
ID: 56463 | Rating: 0 | rate: / Reply Quote | |
And...scene. | |
ID: 56478 | Rating: 0 | rate: / Reply Quote | |
Be happy if they finish at all. Mine (3 at this time) ran between 5,000 and 90,000 sec on a GTX 1660 Super and all errored out. I aborted the last WU in progress now... | |
ID: 56496 | Rating: 0 | rate: / Reply Quote | |
Be happy if they finish at all. Mine (3 at this time) ran between 5,000 and 90,000 sec) on a GTX 1660 Super and all errored out. My 1660 Super takes about 96,000 seconds on Linux. Maybe 100,000+ on Windows since the Windows app is slower. Of your 4 tasks. On your 2-GPU Windows 10 host. 1 was aborted by the user. 1 failed due to a BOINC restart (or system restart) where it attempted to restart on a different device. It started on device 1 then tried to restart on device 0. This is a common and known situation that causes failures. 2 tasks failed with “particle coordinate is nan” (nan = not a number). This commonly happens from too much overclock or overheating. GPUGRID tasks are quite intense, and these tasks are no exception. An overclock that’s stable on another project or application can be unstable for GPUGRID. Try to remove any overclock and make sure the GPU has sufficient airflow. Try to avoid restarting the system. ____________ | |
ID: 56497 | Rating: 0 | rate: / Reply Quote | |
Thanks Ian for the diagnostics! Just did revert back to the stock settings. Hope I'll catch a resend and can try again. Still weird as all other apps do work with the mild OC setting just fine. Never had a single error thrown so far, except on MLC. But that's mostly due to the inherent nature of these tasks where occasionally a WU results in a NaN error. | |
ID: 56498 | Rating: 0 | rate: / Reply Quote | |
A reduced amount of a new kind of ADRIA tasks seems to be in the field. | |
ID: 56771 | Rating: 0 | rate: / Reply Quote | |
i also received a couple of these tasks, and concur that they complete in less time. my 2080ti completed them in about 9,000s vs ~36,000s on the longer running tasks. | |
ID: 56772 | Rating: 0 | rate: / Reply Quote | |
In the case of my GTX 1650 and mentioned task, 30,471 seconds versus ~170,000 seconds for the previous ADRIA tasks in this device. | |
ID: 56773 | Rating: 0 | rate: / Reply Quote | |
Message boards : Number crunching : ADRIA extremely slow - not checkpointing