Advanced search

Message boards : Number crunching : ADRIA extremely slow - not checkpointing

Author Message
GDB
Send message
Joined: 24 Oct 11
Posts: 4
Credit: 360,223,487
RAC: 158,597
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56388 - Posted: 10 Feb 2021 | 13:53:32 UTC

Something wrong with ADRIA. It's not checkpointing. After 7 hours - it's 99.5% done, but slowed to a crawl. I had to reboot, and it restarted from 0%.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1067
Credit: 40,231,533,983
RAC: 4,653
Level
Trp
Scientific publications
wat
Message 56389 - Posted: 10 Feb 2021 | 14:34:59 UTC - in response to Message 56388.

they run for a long time. my RTX 2070 took like 17hrs to complete. just let it go.
____________

lohphat
Send message
Joined: 21 Jan 10
Posts: 44
Credit: 1,141,837,359
RAC: 1,441,781
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56396 - Posted: 10 Feb 2021 | 22:59:36 UTC - in response to Message 56389.
Last modified: 10 Feb 2021 | 23:08:08 UTC

I just got my first WU and the clock is climbing. It's taken 1:40:00 for 6%, the estimated time is now at 26h and climbing.

GTX 980 Ti

lohphat
Send message
Joined: 21 Jan 10
Posts: 44
Credit: 1,141,837,359
RAC: 1,441,781
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56397 - Posted: 11 Feb 2021 | 0:09:36 UTC - in response to Message 56396.

Update: It reached 10% at 2h50m elapsed.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56409 - Posted: 11 Feb 2021 | 11:33:54 UTC
Last modified: 11 Feb 2021 | 11:55:17 UTC

I have one running on my power limited RTX 2080 Ti.
It's at 3.333% after 26m 30s, so it will take around 11h 45m.
Perhaps it will take a bit less than that as the progress indicator restarted after a few minutes.
TBH this is the "usual" lenght of a long workunit. (maybe it's 50% longer than the usual).
EDIT:
I have another one running on another RTX 2080 Ti (It's not power limited, also it's a Linux based host)
The progress indicator falls back to 0.666% after 4 minutes, then goes up by 0.666% increments raughly by every 4 minutes so it will take raughly 10 hours on that host.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1618
Credit: 8,437,794,351
RAC: 15,946,061
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56412 - Posted: 11 Feb 2021 | 12:11:26 UTC - in response to Message 56409.

The progress indicator falls back to 0.666% after 4 minutes, then goes up by 0.666% increments raughly by every 4 minutes so it will take raughly 10 hours on that host.

For the first minute, progress will be shown as zero. The next three minutes (variable according to the speed of your device) will be pseudo-progress invented by BOINC: no significance should be attached to the value shown.

The 0.666% value and onwards will be genuine values, and can be used for estimating the final run time - as you have done. They show the points where the app writes a checkpoint file - I'm seeing "restart.chk" files being written in the slot directories.

But the app doesn't appear to be able to detect or use these files to resume computation after a restart.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56418 - Posted: 11 Feb 2021 | 14:06:36 UTC - in response to Message 56412.

The progress indicator falls back to 0.666% after 4 minutes, then goes up by 0.666% increments raughly by every 4 minutes so it will take raughly 10 hours on that host.

For the first minute, progress will be shown as zero. The next three minutes (variable according to the speed of your device) will be pseudo-progress invented by BOINC: no significance should be attached to the value shown.
Oh, I thought that the GPUGrid app does something strange. This must be a new feature, the v7.9 showed 0.000%.

The 0.666% value and onwards will be genuine values, and can be used for estimating the final run time - as you have done.
New estimation based on 20% done in 1h 56m: 9h 40m

They show the points where the app writes a checkpoint file - I'm seeing "restart.chk" files being written in the slot directories.

But the app doesn't appear to be able to detect or use these files to resume computation after a restart.
This is simply unacceptable behavior for such long workunits.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1335
Credit: 7,373,767,459
RAC: 13,419,957
Level
Tyr
Scientific publications
watwatwatwatwat
Message 56420 - Posted: 11 Feb 2021 | 16:27:58 UTC - in response to Message 56412.

The progress indicator falls back to 0.666% after 4 minutes, then goes up by 0.666% increments raughly by every 4 minutes so it will take raughly 10 hours on that host.

For the first minute, progress will be shown as zero. The next three minutes (variable according to the speed of your device) will be pseudo-progress invented by BOINC: no significance should be attached to the value shown.

The 0.666% value and onwards will be genuine values, and can be used for estimating the final run time - as you have done. They show the points where the app writes a checkpoint file - I'm seeing "restart.chk" files being written in the slot directories.

But the app doesn't appear to be able to detect or use these files to resume computation after a restart.

What about the wrapper_checkpoint.txt file in the slots. Is that the file used to checkpoint restart with wrapper apps?

Or is it always the restart.chk or restart.chk.bkp files that are used to restart a task?

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56421 - Posted: 11 Feb 2021 | 16:43:15 UTC - in response to Message 56420.
Last modified: 11 Feb 2021 | 16:44:10 UTC

What about the wrapper_checkpoint.txt file in the slots. Is that the file used to checkpoint restart with wrapper apps?
I don't think this file is sufficient for that, as it is 29 bytes long:
0 19978.890625 19938.000000


Or is it always the restart.chk or restart.chk.bkp files that are used to restart a task?
I think these ones contain the real information. (~9Mb)

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1335
Credit: 7,373,767,459
RAC: 13,419,957
Level
Tyr
Scientific publications
watwatwatwatwat
Message 56424 - Posted: 11 Feb 2021 | 18:11:45 UTC

This task must have used the checkpoint files. It also was able to switch between devices without erroring out.
https://www.gpugrid.net/result.php?resultid=32507970

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56427 - Posted: 11 Feb 2021 | 19:24:19 UTC - in response to Message 56424.

This task must have used the checkpoint files. It also was able to switch between devices without erroring out.
https://www.gpugrid.net/result.php?resultid=32507970

I confirm that. I restarted my Linux and Windows 10 hosts, and the GPUgrid app resumed just fine from the last checkpoint. These are single GPU hosts though.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1618
Credit: 8,437,794,351
RAC: 15,946,061
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56428 - Posted: 11 Feb 2021 | 19:38:00 UTC

And here. One on a 1050 Ti was swapped out after 24 hours (that was enough in the good old days ...), but didn't lose the 35% done on restart.

I do wish the devs could notify us when they've fixed reported problems. Toni?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1335
Credit: 7,373,767,459
RAC: 13,419,957
Level
Tyr
Scientific publications
watwatwatwatwat
Message 56429 - Posted: 11 Feb 2021 | 19:41:46 UTC - in response to Message 56428.

The one task I referenced that was able to switch and restart on a different device may have been a fluke or something.

I have another task that errored out for restarting on a different device already.
https://www.gpugrid.net/result.php?resultid=32509275

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1067
Credit: 40,231,533,983
RAC: 4,653
Level
Trp
Scientific publications
wat
Message 56430 - Posted: 11 Feb 2021 | 20:00:37 UTC - in response to Message 56429.

The one task I referenced that was able to switch and restart on a different device may have been a fluke or something.

I have another task that errored out for restarting on a different device already.
https://www.gpugrid.net/result.php?resultid=32509275


from what I recall, I noticed this issue of "starting on a different device" to be quite intermittent and inconsistent. I always thought that if it restarted on the same device type (rtx 2070 -> rtx 2070, but different ids) that it would be fine and only caused an issue if the hardware was significantly different (rtx 2070 -> GTX 1060 for example) but that doesnt seem to be the case as i've seen successful pickups and failed pickups in most situations with no clear reason,

sometimes it restarted on a different device id and was fine
sometimes it restarted on a different device id and threw the error
sometimes it got lucky and restarted on the same device id and was fine
sometimes it got lucky and restarted on the same device id and still threw the error

also from what I recall, I very rarely ever got failures from the MDAD tasks, but a much higher failure rate from restarts with PABLO tasks. I just try to never interrupt them anymore and just let it run. I'll only ever interrupt running GPUGRID tasks now if it's due to something like a power outage.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1335
Credit: 7,373,767,459
RAC: 13,419,957
Level
Tyr
Scientific publications
watwatwatwatwat
Message 56431 - Posted: 11 Feb 2021 | 20:10:55 UTC
Last modified: 11 Feb 2021 | 20:11:11 UTC

Up until this one task outlier, I have always thrown an error when restarting on a different device type. Two of my three hosts are mixed gpus types, different generations.

This daily driver has the same gpu type of all cards, RTX 2080 Hybrids.

I have never had an restart failure on any task on this host. Countless tasks have restarted on a different device ID and properly restarted and completed.

lohphat
Send message
Joined: 21 Jan 10
Posts: 44
Credit: 1,141,837,359
RAC: 1,441,781
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56434 - Posted: 11 Feb 2021 | 20:32:19 UTC

Datapoint: My progress has been in 0.333% increments. (GTX 980 Ti)

klepel
Send message
Joined: 23 Dec 09
Posts: 189
Credit: 4,671,006,793
RAC: 2,493,917
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56446 - Posted: 12 Feb 2021 | 0:46:06 UTC

No luck so far with this new ADRIA WUs so far!
This computer with a GTX2070 http://www.gpugrid.net/result.php?resultid=32508594
Produced an ERROR:

Detected memory leaks!

Anybody has an idea, what the cause might be? I moved this GPU from my Linux Computer –never had a problem with the card - to the game computer of my child - other projects just run fine, plays “Fortnite” on it once or twice a day and we switch it off at night.
On the Linux Computer http://www.gpugrid.net/results.php?hostid=523675 received the GTX1660ti from the gaming computer, and produced two ERRORs until now.
The cards are factory overclocked – just a side question how to undervolted a GPU?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1335
Credit: 7,373,767,459
RAC: 13,419,957
Level
Tyr
Scientific publications
watwatwatwatwat
Message 56447 - Posted: 12 Feb 2021 | 1:16:03 UTC - in response to Message 56446.

AFAIK, there is no way to undervolt an Nvidia card.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1067
Credit: 40,231,533,983
RAC: 4,653
Level
Trp
Scientific publications
wat
Message 56448 - Posted: 12 Feb 2021 | 1:18:59 UTC - in response to Message 56447.

AFAIK, there is no way to undervolt an Nvidia card.


*in Linux

you can in Windows with programs liks MSI Afterburner.

____________

OCNfranz
Send message
Joined: 24 Nov 20
Posts: 1
Credit: 535,597,705
RAC: 6,689,739
Level
Lys
Scientific publications
wat
Message 56451 - Posted: 12 Feb 2021 | 2:38:20 UTC
Last modified: 12 Feb 2021 | 2:57:54 UTC

You cant undervolt in Linux directly, but you can lower the power limit if all you are trying to do is use less power.

check stock power limit with
nvidia-smi
~151W for my 1070
~170W for my 2060

set new power limit with
nvidia-smi --power-limit=***
*** is desired power limit in Watts

My 2060 is completing these tasks in around 69500-70000seconds or 19.4hours


EDIT: I was brave and restarted my Win10 rig with one task at 54.667% complete and it restarted the task at the same percentage. I will check it when its done to see if there are any errors, but there is definitely a checkpoint being used.

lohphat
Send message
Joined: 21 Jan 10
Posts: 44
Credit: 1,141,837,359
RAC: 1,441,781
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56452 - Posted: 12 Feb 2021 | 3:13:51 UTC
Last modified: 12 Feb 2021 | 3:18:24 UTC

My first WU completed and validated. Finally.

https://www.gpugrid.net/workunit.php?wuid=27023218

Sent 10 Feb 2021 | 17:42:37 UTC
Received 12 Feb 2021 | 2:30:35 UTC
Credit 435,937.50 (wow!)

GTX 980 Ti

My next WU is a retry of a failed attempt by another system. It's running A LOT faster than my first WU. I'll report back when it completes.

https://www.gpugrid.net/workunit.php?wuid=27025153

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1067
Credit: 40,231,533,983
RAC: 4,653
Level
Trp
Scientific publications
wat
Message 56453 - Posted: 12 Feb 2021 | 3:58:33 UTC - in response to Message 56451.

You cant undervolt in Linux directly, but you can lower the power limit if all you are trying to do is use less power.

check stock power limit with
nvidia-smi
~151W for my 1070
~170W for my 2060

set new power limit with
nvidia-smi --power-limit=***
*** is desired power limit in Watts

My 2060 is completing these tasks in around 69500-70000seconds or 19.4hours


EDIT: I was brave and restarted my Win10 rig with one task at 54.667% complete and it restarted the task at the same percentage. I will check it when its done to see if there are any errors, but there is definitely a checkpoint being used.


yeah I know about the power limiting. it's all you can do in Linux. but if you apply an overclock on top of the power limit you can claw back some lost performance.

you can probably do better on power for that 2060.

I have a system with 2070's that I power limit to 150W, and it completes tasks in about 60-61,000s

____________

lohphat
Send message
Joined: 21 Jan 10
Posts: 44
Credit: 1,141,837,359
RAC: 1,441,781
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56454 - Posted: 12 Feb 2021 | 6:28:49 UTC - in response to Message 56452.
Last modified: 12 Feb 2021 | 6:29:45 UTC


My next WU is a retry of a failed attempt by another system. It's running A LOT faster than my first WU. I'll report back when it completes.

https://www.gpugrid.net/workunit.php?wuid=27025153


10% complete 03:22:00

Oof.

At least it's running. The prior attempt by another user errored-out in a few seconds.

lohphat
Send message
Joined: 21 Jan 10
Posts: 44
Credit: 1,141,837,359
RAC: 1,441,781
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56463 - Posted: 12 Feb 2021 | 19:51:34 UTC

I was able to exit/restart the client at 54% and it continued as expected. (Win64)

lohphat
Send message
Joined: 21 Jan 10
Posts: 44
Credit: 1,141,837,359
RAC: 1,441,781
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56478 - Posted: 13 Feb 2021 | 17:17:32 UTC - in response to Message 56454.

And...scene.

Run time 106,374.11
CPU time 106,292.60
Validate state Valid
Credit 348,750.00

https://www.gpugrid.net/workunit.php?wuid=27025153

bozz4science
Send message
Joined: 22 May 20
Posts: 109
Credit: 102,786,176
RAC: 911,736
Level
Cys
Scientific publications
wat
Message 56496 - Posted: 14 Feb 2021 | 16:45:57 UTC
Last modified: 14 Feb 2021 | 17:12:32 UTC

Be happy if they finish at all. Mine (3 at this time) ran between 5,000 and 90,000 sec on a GTX 1660 Super and all errored out. I aborted the last WU in progress now...

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1067
Credit: 40,231,533,983
RAC: 4,653
Level
Trp
Scientific publications
wat
Message 56497 - Posted: 14 Feb 2021 | 17:18:25 UTC - in response to Message 56496.

Be happy if they finish at all. Mine (3 at this time) ran between 5,000 and 90,000 sec) on a GTX 1660 Super and all errored out.


My 1660 Super takes about 96,000 seconds on Linux. Maybe 100,000+ on Windows since the Windows app is slower.

Of your 4 tasks. On your 2-GPU Windows 10 host.

1 was aborted by the user.
1 failed due to a BOINC restart (or system restart) where it attempted to restart on a different device. It started on device 1 then tried to restart on device 0. This is a common and known situation that causes failures.

2 tasks failed with “particle coordinate is nan” (nan = not a number). This commonly happens from too much overclock or overheating. GPUGRID tasks are quite intense, and these tasks are no exception. An overclock that’s stable on another project or application can be unstable for GPUGRID.

Try to remove any overclock and make sure the GPU has sufficient airflow. Try to avoid restarting the system.
____________

bozz4science
Send message
Joined: 22 May 20
Posts: 109
Credit: 102,786,176
RAC: 911,736
Level
Cys
Scientific publications
wat
Message 56498 - Posted: 14 Feb 2021 | 17:50:33 UTC

Thanks Ian for the diagnostics! Just did revert back to the stock settings. Hope I'll catch a resend and can try again. Still weird as all other apps do work with the mild OC setting just fine. Never had a single error thrown so far, except on MLC. But that's mostly due to the inherent nature of these tasks where occasionally a WU results in a NaN error.

About that restarted WU on the other device. I noticed that too, and noticed that due to dry spell here I forgot to set up the <exclude gpu> poilicy in my cc_config file on this new host and the slow 750Ti just happend to pick it up. That's solved as well for now.

I'll see how the 1660 Super handles tasks in the future!

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 9,332,887,024
RAC: 19,884,276
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56771 - Posted: 15 Mar 2021 | 19:39:07 UTC

A reduced amount of a new kind of ADRIA tasks seems to be in the field.
My host #557889 received one of them, named "e1s45_homeodomain_folded_100ns_44-ADRIA_HomeoFolded100ns-0-1-RND7761_0"
Progress for this task is about 66% after 5,5 hours on a GTX 1650 GPU.
Therefore, initial estimation is pointing that these tasks are much shorter than previous ones.
I'm not testing for the moment whether they checkpoint right or not...

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1067
Credit: 40,231,533,983
RAC: 4,653
Level
Trp
Scientific publications
wat
Message 56772 - Posted: 15 Mar 2021 | 22:42:45 UTC - in response to Message 56771.

i also received a couple of these tasks, and concur that they complete in less time. my 2080ti completed them in about 9,000s vs ~36,000s on the longer running tasks.

payout is 76,500cred
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 9,332,887,024
RAC: 19,884,276
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56773 - Posted: 16 Mar 2021 | 6:21:13 UTC - in response to Message 56772.

In the case of my GTX 1650 and mentioned task, 30,471 seconds versus ~170,000 seconds for the previous ADRIA tasks in this device.
The same amount of 76,500 credits was awarded, since result was returned in time for full bonus.

e1s45_homeodomain_folded_100ns_44-ADRIA_HomeoFolded100ns-0-1-RND7761_0

Post to thread

Message boards : Number crunching : ADRIA extremely slow - not checkpointing

//