Advanced search

Message boards : Number crunching : GERARD_MO_TRV_ WUs

Author Message
Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 485
Credit: 11,126,027,899
RAC: 15,465,518
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44029 - Posted: 22 Jul 2016 | 0:25:47 UTC

I was able to complete GERARD_MO_TRV_ WUs successfully:

Name e2s38_e1s1p0f339-GERARD_MO_TRV_2-0-1-RND8501_0
Workunit 11676480
Created 21 Jul 2016 | 10:42:20 UTC
Sent 21 Jul 2016 | 10:45:23 UTC
Received 21 Jul 2016 | 23:51:41 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 263612
Report deadline 26 Jul 2016 | 10:45:23 UTC
Run time 31,564.82
CPU time 31,399.55
Validate state Valid
Credit 244,050.00
Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65)
Stderr output

<core_client_version>7.6.22</core_client_version>
<![CDATA[
<stderr_txt>
# GPU [GeForce GTX 980 Ti] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 980 Ti
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:01:00.0
# Device clock : 1266MHz
# Memory clock : 3505MHz
# Memory width : 384bit
# Driver version : r358_00 : 35906
# GPU 0 : 63C
# GPU 1 : 56C
# GPU 0 : 64C
# GPU 1 : 57C
# GPU 0 : 65C
# GPU 1 : 58C
# GPU 0 : 66C
# GPU 1 : 59C
# Time per step (avg over 10000000 steps): 3.156 ms
# Approximate elapsed time for entire WU: 31561.041 s
# PERFORMANCE: 54223 Natoms 3.156 ns/day 0.000 ms/step 0.000 us/step/atom
19:48:09 (3464): called boinc_finish

</stderr_txt>
]]>

Name e2s14_e1s32p0f291-GERARD_MO_TRV_2-0-1-RND8906_0
Workunit 11676456
Created 21 Jul 2016 | 10:41:34 UTC
Sent 21 Jul 2016 | 10:45:23 UTC
Received 22 Jul 2016 | 0:20:30 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 263612
Report deadline 26 Jul 2016 | 10:45:23 UTC
Run time 34,050.83
CPU time 33,910.92
Validate state Valid
Credit 244,050.00
Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65)
Stderr output

<core_client_version>7.6.22</core_client_version>
<![CDATA[
<stderr_txt>
# GPU [GeForce GTX 980 Ti] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 980 Ti
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:01:00.0
# Device clock : 1266MHz
# Memory clock : 3505MHz
# Memory width : 384bit
# Driver version : r358_00 : 35906
# GPU 0 : 63C
# GPU 1 : 58C
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 980 Ti] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 1 :
# Name : GeForce GTX 980 Ti
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:02:00.0
# Device clock : 1190MHz
# Memory clock : 3505MHz
# Memory width : 384bit
# Driver version : r358_00 : 35906
# GPU 0 : 66C
# GPU 1 : 56C
# GPU 1 : 57C
# GPU 1 : 58C
# GPU 1 : 59C
# Time per step (avg over 9885000 steps): 3.407 ms
# Approximate elapsed time for entire WU: 34073.917 s
# PERFORMANCE: 54223 Natoms 3.407 ns/day 0.000 ms/step 0.000 us/step/atom
20:17:14 (5252): called boinc_finish

</stderr_txt>
]]>


But these WUs are slow. The GPU usage was 75% to 78% and GPU power usage was less than 70%, for a windows 10 machine. This compares to 80% to 89% GPU usage and 80% + power usage for the other GERARD and ADRIA units, on the same computer.




Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,239,065,968
RAC: 3,161,193
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44077 - Posted: 1 Aug 2016 | 7:09:05 UTC
Last modified: 1 Aug 2016 | 7:09:29 UTC

One of my hosts is processing a GERARD_MO_TRV_2-0-1-RND9009_0 workunit, but its progress is very slow.
It's at 42% in 7h 47m on a Windows XP x64 host with a GTX980Ti and a Core i7-4790k CPU.
It's processing the GPUGrid workunit and 3 Rosetta@home workunits simultaneously the GPU usage is 41~63%, the average is 48%.
There's only a little increase (~3%) in the GPU usage if I suspend the rosetta@home workunits.
SWAN_SYNC is on.
GPU core clock is 1391MHz.
GPU FB usage is 9~15% (awg: 12%).
GPU temperature is 54°C.
GPU memory clock is 3500MHz.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44078 - Posted: 1 Aug 2016 | 19:19:45 UTC

Join the club. They run slow, have low GPU usage (run cool) and give poor credit :-)

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,239,065,968
RAC: 3,161,193
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44079 - Posted: 1 Aug 2016 | 21:16:19 UTC - in response to Message 44078.

Join the club. They run slow, have low GPU usage (run cool) and give poor credit :-)

It's finished in 17h 41m.
The really strange is that the following GERARD_FXCXCL12RX_1153966 workunit was also very slow (with low GPU usage), while similar workunits were runnig at normal speed and GPU usage on my other hosts. I've exited the BOINC manager with stopping scientific applications (to make sure the GPUGrid app will continue without error), and then I've restarted the OS (Windows XP x64). Since then it's progressing at normal rate.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44080 - Posted: 2 Aug 2016 | 16:30:55 UTC - in response to Message 44079.

The really strange is that the following GERARD_FXCXCL12RX_1153966 workunit was also very slow (with low GPU usage), while similar workunits were runnig at normal speed and GPU usage on my other hosts. I've exited the BOINC manager with stopping scientific applications (to make sure the GPUGrid app will continue without error), and then I've restarted the OS (Windows XP x64). Since then it's progressing at normal rate.

Unfortunately we're getting more and more of these "TRV" WUs. I have 5 running at the moment.

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 670
Credit: 2,498,095,550
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44081 - Posted: 2 Aug 2016 | 17:18:11 UTC - in response to Message 44080.

The really strange is that the following GERARD_FXCXCL12RX_1153966 workunit was also very slow (with low GPU usage), while similar workunits were runnig at normal speed and GPU usage on my other hosts. I've exited the BOINC manager with stopping scientific applications (to make sure the GPUGrid app will continue without error), and then I've restarted the OS (Windows XP x64). Since then it's progressing at normal rate.

Unfortunately we're getting more and more of these "TRV" WUs. I have 5 running at the moment.


Why unfortunate? We are here to crunch units and hope something useful comes out of that activity everything else is irrelevant.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,239,065,968
RAC: 3,161,193
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44083 - Posted: 2 Aug 2016 | 19:30:17 UTC - in response to Message 44081.

Unfortunately we're getting more and more of these "TRV" WUs. I have 5 running at the moment.

Why unfortunate? We are here to crunch units and hope something useful comes out of that activity everything else is irrelevant.
I would put what you've said in another perspective: since the GPUGrid project doesn't find the cure of a nasty disease every other day, then these collateral factors can / could motivate some crunchers to participate (or leave), therefore it is unwise to take them "irrevelant". They surely come only second to the results, but if the project's take on the participant's frustration causes a negative feedback, then there will be less crunchers and as a consequence fewer results. I think that every project should avoid this, especially this one which has such a fragile GPU app.
I've tested these "TRV" workunits under Windows 7, and they are experiencing a higher WDDM impact than the FXCXCLs, I guess there is more interaction between the CPU and the GPU for the TRVs. However, the 50% GPU usage surely came from some kind of error, which remained after the WU finished. I couldn't recall such behavior from the past.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44086 - Posted: 2 Aug 2016 | 22:58:08 UTC - in response to Message 44083.

Unfortunately we're getting more and more of these "TRV" WUs. I have 5 running at the moment.

Why unfortunate? We are here to crunch units and hope something useful comes out of that activity everything else is irrelevant.

I would put what you've said in another perspective: since the GPUGrid project doesn't find the cure of a nasty disease every other day, then these collateral factors can / could motivate some crunchers to participate (or leave), therefore it is unwise to take them "irrevelant". They surely come only second to the results, but if the project's take on the participant's frustration causes a negative feedback, then there will be less crunchers and as a consequence fewer results. I think that every project should avoid this, especially this one which has such a fragile GPU app.
I've tested these "TRV" workunits under Windows 7, and they are experiencing a higher WDDM impact than the FXCXCLs, I guess there is more interaction between the CPU and the GPU for the TRVs. However, the 50% GPU usage surely came from some kind of error, which remained after the WU finished. I couldn't recall such behavior from the past.

My thoughts too. Always keep your customers (volunteers) as happy as possible, especially when they're paying a lot and not receiving anything in return (except for warm, fuzzy feelings).

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 485
Credit: 11,126,027,899
RAC: 15,465,518
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44087 - Posted: 2 Aug 2016 | 23:55:53 UTC - in response to Message 44086.

Unfortunately we're getting more and more of these "TRV" WUs. I have 5 running at the moment.

Why unfortunate? We are here to crunch units and hope something useful comes out of that activity everything else is irrelevant.

I would put what you've said in another perspective: since the GPUGrid project doesn't find the cure of a nasty disease every other day, then these collateral factors can / could motivate some crunchers to participate (or leave), therefore it is unwise to take them "irrevelant". They surely come only second to the results, but if the project's take on the participant's frustration causes a negative feedback, then there will be less crunchers and as a consequence fewer results. I think that every project should avoid this, especially this one which has such a fragile GPU app.
I've tested these "TRV" workunits under Windows 7, and they are experiencing a higher WDDM impact than the FXCXCLs, I guess there is more interaction between the CPU and the GPU for the TRVs. However, the 50% GPU usage surely came from some kind of error, which remained after the WU finished. I couldn't recall such behavior from the past.

My thoughts too. Always keep your customers (volunteers) as happy as possible, especially when they're paying a lot and not receiving anything in return (except for warm, fuzzy feelings).


I agree.


Profile caffeineyellow5
Avatar
Send message
Joined: 30 Jul 14
Posts: 225
Credit: 2,658,976,345
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwat
Message 44096 - Posted: 6 Aug 2016 | 10:21:41 UTC - in response to Message 44079.

The really strange is that the following GERARD_FXCXCL12RX_1153966 workunit was also very slow (with low GPU usage), while similar workunits were runnig at normal speed and GPU usage on my other hosts. I've exited the BOINC manager with stopping scientific applications (to make sure the GPUGrid app will continue without error), and then I've restarted the OS (Windows XP x64). Since then it's progressing at normal rate.

I have seen this happen on my system that has the 3 980TI Classys in it. One card will under-perform by a good percentage. Restarting the OS is the only thing that brings all 3 back to normal performance. Restarting BOINC does not. It usually follows a failed work unit, but its always the same (bottom) slot. I have even rotated the cards several times (and will continue to do so since heat removal is toughest on the middle card and I am sure causes wear and tear on it over the others.) I just keep an eye on it and reboot when needed. I haven't needed recently, but it happened quite often around the turn of the year as I remember. I am not sure what tasks we were running at the time that had errors like the MO_MORs have had recently. I have not seen this happen on any of the other machines with 2 cards in them or on the single card systems.
____________
1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!"
Ephesians 6:18-20, please ;-)
http://tbc-pa.org

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 485
Credit: 11,126,027,899
RAC: 15,465,518
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44098 - Posted: 7 Aug 2016 | 11:58:46 UTC

I normally don't run 2 tasks on 1 GPU simultaneously on this project, but I decided run an experiment using these "MO" units.

Here are 4 WUs which I ran using 1 CPU & .5 GPU :

e18s21_e13s127p0f196-GERARD_MO_MOR_2-0-1-RND8164_0 Run time 56,508.43 seconds
e18s20_e16s32p0f162-GERARD_MO_MOR_2-0-1-RND9014_0 Run time 54,610.82 seconds
e18s19_e11s4p0f108-GERARD_MO_MOR_2-0-1-RND5678_0 Run time 53,411.93 seconds
e18s16_e9s1p0f40-GERARD_MO_MOR_2-0-1-RND1625_0 Run time 55,218.16 seconds


The average run time for those 4 WUs is 54,937.34 seconds, but since I am running 2 at a time, that actually average GPU usage is half that or 27,468.67 seconds. I think the actual GPU usage should be used as the run time number.


Here are 2 WUs which I ran using 1 CPU & 1 GPU from the first post in this thread:

e2s38_e1s1p0f339-GERARD_MO_TRV_2-0-1-RND8501_0 Run time 31,564.82 seconds
e2s14_e1s32p0f291-GERARD_MO_TRV_2-0-1-RND8906_0 Run time 34,050.83 seconds

The average run time for these 2 WUs is 32,807.83 seconds.


27,468.67/32,807.83 = 0.837259632 or more than 16% reduction in run time.

The GPU usage was 91% max, and power usage was 80% on a windows 10 computer. I don't think running a slow "MO" simultaneously with normal speed long GPUGRD WU would yield the same reduction in time, but you are welcome to experiment.

When I was running "MO" WU simultaneously with an Einstein Arecibo WU, GPU usage was 94% max.

I did have one down clocking during the experiment, but everything else ran smoothly.

I know I am comparing "MO" to "TRV", but their runtimes are about the same.





Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,886,844,890
RAC: 19,857,568
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44099 - Posted: 7 Aug 2016 | 13:15:27 UTC

Here's an example of a different problem caused by the MO_TRV WUs - I don't think it's been discussed before.

e12s9_e5s8p0f353-GERARD_MO_TRV_2-0-1-RND6047

The first user seems to be a part-time or multi-project cruncher - no quarrel with that, it's what BOINC was written for. As I write this, they have 5 completed tasks visible, with an average turnround of 3.53 days.

But as we know, these MO_TRV tasks run longer than usual, without any compensating increase in the estimated runtime passed to BOINC. Unsurprisingly, BOINC failed in its duty of completing the task by deadline - overrunning by about 2.5 hours.

So the server created a replacement task, and sent it to a replacement user - me, in this case.

So, I was asked to run a task which validated (science complete) when I was barely 15% done with it - that's a waste of resources. As you can see, I've aborted the redundant task, and with few tasks available these days, that GPU has moved back to another project, where there is still science to be done.

As an aside, if I'd allowed the wasteful task to continue, I would still have been awarded - without bonus - vastly more credit that at any other BOINC project I crunch for. So it wasn't the loss of bonus that caused me to abort the task, though I suspect it might have an influence on some people's thinking.

My major concern is the failure of the admins here to use the tools available - <rtsc_fpops_est> - to help BOINC to avoid deadline misses in the first place.

Post to thread

Message boards : Number crunching : GERARD_MO_TRV_ WUs

//