Advanced search

Message boards : Number crunching : do not run CPU benchmarks [issue not reproduced]

Author Message
Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8572 - Posted: 18 Apr 2009 | 18:27:32 UTC

I could be way off base but ...
Do not "Run CPU benchmarks" from BOINC Mgr while GPU crunching. I just did this a few minutes ago and it immediately caused my GPU WU to "compute error" when it resumed computation.

ERROR: mdsim.cu, line 572: Invalid device selected

My guess is that the ripping the CPU out from underneath the GPU to do a benchmark is not a good idea. Perhaps one of the people here who is in contact with BOIC devs might suggest they suspend GPU tasks when performing CPU benchmarks?

4/18/2009 1:58:08 PM Running CPU benchmarks
4/18/2009 1:58:08 PM Suspending computation - running CPU benchmarks
4/18/2009 1:58:39 PM Benchmark results:
4/18/2009 1:58:39 PM Number of CPUs: 8
4/18/2009 1:58:39 PM 3509 floating point MIPS (Whetstone) per CPU
4/18/2009 1:58:39 PM 11138 integer MIPS (Dhrystone) per CPU
4/18/2009 1:58:40 PM Resuming computation
4/18/2009 1:58:43 PM GPUGRID Computation for task xp490000-GIANNI_pYEpYV0304-5-10-RND_0 finished <-- finished because error above
4/18/2009 1:58:43 PM GPUGRID Output file xp490000-GIANNI_pYEpYV0304-5-10-RND_0_1 for task xp490000-GIANNI_pYEpYV0304-5-10-RND_0 absent
4/18/2009 1:58:43 PM GPUGRID Output file xp490000-GIANNI_pYEpYV0304-5-10-RND_0_2 for task xp490000-GIANNI_pYEpYV0304-5-10-RND_0 absent
4/18/2009 1:58:43 PM GPUGRID Output file xp490000-GIANNI_pYEpYV0304-5-10-RND_0_3 for task xp490000-GIANNI_pYEpYV0304-5-10-RND_0 absent

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8576 - Posted: 18 Apr 2009 | 22:08:29 UTC - in response to Message 8572.

Which BOINC version?

MrS
____________
Scanning for our furry friends since Jan 2002

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8577 - Posted: 18 Apr 2009 | 23:02:51 UTC - in response to Message 8576.

Sorry, I was so frustrated I didn't post any specs ...
Vista 64 Ult.
BOINC 6.6.20
EVGA GTX 295 - 182.5

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8586 - Posted: 19 Apr 2009 | 8:50:27 UTC

Can anoyone try to reproduce this issue, either with 6.620 or any other version? (I don't have access to my machine right now)

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Dieter Matuschek
Avatar
Send message
Joined: 28 Dec 08
Posts: 58
Credit: 231,884,297
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8590 - Posted: 19 Apr 2009 | 9:50:02 UTC - in response to Message 8586.

Can anyone try to reproduce this issue <snip>

I just did it with BOINC 6.6.20 and Windows XP 32bit SP3: no problems encountered.
____________

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8602 - Posted: 19 Apr 2009 | 11:52:46 UTC

So we don't need tests with other BOINC versions. Snow, would you mind trying to reproduce the error?

You could either do it at the beginning of a new WU or:
- forbid BOINC network access
- stop BOINC completly
- copy the entire BOINC user data folder
- start BOINC
- test
- in case of crash: shut BOINC down and copy the backup over the existing files
- otherwise: may have just been a coincidence

MrS
____________
Scanning for our furry friends since Jan 2002

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8604 - Posted: 19 Apr 2009 | 11:55:19 UTC - in response to Message 8590.

The client is different for Vista 64 bit ... I really don't want to do this again as I have already messed up my PPD enough recently but if there are no other takers I suppose I could stop getting new tasks, suspend what I do have, let 1 start and try it out. Hopefully I don't crap out the rest at the same time :-) I lost 6 completed WUs on Friday when my internet connecion was broken. When I finally did get back on line 6 hours later (well within the deadline) my "lost results" WUs were sent back to me for processing a second time.

Steve

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8605 - Posted: 19 Apr 2009 | 11:57:33 UTC - in response to Message 8604.

OK ... I will do this but will be going very slow to try and reduce introducing other issues. Please be patient ... I will do my best to stop ranting and get you some real results :-)

Profile Michael Goetz
Avatar
Send message
Joined: 2 Mar 09
Posts: 124
Credit: 60,073,744
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwat
Message 8608 - Posted: 19 Apr 2009 | 12:44:21 UTC - in response to Message 8605.
Last modified: 19 Apr 2009 | 12:52:27 UTC

BOINC runs benchmarks once a week on its own, so every computer that has been running CPUGRID more than a week has had benchmarks running and interrupting the WUs at least once. Since this is the only report of a problem, I would have to assume this is not a problem with the benchmark system itself but only when triggered manually.

Anyway, I'll run a manual benchmark in a moment and report back later today on what happens to the WU in progress.

I'm running 32 bit Vista, however, so it won't necessarily tell us anything.

Mike

EDIT: I'm running 6.6.20, Vista 32 bit SP1, EVGA GTX280 and slightly old driver 180.48. The CPU is Q6600; CPU clocks are stock and GPU is factory OC. I ran the benchmarks and the WU is still running. I'll report back when it completes.
____________
Want to find one of the largest known primes? Try PrimeGrid. Or help cure disease at WCG.

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8609 - Posted: 19 Apr 2009 | 13:05:40 UTC - in response to Message 8608.
Last modified: 19 Apr 2009 | 13:07:02 UTC

I finished trying to make this happen again and the good news it that it did not cause a compute error this time. I actually tried it many times (5) and even resumed network activity and tried again (2 times) and everything works fine. While I would be the first to chalk this up to *user hysteria* (I am a dev by profession), as you can see from the log in my OP that it seems a little too convienient that immediatley following a manual benchmark run that a compute error was generated and the WU gets marked completed within 3 seconds. Perhaps we just keep this one in the *maybe someday* pile just in case it happens again to someone else and only at that point spend more time trying to recreate.

My guess is that this is probably needs a *perfect storm* of events. Perhaps it was trying to write a checkpoint at the same instant or something. Perhaps it even has something to do with CPU affinity, like GPU0 was using CPU0 at the moment it was suspended, the CPU0 caches were swapped out and refreshed to do the bench and then upon resume the OS tried to use CPU1 to finish what GPU0 needed done and it just got all tripped up.

Sorry for the alarmist tone and title in my OP, I will try to remain calmer in the future.

Steve

ps I wish we could edit the Title so it does not cause unecessary concern now that we know that it may not be an issue at all.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8616 - Posted: 19 Apr 2009 | 14:14:36 UTC

I agree: it seems like your BOINC may have been in an unusual state.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Michael Goetz
Avatar
Send message
Joined: 2 Mar 09
Posts: 124
Credit: 60,073,744
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwat
Message 8634 - Posted: 19 Apr 2009 | 19:21:49 UTC - in response to Message 8616.

And, for the record, my WU also completed without error.


____________
Want to find one of the largest known primes? Try PrimeGrid. Or help cure disease at WCG.

Post to thread

Message boards : Number crunching : do not run CPU benchmarks [issue not reproduced]

//