Advanced search

Message boards : Number crunching : A workunit at 99.960% progress for at least 18 hours

Author Message
Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 503
Credit: 755,434,080
RAC: 210,052
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 25585 - Posted: 8 Jun 2012 | 23:57:56 UTC

For many hours, I've had a workunit showing unusual
progress numbers:

Long runs (8-12 hours on fastest card) 6.16 (cuda31)
I4R57-NATHAN-RPS1120528-2-166-RND7359
Running
0.185 CPUs + 1 NVIDIA GPU
5000000 GFLOPS
00:41:51 (CPU at last checkpoint)
19:19:11 (CPU time) (slowly rising)
70:45:57 (Elapsed time) (and rising)
--- (Estimated time remaining; not changing)
99.960% (fraction done; not changing for several hours now)

Is there something wrong with this workunit?
Or is this just the way that application handles a serious
underestimate of the time the workunit should run?

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 503
Credit: 755,434,080
RAC: 210,052
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 25590 - Posted: 9 Jun 2012 | 6:02:47 UTC - in response to Message 25585.

Now about 24 hours.

6/5/2012 10:25:37 PM | | No config file found - using defaults
6/5/2012 10:25:37 PM | | Starting BOINC client version 7.0.25 for windows_x86_64
6/5/2012 10:25:37 PM | | log flags: file_xfer, sched_ops, task
6/5/2012 10:25:37 PM | | Libraries: libcurl/7.21.6 OpenSSL/1.0.0d zlib/1.2.5
6/5/2012 10:25:37 PM | | Data directory: C:\ProgramData\BOINC
6/5/2012 10:25:37 PM | | Running under account Bobby
6/5/2012 10:25:37 PM | | Processor: 8 GenuineIntel Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz [Family 6 Model 42 Stepping 7]
6/5/2012 10:25:37 PM | | Processor: 256.00 KB cache
6/5/2012 10:25:37 PM | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 cx16 sse4_1 sse4_2 syscall nx lm vmx smx tm2 popcnt aes pbe
6/5/2012 10:25:37 PM | | OS: Microsoft Windows 7: Professional x64 Edition, Service Pack 1, (06.01.7601.00)
6/5/2012 10:25:37 PM | | Memory: 15.98 GB physical, 31.96 GB virtual
6/5/2012 10:25:37 PM | | Disk: 136.03 GB total, 65.07 GB free
6/5/2012 10:25:37 PM | | Local time is UTC -5 hours
6/5/2012 10:25:37 PM | | NVIDIA GPU 0: GeForce GT 440 (driver version 301.42, CUDA version 4.20, compute capability 2.1, 1536MB, 1442MB available, 342 GFLOPS peak)
6/5/2012 10:25:37 PM | | OpenCL: NVIDIA GPU 0: GeForce GT 440 (driver version 301.42, device version OpenCL 1.1 CUDA, 1536MB, 1442MB available)

Simba123
Send message
Joined: 5 Dec 11
Posts: 147
Credit: 69,970,684
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 25593 - Posted: 9 Jun 2012 | 8:33:15 UTC

that sounds like there is something wrong there.
Have you tried re-starting your computer?

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 503
Credit: 755,434,080
RAC: 210,052
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 25598 - Posted: 9 Jun 2012 | 12:24:58 UTC

I have now. Lost everything done since about 52 hours.

One of the many GPU workunits downloaded during that situation took over.

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 503
Credit: 755,434,080
RAC: 210,052
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 25599 - Posted: 9 Jun 2012 | 22:36:48 UTC

It now restarted and completed successfully. Returned not quite 4 days after it was sent.

Simba123
Send message
Joined: 5 Dec 11
Posts: 147
Credit: 69,970,684
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 25601 - Posted: 9 Jun 2012 | 22:58:09 UTC - in response to Message 25598.

I have now. Lost everything done since about 52 hours.

One of the many GPU workunits downloaded during that situation took over.


:-(

Sorry you lost all that work. I've had one get stuck before like that, and a simple restart fixed it.

Getting stuck at that percentage seems to be something to do with writing the completed file to disk before transmission. Background task preventing disk writes perhaps? virus scan or something like that. I'm not sure.....

Paul Raney
Send message
Joined: 26 Dec 10
Posts: 115
Credit: 416,576,946
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 25607 - Posted: 10 Jun 2012 | 12:04:36 UTC - in response to Message 25601.

There might be a larger issue here. Two of the I1R45-NATHAN_RPS work units failed for me:

http://www.gpugrid.net/result.php?resultid=5479036
http://www.gpugrid.net/result.php?resultid=5478382

____________
Thx - Paul

Note: Please don't use driver version 295 or 296! Recommended versions are 266 - 285.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,206,655,749
RAC: 261,147
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 25609 - Posted: 10 Jun 2012 | 12:38:13 UTC - in response to Message 25607.
Last modified: 10 Jun 2012 | 12:41:30 UTC

There might be a larger issue here. Two of the I1R45-NATHAN_RPS work units failed for me:

http://www.gpugrid.net/result.php?resultid=5479036
http://www.gpugrid.net/result.php?resultid=5478382

This is a different problem. You should follow the outcome of those workunits on other hosts to see if it's the workunits' fault.
However, I've caught a PAOLA_1H46-15-20 running (at 99% GPU usage) on my PC for more than 20 hours, while the progress indicator was showing 9.23%. I've paused and restarted this workunit, and its running time indicator dropped back to 45 minutes, and the progress indicator to 9%. Since then it's running fine (1h14m, 13.4%)

Post to thread

Message boards : Number crunching : A workunit at 99.960% progress for at least 18 hours

//