Author |
Message |
zooxitSend message
Joined: 4 Jul 21 Posts: 23 Credit: 10,329,848,506 RAC: 60,227,476 Level
Scientific publications
|
Hi,
for some time now I am noticing that WUs mostly end up with errors on my Linux (Debian) computer with GTX1070, but almost no errors are reported for Windows computers both with GTX1070 or RTX3080.
On all computers the same apps are running: Python apps for GPU hosts v4.03 (cuda1131)
Any ideas what is the issue? |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1341 Credit: 7,682,021,308 RAC: 13,131,179 Level
Scientific publications
|
A lot of your errors are badly formatted tasks. Look at all the other failed wingmen that tried seven times before the task was retired.
But I also see an issue with your Debian system in it unarchive/uncompression algorithms that aren't handling the file archives correctly.
Check if bzip2 algorithm is installed. |
|
|
Greg _BESend message
Joined: 30 Jun 14 Posts: 135 Credit: 121,356,939 RAC: 23,416 Level
Scientific publications
|
Windows systems are also getting: Exit status 195 (0xc3) EXIT_CHILD_FAILED
Its a common theme in all these tasks. |
|
|
zooxitSend message
Joined: 4 Jul 21 Posts: 23 Credit: 10,329,848,506 RAC: 60,227,476 Level
Scientific publications
|
Hi,
thanks for answers (didn't have much time to troubleshoot lately, that's why the late response...)
So, still troubleshooting, stil not solved:
- tried Ubuntu, and now Win10, instead of Debian
- found that one RAM stick was faulty - removed it
- tried removing some of the graphic cards from the rig (originaly there where 4 cards) - now I am at only one graphic card on the troublesome computer (waiting for results)
... nothing helped. Still on this machine (GTX1070) I get almost no Valid tasks (my other two machines (GTX1070 and RTX 3080) crunch as expected).
I guess the fact that change of OS did not affect the outcome (still almost no Valid tasks) means there is a hardware problem.(?)
Any ideas? |
|
|
|
Most of your errors seem to involve errors in the extraction phase.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1341 Credit: 7,682,021,308 RAC: 13,131,179 Level
Scientific publications
|
Check your storage system. Hard drive/SSD/NVMe drive flaky. Bad cabling. Incorrect transmission speed.
If storage is common to your PCIe bus along with your gpus. Check that you are not trying to drive PCI Gen2 storage with PCI Gen3 gpus at the same time.
As I mentioned previously and Ian also commented your issue is the decompression phase of the tasks where the Python libraries get expanded.
Check that your storage is not running out of room. Check that your swapfile or swap partition is adequate. |
|
|
zooxitSend message
Joined: 4 Jul 21 Posts: 23 Credit: 10,329,848,506 RAC: 60,227,476 Level
Scientific publications
|
Thanks for answers.
Debian, Ubuntu and now Win where all installed on different disks (all SSDs, admittedly a bit old ones... sizes 128-512GB). Is this size enough?
Will check the bios settings.
Is it normal that project status says there are 700+ tasks available, but my machine received less than an hour of work in almost two days (only 6 tasks, mostly errors)?
Also another question:
Computer with RTX3080 (running 24/7) gets on average 130.000 RAC lately - isn't that low for this graphic card?
|
|
|
gemini8 Send message
Joined: 3 Jul 16 Posts: 31 Credit: 2,212,787,676 RAC: 3,785,098 Level
Scientific publications
|
Regarding the 3080 and its credits:
First, you can't compare the Python tasks with ACEMD or the earlier Short- and Long-runs.
They work differently, and they are credited differently.
Second, I'm running a 'flat-footed' 1080 at about 180k credits per day.
(I call it flat-footed because I limit its power to 110 watts using nvidia-smi to hopefully give it a longer life.)
When monitoring the utilisation of the GPU I saw that it is hardly used, so I set the machine to run two GPUGrid Pythons beside one MilkyWay work-unit. I chose MilkyWay because that project's work-units also don't fill up the 1080 completely, and the GPUGrid work-units should have enough space to breathe. You could chose any other project that doesn't run at 100% for the same purpose. The credits for the second project can be added on-top, but, as you can't really compare the different projects' crediting with one another for the same amounts of work, you should be able to get more than those 55k to 60k credits MilkyWay gives me.
____________
- - - - - - - - - -
Greetings, Jens |
|
|
zooxitSend message
Joined: 4 Jul 21 Posts: 23 Credit: 10,329,848,506 RAC: 60,227,476 Level
Scientific publications
|
Thanks to everyone for help and ideas.
So, the problem was in first place one faulty stick of RAM, and the other problem was as mentioned by others - disk space and page file. Bought new RAM, exchanged for 500GB disk and increased swap file to 50GB and greater.
Much less errors.
I also realized that this tasks are very CPU intensive:
- Ryzen 7 3700X is not capable to 'feed' two GTX 1070 - all cores of CPU at maximum and the graphic cards are used about 20%
- Ryzen 9 5900X uses 60-80% of resources to 'feed' one RTX 3080 (GPU is used near 100%)
|
|
|
|
You should switch back to Linux. The Windows app seems really slow in comparison.
I’m running 2x RTX 3060, running 3x tasks on each. For a total of 6 tasks for the system. Individual tasks times are about 13hrs, about 4hrs:20min per task effective. (That’s for full length runs, excluding ones that end early).
This system is on an EPYC 7443P 24core processor. The same Zen3 architecture as your 5900X, but twice the cores. Running 6x tasks uses ~80-90% of the CPU, and ~56GB of system memory.
You should be able to run 2x tasks on your 3080 for good completion times under Linux.
____________
|
|
|
zooxitSend message
Joined: 4 Jul 21 Posts: 23 Credit: 10,329,848,506 RAC: 60,227,476 Level
Scientific publications
|
Thanks for hints. Will try linux again. |
|
|