Message boards : Number crunching : Recently Python app work unit failed with computing error
Author | Message |
---|---|
With work unit; | |
ID: 59125 | Rating: 0 | rate: / Reply Quote | |
Result 33000731 (easier that way) | |
ID: 59127 | Rating: 0 | rate: / Reply Quote | |
That's a mobile gpu in a laptop. Not usually recommended to even try running gpu tasks on a laptop. | |
ID: 59130 | Rating: 0 | rate: / Reply Quote | |
If you find out what is causing the problem, please let me know about it also. | |
ID: 59132 | Rating: 0 | rate: / Reply Quote | |
It looks like overheating and GPU driver kicked out of operation: Wed Aug 24 17:28:52 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A | | N/A 93C P0 N/A / N/A | 3267MiB / 4096MiB | 94% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1491 G /usr/lib/xorg/Xorg 4MiB | | 0 N/A N/A 3604 C bin/python 3261MiB | +-----------------------------------------------------------------------------+ Wed Aug 24 17:28:57 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A | | N/A 90C P0 N/A / N/A | 3267MiB / 4096MiB | 11% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1491 G /usr/lib/xorg/Xorg 4MiB | | 0 N/A N/A 3604 C bin/python 3261MiB | +-----------------------------------------------------------------------------+ Wed Aug 24 17:29:02 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... ERR! | 00000000:02:00.0 Off | N/A | |ERR! ERR! ERR! N/A / ERR! | GPU is lost | ERR! ERR! | | | | ERR! | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ | |
ID: 59156 | Rating: 0 | rate: / Reply Quote | |
It looks like overheating and GPU driver kicked out of operation: Putting the laptop on a cooling stand, with fans blowing air directly into the laptop's ventilation inlet slots, may help. The slots are often underneath, but check your particular machine. And give then a good clean while you're down there! | |
ID: 59157 | Rating: 0 | rate: / Reply Quote | |
I run laptop GPU's. Take a small air blower to its slots and blow the dust-out. I do it daily. | |
ID: 59158 | Rating: 0 | rate: / Reply Quote | |
Result 33000731 (easier that way) It is clear that these GPU tasks (MX-series) fail with insufficient GPU memory. I am unable to find such "requirements" - I can't even find details of applications. Can anyone help ? | |
ID: 59168 | Rating: 0 | rate: / Reply Quote | |
Result 33000731 (easier that way) I can't speak for Windows behavior, but the last several Linux tasks I processed use a little more than 3GB per task when running 1 task per GPU. with the help of a cuda mps server, I can push 2 tasks concurrently in a 6GB GTX 1060 card as some of the memory gets shared. I would say 3GB minimum needed per task. and at least 4GB to be comfortable. system memory is also quite high. uses about 8GB system memory per task. but you should monitor it, the project could change the requirements at any time if they want to run larger jobs or change the direction of their research. ____________ | |
ID: 59169 | Rating: 0 | rate: / Reply Quote | |
I have 2 laptops, the older one running Nvidia GT 1060, and the newer, running RTX 3060. The older laptop, though reaching GPU temperature of ~90C, completes most of Python tasks, though it may take up to 40 hours. The newer laptop, however, fails vast majority of Python tasks though that started quite recently. For example, 29/08/2022 9:32:39 PM | GPUGRID | Computation for task e00008a01599-ABOU_rnd_ppod_expand_demos24_3-0-1-RND7566_0 finished or another task, with more logging details: 29/08/2022 9:44:40 PM | GPUGRID | [task] Process for e00014a00897-ABOU_rnd_ppod_expand_demos24_3-0-1-RND0029_3 exited, exit code 195, task state 1 What could be the reason? GPU has enough graphical memory, the laptop has 16 GB - same as the older one. There is enough disk space, and the temperature doesn't rise above 55-60C. Acemd3 never fail on the newer laptop (at least, i don't remember such failures) and usually finish in under 15 hours. It's only Python, and only recently. Why could that be? What should I enable in the logs to diagnose better?[/quote] | |
ID: 59173 | Rating: 0 | rate: / Reply Quote | |
I have 2 laptops, the older one running Nvidia GT 1060, and the newer, running RTX 3060. The older laptop, though reaching GPU temperature of ~90C, completes most of Python tasks, though it may take up to 40 hours. The newer laptop, however, fails vast majority of Python tasks though that started quite recently. For example, this is the more specific error you are getting on that system: RuntimeError: Unable to find a valid cuDNN algorithm to run convolution and I've also seen this in your errors: RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:81] data. DefaultCPUAllocator: not enough memory: you tried to allocate 3612672 bytes. researching your first error, and considering the second error, it's likely that memory is your problem. are you trying to run multiple tasks at a time? if not, others with Windows systems have mentioned increasing the pagefile size to solve issues. have you done that as well? ____________ | |
ID: 59174 | Rating: 0 | rate: / Reply Quote | |
Thank you so much! Well, i was suspecting the memory, but both laptops have Windows-managed pagefile which is set to 40+GB. Is that enough, or should I increase it even more on the newer system? Besides, if it's Windows-managed, doesn't the pagefile size increase automatically on demand, though it may cause issues to the processes that request memory allocation? Anyway, I set the pagefile to 64GB now, as someone did in a similar situation. I expect it to be enough for the WUs to complete without issues. Thanks again. | |
ID: 59175 | Rating: 0 | rate: / Reply Quote | |
The problem with Windows managed pagefile is that it probably doesn't increase its size immediately when the Python application starts loading all its spawned processes. | |
ID: 59176 | Rating: 0 | rate: / Reply Quote | |
I can't speak for Windows behavior, but the last several Linux tasks I processed use a little more than 3GB per task when running 1 task per GPU. On Linux I still can successfully run Python tasks on a GTX 1060 3GB. Using Intel graphics for display helps. I noticed Python app's memory usage decreased by ~100MiB too when I moved Xorg process to Intel GPU. I don't exactly know why. Tue Aug 30 08:22:46 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A | | 28% 60C P2 47W / 60W | 2740MiB / 3019MiB | 95% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1672379 C bin/python 2731MiB | +-----------------------------------------------------------------------------+ Before I switched to Intel graphics, it was about: Xorg 169 MiB xfwm4 2 MiB Python 2835 MiB If I'm not wrong. | |
ID: 59177 | Rating: 0 | rate: / Reply Quote | |
The problem with Windows managed pagefile is that it probably doesn't increase its size immediately when the Python application starts loading all its spawned processes. Thanks for the details. I was thinking about setting the min size of the pagefile, but then decided to avoid system hickups during pagefile size increase. After all, that's windows (: Since then my 2nd laptop started processing tasks without issues. Once again, thanks to all who pointed in the right direction. | |
ID: 59179 | Rating: 0 | rate: / Reply Quote | |
Can anyone insight to the reason of these WU crashes? | |
ID: 59189 | Rating: 0 | rate: / Reply Quote | |
Can anyone insight to the reason of these WU crashes? Look at the error below the traceback section. First one: ValueError: Object arrays cannot be loaded when allow_pickle=False Second and third one: BrokenPipeError: [WinError 232] The pipe is being closed In at least two of these cases, other people who ran the same WUs also had errors. Likely to just be a problem with the task itself and not your system. ____________ | |
ID: 59190 | Rating: 0 | rate: / Reply Quote | |
Thanks a lot! | |
ID: 59191 | Rating: 0 | rate: / Reply Quote | |
There are more and more problem WUs -_- BrokenPipeError: [WinError 232] The pipe is being closed https://www.gpugrid.net/result.php?resultid=33022703 BrokenPipeError: [WinError 109] The pipe has been ended | |
ID: 59194 | Rating: 0 | rate: / Reply Quote | |
I have been seeing a number of the same failures for quite some time as well. | |
ID: 59195 | Rating: 0 | rate: / Reply Quote | |
My tasks are still failing with computational errors. | |
ID: 59572 | Rating: 0 | rate: / Reply Quote | |
The WU's that you have processed are failing with a CUDA error. Your GPU driver version 512.15 is somewhat outdated. I would suggest updating to the latest driver version 526.86. | |
ID: 59574 | Rating: 0 | rate: / Reply Quote | |
Message boards : Number crunching : Recently Python app work unit failed with computing error