Message boards : Number crunching : GPU-Utilization low (& variating)
Author | Message |
---|---|
Sun Jun 12 01:33:47 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A | | N/A 69C P0 N/A / N/A | 3267MiB / 4096MiB | 55% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 2666 G /usr/lib/xorg/Xorg 4MiB | | 0 N/A N/A 12881 C bin/python 3261MiB | +-----------------------------------------------------------------------------+ Sun Jun 12 01:33:57 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A | | N/A 68C P0 N/A / N/A | 3267MiB / 4096MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 2666 G /usr/lib/xorg/Xorg 4MiB | | 0 N/A N/A 12881 C bin/python 3261MiB | +-----------------------------------------------------------------------------+ Sun Jun 12 01:34:07 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A | | N/A 67C P0 N/A / N/A | 3267MiB / 4096MiB | 6% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 2666 G /usr/lib/xorg/Xorg 4MiB | | 0 N/A N/A 12881 C bin/python 3261MiB | +-----------------------------------------------------------------------------+ Sun Jun 12 01:34:17 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A | | N/A 66C P0 N/A / N/A | 3267MiB / 4096MiB | 11% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 2666 G /usr/lib/xorg/Xorg 4MiB | | 0 N/A N/A 12881 C bin/python 3261MiB | +-----------------------------------------------------------------------------+ Sun Jun 12 01:34:27 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A | | N/A 69C P0 N/A / N/A | 3267MiB / 4096MiB | 4% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 2666 G /usr/lib/xorg/Xorg 4MiB | | 0 N/A N/A 12881 C bin/python 3261MiB | +-----------------------------------------------------------------------------+ In different samples the GPU-utilization shows low, even below 10% and GPU-utilization is variating. Is this normal? Some other applications were using NVIDIA at 100% GPU-utilization (like collatz). | |
ID: 58902 | Rating: 0 | rate: / Reply Quote | |
yes this is normal for the python application. | |
ID: 58904 | Rating: 0 | rate: / Reply Quote | |
Then with another application (acemd3) the GPU utilization is continuously 100%, but it looks like that the run times would be 5-6 days for the work unit: Sun Jun 12 15:34:45 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A | | N/A 71C P0 N/A / N/A | 180MiB / 4096MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 2666 G /usr/lib/xorg/Xorg 4MiB | | 0 N/A N/A 40431 C bin/acemd3 174MiB | +-----------------------------------------------------------------------------+ | |
ID: 58912 | Rating: 0 | rate: / Reply Quote | |
yes it's normal. acemd3 has full utilization. python has low/intermittent utilization. | |
ID: 58914 | Rating: 0 | rate: / Reply Quote | |
It looks like the GPU/driver is kicked out during running acemd: Sat Jun 18 11:10:09 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A | | N/A 93C P0 N/A / N/A | 180MiB / 4096MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 2745 G /usr/lib/xorg/Xorg 4MiB | | 0 N/A N/A 5433 C bin/acemd3 174MiB | +-----------------------------------------------------------------------------+ Sat Jun 18 11:10:14 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... ERR! | 00000000:02:00.0 Off | N/A | |ERR! ERR! ERR! N/A / ERR! | GPU is lost | ERR! ERR! | | | | ERR! | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ Is it overheating / lower power related? | |
ID: 58938 | Rating: 0 | rate: / Reply Quote | |
Looks like the card fell off the bus. | |
ID: 58939 | Rating: 0 | rate: / Reply Quote | |
Jun 18 11:02:31 mx kernel: [130120.250658] NVRM: GPU at PCI:0000:02:00: GPU-793994bc-1295-4395-dc48-7dd3d7b431e2 Jun 18 11:02:31 mx kernel: [130120.250663] NVRM: Xid (PCI:0000:02:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus. Jun 18 11:02:31 mx kernel: [130120.250665] NVRM: GPU 0000:02:00.0: GPU has fallen off the bus. Jun 18 11:02:31 mx kernel: [130120.250687] NVRM: A GPU crash dump has been created. If possible, please run Jun 18 11:03:58 mx kernel: [ 7.229407] [drm] [nvidia-drm] [GPU ID 0x00000200] Loading driver Jun 18 11:10:14 mx kernel: [ 396.557669] NVRM: GPU at PCI:0000:02:00: GPU-793994bc-1295-4395-dc48-7dd3d7b431e2 Jun 18 11:10:14 mx kernel: [ 396.557674] NVRM: Xid (PCI:0000:02:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus. Jun 18 11:10:14 mx kernel: [ 396.557676] NVRM: GPU 0000:02:00.0: GPU has fallen off the bus. Jun 18 11:10:14 mx kernel: [ 396.557697] NVRM: A GPU crash dump has been created. If possible, please run Jun 18 20:11:43 mx kernel: [ 6.301529] [drm] [nvidia-drm] [GPU ID 0x00000200] Loading driver Yes, it says like it has fallen of from the bus. It could be difficult to downgrade hte driver version. Not sure why this occurs. Maybe also Linux driver consumes higher power than Windows driver. | |
ID: 58940 | Rating: 0 | rate: / Reply Quote | |
I would think it is overheating related, as per when it starts up Sat Jun 18 15:56:49 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A | | N/A 93C P0 N/A / N/A | 180MiB / 4096MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 15623 G /usr/lib/xorg/Xorg 4MiB | | 0 N/A N/A 17273 C bin/acemd3 174MiB | +-----------------------------------------------------------------------------+ Sat Jun 18 15:56:54 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... ERR! | 00000000:02:00.0 Off | N/A | |ERR! ERR! ERR! N/A / ERR! | GPU is lost | ERR! ERR! | | | | ERR! | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ | |
ID: 58941 | Rating: 0 | rate: / Reply Quote | |
92° C. I believe it the standard throttle temp for the current Nvidia cards. | |
ID: 58942 | Rating: 0 | rate: / Reply Quote | |
I would think it is overheating related, as per when it starts up I agree that the problem is most likely produced by overheating. GPU detaching from bus when reaching 93 ºC could be caused by two reasons: - A hard-coded GPU self-protection mechanism actuating. - An electromechanical problem due to soldering getting its melting point at any GPU pin, causing it to miss a good electrical conductivity (this being very dangerous in the long run!). IMHO, running ACEMD3 / ACEMD4 tasks, due to its high optimization to squeeze the maximum power from the GPU, should be limited (if anything) to very well refrigerated laptops. Previous wise advices from Keith Myers could be useful. Python tasks are currently less GPU power demanding, making them more appropriate to run at requirements-complying laptops. Running certain apps or not, can be selected at Project preferences page. Additionally, I've specifically treated laptop overheating problems at: -Message #52937 -Message #57435 | |
ID: 58945 | Rating: 0 | rate: / Reply Quote | |
There is probably very little to do if the laptop cooling is not good enough | |
ID: 58946 | Rating: 0 | rate: / Reply Quote | |
Deselect the acemd3 job tasks. They use all of a gpu to its max capabilities and will overwhelm the weak heatpipe laptop cooling solutions. | |
ID: 58947 | Rating: 0 | rate: / Reply Quote | |
Also based to below discussion they suggest the laptop is with "defective GPU": | |
ID: 58954 | Rating: 0 | rate: / Reply Quote | |
It looks like the GPU is taking more power than the 90W charger adapter can provide and battery level is going down even the laptop is connected in the charger. | |
ID: 58955 | Rating: 0 | rate: / Reply Quote | |
You may still need a higher power charger. Depending on the laptop the next step up is probably 130W. | |
ID: 58959 | Rating: 0 | rate: / Reply Quote | |
Message boards : Number crunching : GPU-Utilization low (& variating)