Message boards : Number crunching : 195 (0xc3) EXIT_CHILD_FAILED
Author | Message |
---|---|
My RTX 3080 machine completed a first task successfully. Afterwards, two more tasks crashed with an 195 (0xc3) EXIT_CHILD_FAILED error message and the following log (after only a few seconds of run time): Name e2s184_e1s254p0f959-ADRIA_AdB_KIXCMYB_HIP-0-2-RND9959_9 Arbeitspaket 27080023 Erstellt 4 Oct 2021 | 9:59:05 UTC Gesendet 4 Oct 2021 | 10:48:16 UTC Empfangen 4 Oct 2021 | 22:07:40 UTC Serverstatus Abgeschlossen Resultat Berechnungsfehler Clientstatus Berechnungsfehler Endstatus 195 (0xc3) EXIT_CHILD_FAILED Computer ID 584499 Ablaufdatum 9 Oct 2021 | 10:48:16 UTC Laufzeit 25.51 CPU Zeit 0.00 Prüfungsstatus Ungültig Punkte 0.00 Anwendungsversion New version of ACEMD v2.18 (cuda101) Stderr Ausgabe <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> (unknown error) - exit code 195 (0xc3)</message> <stderr_txt> 00:05:49 (30732): wrapper (7.9.26016): starting 00:05:49 (30732): wrapper: running bin/acemd3.exe (--boinc --device 0) ACEMD failed: Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch) 00:05:59 (30732): bin/acemd3.exe exited; CPU time 0.000000 00:05:59 (30732): app exit status: 0x1 00:05:59 (30732): called boinc_finish(195) 0 bytes in 0 Free Blocks. 186 bytes in 4 Normal Blocks. 1144 bytes in 1 CRT Blocks. 0 bytes in 0 Ignore Blocks. 0 bytes in 0 Client Blocks. Largest number used: 0 bytes. Total allocations: 239849 bytes. Dumping objects -> {323256} normal block at 0x000001B7D23D3BC0, 85 bytes long. Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65 ..\api\boinc_api.cpp(309) : {323253} normal block at 0x000001B7D23D4940, 8 bytes long. Data: < 1Ò· > 00 00 31 D2 B7 01 00 00 {322608} normal block at 0x000001B7D23D3C60, 85 bytes long. Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65 {321994} normal block at 0x000001B7D23D46C0, 8 bytes long. Data: <@Ê?Ò· > 40 CA 3F D2 B7 01 00 00 ..\zip\boinc_zip.cpp(122) : {146} normal block at 0x000001B7D23D3090, 260 bytes long. Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 {133} normal block at 0x000001B7D23D48A0, 16 bytes long. Data: < ø<Ò· > 10 F8 3C D2 B7 01 00 00 00 00 00 00 00 00 00 00 {132} normal block at 0x000001B7D23CF810, 40 bytes long. Data: < H=Ò· conda-pa> A0 48 3D D2 B7 01 00 00 63 6F 6E 64 61 2D 70 61 {125} normal block at 0x000001B7D23CF340, 48 bytes long. Data: <--boinc --device> 2D 2D 62 6F 69 6E 63 20 2D 2D 64 65 76 69 63 65 {124} normal block at 0x000001B7D23D49E0, 16 bytes long. Data: <XN=Ò· > 58 4E 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00 {123} normal block at 0x000001B7D23D4C60, 16 bytes long. Data: <0N=Ò· > 30 4E 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00 {122} normal block at 0x000001B7D23D4850, 16 bytes long. Data: < N=Ò· > 08 4E 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00 {121} normal block at 0x000001B7D23D3DB0, 16 bytes long. Data: <àM=Ò· > E0 4D 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00 {120} normal block at 0x000001B7D23D4030, 16 bytes long. Data: <¸M=Ò· > B8 4D 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00 {119} normal block at 0x000001B7D23D4080, 16 bytes long. Data: < M=Ò· > 90 4D 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00 {118} normal block at 0x000001B7D23D4120, 16 bytes long. Data: <pM=Ò· > 70 4D 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00 {117} normal block at 0x000001B7D23D4990, 16 bytes long. Data: <HM=Ò· > 48 4D 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00 {116} normal block at 0x000001B7D23D42B0, 16 bytes long. Data: < M=Ò· > 20 4D 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00 {115} normal block at 0x000001B7D23D4D20, 496 bytes long. Data: <°B=Ò· bin/acem> B0 42 3D D2 B7 01 00 00 62 69 6E 2F 61 63 65 6D {65} normal block at 0x000001B7D23C2D80, 16 bytes long. Data: < ê—{÷ > 80 EA 97 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {64} normal block at 0x000001B7D23C2B50, 16 bytes long. Data: <@é—{÷ > 40 E9 97 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {63} normal block at 0x000001B7D23C2B00, 16 bytes long. Data: <øW”{÷ > F8 57 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {62} normal block at 0x000001B7D23C2AB0, 16 bytes long. Data: <ØW”{÷ > D8 57 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {61} normal block at 0x000001B7D23C3370, 16 bytes long. Data: <P ”{÷ > 50 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {60} normal block at 0x000001B7D23C2A60, 16 bytes long. Data: <0 ”{÷ > 30 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {59} normal block at 0x000001B7D23C3500, 16 bytes long. Data: <à ”{÷ > E0 02 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {58} normal block at 0x000001B7D23C3640, 16 bytes long. Data: < ”{÷ > 10 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {57} normal block at 0x000001B7D23C2A10, 16 bytes long. Data: <p ”{÷ > 70 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {56} normal block at 0x000001B7D23C3870, 16 bytes long. Data: < À’{÷ > 18 C0 92 7B F7 7F 00 00 00 00 00 00 00 00 00 00 Object dump complete. </stderr_txt> ]]> Name e4s109_e1s39p0f745-ADRIA_AdB_KIXCMYB_HIP-1-2-RND2493_0 Arbeitspaket 27081645 Erstellt 4 Oct 2021 | 22:12:32 UTC Gesendet 4 Oct 2021 | 22:14:12 UTC Empfangen 4 Oct 2021 | 22:16:12 UTC Serverstatus Abgeschlossen Resultat Berechnungsfehler Clientstatus Berechnungsfehler Endstatus 195 (0xc3) EXIT_CHILD_FAILED Computer ID 584499 Ablaufdatum 9 Oct 2021 | 22:14:12 UTC Laufzeit 7.26 CPU Zeit 0.00 Prüfungsstatus Ungültig Punkte 0.00 Anwendungsversion New version of ACEMD v2.18 (cuda101) Stderr Ausgabe <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> (unknown error) - exit code 195 (0xc3)</message> <stderr_txt> 00:14:24 (14320): wrapper (7.9.26016): starting 00:14:24 (14320): wrapper: running bin/acemd3.exe (--boinc --device 0) ACEMD failed: Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch) 00:14:26 (14320): bin/acemd3.exe exited; CPU time 0.000000 00:14:26 (14320): app exit status: 0x1 00:14:26 (14320): called boinc_finish(195) 0 bytes in 0 Free Blocks. 186 bytes in 4 Normal Blocks. 1144 bytes in 1 CRT Blocks. 0 bytes in 0 Ignore Blocks. 0 bytes in 0 Client Blocks. Largest number used: 0 bytes. Total allocations: 241603 bytes. Dumping objects -> {323256} normal block at 0x000002061D1C3BC0, 85 bytes long. Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65 ..\api\boinc_api.cpp(309) : {323253} normal block at 0x000002061D1C43F0, 8 bytes long. Data: < > 00 00 02 1D 06 02 00 00 {322608} normal block at 0x000002061D1C3C60, 85 bytes long. Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65 {321994} normal block at 0x000002061D1C42B0, 8 bytes long. Data: <@Ê > 40 CA 1E 1D 06 02 00 00 ..\zip\boinc_zip.cpp(122) : {146} normal block at 0x000002061D1C3090, 260 bytes long. Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 {133} normal block at 0x000002061D1C3EF0, 16 bytes long. Data: <Ðò > D0 F2 1B 1D 06 02 00 00 00 00 00 00 00 00 00 00 {132} normal block at 0x000002061D1BF2D0, 40 bytes long. Data: <ð> conda-pa> F0 3E 1C 1D 06 02 00 00 63 6F 6E 64 61 2D 70 61 {125} normal block at 0x000002061D1BF180, 48 bytes long. Data: <--boinc --device> 2D 2D 62 6F 69 6E 63 20 2D 2D 64 65 76 69 63 65 {124} normal block at 0x000002061D1C4940, 16 bytes long. Data: <XN > 58 4E 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00 {123} normal block at 0x000002061D1C4490, 16 bytes long. Data: <0N > 30 4E 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00 {122} normal block at 0x000002061D1C4800, 16 bytes long. Data: < N > 08 4E 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00 {121} normal block at 0x000002061D1C47B0, 16 bytes long. Data: <àM > E0 4D 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00 {120} normal block at 0x000002061D1C48A0, 16 bytes long. Data: <¸M > B8 4D 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00 {119} normal block at 0x000002061D1C4710, 16 bytes long. Data: < M > 90 4D 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00 {118} normal block at 0x000002061D1C48F0, 16 bytes long. Data: <pM > 70 4D 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00 {117} normal block at 0x000002061D1C4990, 16 bytes long. Data: <HM > 48 4D 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00 {116} normal block at 0x000002061D1C4A80, 16 bytes long. Data: < M > 20 4D 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00 {115} normal block at 0x000002061D1C4D20, 496 bytes long. Data: < J bin/acem> 80 4A 1C 1D 06 02 00 00 62 69 6E 2F 61 63 65 6D {65} normal block at 0x000002061D1B36E0, 16 bytes long. Data: < ê—{÷ > 80 EA 97 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {64} normal block at 0x000002061D1B3410, 16 bytes long. Data: <@é—{÷ > 40 E9 97 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {63} normal block at 0x000002061D1B3820, 16 bytes long. Data: <øW”{÷ > F8 57 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {62} normal block at 0x000002061D1B33C0, 16 bytes long. Data: <ØW”{÷ > D8 57 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {61} normal block at 0x000002061D1B3190, 16 bytes long. Data: <P ”{÷ > 50 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {60} normal block at 0x000002061D1B3000, 16 bytes long. Data: <0 ”{÷ > 30 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {59} normal block at 0x000002061D1B2FB0, 16 bytes long. Data: <à ”{÷ > E0 02 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {58} normal block at 0x000002061D1B3320, 16 bytes long. Data: < ”{÷ > 10 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {57} normal block at 0x000002061D1B2F60, 16 bytes long. Data: <p ”{÷ > 70 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {56} normal block at 0x000002061D1B3140, 16 bytes long. Data: < À’{÷ > 18 C0 92 7B F7 7F 00 00 00 00 00 00 00 00 00 00 Object dump complete. </stderr_txt> ]]> Any idea what is going on? Very annoying is the fact that after these two consecutive crashes, it took the GPUGRID server 4 hours to send out a new task (which is now in progress) - making my machine uselessly idling for hours. Michael. ____________ President of Rechenkraft.net - Germany's first and largest distributed computing organization. | |
ID: 57480 | Rating: 0 | rate: / Reply Quote | |
Your computers are hidden, so I can't be certain, but your problem seems to be Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch) There are two versions of the new GPUGrid application: cuda1121 and cuda101. You will be able to see in your task list which worked, and which didn't work. Despite some posts to the contrary, the general consensus is that cuda1121 works on an RTX 3080, and cuda101 doesn't. And despite an assurance from the project that they have prevented the cuda101 application being sent to RTX cards, clearly they haven't. There's nothing you can do to prevent the wrong application being sent to your card: just take comfort from the fact that cuda101 tasks will fail very quickly, and you won't waste computing power on the tasks. The only hit you and the project are taking is the waste of bandwidth downloading the inappropriate tasks. | |
ID: 57481 | Rating: 0 | rate: / Reply Quote | |
Thank you Richard - only cuda1121 works for me. | |
ID: 57482 | Rating: 0 | rate: / Reply Quote | |
And despite an assurance from the project that they have prevented the cuda101 application being sent to RTX cards, clearly they haven't. I endorse this statement. I have been sent cuda101 tasks to my two RTX3070, the latest one this morning. There's nothing you can do to prevent the wrong application being sent to your card: just take comfort from the fact that cuda101 tasks will fail very quickly, and you won't waste computing power on the tasks. The only hit you and the project are taking is the waste of bandwidth downloading the inappropriate tasks. however, there is more to it: What might also happen is that if one deletes erronously downloaded cuda101 tasks from the BOINC task list too often, one will not receive any more tasks for the next 24 hours. Hence, this problem should be solved by the project team ASAP ! | |
ID: 57484 | Rating: 0 | rate: / Reply Quote | |
I also don't quite understand what information determines the app version to be sent out. This task f.ex. has been sent out 6 times before my host caught it. Once it was 1121 app version, all others were sent out as the 101 app version. It did fail on all previous hosts and went through 3 Ampere cards (3060Ti, 3070 & 3090). Seems to be quite an annoyance for anyone with the latest cards. And older cards take some serious chewing on the new tasks. Mine takes a little over 31hrs. This project could be working much more efficiently if it were able to fully capture the potential of these RTX 3000 series cards. | |
ID: 57486 | Rating: 0 | rate: / Reply Quote | |
But there are two different failure modes - three of each: three missing DLLs (probably vcruntime140_1), and three wrong architecture (cuda101 on RTX) | |
ID: 57487 | Rating: 0 | rate: / Reply Quote | |
That's sounds about right. Only meant to highlight the Ampere cards that all failed obviously due to the wrong version having been sent to these hosts, but somehow older gen cards getting the 1121 app version instead on some occasions. | |
ID: 57488 | Rating: 0 | rate: / Reply Quote | |
All the more reason to just retire the cuda101 app, and force everyone to update their drivers to use the cuda1121 app | |
ID: 57489 | Rating: 0 | rate: / Reply Quote | |
All the more reason to just retire the cuda101 app, and force everyone to update their drivers to use the cuda1121 app I disagree. People should be allowed their own choice of driver (you don't know why they've kept an older one), but the project should manage the minimum limits better. | |
ID: 57490 | Rating: 0 | rate: / Reply Quote | |
IMO, the "choice" of driver in the ranges of CUDA101 and CUDA1121 compatibility will be arbitrary. the list of supported products is exactly the same so it's not like some older GPU wont be supported anymore with the newer driver. Nvidia drivers are very stable and it's pretty rare that a new driver fully breaks something. CUDA101 requires driver 418.xx, CUDA1121 requires driver 461.xx, there's not a huge difference here. but even still there's a large range of "choice" between the minimum driver required for cuda1121 and what is the current latest driver release. they don't need to be bleeding edge. CUDA 11.2 was introduced almost a year ago, and it's currently up to CUDA 11.4 Update 2. | |
ID: 57491 | Rating: 0 | rate: / Reply Quote | |
it's not a huge change to get more recent drivers, and the observed negative impacts from the project maintaining two app versions far outweigh the impact of requiring a user to update their drivers. It can be if your computer is managed by your employer's domain controller, and group policy prevents you updating it yourself. Just an example. | |
ID: 57492 | Rating: 0 | rate: / Reply Quote | |
it's not a huge change to get more recent drivers, and the observed negative impacts from the project maintaining two app versions far outweigh the impact of requiring a user to update their drivers. in this case, it's MORE likely that these systems will (*should*) be updated to recent, as any competent SA will (*should*) be keeping everything on the up and up in terms of security patches, and there has been a stronger push from Nvidia in this regard lately. ____________ | |
ID: 57493 | Rating: 0 | rate: / Reply Quote | |
I think we're far enough off topic. Let's leave it there. | |
ID: 57494 | Rating: 0 | rate: / Reply Quote | |
Well, this project's incapability of delivering the proper GPU app/plan class to the corresponding GPU systems simply results in a massive loss of project overall performance: Due to the repetitive "compute errors" the clients do not receive further tasks for a while and idle around for hours. I figured that this way instead of two tasks per day, I can deliver only one. | |
ID: 57544 | Rating: 0 | rate: / Reply Quote | |
Glad at least one of my Host is running but all the other are NOT! | |
ID: 57572 | Rating: 0 | rate: / Reply Quote | |
Nvidia driver version 441.20 is a bit old. I am currently running version 471.11 on my Windows hosts. | |
ID: 57575 | Rating: 0 | rate: / Reply Quote | |
I second that. There's nothing you can do to prevent the wrong application being sent to your card: just take comfort from the fact that cuda101 tasks will fail very quickly, and you won't waste computing power on the tasks. Unfortunately, that is exactly NOT the case. Here an example of a task which ran for almost 15 hrs before failing with an error: Task: https://www.gpugrid.net/result.php?resultid=32653715 Name e7s106_e5s196p1f1036-ADRIA_AdB_KIXCMYB_HIP-1-2-RND0214_4 Arbeitspaket 27082868 Erstellt 11 Oct 2021 | 6:23:21 UTC Gesendet 11 Oct 2021 | 6:24:56 UTC Empfangen 12 Oct 2021 | 9:32:05 UTC Serverstatus Abgeschlossen Resultat Berechnungsfehler Clientstatus Berechnungsfehler Endstatus 195 (0xc3) EXIT_CHILD_FAILED Computer ID 588794 Ablaufdatum 16 Oct 2021 | 6:24:56 UTC Laufzeit 53,608.48 CPU Zeit 53,473.36 Prüfungsstatus Ungültig Punkte 0.00 Anwendungsversion New version of ACEMD v2.18 (cuda101) Stderr Ausgabe <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> (unknown error) - exit code 195 (0xc3)</message> <stderr_txt> 08:25:11 (11620): wrapper (7.9.26016): starting 08:25:11 (11620): wrapper: running bin/acemd3.exe (--boinc --device 0) Detected memory leaks! Dumping objects -> ..\api\boinc_api.cpp(309) : {323250} normal block at 0x000002C574996E70, 8 bytes long. Data: < $v > 00 00 24 76 C5 02 00 00 ..\lib\diagnostics_win.cpp(417) : {321999} normal block at 0x000002C576431310, 1080 bytes long. Data: < > FC 08 00 00 CD CD CD CD 0C 01 00 00 00 00 00 00 ..\zip\boinc_zip.cpp(122) : {149} normal block at 0x000002C57499EBA0, 260 bytes long. Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Object dump complete. 10:58:41 (4808): wrapper (7.9.26016): starting 10:58:41 (4808): wrapper: running bin/acemd3.exe (--boinc --device 0) Detected memory leaks! Dumping objects -> ..\api\boinc_api.cpp(309) : {323286} normal block at 0x000001DA265261F0, 8 bytes long. Data: < J& > 00 00 4A 26 DA 01 00 00 ..\lib\diagnostics_win.cpp(417) : {322035} normal block at 0x000001DA26591B80, 1080 bytes long. Data: <h > 68 1A 00 00 CD CD CD CD 20 01 00 00 00 00 00 00 ..\zip\boinc_zip.cpp(122) : {149} normal block at 0x000001DA2652EB00, 260 bytes long. Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Object dump complete. 11:30:47 (13592): wrapper (7.9.26016): starting 11:30:47 (13592): wrapper: running bin/acemd3.exe (--boinc --device 0) ACEMD failed: Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch) 11:30:49 (13592): bin/acemd3.exe exited; CPU time 0.000000 11:30:49 (13592): app exit status: 0x1 11:30:49 (13592): called boinc_finish(195) 0 bytes in 0 Free Blocks. 298 bytes in 4 Normal Blocks. 1144 bytes in 1 CRT Blocks. 0 bytes in 0 Ignore Blocks. 0 bytes in 0 Client Blocks. Largest number used: 0 bytes. Total allocations: 130740 bytes. Dumping objects -> {323289} normal block at 0x0000014269701A70, 141 bytes long. Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65 ..\api\boinc_api.cpp(309) : {323286} normal block at 0x00000142696C62F0, 8 bytes long. Data: < eiB > 00 00 65 69 42 01 00 00 {322649} normal block at 0x00000142697020F0, 141 bytes long. Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65 {322036} normal block at 0x00000142696C6890, 8 bytes long. Data: <p siB > 70 1B 73 69 42 01 00 00 ..\zip\boinc_zip.cpp(122) : {149} normal block at 0x00000142696CE940, 260 bytes long. Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 {136} normal block at 0x00000142696C7060, 16 bytes long. Data: <p«liB > 70 AB 6C 69 42 01 00 00 00 00 00 00 00 00 00 00 {135} normal block at 0x00000142696CAB70, 40 bytes long. Data: <`pliB conda-pa> 60 70 6C 69 42 01 00 00 63 6F 6E 64 61 2D 70 61 {128} normal block at 0x00000142696CAB00, 48 bytes long. Data: <--boinc --device> 2D 2D 62 6F 69 6E 63 20 2D 2D 64 65 76 69 63 65 {127} normal block at 0x00000142696C6FC0, 16 bytes long. Data: <8øliB > 38 F8 6C 69 42 01 00 00 00 00 00 00 00 00 00 00 {126} normal block at 0x00000142696C6AC0, 16 bytes long. Data: < øliB > 10 F8 6C 69 42 01 00 00 00 00 00 00 00 00 00 00 {125} normal block at 0x00000142696C6A70, 16 bytes long. Data: <è÷liB > E8 F7 6C 69 42 01 00 00 00 00 00 00 00 00 00 00 {124} normal block at 0x00000142696C6A20, 16 bytes long. Data: <À÷liB > C0 F7 6C 69 42 01 00 00 00 00 00 00 00 00 00 00 {123} normal block at 0x00000142696C6C00, 16 bytes long. Data: < ÷liB > 98 F7 6C 69 42 01 00 00 00 00 00 00 00 00 00 00 {122} normal block at 0x00000142696C6980, 16 bytes long. Data: <p÷liB > 70 F7 6C 69 42 01 00 00 00 00 00 00 00 00 00 00 {121} normal block at 0x00000142696C70B0, 16 bytes long. Data: <P÷liB > 50 F7 6C 69 42 01 00 00 00 00 00 00 00 00 00 00 {120} normal block at 0x00000142696C6930, 16 bytes long. Data: <(÷liB > 28 F7 6C 69 42 01 00 00 00 00 00 00 00 00 00 00 {119} normal block at 0x00000142696C6570, 16 bytes long. Data: < ÷liB > 00 F7 6C 69 42 01 00 00 00 00 00 00 00 00 00 00 {118} normal block at 0x00000142696CF700, 496 bytes long. Data: <peliB bin/acem> 70 65 6C 69 42 01 00 00 62 69 6E 2F 61 63 65 6D {68} normal block at 0x00000142696C62A0, 16 bytes long. Data: < ê¼ ö > 80 EA BC 1A F6 7F 00 00 00 00 00 00 00 00 00 00 {67} normal block at 0x00000142696C6CF0, 16 bytes long. Data: <@é¼ ö > 40 E9 BC 1A F6 7F 00 00 00 00 00 00 00 00 00 00 {66} normal block at 0x00000142696C6480, 16 bytes long. Data: <øW¹ ö > F8 57 B9 1A F6 7F 00 00 00 00 00 00 00 00 00 00 {65} normal block at 0x00000142696C6520, 16 bytes long. Data: <ØW¹ ö > D8 57 B9 1A F6 7F 00 00 00 00 00 00 00 00 00 00 {64} normal block at 0x00000142696C6840, 16 bytes long. Data: <P ¹ ö > 50 04 B9 1A F6 7F 00 00 00 00 00 00 00 00 00 00 {63} normal block at 0x00000142696C6B60, 16 bytes long. Data: <0 ¹ ö > 30 04 B9 1A F6 7F 00 00 00 00 00 00 00 00 00 00 {62} normal block at 0x00000142696C6390, 16 bytes long. Data: <à ¹ ö > E0 02 B9 1A F6 7F 00 00 00 00 00 00 00 00 00 00 {61} normal block at 0x00000142696C6250, 16 bytes long. Data: < ¹ ö > 10 04 B9 1A F6 7F 00 00 00 00 00 00 00 00 00 00 {60} normal block at 0x00000142696C66B0, 16 bytes long. Data: <p ¹ ö > 70 04 B9 1A F6 7F 00 00 00 00 00 00 00 00 00 00 {59} normal block at 0x00000142696C67F0, 16 bytes long. Data: < À· ö > 18 C0 B7 1A F6 7F 00 00 00 00 00 00 00 00 00 00 Object dump complete. </stderr_txt> ]]> Currently, I have another one with cuda101 (falsely selected by the server for this client) which is now running for several hours. Michael. ____________ President of Rechenkraft.net - Germany's first and largest distributed computing organization. | |
ID: 57598 | Rating: 0 | rate: / Reply Quote | |
...it actually caused my machine to crash and was re-starting after re-boot. So I aborted it. Michael. ____________ President of Rechenkraft.net - Germany's first and largest distributed computing organization. | |
ID: 57605 | Rating: 0 | rate: / Reply Quote | |
I had an "195 (0xc3) EXIT_CHILD_FAILED" case this afternoon, a few seconds after start: | |
ID: 57642 | Rating: 0 | rate: / Reply Quote | |
Look at your own link: | |
ID: 57643 | Rating: 0 | rate: / Reply Quote | |
Faulty data - a bad task. Not your fault. +1 Windows Operating System: EXCEPTIONAL CONDITION: src\mdio\bincoord.c, line 193: "nelems != 1" Linux Operating System: EXCEPTIONAL CONDITION: /home/user/conda/conda-bld/acemd3_1618916459379/work/src/mdio/bincoord.c, line 193: "nelems != 1" When this warning appears, it usually implies that there is some definition error at the task initial parameters. It is a badly constructed task at origin, and it will fail at every host that receive it. Watching at Work Unit #27084895, from which this task hangs, It has previously failed at several other hosts, both Windows and Linux Operating Systems. The destiny for a Work Unit like this is getting extinguished after the maximum allowed failed tasks (7) is reached... | |
ID: 57644 | Rating: 0 | rate: / Reply Quote | |
New task from batch e1s34_0-ADRIA_GPCR2021_APJ_b0-0-1-RND2256 appears to break at start with Not A Number for coordinate: ACEMD failed: Particle coordinate is nan Errorcode: process exited with code 195 (0xc3, -61) WU: https://www.gpugrid.net/workunit.php?wuid=27087091 | |
ID: 57865 | Rating: 0 | rate: / Reply Quote | |
Same here with two different Linux machines: | |
ID: 57866 | Rating: 0 | rate: / Reply Quote | |
Same here - RTX2060 card. Fails after 10 seconds. | |
ID: 57867 | Rating: 0 | rate: / Reply Quote | |
have the same problem with 4x Nvidia T600 | |
ID: 57868 | Rating: 0 | rate: / Reply Quote | |
Looks we all have the same issue with NaN. I've bombed through a couple dozen today for wasted download cap. | |
ID: 57869 | Rating: 0 | rate: / Reply Quote | |
Same on my end. Had almost 80 tasks on a single machine today that all failed with said NaN error. Why were so many faulty tasks sent out in the first place? | |
ID: 57870 | Rating: 0 | rate: / Reply Quote | |
Thrown away 150 bad tasks today. | |
ID: 57871 | Rating: 0 | rate: / Reply Quote | |
Zero valid tasks returned overnight, it's clearly a faulty constructed batch. | |
ID: 57872 | Rating: 0 | rate: / Reply Quote | |
I got 16 errored tasks: 8 of cuda101, 8 of cuda1121. No tasks successfully completed. " Particle coordinate is nan" and " The requested CUDA device could not be loaded". | |
ID: 57873 | Rating: 0 | rate: / Reply Quote | |
Looks like they are sending out corrected tasks now from that last batch. | |
ID: 57875 | Rating: 0 | rate: / Reply Quote | |
Right, | |
ID: 57876 | Rating: 0 | rate: / Reply Quote | |
Only partial success! | |
ID: 57877 | Rating: 0 | rate: / Reply Quote | |
here the tasks failed within a few seconds: | |
ID: 57878 | Rating: 0 | rate: / Reply Quote | |
My I7 machine with a RTX 2070 and my I3 machine with a GTX 1060 were likewise restarted with GPU Grid and several tasks have all repeatedly failed within 10-13 seconds of starting. Those two machines are running the same tasks - with 'Bandit' in their name - that are running successfully on other machines. So the problem lies in your machine setup, not in the tasks themselves. A lot of other machines have the same problem - you'be not alone. Your tasks are failing with 'app exit status: 0xc0000135' - in all likelihood, you are missing a Microsoft runtime DLL file. Please refer to message 57353 | |
ID: 57879 | Rating: 0 | rate: / Reply Quote | |
excerpt from stderr: It's not the tasks that fail on Ampere cards - it's the CUDA 101 app. The 1121 app should be OK. | |
ID: 57880 | Rating: 0 | rate: / Reply Quote | |
excerpt from stderr: This was exactly the problem when the previous batch of WUs started: CUDA101 apps could not be crunched on Ampere cards, 1121 WUs went well. Then, someone here from the forum posted instructions how to change the content of a specific file in the - I guess - GPUGRID project folder (I forgot which file that was), I followed this instruction, and from than on my RTX3070 cards were crunching both CUDA versions. Since this is no longer the case with the current batch, I suppose something must be different with these new WUs. From what I remember, about half of the WUs I had crunched over several weeks were 101, about the other half was 1121. Of course, it is rather impractical to just try downloading task after task and hoping that a 1121 will show up some time. As known, after a number of unsuccessful WUs, downloads of new ones are blocked for a day. I would have expected that the GPUGRID people have repaired this specific problem in the meantime. Which obviously is not the case. So they keep blocking an increasing number of hosts, which to me does not make any sense at all :-( | |
ID: 57881 | Rating: 0 | rate: / Reply Quote | |
I now remember: it was the Conda-pack.zip... file of which the content had to be changed. | |
ID: 57882 | Rating: 0 | rate: / Reply Quote | |
I would have expected that the GPUGRID people have repaired this specific problem in the meantime. Which obviously is not the case. So they keep blocking an increasing number of hosts, which to me does not make any sense at all :-( I agree completely. But since the project doesn't seem to be (effectively) learning the lessons from previous mistakes, the best we can do is to perform the analysis for them, draw attention to the precise causes, and do what we can to ensure that at least some scientific research is completed successfully. Just burning up tasks with failures, until the maximum error limit for the WU is reached, doesn't help anyone. The file you need to change is acemd3.exe - it can be found in your current conda-pack.zip.xxxxxx file, in the GPUGrid project folder. Check whether a newer version of that file has been downloaded since you last modified it. Mine is currently dated 05 October - later than our last major discussion on the subject. That zip pack should also contain vcruntime140_1.dll, but I don't know if simply placing it in the zip would help - it might need to be specifically added to the upacking instruction list as well. | |
ID: 57883 | Rating: 0 | rate: / Reply Quote | |
the Conda-pack file which is currently in the GPUGRID folder is named "Conda-pack.zip.1d5...358" and is dated 4.10.21. | |
ID: 57884 | Rating: 0 | rate: / Reply Quote | |
Have a look at the job.xml.xxxxxx file. <job_desc> <task> <application>bin/acemd3.exe</application> <command_line>--boinc --device $GPU_DEVICE_NUM</command_line> <stdout_filename>progress.log</stdout_filename> <checkpoint_filename>restart.chk</checkpoint_filename> <fraction_done_filename>progress</fraction_done_filename> </task> <unzip_input> <zipfilename>conda-pack.zip</zipfilename> </unzip_input> </job_desc> and another dated yesterday that says <job_desc> <task> <application>bin/acemd3.exe</application> <command_line>--boinc --device $GPU_DEVICE_NUM</command_line> <stdout_filename>progress.log</stdout_filename> <checkpoint_filename>restart.chk</checkpoint_filename> <fraction_done_filename>progress</fraction_done_filename> </task> <unzip_input> <zipfilename>windows_x86_64__cuda101.zip</zipfilename> </unzip_input> </job_desc> To be certain, you'd need to look at the job specification in client_state.xml, but I think I'd go with the newest. Note that you'd also need to have the matching versions of cudart and cufft for the app you end up using. | |
ID: 57885 | Rating: 0 | rate: / Reply Quote | |
Have a look at the job.xml.xxxxxx file. ... My job.xml.xxxxxx files look exactly like yours. Also date-wise. To me, this shows that the new tasks no longer use the former <zipfilename>conda-pack.zip< but rather the new <zipfilename>windows_x86_64__cuda101.zip< And since no "...cuda1121.zip" was downloaded into the GPUGRID folder, I suppose that the new WUs are running cuda101 only. Which further means that these new WUs will not work with Ampere cards :-( Looks as simple as that, most sadly :-( Unless someone here can report about successful completion of the new WUs with an Ampere card. If possible, some kind of confirmation/statement/explanation or whatever from the team would also help a lot. | |
ID: 57887 | Rating: 0 | rate: / Reply Quote | |
Unless someone here can report about successful completion of the new WUs with an Ampere card. I have Ampere cards completing 101 and 1121 tasks from the latest batch just fine. All tasks on all cards have worked, am going to try some slower cards given the tasks are smaller. I have never renamed any of the project files. | |
ID: 57888 | Rating: 0 | rate: / Reply Quote | |
I have Ampere cards completing 101 and 1121 tasks from the latest batch just fine. thanks for the information, sounds interesting. Could you please let me/us know whether your www.gpugrid.net folder (in BOINC > projects) contains any conda-pack.zip-files (if yes, which ones?), and whether besides the "windows_x86_64_cuda101.zip.c0d...b21" it contains such a file with "...cuda1121" (instead cuda101). | |
ID: 57889 | Rating: 0 | rate: / Reply Quote | |
I have completed a Windows x64 cuda1121 task, and I have a windows_x86_64__cuda1121.zip file on that machine. | |
ID: 57890 | Rating: 0 | rate: / Reply Quote | |
I have Ampere cards completing 101 and 1121 tasks from the latest batch just fine. This lastly commented problem is only impacting Windows hosts. If your hosts are running under any kind of Linux distribution, it is normal that they aren't being affected. | |
ID: 57891 | Rating: 0 | rate: / Reply Quote | |
What's the point in keeping the CUDA10 app alive? The CUDA11 app works on older cards as well. | |
ID: 57892 | Rating: 0 | rate: / Reply Quote | |
What's the point in keeping the CUDA10 app alive? The CUDA11 app works on older cards as well. I agree and I've said this a few times also. no point in keeping the CUDA101 app when there's the 1121 app. ____________ | |
ID: 57893 | Rating: 0 | rate: / Reply Quote | |
I have Ampere cards completing 101 and 1121 tasks from the latest batch just fine. too bad that the user PDW has hidden his computers in the profile. So no idea what OS is being used ... unless he tells us. What's the point in keeping the CUDA10 app alive? The CUDA11 app works on older cards as well. good question | |
ID: 57894 | Rating: 0 | rate: / Reply Quote | |
I'm also estimating that this batch is considerably slighter than precedent ones, and my GTX 1660 Ti will be hitting full bonus with its current task. ServiceEnginIC, I noticed that your task completed in under 64,000 sec. My 1660 TI is looking to finish in just under 88,000 sec. I am wondering what could be causing such a big difference. The tasks, mine is a Cuda101 running under Win 7 and yours is Cuda1121 running under Linux. Are either of these the culprit? | |
ID: 57896 | Rating: 0 | rate: / Reply Quote | |
Working under Linux helps to squeeze maximum performance. | |
ID: 57898 | Rating: 0 | rate: / Reply Quote | |
I have Ampere cards completing 101 and 1121 tasks from the latest batch just fine. You asked: Unless someone here can report about successful completion of the new WUs with an Ampere card. As I posted previously I am using linux. | |
ID: 57905 | Rating: 0 | rate: / Reply Quote | |
Unless someone here can report about successful completion of the new WUs with an Ampere card. As I posted previously I am using linux. oh okay, thanks for the information (which explains why it works well on your system). | |
ID: 57906 | Rating: 0 | rate: / Reply Quote | |
No, it does not explain it.I have Ampere cards completing 101 and 1121 tasks from the latest batch just fine. I've tried to run a CUDA 101 task on my Ubuntu 18.04.6 host on an RTX 3080 Ti (driver: 495.44), and it's failed after a few minutes. <core_client_version>7.16.17</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
14:33:16 (1675): wrapper (7.7.26016): starting
14:33:23 (1675): wrapper (7.7.26016): starting
14:33:23 (1675): wrapper: running bin/acemd3 (--boinc --device 0)
ACEMD failed:
Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch)
14:35:30 (1675): bin/acemd3 exited; CPU time 127.166324
14:35:30 (1675): app exit status: 0x1
14:35:30 (1675): called boinc_finish(195)
</stderr_txt>
]]> | |
ID: 57911 | Rating: 0 | rate: / Reply Quote | |
Another example: | |
ID: 57918 | Rating: 0 | rate: / Reply Quote | |
after yesterday's snafu, I picked up two cuda101 tasks this morning on my Linux Ubuntu 20.04 3080Ti system. currently running ok. been running about 20 mins now, and is utilizing the GPU @99% so it's definitely working. I basically executed a project reset yesterday on this host, so I don't think my previous modifications to swap out the 101 app for 1121 carried over. | |
ID: 57959 | Rating: 0 | rate: / Reply Quote | |
Quote:Your tasks are failing with 'app exit status: 0xc0000135' - in all likelihood, you are missing a Microsoft runtime DLL file. Please refer to message 57353.Quote | |
ID: 57966 | Rating: 0 | rate: / Reply Quote | |
I missed out on all the new work because I had to get new master lists on all the hosts when their 24 hour timeouts finally expired. | |
ID: 57967 | Rating: 0 | rate: / Reply Quote | |
I missed out on all the new work because I had to get new master lists on all the hosts when their 24 hour timeouts finally expired. I think the 24 hour (master file fetch) backoff is set by the client, rather than the server - so it can be over-ridden by a manual update. That's unlike the 'daily quota exceeded' and the 'last request too recent' backoffs, which are enforced by the server and can't be bypassed. I might use one of these boring lockdown days to force a client into 'master file fetch' mode, so I can see how it's recorded in client_state.xml, and hence how to remove it again - whenever and wherever that knowledge might be useful in the future. | |
ID: 57969 | Rating: 0 | rate: / Reply Quote | |
Manual updates did nothing but keep resetting the 24 hour timer backoff. | |
ID: 57971 | Rating: 0 | rate: / Reply Quote | |
That's because another failure doesn't reset the failure count. We need to find out where that's stored, and reduce it to less than 10. | |
ID: 57973 | Rating: 0 | rate: / Reply Quote | |
after yesterday's snafu, I picked up two cuda101 tasks this morning on my Linux Ubuntu 20.04 3080Ti system. currently running ok. been running about 20 mins now, and is utilizing the GPU @99% so it's definitely working. I basically executed a project reset yesterday on this host, so I don't think my previous modifications to swap out the 101 app for 1121 carried over.That's easy to check: the CUDA1121 is 990MB while the CUDA101 is 491MB (503406KB). I think it's impossible to run the CUDA101 on RTX3000 series, as that was the main reason demanding a CUDA11 client not so long ago. | |
ID: 57975 | Rating: 0 | rate: / Reply Quote | |
after yesterday's snafu, I picked up two cuda101 tasks this morning on my Linux Ubuntu 20.04 3080Ti system. currently running ok. been running about 20 mins now, and is utilizing the GPU @99% so it's definitely working. I basically executed a project reset yesterday on this host, so I don't think my previous modifications to swap out the 101 app for 1121 carried over.That's easy to check: the CUDA1121 is 990MB while the CUDA101 is 491MB (503406KB). my gpugrid project folder contains two compressed files for acemd3. x86_64-pc-linux-gnu__cuda101.zip.<alphanumeric> (515.5 MB) x86_64-pc-linux-gnu__cuda1121.zip.<alphanumeric> (1.0 GB) so it seems it did indeed use the cuda101 code on my 3080Ti and both tasks succeeded. https://www.gpugrid.net/result.php?resultid=32707549 https://www.gpugrid.net/result.php?resultid=32701203 since both apps use the same filename of just 'acemd3', it's possible some bug is causing the wrong (or is it right? lol) one to be used or something to that effect. ____________ | |
ID: 57976 | Rating: 0 | rate: / Reply Quote | |
since both apps use the same filename of just 'acemd3', it's possible some bug is causing the wrong (or is it right? lol) one to be used or something to that effect. Don't forget it could be this... http://www.gpugrid.net/forum_thread.php?id=5256&nowrap=true#57473 Have completed 5 of the recent cuda101 tasks on Ampere hosts now, a sixth is running and a seventh lined up. Have seen no failures as yet. | |
ID: 57978 | Rating: 0 | rate: / Reply Quote | |
I guess that you still use the "special" BOINC manager (compiled for SETI@home), and that handles apps in a different way. That would explain this anomaly.since both apps use the same filename of just 'acemd3', it's possible some bug is causing the wrong (or is it right? lol) one to be used or something to that effect. | |
ID: 57985 | Rating: 0 | rate: / Reply Quote | |
I guess that you still use the "special" BOINC manager (compiled for SETI@home), and that handles apps in a different way. That would explain this anomaly. No. No modified manager or client here, just the bog standard BOINC 7.16.6 | |
ID: 57987 | Rating: 0 | rate: / Reply Quote | |
... just the bog standard BOINC 7.16.6 You are recommended to upgrade to v7.16.20 - it's pretty good code, and - importantly - it has updated SSL security certificates needed by some BOINC projects. (Edit - the above advice applies only to Windows machines. If you're running Linux, you can ignore it. Your computers are hidden, so I don't know which applies) | |
ID: 57988 | Rating: 0 | rate: / Reply Quote | |
a few hours ago, I had another task which failed after a few seconds with | |
ID: 58071 | Rating: 0 | rate: / Reply Quote | |
Message boards : Number crunching : 195 (0xc3) EXIT_CHILD_FAILED