195 (0xc3) EXIT_CHILD

Message boards : Number crunching : 195 (0xc3) EXIT_CHILD_FAILED

Author	Message
Michael H.W. Weber Send message Joined: 9 Feb 16 Posts: 71 Credit: 607,916,391 RAC: 0 Level Scientific publications	Message 57480 - Posted: 5 Oct 2021 \| 8:42:33 UTC Last modified: 5 Oct 2021 \| 8:55:54 UTC
	My RTX 3080 machine completed a first task successfully. Afterwards, two more tasks crashed with an 195 (0xc3) EXIT_CHILD_FAILED error message and the following log (after only a few seconds of run time): Name e2s184_e1s254p0f959-ADRIA_AdB_KIXCMYB_HIP-0-2-RND9959_9 Arbeitspaket 27080023 Erstellt 4 Oct 2021 \| 9:59:05 UTC Gesendet 4 Oct 2021 \| 10:48:16 UTC Empfangen 4 Oct 2021 \| 22:07:40 UTC Serverstatus Abgeschlossen Resultat Berechnungsfehler Clientstatus Berechnungsfehler Endstatus 195 (0xc3) EXIT_CHILD_FAILED Computer ID 584499 Ablaufdatum 9 Oct 2021 \| 10:48:16 UTC Laufzeit 25.51 CPU Zeit 0.00 Prüfungsstatus Ungültig Punkte 0.00 Anwendungsversion New version of ACEMD v2.18 (cuda101) Stderr Ausgabe <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> (unknown error) - exit code 195 (0xc3)</message> <stderr_txt> 00:05:49 (30732): wrapper (7.9.26016): starting 00:05:49 (30732): wrapper: running bin/acemd3.exe (--boinc --device 0) ACEMD failed: Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch) 00:05:59 (30732): bin/acemd3.exe exited; CPU time 0.000000 00:05:59 (30732): app exit status: 0x1 00:05:59 (30732): called boinc_finish(195) 0 bytes in 0 Free Blocks. 186 bytes in 4 Normal Blocks. 1144 bytes in 1 CRT Blocks. 0 bytes in 0 Ignore Blocks. 0 bytes in 0 Client Blocks. Largest number used: 0 bytes. Total allocations: 239849 bytes. Dumping objects -> {323256} normal block at 0x000001B7D23D3BC0, 85 bytes long. Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65 ..\api\boinc_api.cpp(309) : {323253} normal block at 0x000001B7D23D4940, 8 bytes long. Data: < 1Ò· > 00 00 31 D2 B7 01 00 00 {322608} normal block at 0x000001B7D23D3C60, 85 bytes long. Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65 {321994} normal block at 0x000001B7D23D46C0, 8 bytes long. Data: <@Ê?Ò· > 40 CA 3F D2 B7 01 00 00 ..\zip\boinc_zip.cpp(122) : {146} normal block at 0x000001B7D23D3090, 260 bytes long. Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 {133} normal block at 0x000001B7D23D48A0, 16 bytes long. Data: < ø<Ò· > 10 F8 3C D2 B7 01 00 00 00 00 00 00 00 00 00 00 {132} normal block at 0x000001B7D23CF810, 40 bytes long. Data: < H=Ò· conda-pa> A0 48 3D D2 B7 01 00 00 63 6F 6E 64 61 2D 70 61 {125} normal block at 0x000001B7D23CF340, 48 bytes long. Data: <--boinc --device> 2D 2D 62 6F 69 6E 63 20 2D 2D 64 65 76 69 63 65 {124} normal block at 0x000001B7D23D49E0, 16 bytes long. Data: <XN=Ò· > 58 4E 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00 {123} normal block at 0x000001B7D23D4C60, 16 bytes long. Data: <0N=Ò· > 30 4E 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00 {122} normal block at 0x000001B7D23D4850, 16 bytes long. Data: < N=Ò· > 08 4E 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00 {121} normal block at 0x000001B7D23D3DB0, 16 bytes long. Data: <àM=Ò· > E0 4D 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00 {120} normal block at 0x000001B7D23D4030, 16 bytes long. Data: <¸M=Ò· > B8 4D 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00 {119} normal block at 0x000001B7D23D4080, 16 bytes long. Data: < M=Ò· > 90 4D 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00 {118} normal block at 0x000001B7D23D4120, 16 bytes long. Data: <pM=Ò· > 70 4D 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00 {117} normal block at 0x000001B7D23D4990, 16 bytes long. Data: <HM=Ò· > 48 4D 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00 {116} normal block at 0x000001B7D23D42B0, 16 bytes long. Data: < M=Ò· > 20 4D 3D D2 B7 01 00 00 00 00 00 00 00 00 00 00 {115} normal block at 0x000001B7D23D4D20, 496 bytes long. Data: <°B=Ò· bin/acem> B0 42 3D D2 B7 01 00 00 62 69 6E 2F 61 63 65 6D {65} normal block at 0x000001B7D23C2D80, 16 bytes long. Data: < ê{÷ > 80 EA 97 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {64} normal block at 0x000001B7D23C2B50, 16 bytes long. Data: <@é{÷ > 40 E9 97 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {63} normal block at 0x000001B7D23C2B00, 16 bytes long. Data: <øW{÷ > F8 57 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {62} normal block at 0x000001B7D23C2AB0, 16 bytes long. Data: <ØW{÷ > D8 57 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {61} normal block at 0x000001B7D23C3370, 16 bytes long. Data: <P {÷ > 50 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {60} normal block at 0x000001B7D23C2A60, 16 bytes long. Data: <0 {÷ > 30 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {59} normal block at 0x000001B7D23C3500, 16 bytes long. Data: <à {÷ > E0 02 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {58} normal block at 0x000001B7D23C3640, 16 bytes long. Data: < {÷ > 10 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {57} normal block at 0x000001B7D23C2A10, 16 bytes long. Data: <p {÷ > 70 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {56} normal block at 0x000001B7D23C3870, 16 bytes long. Data: < À{÷ > 18 C0 92 7B F7 7F 00 00 00 00 00 00 00 00 00 00 Object dump complete. </stderr_txt> ]]> Name e4s109_e1s39p0f745-ADRIA_AdB_KIXCMYB_HIP-1-2-RND2493_0 Arbeitspaket 27081645 Erstellt 4 Oct 2021 \| 22:12:32 UTC Gesendet 4 Oct 2021 \| 22:14:12 UTC Empfangen 4 Oct 2021 \| 22:16:12 UTC Serverstatus Abgeschlossen Resultat Berechnungsfehler Clientstatus Berechnungsfehler Endstatus 195 (0xc3) EXIT_CHILD_FAILED Computer ID 584499 Ablaufdatum 9 Oct 2021 \| 22:14:12 UTC Laufzeit 7.26 CPU Zeit 0.00 Prüfungsstatus Ungültig Punkte 0.00 Anwendungsversion New version of ACEMD v2.18 (cuda101) Stderr Ausgabe <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> (unknown error) - exit code 195 (0xc3)</message> <stderr_txt> 00:14:24 (14320): wrapper (7.9.26016): starting 00:14:24 (14320): wrapper: running bin/acemd3.exe (--boinc --device 0) ACEMD failed: Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch) 00:14:26 (14320): bin/acemd3.exe exited; CPU time 0.000000 00:14:26 (14320): app exit status: 0x1 00:14:26 (14320): called boinc_finish(195) 0 bytes in 0 Free Blocks. 186 bytes in 4 Normal Blocks. 1144 bytes in 1 CRT Blocks. 0 bytes in 0 Ignore Blocks. 0 bytes in 0 Client Blocks. Largest number used: 0 bytes. Total allocations: 241603 bytes. Dumping objects -> {323256} normal block at 0x000002061D1C3BC0, 85 bytes long. Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65 ..\api\boinc_api.cpp(309) : {323253} normal block at 0x000002061D1C43F0, 8 bytes long. Data: < > 00 00 02 1D 06 02 00 00 {322608} normal block at 0x000002061D1C3C60, 85 bytes long. Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65 {321994} normal block at 0x000002061D1C42B0, 8 bytes long. Data: <@Ê > 40 CA 1E 1D 06 02 00 00 ..\zip\boinc_zip.cpp(122) : {146} normal block at 0x000002061D1C3090, 260 bytes long. Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 {133} normal block at 0x000002061D1C3EF0, 16 bytes long. Data: <Ðò > D0 F2 1B 1D 06 02 00 00 00 00 00 00 00 00 00 00 {132} normal block at 0x000002061D1BF2D0, 40 bytes long. Data: <ð> conda-pa> F0 3E 1C 1D 06 02 00 00 63 6F 6E 64 61 2D 70 61 {125} normal block at 0x000002061D1BF180, 48 bytes long. Data: <--boinc --device> 2D 2D 62 6F 69 6E 63 20 2D 2D 64 65 76 69 63 65 {124} normal block at 0x000002061D1C4940, 16 bytes long. Data: <XN > 58 4E 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00 {123} normal block at 0x000002061D1C4490, 16 bytes long. Data: <0N > 30 4E 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00 {122} normal block at 0x000002061D1C4800, 16 bytes long. Data: < N > 08 4E 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00 {121} normal block at 0x000002061D1C47B0, 16 bytes long. Data: <àM > E0 4D 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00 {120} normal block at 0x000002061D1C48A0, 16 bytes long. Data: <¸M > B8 4D 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00 {119} normal block at 0x000002061D1C4710, 16 bytes long. Data: < M > 90 4D 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00 {118} normal block at 0x000002061D1C48F0, 16 bytes long. Data: <pM > 70 4D 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00 {117} normal block at 0x000002061D1C4990, 16 bytes long. Data: <HM > 48 4D 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00 {116} normal block at 0x000002061D1C4A80, 16 bytes long. Data: < M > 20 4D 1C 1D 06 02 00 00 00 00 00 00 00 00 00 00 {115} normal block at 0x000002061D1C4D20, 496 bytes long. Data: < J bin/acem> 80 4A 1C 1D 06 02 00 00 62 69 6E 2F 61 63 65 6D {65} normal block at 0x000002061D1B36E0, 16 bytes long. Data: < ê{÷ > 80 EA 97 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {64} normal block at 0x000002061D1B3410, 16 bytes long. Data: <@é{÷ > 40 E9 97 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {63} normal block at 0x000002061D1B3820, 16 bytes long. Data: <øW{÷ > F8 57 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {62} normal block at 0x000002061D1B33C0, 16 bytes long. Data: <ØW{÷ > D8 57 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {61} normal block at 0x000002061D1B3190, 16 bytes long. Data: <P {÷ > 50 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {60} normal block at 0x000002061D1B3000, 16 bytes long. Data: <0 {÷ > 30 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {59} normal block at 0x000002061D1B2FB0, 16 bytes long. Data: <à {÷ > E0 02 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {58} normal block at 0x000002061D1B3320, 16 bytes long. Data: < {÷ > 10 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {57} normal block at 0x000002061D1B2F60, 16 bytes long. Data: <p {÷ > 70 04 94 7B F7 7F 00 00 00 00 00 00 00 00 00 00 {56} normal block at 0x000002061D1B3140, 16 bytes long. Data: < À{÷ > 18 C0 92 7B F7 7F 00 00 00 00 00 00 00 00 00 00 Object dump complete. </stderr_txt> ]]> Any idea what is going on? Very annoying is the fact that after these two consecutive crashes, it took the GPUGRID server 4 hours to send out a new task (which is now in progress) - making my machine uselessly idling for hours. Michael. ____________ President of Rechenkraft.net - Germany's first and largest distributed computing organization.
	ID: 57480 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,835,316,430 RAC: 20,000,148 Level Scientific publications	Message 57481 - Posted: 5 Oct 2021 \| 8:59:03 UTC - in response to Message 57480.
	Your computers are hidden, so I can't be certain, but your problem seems to be Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch) There are two versions of the new GPUGrid application: cuda1121 and cuda101. You will be able to see in your task list which worked, and which didn't work. Despite some posts to the contrary, the general consensus is that cuda1121 works on an RTX 3080, and cuda101 doesn't. And despite an assurance from the project that they have prevented the cuda101 application being sent to RTX cards, clearly they haven't. There's nothing you can do to prevent the wrong application being sent to your card: just take comfort from the fact that cuda101 tasks will fail very quickly, and you won't waste computing power on the tasks. The only hit you and the project are taking is the waste of bandwidth downloading the inappropriate tasks.
	ID: 57481 \| Rating: 0 \| rate: / Reply Quote

Michael H.W. Weber Send message Joined: 9 Feb 16 Posts: 71 Credit: 607,916,391 RAC: 0 Level Scientific publications	Message 57482 - Posted: 5 Oct 2021 \| 9:13:09 UTC
	Thank you Richard - only cuda1121 works for me. Michael. ____________ President of Rechenkraft.net - Germany's first and largest distributed computing organization.
	ID: 57482 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1132 Credit: 10,216,282,676 RAC: 29,658,016 Level Scientific publications	Message 57484 - Posted: 5 Oct 2021 \| 11:43:49 UTC - in response to Message 57481.
	And despite an assurance from the project that they have prevented the cuda101 application being sent to RTX cards, clearly they haven't. I endorse this statement. I have been sent cuda101 tasks to my two RTX3070, the latest one this morning. There's nothing you can do to prevent the wrong application being sent to your card: just take comfort from the fact that cuda101 tasks will fail very quickly, and you won't waste computing power on the tasks. The only hit you and the project are taking is the waste of bandwidth downloading the inappropriate tasks. however, there is more to it: What might also happen is that if one deletes erronously downloaded cuda101 tasks from the BOINC task list too often, one will not receive any more tasks for the next 24 hours. Hence, this problem should be solved by the project team ASAP !
	ID: 57484 \| Rating: 0 \| rate: / Reply Quote

bozz4science Send message Joined: 22 May 20 Posts: 109 Credit: 102,786,176 RAC: 103,187 Level Scientific publications	Message 57486 - Posted: 5 Oct 2021 \| 13:02:45 UTC
	I also don't quite understand what information determines the app version to be sent out. This task f.ex. has been sent out 6 times before my host caught it. Once it was 1121 app version, all others were sent out as the 101 app version. It did fail on all previous hosts and went through 3 Ampere cards (3060Ti, 3070 & 3090). Seems to be quite an annoyance for anyone with the latest cards. And older cards take some serious chewing on the new tasks. Mine takes a little over 31hrs. This project could be working much more efficiently if it were able to fully capture the potential of these RTX 3000 series cards.
	ID: 57486 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,835,316,430 RAC: 20,000,148 Level Scientific publications	Message 57487 - Posted: 5 Oct 2021 \| 13:30:47 UTC - in response to Message 57486.
	But there are two different failure modes - three of each: three missing DLLs (probably vcruntime140_1), and three wrong architecture (cuda101 on RTX) You need all three to align - right version, on right architecture, with right software support - before it'll run. One out of eight is about the right probability.
	ID: 57487 \| Rating: 0 \| rate: / Reply Quote

bozz4science Send message Joined: 22 May 20 Posts: 109 Credit: 102,786,176 RAC: 103,187 Level Scientific publications	Message 57488 - Posted: 5 Oct 2021 \| 13:36:57 UTC
	That's sounds about right. Only meant to highlight the Ampere cards that all failed obviously due to the wrong version having been sent to these hosts, but somehow older gen cards getting the 1121 app version instead on some occasions.
	ID: 57488 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 57489 - Posted: 5 Oct 2021 \| 13:41:17 UTC
	All the more reason to just retire the cuda101 app, and force everyone to update their drivers to use the cuda1121 app ____________
	ID: 57489 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,835,316,430 RAC: 20,000,148 Level Scientific publications	Message 57490 - Posted: 5 Oct 2021 \| 13:45:41 UTC - in response to Message 57489.
	All the more reason to just retire the cuda101 app, and force everyone to update their drivers to use the cuda1121 app I disagree. People should be allowed their own choice of driver (you don't know why they've kept an older one), but the project should manage the minimum limits better.
	ID: 57490 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 57491 - Posted: 5 Oct 2021 \| 13:56:55 UTC - in response to Message 57490. Last modified: 5 Oct 2021 \| 14:32:33 UTC
	IMO, the "choice" of driver in the ranges of CUDA101 and CUDA1121 compatibility will be arbitrary. the list of supported products is exactly the same so it's not like some older GPU wont be supported anymore with the newer driver. Nvidia drivers are very stable and it's pretty rare that a new driver fully breaks something. CUDA101 requires driver 418.xx, CUDA1121 requires driver 461.xx, there's not a huge difference here. but even still there's a large range of "choice" between the minimum driver required for cuda1121 and what is the current latest driver release. they don't need to be bleeding edge. CUDA 11.2 was introduced almost a year ago, and it's currently up to CUDA 11.4 Update 2. If you have some software issue that actively prevents installing a new driver, then fix your software issues. there's really no good reason not to update if you're already on hardware and drivers new enough to support the CUDA101 app. it's not a huge change to get more recent drivers, and the observed negative impacts from the project maintaining two app versions far outweigh the impact of requiring a user to update their drivers. ____________
	ID: 57491 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,835,316,430 RAC: 20,000,148 Level Scientific publications	Message 57492 - Posted: 5 Oct 2021 \| 13:59:51 UTC - in response to Message 57491.
	it's not a huge change to get more recent drivers, and the observed negative impacts from the project maintaining two app versions far outweigh the impact of requiring a user to update their drivers. It can be if your computer is managed by your employer's domain controller, and group policy prevents you updating it yourself. Just an example.
	ID: 57492 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 57493 - Posted: 5 Oct 2021 \| 14:04:46 UTC - in response to Message 57492.
	it's not a huge change to get more recent drivers, and the observed negative impacts from the project maintaining two app versions far outweigh the impact of requiring a user to update their drivers. It can be if your computer is managed by your employer's domain controller, and group policy prevents you updating it yourself. Just an example. in this case, it's MORE likely that these systems will (should) be updated to recent, as any competent SA will (should) be keeping everything on the up and up in terms of security patches, and there has been a stronger push from Nvidia in this regard lately. ____________
	ID: 57493 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,835,316,430 RAC: 20,000,148 Level Scientific publications	Message 57494 - Posted: 5 Oct 2021 \| 14:15:13 UTC
	I think we're far enough off topic. Let's leave it there.
	ID: 57494 \| Rating: 0 \| rate: / Reply Quote

Michael H.W. Weber Send message Joined: 9 Feb 16 Posts: 71 Credit: 607,916,391 RAC: 0 Level Scientific publications	Message 57544 - Posted: 8 Oct 2021 \| 12:13:48 UTC
	Well, this project's incapability of delivering the proper GPU app/plan class to the corresponding GPU systems simply results in a massive loss of project overall performance: Due to the repetitive "compute errors" the clients do not receive further tasks for a while and idle around for hours. I figured that this way instead of two tasks per day, I can deliver only one. Well, not my problem. A second project is occupying the idle time now. Michael. ____________ President of Rechenkraft.net - Germany's first and largest distributed computing organization.
	ID: 57544 \| Rating: 0 \| rate: / Reply Quote

bcavnaugh Send message Joined: 8 Nov 13 Posts: 56 Credit: 1,002,640,163 RAC: 0 Level Scientific publications	Message 57572 - Posted: 11 Oct 2021 \| 2:26:22 UTC Last modified: 11 Oct 2021 \| 2:59:44 UTC
	Glad at least one of my Host is running but all the other are NOT! "[img]Not Working[/img]" 2080 (441.20) running 1080 (431.86) not running also 2080 (431.86) not running What NVIDIA Driver must me have to Run GPUGRID? As you can see even with the new or current runtimes it still fails 14.29.30135.0 Current VS 2022 the version with the tasks is 14.28.29325.2 https://live.staticflickr.com/65535/51574059037_5ae789d24d_b.jpg Update: For me Driver 441.20 seems to work on all my Host,Yahoo!
	ID: 57572 \| Rating: 0 \| rate: / Reply Quote

jjch Send message Joined: 10 Nov 13 Posts: 101 Credit: 15,569,300,388 RAC: 3,786,488 Level Scientific publications	Message 57575 - Posted: 11 Oct 2021 \| 4:26:33 UTC
	Nvidia driver version 441.20 is a bit old. I am currently running version 471.11 on my Windows hosts.
	ID: 57575 \| Rating: 0 \| rate: / Reply Quote

Michael H.W. Weber Send message Joined: 9 Feb 16 Posts: 71 Credit: 607,916,391 RAC: 0 Level Scientific publications	Message 57598 - Posted: 13 Oct 2021 \| 10:29:25 UTC - in response to Message 57481. Last modified: 13 Oct 2021 \| 10:31:09 UTC
	Despite some posts to the contrary, the general consensus is that cuda1121 works on an RTX 3080, and cuda101 doesn't. And despite an assurance from the project that they have prevented the cuda101 application being sent to RTX cards, clearly they haven't. I second that. There's nothing you can do to prevent the wrong application being sent to your card: just take comfort from the fact that cuda101 tasks will fail very quickly, and you won't waste computing power on the tasks. Unfortunately, that is exactly NOT the case. Here an example of a task which ran for almost 15 hrs before failing with an error: Task: https://www.gpugrid.net/result.php?resultid=32653715 Name e7s106_e5s196p1f1036-ADRIA_AdB_KIXCMYB_HIP-1-2-RND0214_4 Arbeitspaket 27082868 Erstellt 11 Oct 2021 \| 6:23:21 UTC Gesendet 11 Oct 2021 \| 6:24:56 UTC Empfangen 12 Oct 2021 \| 9:32:05 UTC Serverstatus Abgeschlossen Resultat Berechnungsfehler Clientstatus Berechnungsfehler Endstatus 195 (0xc3) EXIT_CHILD_FAILED Computer ID 588794 Ablaufdatum 16 Oct 2021 \| 6:24:56 UTC Laufzeit 53,608.48 CPU Zeit 53,473.36 Prüfungsstatus Ungültig Punkte 0.00 Anwendungsversion New version of ACEMD v2.18 (cuda101) Stderr Ausgabe <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> (unknown error) - exit code 195 (0xc3)</message> <stderr_txt> 08:25:11 (11620): wrapper (7.9.26016): starting 08:25:11 (11620): wrapper: running bin/acemd3.exe (--boinc --device 0) Detected memory leaks! Dumping objects -> ..\api\boinc_api.cpp(309) : {323250} normal block at 0x000002C574996E70, 8 bytes long. Data: < $v > 00 00 24 76 C5 02 00 00 ..\lib\diagnostics_win.cpp(417) : {321999} normal block at 0x000002C576431310, 1080 bytes long. Data: < > FC 08 00 00 CD CD CD CD 0C 01 00 00 00 00 00 00 ..\zip\boinc_zip.cpp(122) : {149} normal block at 0x000002C57499EBA0, 260 bytes long. Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Object dump complete. 10:58:41 (4808): wrapper (7.9.26016): starting 10:58:41 (4808): wrapper: running bin/acemd3.exe (--boinc --device 0) Detected memory leaks! Dumping objects -> ..\api\boinc_api.cpp(309) : {323286} normal block at 0x000001DA265261F0, 8 bytes long. Data: < J& > 00 00 4A 26 DA 01 00 00 ..\lib\diagnostics_win.cpp(417) : {322035} normal block at 0x000001DA26591B80, 1080 bytes long. Data: <h > 68 1A 00 00 CD CD CD CD 20 01 00 00 00 00 00 00 ..\zip\boinc_zip.cpp(122) : {149} normal block at 0x000001DA2652EB00, 260 bytes long. Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Object dump complete. 11:30:47 (13592): wrapper (7.9.26016): starting 11:30:47 (13592): wrapper: running bin/acemd3.exe (--boinc --device 0) ACEMD failed: Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch) 11:30:49 (13592): bin/acemd3.exe exited; CPU time 0.000000 11:30:49 (13592): app exit status: 0x1 11:30:49 (13592): called boinc_finish(195) 0 bytes in 0 Free Blocks. 298 bytes in 4 Normal Blocks. 1144 bytes in 1 CRT Blocks. 0 bytes in 0 Ignore Blocks. 0 bytes in 0 Client Blocks. Largest number used: 0 bytes. Total allocations: 130740 bytes. Dumping objects -> {323289} normal block at 0x0000014269701A70, 141 bytes long. Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65 ..\api\boinc_api.cpp(309) : {323286} normal block at 0x00000142696C62F0, 8 bytes long. Data: < eiB > 00 00 65 69 42 01 00 00 {322649} normal block at 0x00000142697020F0, 141 bytes long. Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65 {322036} normal block at 0x00000142696C6890, 8 bytes long. Data: <p siB > 70 1B 73 69 42 01 00 00 ..\zip\boinc_zip.cpp(122) : {149} normal block at 0x00000142696CE940, 260 bytes long. Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 {136} normal block at 0x00000142696C7060, 16 bytes long. Data: <p«liB > 70 AB 6C 69 42 01 00 00 00 00 00 00 00 00 00 00 {135} normal block at 0x00000142696CAB70, 40 bytes long. Data: <`pliB conda-pa> 60 70 6C 69 42 01 00 00 63 6F 6E 64 61 2D 70 61 {128} normal block at 0x00000142696CAB00, 48 bytes long. Data: <--boinc --device> 2D 2D 62 6F 69 6E 63 20 2D 2D 64 65 76 69 63 65 {127} normal block at 0x00000142696C6FC0, 16 bytes long. Data: <8øliB > 38 F8 6C 69 42 01 00 00 00 00 00 00 00 00 00 00 {126} normal block at 0x00000142696C6AC0, 16 bytes long. Data: < øliB > 10 F8 6C 69 42 01 00 00 00 00 00 00 00 00 00 00 {125} normal block at 0x00000142696C6A70, 16 bytes long. Data: <è÷liB > E8 F7 6C 69 42 01 00 00 00 00 00 00 00 00 00 00 {124} normal block at 0x00000142696C6A20, 16 bytes long. Data: <À÷liB > C0 F7 6C 69 42 01 00 00 00 00 00 00 00 00 00 00 {123} normal block at 0x00000142696C6C00, 16 bytes long. Data: < ÷liB > 98 F7 6C 69 42 01 00 00 00 00 00 00 00 00 00 00 {122} normal block at 0x00000142696C6980, 16 bytes long. Data: <p÷liB > 70 F7 6C 69 42 01 00 00 00 00 00 00 00 00 00 00 {121} normal block at 0x00000142696C70B0, 16 bytes long. Data: <P÷liB > 50 F7 6C 69 42 01 00 00 00 00 00 00 00 00 00 00 {120} normal block at 0x00000142696C6930, 16 bytes long. Data: <(÷liB > 28 F7 6C 69 42 01 00 00 00 00 00 00 00 00 00 00 {119} normal block at 0x00000142696C6570, 16 bytes long. Data: < ÷liB > 00 F7 6C 69 42 01 00 00 00 00 00 00 00 00 00 00 {118} normal block at 0x00000142696CF700, 496 bytes long. Data: <peliB bin/acem> 70 65 6C 69 42 01 00 00 62 69 6E 2F 61 63 65 6D {68} normal block at 0x00000142696C62A0, 16 bytes long. Data: < ê¼ ö > 80 EA BC 1A F6 7F 00 00 00 00 00 00 00 00 00 00 {67} normal block at 0x00000142696C6CF0, 16 bytes long. Data: <@é¼ ö > 40 E9 BC 1A F6 7F 00 00 00 00 00 00 00 00 00 00 {66} normal block at 0x00000142696C6480, 16 bytes long. Data: <øW¹ ö > F8 57 B9 1A F6 7F 00 00 00 00 00 00 00 00 00 00 {65} normal block at 0x00000142696C6520, 16 bytes long. Data: <ØW¹ ö > D8 57 B9 1A F6 7F 00 00 00 00 00 00 00 00 00 00 {64} normal block at 0x00000142696C6840, 16 bytes long. Data: <P ¹ ö > 50 04 B9 1A F6 7F 00 00 00 00 00 00 00 00 00 00 {63} normal block at 0x00000142696C6B60, 16 bytes long. Data: <0 ¹ ö > 30 04 B9 1A F6 7F 00 00 00 00 00 00 00 00 00 00 {62} normal block at 0x00000142696C6390, 16 bytes long. Data: <à ¹ ö > E0 02 B9 1A F6 7F 00 00 00 00 00 00 00 00 00 00 {61} normal block at 0x00000142696C6250, 16 bytes long. Data: < ¹ ö > 10 04 B9 1A F6 7F 00 00 00 00 00 00 00 00 00 00 {60} normal block at 0x00000142696C66B0, 16 bytes long. Data: <p ¹ ö > 70 04 B9 1A F6 7F 00 00 00 00 00 00 00 00 00 00 {59} normal block at 0x00000142696C67F0, 16 bytes long. Data: < À· ö > 18 C0 B7 1A F6 7F 00 00 00 00 00 00 00 00 00 00 Object dump complete. </stderr_txt> ]]> Currently, I have another one with cuda101 (falsely selected by the server for this client) which is now running for several hours. Michael. ____________ President of Rechenkraft.net - Germany's first and largest distributed computing organization.
	ID: 57598 \| Rating: 0 \| rate: / Reply Quote

Michael H.W. Weber Send message Joined: 9 Feb 16 Posts: 71 Credit: 607,916,391 RAC: 0 Level Scientific publications	Message 57605 - Posted: 14 Oct 2021 \| 12:35:20 UTC - in response to Message 57598.
	Currently, I have another one with cuda101 (falsely selected by the server for this client) which is now running for several hours. ...it actually caused my machine to crash and was re-starting after re-boot. So I aborted it. Michael. ____________ President of Rechenkraft.net - Germany's first and largest distributed computing organization.
	ID: 57605 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1132 Credit: 10,216,282,676 RAC: 29,658,016 Level Scientific publications	Message 57642 - Posted: 24 Oct 2021 \| 15:16:50 UTC
	I had an "195 (0xc3) EXIT_CHILD_FAILED" case this afternoon, a few seconds after start: https://www.gpugrid.net/result.php?resultid=32657585 anyone any idea what the reason might have been?
	ID: 57642 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,835,316,430 RAC: 20,000,148 Level Scientific publications	Message 57643 - Posted: 24 Oct 2021 \| 15:52:09 UTC - in response to Message 57642.
	Look at your own link: EXCEPTIONAL CONDITION: src\mdio\bincoord.c, line 193: "nelems != 1" Faulty data - a bad task. Not your fault.
	ID: 57643 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 581 Credit: 9,775,312,024 RAC: 21,420,388 Level Scientific publications	Message 57644 - Posted: 24 Oct 2021 \| 16:25:46 UTC - in response to Message 57642.
	Faulty data - a bad task. Not your fault. +1 Windows Operating System: EXCEPTIONAL CONDITION: src\mdio\bincoord.c, line 193: "nelems != 1" Linux Operating System: EXCEPTIONAL CONDITION: /home/user/conda/conda-bld/acemd3_1618916459379/work/src/mdio/bincoord.c, line 193: "nelems != 1" When this warning appears, it usually implies that there is some definition error at the task initial parameters. It is a badly constructed task at origin, and it will fail at every host that receive it. Watching at Work Unit #27084895, from which this task hangs, It has previously failed at several other hosts, both Windows and Linux Operating Systems. The destiny for a Work Unit like this is getting extinguished after the maximum allowed failed tasks (7) is reached...
	ID: 57644 \| Rating: 0 \| rate: / Reply Quote

Greger Send message Joined: 6 Jan 15 Posts: 76 Credit: 24,192,102,249 RAC: 13,992,829 Level Scientific publications	Message 57865 - Posted: 24 Nov 2021 \| 17:42:50 UTC
	New task from batch e1s34_0-ADRIA_GPCR2021_APJ_b0-0-1-RND2256 appears to break at start with Not A Number for coordinate: ACEMD failed: Particle coordinate is nan Errorcode: process exited with code 195 (0xc3, -61) WU: https://www.gpugrid.net/workunit.php?wuid=27087091
	ID: 57865 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,835,316,430 RAC: 20,000,148 Level Scientific publications	Message 57866 - Posted: 24 Nov 2021 \| 17:59:04 UTC Last modified: 24 Nov 2021 \| 18:38:56 UTC
	Same here with two different Linux machines: e1s25_0-ADRIA_GPCR2021_APJ_b0-0-1-RND8388_5 e1s174_0-ADRIA_GPCR2021_APJ_b0-0-1-RND2767_0 The first task has failed for multiple users: I was the first to attempt the second task, but it looks to have gone the same way. And the same error under Windows: e1s401_0-ADRIA_GPCR2021_APJ_b0-0-1-RND6370_3
	ID: 57866 \| Rating: 0 \| rate: / Reply Quote

curiously_indifferent Send message Joined: 20 Nov 17 Posts: 21 Credit: 1,500,833,674 RAC: 3,717,133 Level Scientific publications	Message 57867 - Posted: 24 Nov 2021 \| 18:16:08 UTC
	Same here - RTX2060 card. Fails after 10 seconds. https://www.gpugrid.net/result.php?resultid=32662557 https://www.gpugrid.net/result.php?resultid=32663017 https://www.gpugrid.net/result.php?resultid=32663196 https://www.gpugrid.net/result.php?resultid=32663734
	ID: 57867 \| Rating: 0 \| rate: / Reply Quote

ZUSE Send message Joined: 10 Jun 20 Posts: 7 Credit: 885,090,897 RAC: 2,843,455 Level Scientific publications	Message 57868 - Posted: 24 Nov 2021 \| 18:45:20 UTC
	have the same problem with 4x Nvidia T600
	ID: 57868 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,656,423,724 RAC: 13,411,734 Level Scientific publications	Message 57869 - Posted: 24 Nov 2021 \| 20:10:02 UTC - in response to Message 57868.
	Looks we all have the same issue with NaN. I've bombed through a couple dozen today for wasted download cap.
	ID: 57869 \| Rating: 0 \| rate: / Reply Quote

bozz4science Send message Joined: 22 May 20 Posts: 109 Credit: 102,786,176 RAC: 103,187 Level Scientific publications	Message 57870 - Posted: 24 Nov 2021 \| 23:43:48 UTC
	Same on my end. Had almost 80 tasks on a single machine today that all failed with said NaN error. Why were so many faulty tasks sent out in the first place?
	ID: 57870 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,656,423,724 RAC: 13,411,734 Level Scientific publications	Message 57871 - Posted: 25 Nov 2021 \| 2:00:29 UTC
	Thrown away 150 bad tasks today.
	ID: 57871 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 581 Credit: 9,775,312,024 RAC: 21,420,388 Level Scientific publications	Message 57872 - Posted: 25 Nov 2021 \| 6:40:41 UTC
	Zero valid tasks returned overnight, it's clearly a faulty constructed batch. At least, it seems that we'll have new work to crunch when it is corrected.
	ID: 57872 \| Rating: 0 \| rate: / Reply Quote

joeybuddy96 Send message Joined: 1 Apr 20 Posts: 9 Credit: 146,536,770 RAC: 0 Level Scientific publications	Message 57873 - Posted: 25 Nov 2021 \| 21:29:53 UTC
	I got 16 errored tasks: 8 of cuda101, 8 of cuda1121. No tasks successfully completed. " Particle coordinate is nan" and " The requested CUDA device could not be loaded". ____________
	ID: 57873 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,656,423,724 RAC: 13,411,734 Level Scientific publications	Message 57875 - Posted: 26 Nov 2021 \| 18:46:48 UTC
	Looks like they are sending out corrected tasks now from that last batch. Have several running now correctly.
	ID: 57875 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 581 Credit: 9,775,312,024 RAC: 21,420,388 Level Scientific publications	Message 57876 - Posted: 26 Nov 2021 \| 22:25:15 UTC - in response to Message 57875.
	Right, Problems seem to have been solved in this new batch of ADRIA tasks. I'm also estimating that this batch is considerably slighter than precedent ones, and my GTX 1660 Ti will be hitting full bonus with its current task.
	ID: 57876 \| Rating: 0 \| rate: / Reply Quote

Billy Ewell 1931 Send message Joined: 22 Oct 10 Posts: 40 Credit: 1,543,734,361 RAC: 3,766,888 Level Scientific publications	Message 57877 - Posted: 27 Nov 2021 \| 5:01:18 UTC - in response to Message 57876.
	Only partial success! My Xeon powered machine with a GTX 1060 was reinitiated about 2 hours ago and is performing without a fault. My I7 machine with a RTX 2070 and my I3 machine with a GTX 1060 were likewise restarted with GPU Grid and several tasks have all repeatedly failed within 10-13 seconds of starting. I changed all drivers without any impact. I suppose we let our boxes run until all tasks that choose to fail have succeeded and the few that are successful are recorded as winners.
	ID: 57877 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1132 Credit: 10,216,282,676 RAC: 29,658,016 Level Scientific publications	Message 57878 - Posted: 27 Nov 2021 \| 9:04:18 UTC
	here the tasks failed within a few seconds: https://www.gpugrid.net/result.php?resultid=32703425 https://www.gpugrid.net/result.php?resultid=32698761 excerpt from stderr: ACEMD failed: Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch) so I guess these new WUs are not running on Ampere cards (here: 2 x RTX3070). However, before the WUs were crunched perfectly with these cards, regardless of the cuda version. Too bad :-(
	ID: 57878 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,835,316,430 RAC: 20,000,148 Level Scientific publications	Message 57879 - Posted: 27 Nov 2021 \| 9:05:10 UTC - in response to Message 57877.
	My I7 machine with a RTX 2070 and my I3 machine with a GTX 1060 were likewise restarted with GPU Grid and several tasks have all repeatedly failed within 10-13 seconds of starting. Those two machines are running the same tasks - with 'Bandit' in their name - that are running successfully on other machines. So the problem lies in your machine setup, not in the tasks themselves. A lot of other machines have the same problem - you'be not alone. Your tasks are failing with 'app exit status: 0xc0000135' - in all likelihood, you are missing a Microsoft runtime DLL file. Please refer to message 57353
	ID: 57879 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,835,316,430 RAC: 20,000,148 Level Scientific publications	Message 57880 - Posted: 27 Nov 2021 \| 9:08:06 UTC - in response to Message 57878.
	excerpt from stderr: ACEMD failed: Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch) so I guess these new WUs are not running on Ampere cards (here: 2 x RTX3070). It's not the tasks that fail on Ampere cards - it's the CUDA 101 app. The 1121 app should be OK.
	ID: 57880 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1132 Credit: 10,216,282,676 RAC: 29,658,016 Level Scientific publications	Message 57881 - Posted: 27 Nov 2021 \| 9:31:07 UTC - in response to Message 57880.
	excerpt from stderr: ACEMD failed: Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch) so I guess these new WUs are not running on Ampere cards (here: 2 x RTX3070). It's not the tasks that fail on Ampere cards - it's the CUDA 101 app. The 1121 app should be OK. This was exactly the problem when the previous batch of WUs started: CUDA101 apps could not be crunched on Ampere cards, 1121 WUs went well. Then, someone here from the forum posted instructions how to change the content of a specific file in the - I guess - GPUGRID project folder (I forgot which file that was), I followed this instruction, and from than on my RTX3070 cards were crunching both CUDA versions. Since this is no longer the case with the current batch, I suppose something must be different with these new WUs. From what I remember, about half of the WUs I had crunched over several weeks were 101, about the other half was 1121. Of course, it is rather impractical to just try downloading task after task and hoping that a 1121 will show up some time. As known, after a number of unsuccessful WUs, downloads of new ones are blocked for a day. I would have expected that the GPUGRID people have repaired this specific problem in the meantime. Which obviously is not the case. So they keep blocking an increasing number of hosts, which to me does not make any sense at all :-(
	ID: 57881 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1132 Credit: 10,216,282,676 RAC: 29,658,016 Level Scientific publications	Message 57882 - Posted: 27 Nov 2021 \| 9:38:01 UTC - in response to Message 57881.
	Then, someone here from the forum posted instructions how to change the content of a specific file in the - I guess - GPUGRID project folder (I forgot which file that was), I followed this instruction, and from than on my RTX3070 cards were crunching both CUDA versions. I now remember: it was the Conda-pack.zip... file of which the content had to be changed.
	ID: 57882 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,835,316,430 RAC: 20,000,148 Level Scientific publications	Message 57883 - Posted: 27 Nov 2021 \| 9:53:27 UTC - in response to Message 57881.
	I would have expected that the GPUGRID people have repaired this specific problem in the meantime. Which obviously is not the case. So they keep blocking an increasing number of hosts, which to me does not make any sense at all :-( I agree completely. But since the project doesn't seem to be (effectively) learning the lessons from previous mistakes, the best we can do is to perform the analysis for them, draw attention to the precise causes, and do what we can to ensure that at least some scientific research is completed successfully. Just burning up tasks with failures, until the maximum error limit for the WU is reached, doesn't help anyone. The file you need to change is acemd3.exe - it can be found in your current conda-pack.zip.xxxxxx file, in the GPUGrid project folder. Check whether a newer version of that file has been downloaded since you last modified it. Mine is currently dated 05 October - later than our last major discussion on the subject. That zip pack should also contain vcruntime140_1.dll, but I don't know if simply placing it in the zip would help - it might need to be specifically added to the upacking instruction list as well.
	ID: 57883 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1132 Credit: 10,216,282,676 RAC: 29,658,016 Level Scientific publications	Message 57884 - Posted: 27 Nov 2021 \| 10:54:44 UTC - in response to Message 57883.
	the Conda-pack file which is currently in the GPUGRID folder is named "Conda-pack.zip.1d5...358" and is dated 4.10.21. However, the file for which I was changing the content was called "conda-pack.zip.aeb48...86a", but this file, for some reason, is no longer existing in the GPUGRID folder. Maybe it was deleted this morning when GPUGRID was updated this morning when I was retrieving new tasks. I am sure that both files were in the GPUGRID folder before. In fact, I remember that the change I had made in October was to override the content of the "conda-pack.zip.aeb48...86a" with the content of the file "Conda-pack.zip.1d5...358". What is new since this morning, among other files, is "windows_x86_64_cuda101.zip.c0d...b21", dated 27.11.(=date of download this morning). Whether a similar file was in the GPUGRID folder before or not, and may have been deleted this morning - I do not know. So what I could do is to copy the "conda-pack.zip.aeb48...86a", of which I had saved a copy in the "documents" folder, to the GPUGRID folder. Whether it helps or not, I will only see after retrying a new task (if it happens to be again CUDA101). Any other suggestions ?
	ID: 57884 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,835,316,430 RAC: 20,000,148 Level Scientific publications	Message 57885 - Posted: 27 Nov 2021 \| 11:17:30 UTC - in response to Message 57884.
	Have a look at the job.xml.xxxxxx file. I have one dated 22 September that says <job_desc> <task> <application>bin/acemd3.exe</application> <command_line>--boinc --device $GPU_DEVICE_NUM</command_line> <stdout_filename>progress.log</stdout_filename> <checkpoint_filename>restart.chk</checkpoint_filename> <fraction_done_filename>progress</fraction_done_filename> </task> <unzip_input> <zipfilename>conda-pack.zip</zipfilename> </unzip_input> </job_desc> and another dated yesterday that says <job_desc> <task> <application>bin/acemd3.exe</application> <command_line>--boinc --device $GPU_DEVICE_NUM</command_line> <stdout_filename>progress.log</stdout_filename> <checkpoint_filename>restart.chk</checkpoint_filename> <fraction_done_filename>progress</fraction_done_filename> </task> <unzip_input> <zipfilename>windows_x86_64__cuda101.zip</zipfilename> </unzip_input> </job_desc> To be certain, you'd need to look at the job specification in client_state.xml, but I think I'd go with the newest. Note that you'd also need to have the matching versions of cudart and cufft for the app you end up using.
	ID: 57885 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1132 Credit: 10,216,282,676 RAC: 29,658,016 Level Scientific publications	Message 57887 - Posted: 27 Nov 2021 \| 13:46:24 UTC - in response to Message 57885.
	Have a look at the job.xml.xxxxxx file. ... My job.xml.xxxxxx files look exactly like yours. Also date-wise. To me, this shows that the new tasks no longer use the former <zipfilename>conda-pack.zip< but rather the new <zipfilename>windows_x86_64__cuda101.zip< And since no "...cuda1121.zip" was downloaded into the GPUGRID folder, I suppose that the new WUs are running cuda101 only. Which further means that these new WUs will not work with Ampere cards :-( Looks as simple as that, most sadly :-( Unless someone here can report about successful completion of the new WUs with an Ampere card. If possible, some kind of confirmation/statement/explanation or whatever from the team would also help a lot.
	ID: 57887 \| Rating: 0 \| rate: / Reply Quote

PDW Send message Joined: 7 Mar 14 Posts: 15 Credit: 4,919,544,525 RAC: 31,461,700 Level Scientific publications	Message 57888 - Posted: 27 Nov 2021 \| 14:23:34 UTC - in response to Message 57887.
	Unless someone here can report about successful completion of the new WUs with an Ampere card. I have Ampere cards completing 101 and 1121 tasks from the latest batch just fine. All tasks on all cards have worked, am going to try some slower cards given the tasks are smaller. I have never renamed any of the project files.
	ID: 57888 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1132 Credit: 10,216,282,676 RAC: 29,658,016 Level Scientific publications	Message 57889 - Posted: 27 Nov 2021 \| 14:49:32 UTC - in response to Message 57888.
	I have Ampere cards completing 101 and 1121 tasks from the latest batch just fine. thanks for the information, sounds interesting. Could you please let me/us know whether your www.gpugrid.net folder (in BOINC > projects) contains any conda-pack.zip-files (if yes, which ones?), and whether besides the "windows_x86_64_cuda101.zip.c0d...b21" it contains such a file with "...cuda1121" (instead cuda101).
	ID: 57889 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,835,316,430 RAC: 20,000,148 Level Scientific publications	Message 57890 - Posted: 27 Nov 2021 \| 15:31:47 UTC - in response to Message 57889.
	I have completed a Windows x64 cuda1121 task, and I have a windows_x86_64__cuda1121.zip file on that machine. You can download a copy from my Google drive.
	ID: 57890 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 581 Credit: 9,775,312,024 RAC: 21,420,388 Level Scientific publications	Message 57891 - Posted: 27 Nov 2021 \| 15:39:37 UTC - in response to Message 57888.
	I have Ampere cards completing 101 and 1121 tasks from the latest batch just fine. This lastly commented problem is only impacting Windows hosts. If your hosts are running under any kind of Linux distribution, it is normal that they aren't being affected.
	ID: 57891 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,206,655,749 RAC: 261,147 Level Scientific publications	Message 57892 - Posted: 27 Nov 2021 \| 15:42:46 UTC
	What's the point in keeping the CUDA10 app alive? The CUDA11 app works on older cards as well.
	ID: 57892 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 57893 - Posted: 27 Nov 2021 \| 15:44:37 UTC - in response to Message 57892.
	What's the point in keeping the CUDA10 app alive? The CUDA11 app works on older cards as well. I agree and I've said this a few times also. no point in keeping the CUDA101 app when there's the 1121 app. ____________
	ID: 57893 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1132 Credit: 10,216,282,676 RAC: 29,658,016 Level Scientific publications	Message 57894 - Posted: 27 Nov 2021 \| 15:52:30 UTC - in response to Message 57891.
	I have Ampere cards completing 101 and 1121 tasks from the latest batch just fine. This lastly commented problem is only impacting Windows hosts. If your hosts are running under any kind of Linux distribution, it is normal that they aren't being affected. too bad that the user PDW has hidden his computers in the profile. So no idea what OS is being used ... unless he tells us. What's the point in keeping the CUDA10 app alive? The CUDA11 app works on older cards as well. good question
	ID: 57894 \| Rating: 0 \| rate: / Reply Quote

Werinbert Send message Joined: 12 May 13 Posts: 5 Credit: 100,032,540 RAC: 0 Level Scientific publications	Message 57896 - Posted: 27 Nov 2021 \| 21:18:57 UTC - in response to Message 57876.
	I'm also estimating that this batch is considerably slighter than precedent ones, and my GTX 1660 Ti will be hitting full bonus with its current task. ServiceEnginIC, I noticed that your task completed in under 64,000 sec. My 1660 TI is looking to finish in just under 88,000 sec. I am wondering what could be causing such a big difference. The tasks, mine is a Cuda101 running under Win 7 and yours is Cuda1121 running under Linux. Are either of these the culprit?
	ID: 57896 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 581 Credit: 9,775,312,024 RAC: 21,420,388 Level Scientific publications	Message 57898 - Posted: 27 Nov 2021 \| 22:44:29 UTC - in response to Message 57896.
	Working under Linux helps to squeeze maximum performance. Some optimized settings at BOINC Manager and a moderate overclocking do the rest. At Managing non-high-end hosts thread I try to share all what I know about it.
	ID: 57898 \| Rating: 0 \| rate: / Reply Quote

PDW Send message Joined: 7 Mar 14 Posts: 15 Credit: 4,919,544,525 RAC: 31,461,700 Level Scientific publications	Message 57905 - Posted: 28 Nov 2021 \| 10:05:52 UTC - in response to Message 57894.
	I have Ampere cards completing 101 and 1121 tasks from the latest batch just fine. This lastly commented problem is only impacting Windows hosts. If your hosts are running under any kind of Linux distribution, it is normal that they aren't being affected. too bad that the user PDW has hidden his computers in the profile. So no idea what OS is being used ... unless he tells us. You asked: Unless someone here can report about successful completion of the new WUs with an Ampere card. As I posted previously I am using linux.
	ID: 57905 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1132 Credit: 10,216,282,676 RAC: 29,658,016 Level Scientific publications	Message 57906 - Posted: 28 Nov 2021 \| 10:53:41 UTC - in response to Message 57905.
	Unless someone here can report about successful completion of the new WUs with an Ampere card. As I posted previously I am using linux. oh okay, thanks for the information (which explains why it works well on your system).
	ID: 57906 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,206,655,749 RAC: 261,147 Level Scientific publications	Message 57911 - Posted: 28 Nov 2021 \| 13:55:24 UTC - in response to Message 57906. Last modified: 28 Nov 2021 \| 13:56:49 UTC
	I have Ampere cards completing 101 and 1121 tasks from the latest batch just fine. This lastly commented problem is only impacting Windows hosts. If your hosts are running under any kind of Linux distribution, it is normal that they aren't being affected. too bad that the user PDW has hidden his computers in the profile. So no idea what OS is being used ... unless he tells us. You asked: Unless someone here can report about successful completion of the new WUs with an Ampere card. As I posted previously I am using linux. oh okay, thanks for the information (which explains why it works well on your system). No, it does not explain it. I've tried to run a CUDA 101 task on my Ubuntu 18.04.6 host on an RTX 3080 Ti (driver: 495.44), and it's failed after a few minutes. <core_client_version>7.16.17</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 14:33:16 (1675): wrapper (7.7.26016): starting 14:33:23 (1675): wrapper (7.7.26016): starting 14:33:23 (1675): wrapper: running bin/acemd3 (--boinc --device 0) ACEMD failed: Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch) 14:35:30 (1675): bin/acemd3 exited; CPU time 127.166324 14:35:30 (1675): app exit status: 0x1 14:35:30 (1675): called boinc_finish(195) </stderr_txt> ]]>
	ID: 57911 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,206,655,749 RAC: 261,147 Level Scientific publications	Message 57918 - Posted: 28 Nov 2021 \| 14:26:19 UTC - in response to Message 57911. Last modified: 28 Nov 2021 \| 14:39:37 UTC
	Another example: http://www.gpugrid.net/result.php?resultid=32706825 EDIT: 3rd attempt (failed as well): http://www.gpugrid.net/result.php?resultid=32706943
	ID: 57918 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 57959 - Posted: 29 Nov 2021 \| 14:28:37 UTC
	after yesterday's snafu, I picked up two cuda101 tasks this morning on my Linux Ubuntu 20.04 3080Ti system. currently running ok. been running about 20 mins now, and is utilizing the GPU @99% so it's definitely working. I basically executed a project reset yesterday on this host, so I don't think my previous modifications to swap out the 101 app for 1121 carried over. ____________
	ID: 57959 \| Rating: 0 \| rate: / Reply Quote

Billy Ewell 1931 Send message Joined: 22 Oct 10 Posts: 40 Credit: 1,543,734,361 RAC: 3,766,888 Level Scientific publications	Message 57966 - Posted: 29 Nov 2021 \| 18:07:32 UTC - in response to Message 57880.
	Quote:Your tasks are failing with 'app exit status: 0xc0000135' - in all likelihood, you are missing a Microsoft runtime DLL file. Please refer to message 57353.Quote Richard: Thank you kindly for solving the problem. I installed both 86 and 64 updating applications and now both machines are processing GPU Grid tasks without fault. Billy Ewell 1931; celebrating the passage of my 90th birthday a few days ago and am physically in good shape and still mentally quite capable.
	ID: 57966 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,656,423,724 RAC: 13,411,734 Level Scientific publications	Message 57967 - Posted: 29 Nov 2021 \| 18:09:21 UTC
	I missed out on all the new work because I had to get new master lists on all the hosts when their 24 hour timeouts finally expired.
	ID: 57967 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,835,316,430 RAC: 20,000,148 Level Scientific publications	Message 57969 - Posted: 29 Nov 2021 \| 18:24:03 UTC - in response to Message 57967.
	I missed out on all the new work because I had to get new master lists on all the hosts when their 24 hour timeouts finally expired. I think the 24 hour (master file fetch) backoff is set by the client, rather than the server - so it can be over-ridden by a manual update. That's unlike the 'daily quota exceeded' and the 'last request too recent' backoffs, which are enforced by the server and can't be bypassed. I might use one of these boring lockdown days to force a client into 'master file fetch' mode, so I can see how it's recorded in client_state.xml, and hence how to remove it again - whenever and wherever that knowledge might be useful in the future.
	ID: 57969 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1340 Credit: 7,656,423,724 RAC: 13,411,734 Level Scientific publications	Message 57971 - Posted: 29 Nov 2021 \| 19:54:05 UTC
	Manual updates did nothing but keep resetting the 24 hour timer backoff. Same with an update script running every 15 minutes. Backoff never got below 23 hours before resetting back to 24.
	ID: 57971 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,835,316,430 RAC: 20,000,148 Level Scientific publications	Message 57973 - Posted: 29 Nov 2021 \| 21:04:16 UTC - in response to Message 57971.
	That's because another failure doesn't reset the failure count. We need to find out where that's stored, and reduce it to less than 10.
	ID: 57973 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,206,655,749 RAC: 261,147 Level Scientific publications	Message 57975 - Posted: 29 Nov 2021 \| 21:37:29 UTC - in response to Message 57959.
	after yesterday's snafu, I picked up two cuda101 tasks this morning on my Linux Ubuntu 20.04 3080Ti system. currently running ok. been running about 20 mins now, and is utilizing the GPU @99% so it's definitely working. I basically executed a project reset yesterday on this host, so I don't think my previous modifications to swap out the 101 app for 1121 carried over. That's easy to check: the CUDA1121 is 990MB while the CUDA101 is 491MB (503406KB). I think it's impossible to run the CUDA101 on RTX3000 series, as that was the main reason demanding a CUDA11 client not so long ago.
	ID: 57975 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1069 Credit: 40,231,533,983 RAC: 527 Level Scientific publications	Message 57976 - Posted: 30 Nov 2021 \| 2:50:49 UTC - in response to Message 57975.
	after yesterday's snafu, I picked up two cuda101 tasks this morning on my Linux Ubuntu 20.04 3080Ti system. currently running ok. been running about 20 mins now, and is utilizing the GPU @99% so it's definitely working. I basically executed a project reset yesterday on this host, so I don't think my previous modifications to swap out the 101 app for 1121 carried over. That's easy to check: the CUDA1121 is 990MB while the CUDA101 is 491MB (503406KB). I think it's impossible to run the CUDA101 on RTX3000 series, as that was the main reason demanding a CUDA11 client not so long ago. my gpugrid project folder contains two compressed files for acemd3. x86_64-pc-linux-gnu__cuda101.zip.<alphanumeric> (515.5 MB) x86_64-pc-linux-gnu__cuda1121.zip.<alphanumeric> (1.0 GB) so it seems it did indeed use the cuda101 code on my 3080Ti and both tasks succeeded. https://www.gpugrid.net/result.php?resultid=32707549 https://www.gpugrid.net/result.php?resultid=32701203 since both apps use the same filename of just 'acemd3', it's possible some bug is causing the wrong (or is it right? lol) one to be used or something to that effect. ____________
	ID: 57976 \| Rating: 0 \| rate: / Reply Quote

PDW Send message Joined: 7 Mar 14 Posts: 15 Credit: 4,919,544,525 RAC: 31,461,700 Level Scientific publications	Message 57978 - Posted: 30 Nov 2021 \| 8:08:50 UTC - in response to Message 57976.
	since both apps use the same filename of just 'acemd3', it's possible some bug is causing the wrong (or is it right? lol) one to be used or something to that effect. Don't forget it could be this... http://www.gpugrid.net/forum_thread.php?id=5256&nowrap=true#57473 Have completed 5 of the recent cuda101 tasks on Ampere hosts now, a sixth is running and a seventh lined up. Have seen no failures as yet.
	ID: 57978 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,206,655,749 RAC: 261,147 Level Scientific publications	Message 57985 - Posted: 1 Dec 2021 \| 9:43:31 UTC - in response to Message 57978.
	since both apps use the same filename of just 'acemd3', it's possible some bug is causing the wrong (or is it right? lol) one to be used or something to that effect. Don't forget it could be this... http://www.gpugrid.net/forum_thread.php?id=5256&nowrap=true#57473 Have completed 5 of the recent cuda101 tasks on Ampere hosts now, a sixth is running and a seventh lined up. Have seen no failures as yet. I guess that you still use the "special" BOINC manager (compiled for SETI@home), and that handles apps in a different way. That would explain this anomaly.
	ID: 57985 \| Rating: 0 \| rate: / Reply Quote

PDW Send message Joined: 7 Mar 14 Posts: 15 Credit: 4,919,544,525 RAC: 31,461,700 Level Scientific publications	Message 57987 - Posted: 1 Dec 2021 \| 9:59:34 UTC - in response to Message 57985.
	I guess that you still use the "special" BOINC manager (compiled for SETI@home), and that handles apps in a different way. That would explain this anomaly. No. No modified manager or client here, just the bog standard BOINC 7.16.6
	ID: 57987 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1620 Credit: 8,835,316,430 RAC: 20,000,148 Level Scientific publications	Message 57988 - Posted: 1 Dec 2021 \| 10:21:40 UTC - in response to Message 57987. Last modified: 1 Dec 2021 \| 10:24:55 UTC
	... just the bog standard BOINC 7.16.6 You are recommended to upgrade to v7.16.20 - it's pretty good code, and - importantly - it has updated SSL security certificates needed by some BOINC projects. (Edit - the above advice applies only to Windows machines. If you're running Linux, you can ignore it. Your computers are hidden, so I don't know which applies)
	ID: 57988 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1132 Credit: 10,216,282,676 RAC: 29,658,016 Level Scientific publications	Message 58071 - Posted: 11 Dec 2021 \| 15:16:28 UTC
	a few hours ago, I had another task which failed after a few seconds with 195 (0xc3) EXIT_CHILD_FAILED ACEMD failed: Particle coordinate is nan https://www.gpugrid.net/workunit.php?wuid=27099407 As can be seen, the task failed on a total of 8 different hosts. I am questioning the rationale behind sending out a faulty task 8 x :-(((
	ID: 58071 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Number crunching : 195 (0xc3) EXIT_CHILD_FAILED

	About	Science	Volunteers	Performance	Forum	Join us	Donate