Message boards : Number crunching : Major SNAFU in Effect
Author | Message |
---|---|
I noticed a ton of errors on a previously 100% reliable host tonight. Looks like a bad batch of WUs got pushed out, both IDP and KIX jobs are affected. | |
ID: 51786 | Rating: 0 | rate: / Reply Quote | |
http://www.gpugrid.net/results.php?hostid=490728 | |
ID: 51789 | Rating: 0 | rate: / Reply Quote | |
http://www.gpugrid.net/results.php?hostid=490728 Did someone forget to renew a license? | |
ID: 51790 | Rating: 0 | rate: / Reply Quote | |
I'm getting nothing but comp errors on these new tasks also. | |
ID: 51791 | Rating: 0 | rate: / Reply Quote | |
Same here, of course. But I haven't seen anyone from the project around here for a while. Is anyone at home? | |
ID: 51792 | Rating: 0 | rate: / Reply Quote | |
Same here as well. Error 212 on WU's that were running fine up to 4 -5 hours ago. sounds like a license thing to me as well. Suspended project until the issue is resolved. | |
ID: 51793 | Rating: 0 | rate: / Reply Quote | |
Have the same issues on two Linux machines, so not sure if this is a license thing. | |
ID: 51795 | Rating: 0 | rate: / Reply Quote | |
For the last 2 years, the License error usually comes after July 1st. 12 month license, I am assuming. | |
ID: 51796 | Rating: 0 | rate: / Reply Quote | |
Every task I had in my cache on 4 hosts errored out today. Since I don't run very high resource allotment, some tasks had been running a couple of hours a day with no issues until today. The hosts are processing other projects without any errors during this time. I'd have to guess a license expired today. | |
ID: 51798 | Rating: 0 | rate: / Reply Quote | |
Same. I have two Ubuntu machines that throw up nothing but immediate errors now. My two Windows crunchers are fine, though. | |
ID: 51799 | Rating: 0 | rate: / Reply Quote | |
The Linux app is broken (most probably its license expired). <core_client_version>7.9.3</core_client_version>
<![CDATA[
<message>
process exited with code 212 (0xd4, -44)</message>
<stderr_txt>
</stderr_txt>
]]> However my Windows host are crunching happily, so I switched back to Windows on my Linux hosts. The GPUGrid staff need to act on this without delay. | |
ID: 51801 | Rating: 0 | rate: / Reply Quote | |
Same over here: | |
ID: 51802 | Rating: 0 | rate: / Reply Quote | |
The Linux ACEMD v9.19 apps were deployed on 13/14 February 2018 - so it possibly looks like a 15 month licence expiry. | |
ID: 51804 | Rating: 0 | rate: / Reply Quote | |
A temporary fix for Linux users is to set your system date back 1 year. | |
ID: 51808 | Rating: 0 | rate: / Reply Quote | |
Is project leadership aware of the licensing expiration? Seems like someone should be keeping a tickler file for this so that renewals could happen before WU's start erroring out. | |
ID: 51810 | Rating: 0 | rate: / Reply Quote | |
Is project leadership aware of the licensing expiration?Apparently not. That's why this SNAFU. Seems like someone should be keeping a tickler file for this so that renewals could happen before WU's start erroring out.True. | |
ID: 51811 | Rating: 0 | rate: / Reply Quote | |
There wasn't any notification of the pending shutdown of the Quantum Chemistry (CPU) work units either, or when they might be restarted. | |
ID: 51814 | Rating: 0 | rate: / Reply Quote | |
I'm going to just suspend the project on all my hosts. The fact I have to exclude my Turing cards makes it difficult to work with the project anyway. | |
ID: 51816 | Rating: 0 | rate: / Reply Quote | |
Seems like someone should be keeping a tickler file for this so that renewals could happen before WU's start erroring out. also in the past, license renewals were not done in time and tasks failed. Too bad, but it really seems that the people at GPUGRID simply forget about these things. | |
ID: 51817 | Rating: 0 | rate: / Reply Quote | |
Just in case anyone is still wondering, I've been sent WU 16485663. | |
ID: 51821 | Rating: 0 | rate: / Reply Quote | |
Got a PM reply from Toni: Oh gosh, thanks ... :-) | |
ID: 51822 | Rating: 0 | rate: / Reply Quote | |
Got a PM reply from Toni: Hey Richard, thanks for raising this with admins. Much appreciated! | |
ID: 51823 | Rating: 0 | rate: / Reply Quote | |
So hopefully we will be back up and running shortly :). Thanks for bring it to Toni's attention. | |
ID: 51824 | Rating: 0 | rate: / Reply Quote | |
Will someone tell us when the FUBAR has finished??? | |
ID: 51825 | Rating: 0 | rate: / Reply Quote | |
The problem is still not resolved... | |
ID: 51828 | Rating: 0 | rate: / Reply Quote | |
Got a PM reply from Toni: What surprises me though is that no one from GPUGRID found out by themselves :-( | |
ID: 51830 | Rating: 0 | rate: / Reply Quote | |
I aborted all my gpu wu's to let someone with windows run them. Was hoping the certificate would be renewed by now so I could finish the ones I had time invested that I suspended before they failed. No such luck :-(. Barley enough calander time left to finish them anyway. | |
ID: 51838 | Rating: 0 | rate: / Reply Quote | |
Toni responded in this thread: http://www.gpugrid.net/forum_thread.php?id=4925&nowrap=true#51834 We are aware of the problem. We'd like to do a major version upgrade rather than continue fixing the old one. For the time being, I'm deprecating the app for linux so crunching goes on on Windows rather than erroring out. | |
ID: 51839 | Rating: 0 | rate: / Reply Quote | |
So it looks like time to find a new project for the majority of my machines. Only have 1 that still runs M$ | |
ID: 51841 | Rating: 0 | rate: / Reply Quote | |
I came back from the Pent to this. :( Thought my computers borked. | |
ID: 51844 | Rating: 0 | rate: / Reply Quote | |
So does anyone want to explain how a BOINC wrapper works? The docs don't really say anything about the mechanics involved. | |
ID: 51845 | Rating: 0 | rate: / Reply Quote | |
LHC@home uses a boincwrapper. All Windows, MAC OSX and other Linux distros can run their programs written in Scientific Linux. You must have VirtualBox installed. | |
ID: 51846 | Rating: 0 | rate: / Reply Quote | |
LHC@home uses a boincwrapper. All Windows, MAC OSX and other Linux distros can run their programs written in Scientific Linux. You must have VirtualBox installed. Nope that's even more separation from the client including OS and environment variables like specific libc versions. In the case of LHC they give the choice of VBox or setting up CVFMS and Singularity on your own which is included in vbox.vdi file https://boinc.berkeley.edu/trac/wiki/WrapperApp If you DON'T want want to include progress % complete, check pointing, GPU device # within your app then the wrapper can do that. Don't expect it to be as efficient as there is now another layer between the exe doing the calculations and hardware. | |
ID: 51859 | Rating: 0 | rate: / Reply Quote | |
So we can no longer run this BOINC GPU Project under BOINC version 7.9.3 on Ubuntu 18.04.2 LTS [4.15.0-51-generic|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1)] Running NVIDIA GeForce GTX 1080 Ti (4095MB) driver: 390.11? | |
ID: 51863 | Rating: 0 | rate: / Reply Quote | |
So we can no longer run this BOINC GPU Project under BOINC version 7.9.3 on Ubuntu 18.04.2 LTS [4.15.0-51-generic|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1)] Running NVIDIA GeForce GTX 1080 Ti (4095MB) driver: 390.11?Correction: We can not run this BOINC GPU Project (GPUGrid) on any Linux distro for a who-knows-how-long time period. | |
ID: 51866 | Rating: 0 | rate: / Reply Quote | |
I bet it won't be long before we get Linux WUs again. In the mean time there's asteroids, einstein, milkyway & seti to keep one busy. | |
ID: 51875 | Rating: 0 | rate: / Reply Quote | |
So does anyone want to explain how a BOINC wrapper works? The docs don't really say anything about the mechanics involved. From what I understand its a wrapper program they put around their normal (non-BOINC) science app that is used to invoke it. No pre-reqs. No need for vbox. That way the wrapper handles the BOINC interaction and allows the use of non-BOINC app. See https://boinc.berkeley.edu/trac/wiki/WrapperApp for docs. ____________ BOINC blog | |
ID: 51883 | Rating: 0 | rate: / Reply Quote | |
Thanks, I had already read that document and was and still am confused. I gather it is not a VM. So assume you don't need virtualization on the cpu? | |
ID: 51886 | Rating: 0 | rate: / Reply Quote | |
Thanks, I had already read that document and was and still am confused. I gather it is not a VM. So assume you don't need virtualization on the cpu? The wrapper does not need VBox. It's just another interface to perform BOINC related functions while the project's 'math.exe' or w/e is doing the crunching ONLY performs calculations. VBox can set up the entire OS environment to satisfy all the specifics needed to crunch. If a project needs extra programs that do not typically come with an OS or are normally installed by people then that can be included in the vbox image. Again as LHC as the example, Singularity and CVFMS are included in the image. They can also make 1 vbox image for Windows and Linux Host OSs | |
ID: 51887 | Rating: 0 | rate: / Reply Quote | |
Is the BOINC wrapper a memory hog like virtualbox??? | |
ID: 51888 | Rating: 0 | rate: / Reply Quote | |
I'm trying to to think of projects that use it. Going through project folders it looks like DrugDiscovery CPU Goofy, MindModeling and CAS used it. DHEP, Gerasium, Moo, SRBase, Enigma, YoYo and Yafu are active projects that have a wrapper in the exe name. Some Yoyo ECM tasks can use like 8GB but I think thats the data as its limited to certain types. But nothing like LHC Atlas using 10gb for the other projects. VBox apps are huge because its an entire image. | |
ID: 51889 | Rating: 0 | rate: / Reply Quote | |
That still shows the Linux hosts responsible for 1/3 of the total credit. And since the percentage of Linux hosts is 37% compared to 54% for Windows hosts, the Linux hosts are showing a greater percentage of higher production hosts compared to Windows hosts. | |
ID: 51890 | Rating: 0 | rate: / Reply Quote | |
It would benefit the project to return the Linux hosts to participation. Which is why the PM which got Toni's attention had the subject line Research being delayed - Linux apps broken :-) | |
ID: 51891 | Rating: 0 | rate: / Reply Quote | |
Been a while, and news? | |
ID: 52084 | Rating: 0 | rate: / Reply Quote | |
Now the license of the Windows app has expired. | |
ID: 52426 | Rating: 0 | rate: / Reply Quote | |
Now the license of the Windows app has expired. August is the vacation month in Italy. Looking at the "about" I don't see a lot of diversity. Probably took off a week to get their heads out of the quantum clouds and socialize with opposite sex. | |
ID: 52427 | Rating: 0 | rate: / Reply Quote | |
August is vacation month in Italy . . . Most likely most take off the whole month . . . not just a week. ____________ | |
ID: 52429 | Rating: 0 | rate: / Reply Quote | |
They are in Spain, so I always figured they would head to Majorca. No one ever denied it at any rate. | |
ID: 52430 | Rating: 0 | rate: / Reply Quote | |
Same here, of course. But I haven't seen anyone from the project around here for a while. Is anyone at home? It looks to me like the two main researchers are about to get a flood of workunits that failed due to all of the tasks giving an error. If so, they will have to notify the programmer or programmers, and start an effort to fix the problem. If they're able to read and write in English, they'll then have little worthwhile to do other than tell us what happened, and when they expect a fix. | |
ID: 52431 | Rating: 0 | rate: / Reply Quote | |
Am I to assume this has been fixed and I can add my Linux machine here? Or are there no WUs for Linux as of yet? | |
ID: 52705 | Rating: 0 | rate: / Reply Quote | |
Am I to assume this has been fixed and I can add my Linux machine here?It's been fixed, thoguh only the Windows app is released to the production line. You can add your Linux machine, but it will receive only beta test tasks for a while. Or are there no WUs for Linux as of yet?The workunits are common, but the new Linux app will be put into the production line only when the new Windows app is working as it should be. I know I'm crunching okay under Windows...Me too. | |
ID: 52706 | Rating: 0 | rate: / Reply Quote | |
Am I to assume this has been fixed and I can add my Linux machine here?It's been fixed, thoguh only the Windows app is released to the production line. I am receiving non-Toni test tasks today for my Linux host. Looks like normal project work. https://www.gpugrid.net/result.php?resultid=21405079 https://www.gpugrid.net/result.php?resultid=21405557 https://www.gpugrid.net/result.php?resultid=21405187 https://www.gpugrid.net/result.php?resultid=21405090 | |
ID: 52708 | Rating: 0 | rate: / Reply Quote | |
I am receiving non-Toni test tasks today for my Linux host. Looks like normal project work. Yes, 'Application version: New version of ACEMD v2.06 (cuda100)' is the new normal. | |
ID: 52712 | Rating: 0 | rate: / Reply Quote | |
Being in check-in mode for months has got me so confused. I thought Toni asked not to run acemd3 on Linux as that's not what she needs to test. Or are we now good to go on Linux WUs??? | |
ID: 52714 | Rating: 0 | rate: / Reply Quote | |
I thought Toni asked not to run acemd3 on Linux as that's not what she needs to test. Yes, that is what he said. I am just surprised that they send them to Linux machines at all. Can't they block them? | |
ID: 52715 | Rating: 0 | rate: / Reply Quote | |
I received such tasks too. These are from the short queue. (Which is epmty now, though).I am receiving non-Toni test tasks today for my Linux host. Looks like normal project work.Yes, 'Application version: New version of ACEMD v2.06 (cuda100)' is the new normal. I think Toni put some workunits from the short queue to the "New version of ACEMD" queue from time to time to serve as a bit longer test. | |
ID: 52716 | Rating: 0 | rate: / Reply Quote | |
I've received only 1 since he's said that. If admins only want Windows hosts to receive the tasks then they could always depreciate the Linux app. | |
ID: 52717 | Rating: 0 | rate: / Reply Quote | |
I received such tasks too. These are from the short queue. (Which is epmty now, though). Agreed. My Windows hosts do not process from the short queue, only from the long queue and test queue. I am receiving ADRIA short work units from the test queue. This would seem to indicate ADRIA is becoming familiar with creating ACEMD3 work units. We are getting closer to full release of ACEMD3! | |
ID: 52723 | Rating: 0 | rate: / Reply Quote | |
Message boards : Number crunching : Major SNAFU in Effect