-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCHP full chemistry simulation #355
Comments
Hi @ast222340, this looks like a compiler problem. The error is occurring in one of the 3rd party libraries from NASA GMAO. Try using a newer version of intel. I think 2018 may be too old. |
Thank you @lizziel. Now using 2019 version...Then again some errors come...please help me..don't understand. |
Hi @ast222340, the error |
Thank you @lizziel.You are helping me a lot.I tried but can't do anything the same error is coming.I have attached the GCHP.log file and setCommonRunSettings file. |
Your log file indicates you set GCHP to run with six cores:
However, it also indicates you only have one core available:
This indicates that when you run the GCHP program you are only assigning one core. What command are you using to run GCHP? |
Thank you @lizziel .Our HPC team says all libraries are set up.Go to the run directory and write"gchp". |
If you execute GCHP by simply typing |
Thank you @lizziel. Happy New Year.
|
Hi @ast222340, Happy New Year to you as well! The log shows there is a new error than the one before, which is a good thing. There is also a traceback of the error that shows where in the code there is a problem. What model version are you using? The next step is to look at file
|
Thank you @lizziel, Model version is GCHP 14.2.3.... Running with 'mpirun -np 1 ./gchp' comments and 'mpirun -np 2 ./gchp'. |
Hi @ast222340, you cannot run GCHP with fewer than 6 cores. If you do Also see the GCHP 14.2.3 User Guide on running GCHP. The guide specifies 6 cores as minimum and recommends using script |
Hi @ast222340, I wonder if you are running out of memory. Also, it looks like the model is running multiple times at once given your log file. Are you running using the gchp.local.run script? How much memory do you have available? You can also try running with more cores than 6. If you are running on a compute cluster we suggest running your job as a batch job, submitted to a scheduler with specification of number of nodes, cores, and memory. Your system administrators should be able to help you learn how to do that. |
Thank you @lizziel. I tried to run on HPC with 400gb space(Interactive node).Even taking more than 6 core shows the same error.(please check gchp.log file) qsub -I -lselect=1:ncpus=24:mpiprocs=24:mem=400gb -lwalltime=01:00:00 -P geoschem.spons
|
I followed the error traceback to the line of code where it is failing. The traceback is this:
The first line is the code that I found, which is here.
This is a call to the ESMF library. What version are you using? Do you have file ESMF.rc in your run directory. If yes, open it and set the logKindFlag parameter to |
Thank you @lizziel .Model version is GCHP 14.2.3.ESMF version is 8.4.2.ESMF.rc file is present in run directory.
20240114 003137.511 ERROR PET0 ESMCI_DistGrid_F.C:481 c_esmc_distgridget() Argument sizes do not match - 2nd dim of minIndexPDimPDe array must be of size 'deCount' |
@tclune, have you ever seen this ESMF error before? It is occurring during this call in
|
@lizziel No - does not ring a bell. And we use DistGrid very rarely in MAPL, so have relatively little experience. I'm sure the error message is correct - some incorrect assumption about rank/sizes going on. Including @atrayanov and @nasa-ben in case they have further insights. Do you have any other context you can provide? And/or what was different about this run? Stack trace? |
As far as I know only @ast222340 has run into this problem. The configuration files look normal and the resource configuration is simple (1 node, 24 cores). The error itself is strange given the array with the size mismatch is allocated directly before the call. @ast222340, you could try going into this code and printing out deCount to see what is printed for each core. @tclune, here is the stack trace. Is this getting dimensions for the grid per core, and so would possibly have different deCount per core?
MAPL_Generic.F90 line 10225: GCHP_GridCompMod.F90 line 306:
MAPL_CapGridComp.F90 line 655:
MAPL_CapGridComp.F90 line 959:
MAPL_Cap.F90 line 311:
MAPL_Cap.F90 line 258: MAPL_Cap.F90 line 192: MAPL_Cap.F90 line 169: |
Thank you for the additional info. Hmm. In general, MAPL expects exactly 1 DE per PET. (one local domain per process). There are some exceptions in History and ExtData to deal with coarse grids that end up with missing DE's on some PET's. But that should not be the case here. Hoping @nasa-ben or @atrayano have advice. |
Thank you @lizziel @tclune.A problem has arisen.After seeing your discussion,I went to check the error from beginning. NOT using buffer I/O for file: cap_restart Then I looked at file MAPL_CapGridComp.F90 within the MAPL package in src/MAPL.This will indicate the model is failing. Then came this error.Please check conversation 5 days back.I just copied your conversation. pe=00014 FAIL at line=00287 MaplGrid.F90 <status=508> The first line is the code that I found, which is here.
This is a call to the ESMF library. What version are you using? Do you have file ESMF.rc in your run directory. If yes, open it and set the logKindFlag parameter to ESMF_LOGKIND_MULTI_ON_ERROR and run again. You should then get an ESMF error log file with more information. There will be one log file per core. The traceback indicates which core the error is on (PE #), e.g. the above traceback shows it is core 14. |
Hi @ast222340, I am not quite following what you mean. Is your problem the ESMF_DistGridGet call, or is there a different |
Thank you for quick reply @lizziel .yes..My problem is ESMF_DistGridGet call part..that error..Crash code.Full details are in GCHP.log file(Previously uploaded) |
@lizziel.If you're not busy,I haven't solved this problem yet,so take a look.I don't understand,I have to do it right. |
@ast222340, I'm afraid it is difficult to figure out the issue here without being able to reproduce. As I suggested earlier you can add print statements to the location in the traceback where the error is happening and try to figure out what the issue is there. @tclune said that there should be 1 DE per PET, meaning variable deCount should be one when printed from each core. |
Thank you @lizziel.I don't have any idea about this error.Mainly error is showing this file "MAPL_CapGridComp.F90" at 324 line. |
Previously you said your error was in |
@lizziel. I have deleted everything.I have configured the model again for the GCHP 14.2.3.Now I am facing the above mentioned problem. |
If you follow the traceback you will see this error in the code:
This indicates cores per node greater than the number of cores available. You ran into this same issue earlier in this thread and I believe you had resolved it. |
Thank you @lizziel.Error have not been solved before.If you tell me how to solve it,so this is my now situation. |
Aha, I figured out what the issue is. I searched for "something impossible happened" on the GCHP GitHub issues page and found this very old issue with the same error: #8. Apologies that I forgot about this possibility. You (or your system administrators) built ESMF with setting Tagging @tclune that this is resolved. |
Whew. Was afraid I'd have to dig into this one ... |
@lizziel we hope to add either build time logic in CMake or a clearer runtime message in MAPL to minimize confusion the next time this happens. (Basically ensuring via Murphy that it will never happen again.) |
@tclune Thank you! @ast222340 Please confirm this works. |
Thank you @lizziel @tclune.I apologize for late reply.Because our system admin was compiling ESMF with openmpi.I have put it in the run for two days,the run is getting an error at the end.Two output files are also created in the "OutputDir". |
First guess is that the checkpoint file already exists. GEOS separately moves files out of the way, so default MAPL settings crash if the file already exists. GCHP is different if I recall, so we added an option to overwrite an existing file. But I do not immediately remember how that option is activated. If @lizziel cannot provide those details, I'll look again in the morning. (Am in a meeting, but have enough spare cycles to type this message.) |
@tclune is correct, some time ago (in 2021-01-08 for MAPL 2.5.0) we changed the default to be "clobber" for netCDF formatter in PFIO |
GCHP will crash if @ast222340, make sure you are using a GCHP run script from the @atrayano, we are using MAPL 2.26.0 but the default is still
|
My memory is that we only added a switch and preserved the original GEOS behavior as the default. The override would be at a much higher level in MAPL. |
Thank you @lizziel @tclune @atrayano. when I run it for one day ,the cap_restart was updated.Then i run for two days,the last part of error comes and no update cap_restart file,not created updated restart file because I didn't remove the 'gcchem_internal_checkpoint' file that was created for a day.please clear this part. 2.In HEMCO_Config.rc 3.In HISTORY.rc
MODEL PHASE#------------------------------------------------ FORWARD for forward model, ADJOINT for adjoint modelModel_Phase=FORWARD The study domain of my research is from 2016.Here the restart file of 2019 is provided.what can I do for when I go to run from January 2016? Thank you... |
Hi @ast222340, it sounds like you have GCHP successfully running now. As I said before, make sure you use a GCHP run script when submitting to ensure you do not have gcchem_internal_checkpoint for sequential runs, and to make sure you get all of the other functionality in the run script. Regarding your questions:
Regarding restart files, we recommend spinning up a GEOS-Chem restart file by starting a run prior to when your research period begins. You can rename the restart file to use for any date. However, due to seasonality we recommend only changing the year. |
Any additional issues should go into a new GitHub issue. |
Thank you @lizziel. I am not reopen new issue because doubt comes from above discussion. That means you say that I have a restart file "GEOSChem.Restart.fullchem.20190701_0000z.c48.nc4"..when i start my simulation from 2016,change above file name "GEOSChem.Restart.fullchem.20160101_0000z.c48.nc4" for january..So is this right way?because original file created for 20190701 time..is there any effect on simulation? |
Name and Institution (Required)
Name: Subhadeep Ghosh
Institution:IIT Delhi
Confirm you have reviewed the following documentation
Description of your issue or question
Respected sir/madam,




Even if the configuration is done properly,the simulation is not running,some error is coming.I don't really understand what they are trying to say.The cmake, make logs and the runsettings files are attached.
CmakeLog.txt
log.make.txt
MakeLog.txt
Here I attached my setCommonRunSettings and PBS job script.It would be great if you could tell me about my mistake.
CAP.txt
ExtData.txt
GCHP.log
gchp_int.txt
HEMCO_Config.txt
HISTORY.txt
setCommonRunSettings.txt
The text was updated successfully, but these errors were encountered: