Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCHP run crashes almost immediately in MAPL_CapGridComp.F90 #443

Closed
InterstellarPenguin opened this issue Sep 23, 2024 · 7 comments
Closed
Assignees
Labels
category: Debug Help Request for help debugging GCHP topic: Runtime Related to runtime issues (e.g. simulation stops with error)

Comments

@InterstellarPenguin
Copy link

Your name

Linyang Guo

Your affiliation

UCAS

What happened? What did you expect to happen?

Hi, everyone!
There is an issue appeared while I'm going to run a carbon simulation.

What are the steps to reproduce the bug?

Just for now, i tried to run a 864-core 27-node 10-years GCHP simulation and it crashed immediately. I picked out the important part of the output.log:

Please attach any relevant configuration and log files.

image image I've seen a similiar issue that posted on git #8. But I still have no idea about how to fix it. Below is my environment set [GCHP.intel23.txt](https://github.com/user-attachments/files/17097294/GCHP.intel23.txt)

What GCHP version were you using?

14.4.3

What environment were you running GCHP on?

Local cluster

What compiler and version were you using?

ifort 2021.3.0

What MPI library and version were you using?

Intel MPI 2021.3.0

Will you be addressing this bug yourself?

Yes

Additional information

No response

@InterstellarPenguin InterstellarPenguin added the category: Bug Something isn't working label Sep 23, 2024
@lizziel
Copy link
Contributor

lizziel commented Sep 23, 2024

Hi, thanks for reaching out about this. I see there are multiple prints of the same message about heartbeat. There should only be one. This makes me think there is a problem with MPI, maybe having to do with ESMF. This is also supported when looking at the MAPL code that the error message points to here:

    call ESMF_VMGet(cap%vm, petcount=npes, mpicommunicator=comm, rc=status)
    _VERIFY(status)
     _ASSERT(CoresPerNode <= npes, 'something impossible happened')

Could you check that your ESMF build did not include 'mpiuni'? Also please post your GCHP log file here.

@lizziel lizziel self-assigned this Sep 23, 2024
@lizziel lizziel added category: Debug Help Request for help debugging GCHP topic: Runtime Related to runtime issues (e.g. simulation stops with error) and removed category: Bug Something isn't working labels Sep 23, 2024
@InterstellarPenguin
Copy link
Author

InterstellarPenguin commented Sep 24, 2024

@lizziel Good to hear your reply, I've checked the root dir of ESMF, actually I found the setting ' -DESMF_COMM=mpiuni'.
image
Does it means that I should rebuild ESMF ? Btw, as the picture says, I don't have the permissions to edit the files under /public/software/.../esmf/, is it works to reinstall the esmf in my own dir? Here are my files relate to those issues.
GCHP.log

@InterstellarPenguin
Copy link
Author

I've solved this issue by rewriting the env files and updating the ESMF module. Thanks again @lizziel

@lizziel
Copy link
Contributor

lizziel commented Sep 24, 2024

Great, I am glad that worked. Setting ESMF_COMM to mpiuni prior to building ESMF makes ESMF bypass MPI and run on every processor. This is why there were duplicate messages in your log. From the ESMF docs:

Alternatively, ESMF comes with a single-processor MPI-bypass library which is the default for Linux and Darwin systems. To force the use of this bypass library set ESMF_COMM equal to "mpiuni".

@Xinying331
Copy link

Xinying331 commented Feb 28, 2025

Hi,

Could you please share the steps you took to update the ESMF module and modify the environment files? I encountered the exact same issue and have already recompiled the ESMF package, ensuring that all ESMF_COMM settings point to OpenMPI. However, I’m still receiving the same error message. I'm attching my compile and install log file with this comment. I would greatly appreciate any guidance you can provide! @lizziel @InterstellarPenguin @LiamBindle

Image

compile.log

install.log

Thanks,
Xinying

@lizziel
Copy link
Contributor

lizziel commented Feb 28, 2025

Hi @Xinying331, if ESMF_COMM is set to openmpi then I think you must be encountering a different problem. Please open a new github issue for this. Thanks!

@Xinying331
Copy link

Got it. Thank you Lizziel! @lizziel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: Debug Help Request for help debugging GCHP topic: Runtime Related to runtime issues (e.g. simulation stops with error)
Projects
None yet
Development

No branches or pull requests

3 participants