Skip to content

geth consumes all ram; drops blocks, peers #20963

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nysxah opened this issue Apr 23, 2020 · 19 comments
Closed

geth consumes all ram; drops blocks, peers #20963

nysxah opened this issue Apr 23, 2020 · 19 comments

Comments

@nysxah
Copy link

nysxah commented Apr 23, 2020

Hi, we are running several Geth nodes.

Every week, at least one node which is synced to tip 'loses' 4-10k blocks and begins re-syncing them. At the same time, all peers are dropped/disconnected.

We are seeing OOM errors around the time this happens.

RAM usage creeps up to 100%, then the blocks & peers are dropped.

We upgraded one node to 16GB from 8GB, and it slowly consumed the additional memory and the issue happened again.

What could this be related to, where should we look, and which flags could we modify to potentially resolve this issue?

System information

Geth version: 1.9.12-stable
OS & Version: Ubuntu 16.04.6 LTS

Expected behaviour

node stays in-sync with the network

Actual behaviour

eats up available RAM (8-16GB) and drops 4-10k blocks, begins resyncing them

Screenshot 2020-04-22 23 50 16

also drops peers

Screenshot 2020-04-23 00 13 26

Steps to reproduce the behaviour

unclear; the node is running and answering RPC requests via websockets. sometimes the issue coincides to a sudden influx of requests to the node, but not always.

@karalabe
Copy link
Member

Could you please provide the command you use to run Geth?

The reason for the resync is that the recent state is kept in memory for garbage collection. This however means that a crash loses all that, so when you restart, you need to reprocess the lost blocks.

@karalabe
Copy link
Member

Could you also provide a memory chart? Would be nice to see the consumption.

@cp0k
Copy link

cp0k commented Apr 23, 2020

The command we are using to run Geth:

/usr/bin/geth --rpcapi eth,web3,debug,txpool,net,shh,db,admin,debug --rpc --ws --wsapi eth,web3,debug,txpool,net,shh,db,admin,debug --wsorigins localhost --gcmode full --rpcport=8547 --maxpeers 250

Memory / system charts for the node in question:

Screen Shot 2020-04-23 at 5 02 08 PM

FYI...we are observing this same exact behavior on a node that is 100% idle, having absolutely zero requests thrown at it.

In case you are wondering why the memory chart has a bunch of sudden drops, this is because we also have a bash script that checks total memory in use, if total memory exceeds 80%, geth is restarted. This was put in place as a "bandaid" till we get to the root cause of the issue.

@karalabe
Copy link
Member

Which version of Go did you build it with? 1.14.0 and 1.14.1 had a GC bug (golang/go#37525) that cause Geth to explode on memory use. It was fixed in 1.14.2.

@karalabe
Copy link
Member

Another thing that could help, is when your node enters into this strange high memory use, dangerously close to being killed, please run a debug.stacks() from a Geth console. That will create a dump of all the running goroutines. If you share that with us, we might check if there's some leak that might result in memory accumulation.

@cp0k
Copy link

cp0k commented Apr 24, 2020

Which version of Go did you build it with? 1.14.0 and 1.14.1 had a GC bug (golang/go#37525) that cause Geth to explode on memory use. It was fixed in 1.14.2.

Looks like we are running go 1.13.8:

# geth version
Geth
Version: 1.9.12-stable
Git Commit: b6f1c8dcc058a936955eb8e5766e2962218924bc
Git Commit Date: 20200316
Architecture: amd64
Protocol Versions: [65 64 63]
Go Version: go1.13.8
Operating System: linux
GOPATH=
GOROOT=/home/travis/.gimme/versions/go1.13.8.linux.amd64

Another thing that could help, is when your node enters into this strange high memory use, dangerously close to being killed, please run a debug.stacks() from a Geth console. That will create a dump of all the running goroutines. If you share that with us, we might check if there's some leak that might result in memory accumulation.

Thank you for the tip! I'll definitely get back to you with the debug.stacks() output as soon as possible.

@drhashes
Copy link

By adding
--cache 2048
or
--cache 1024
to your command line will reduce RAM consumption.

@mtbitcoin
Copy link

We are running into similar issues of memory leak. First with 1.9.12, so we upgraded to 1.9.13 but also encountered similar issues on multiple production servers. No other changes were made

@karalabe
Copy link
Member

karalabe commented May 6, 2020

@mtbitcoin What flags are you running with?

@mtbitcoin
Copy link

@karalabe i am running further tests, but it appears it might be related to someone intentionally dossing the nodes by running eth_call or gas estimates, applying rpc.gascap appears to have helped

@cp0k
Copy link

cp0k commented May 6, 2020

Another thing that could help, is when your node enters into this strange high memory use, dangerously close to being killed, please run a debug.stacks() from a Geth console. That will create a dump of all the running goroutines. If you share that with us, we might check if there's some leak that might result in memory accumulation.

As requested - https://pastebin.com/1mGvk4SQ

@karalabe i am running further tests, but it appears it might be related to someone intentionally dossing the nodes by running eth_call or gas estimates, applying rpc.gascap appears to have helped

Thanks for the heads up! will definitely give it a try.

By adding
--cache 2048
or
--cache 1024
to your command line will reduce RAM consumption.

Thanks! I'll try that as well :)

@cp0k
Copy link

cp0k commented May 26, 2020

Another thing that could help, is when your node enters into this strange high memory use, dangerously close to being killed, please run a debug.stacks() from a Geth console. That will create a dump of all the running goroutines. If you share that with us, we might check if there's some leak that might result in memory accumulation.

Péter, I replied back with the requested information a couple of weeks back, can you please review it and let us know if anything stands out?

https://pastebin.com/1mGvk4SQ

My organization and I were hoping this would be fixed in Geth 1.9.14-stable, unfortunately, the same issue.

Please let me know if you require any additional information from our end. Thanks!

@nysxah
Copy link
Author

nysxah commented Jun 8, 2020

@karalabe @drhashes adding --cache flags has not helped. The interim workaround is to restart geth once memory reaches a high threshold. How else can we help you debug this?

@holiman
Copy link
Contributor

holiman commented Aug 6, 2020

We've been looking into this today, and can't find any obvious culprit. Does this issue still appear with most recent version? Also, if it does, would be great with a new stack trace, since the codelines have changed.

@karalabe
Copy link
Member

karalabe commented Aug 6, 2020

Please check with latest Geth, latest Go. What would really help is to try to minimize the moving components. Lets try a 16GB machine, idling without RPC calls, just syncing with the network. That is what we're running all the time and should not go OOM.

If that works stably, lets try to add RPC into the mix. That would really help if you could provide what API calls you are doing. There are very easy ways to make a node go boom with the "correct" RPC requests.

@mtbitcoin
Copy link

mtbitcoin commented Aug 6, 2020

This is no more an issue for us... (archive and default synch nodes) with the latest version

Edit: the gascap helped on our end (from what we could see, someone figured out they could dos the nodes and were sending in calls with high gas limits)

@YpsilonOmega
Copy link

I've got the same issue with the following set-up:

Geth version: 1.9.25-stable
OS & Version: Ubuntu 20.10
Go Version: go1.15.6
Hardware: Raspberry Pi 4, 8GB

Geth eats up all my RAM after a long time running.
+600MB after 20min
+5GB after 10 days (actually more due to Swap)

I started with the line:
sudo geth --syncmode fast --cache 2048 --datadir /mnt/ssd/ethereum

Afterwards I tried to use the solution presented by @mtbitcoin :
sudo geth --syncmode fast --cache 2048 --rpc.gascap 500000 --datadir /mnt/ssd/ethereum

However a gascap of 500000 seemed to be too high so I changed it to 300000.
Filling up the RAM was much slower this time, however it took quickly more than 200MB in a few minutes after it had reached the adjusted cache of 2048MB.

@mtbitcoin
How much gas did you use for the gascap and is this a long term fix?

@noahh40
Copy link

noahh40 commented Aug 24, 2021

I've got the same issue with the following set-up:

Geth version: 1.9.25-stable
OS & Version: Ubuntu 20.10
Go Version: go1.15.6
Hardware: Raspberry Pi 4, 8GB

Geth eats up all my RAM after a long time running.
+600MB after 20min
+5GB after 10 days (actually more due to Swap)

I started with the line:
sudo geth --syncmode fast --cache 2048 --datadir /mnt/ssd/ethereum

Afterwards I tried to use the solution presented by @mtbitcoin :
sudo geth --syncmode fast --cache 2048 --rpc.gascap 500000 --datadir /mnt/ssd/ethereum

However a gascap of 500000 seemed to be too high so I changed it to 300000.
Filling up the RAM was much slower this time, however it took quickly more than 200MB in a few minutes after it had reached the adjusted cache of 2048MB.

@mtbitcoin
How much gas did you use for the gascap and is this a long term fix?

also interested in @mtbitcoin reply to the above question

@shreethejaBandit
Copy link

I've got the same issue with the following set-up:
Geth version: 1.9.25-stable
OS & Version: Ubuntu 20.10
Go Version: go1.15.6
Hardware: Raspberry Pi 4, 8GB
Geth eats up all my RAM after a long time running.
+600MB after 20min
+5GB after 10 days (actually more due to Swap)
I started with the line:
sudo geth --syncmode fast --cache 2048 --datadir /mnt/ssd/ethereum
Afterwards I tried to use the solution presented by @mtbitcoin :
sudo geth --syncmode fast --cache 2048 --rpc.gascap 500000 --datadir /mnt/ssd/ethereum
However a gascap of 500000 seemed to be too high so I changed it to 300000.
Filling up the RAM was much slower this time, however it took quickly more than 200MB in a few minutes after it had reached the adjusted cache of 2048MB.
@mtbitcoin
How much gas did you use for the gascap and is this a long term fix?

also interested in @mtbitcoin reply to the above question

Should there be a Gas Limit ?? Interested to Finding this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants