Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug]: LND falls out of sync when Bitcoin Core's IP address changes #9353

Open
kallerosenbaum opened this issue Dec 12, 2024 · 6 comments
Open
Labels
bug Unintended code behaviour P2 should be fixed if one has time
Milestone

Comments

@kallerosenbaum
Copy link

kallerosenbaum commented Dec 12, 2024

Background

We run two LND nodes in kubernetes, and after restarting the backing Bitcoin Core node, we notice that LND falls out of sync with the blockchain.

This happens because, in our kubernetes environment, the IP address of Bitcoin Core changes when it is restarted. synced_to_chain will become false and no new blocks will be received.

Your environment

  • version of lnd: v0.18.2-beta
  • which operating system (uname -a on *Nix):
    Linux lnd-routing-0 6.8.0-1018-aws #19~22.04.1-Ubuntu SMP Wed Oct 9 17:10:38 UTC 2024 aarch64 Linux
    and Linux 9db991b293cb 6.1.0-26-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30) x86_64 Linux
  • version of btcd, bitcoind, or other backend: Bitcon Core 27.0
  • any other relevant environment details: We run our stack in kubernetes

Steps to reproduce

I'll show how I reproduce it in regtest, but we get the same issue in production (running in kubernetes) too.

  • We run LND with the following config in docker-compose:
        --listen=0.0.0.0:9735
        --externalip=lnd-0
        --rpclisten=0.0.0.0:10009
        --bitcoin.active
        --bitcoin.node=bitcoind
        --bitcoin.regtest
        --bitcoind.rpcuser=test
        --bitcoind.rpcpass=password
        --bitcoind.rpchost=bitcoin:18443
        --bitcoind.zmqpubrawblock=tcp://bitcoin:18501
        --bitcoind.zmqpubrawtx=tcp://bitcoin:18502
        --norest
        --protocol.wumbo-channels

When running this, bitcoin resolves to 172.18.0.2.

  • Build some blocks and make sure LND is in sync by running lncli -network=regtest getinfo and check that synced_to_chain is true.
  • Stop bitcoin core, and restart it again, but this time make sure it gets a new IP address, so from now on bitcoin resolves to e.g. 172.18.0.6.
  • Build a block
  • Run lncli -network=regtest getinfo. synced_to_chain will be false, but block_height and block_hash will be the most recent one.

After this, LND will not receive any new blocks, but it has apparently reconnected (presumably through RPC) to get the latest block hash. My guess is that ZMQ stops working due to the IP address change.

Expected behaviour

After reconnecting to the node it should eventually show "synced_to_chain": true. Alternatively (it it's a ZMQ connection issue) I'd expect LND to scream pretty loudly in the log.

Actual behaviour

"synced_to_chain": false indefinitely and we see no new logs of type

[INF] NTFN: New block: height=873198, sha=000000000000000000007b48042479e4f07ce2d6ae9a79c2a3ef5223dc78dd5c
@kallerosenbaum kallerosenbaum added bug Unintended code behaviour needs triage labels Dec 12, 2024
@Roasbeef
Copy link
Member

Roasbeef commented Dec 12, 2024

Are you running with the health check system on? It's meant to catch failures like this, then cause a restart of lnd. It seems like you expect that lnd will resolve the bitcoind host again automatically, but atm we do the resolution once, then use the IP from there on.

Here're the health check params I'm referring to:

; The number of times we should attempt to query our chain backend before
; gracefully shutting down. Set this value to 0 to disable this health check.
; healthcheck.chainbackend.attempts=3

; The amount of time we allow a call to our chain backend to take before we fail
; the attempt. This value must be >= 1s.
; healthcheck.chainbackend.timeout=30s

; The amount of time we should backoff between failed attempts to query chain
; backend. This value must be >= 1s.
; healthcheck.chainbackend.backoff=2m

; The amount of time we should wait between chain backend health checks. This
; value must be >= 1m.
; healthcheck.chainbackend.interval=1m

@kallerosenbaum
Copy link
Author

@Roasbeef yes, it's on, and in production we've set

--healthcheck.chainbackend.attempts=30

And we see the following from healthcheck after restart:


2024-12-04 09:55:59.568 [INF] HLCK: Health check: chain backend, call: 1 failed with: invalid http POST response (nil), method: uptime, id: 1215, last error=Post "http://bitcoin-0.bitcoin.crypto.svc.cluster.local:8332": dial tcp: lookup bitcoin-0.bitcoin.crypto.svc.cluster.local on 169.254.20.10:53: no such host, backing off for: 2m0s
2024-12-04 09:58:22.107 [INF] HLCK: Health check: chain backend, call: 2 failed with: invalid http POST response (nil), method: uptime, id: 1216, last error=Post "http://bitcoin-0.bitcoin.crypto.svc.cluster.local:8332": dial tcp: lookup bitcoin-0.bitcoin.crypto.svc.cluster.local on 169.254.20.10:53: no such host, backing off for: 2m0s
2024-12-04 10:00:44.648 [INF] HLCK: Health check: chain backend, call: 3 failed with: invalid http POST response (nil), method: uptime, id: 1217, last error=Post "http://bitcoin-0.bitcoin.crypto.svc.cluster.local:8332": dial tcp: lookup bitcoin-0.bitcoin.crypto.svc.cluster.local on 169.254.20.10:53: no such host, backing off for: 2m0s

Then it succeeds to connect to the RPC port (in spite of IP address change). So at least RPC can handle an IP address change. My guess is that it's the ZMQ connection that stops working, and the health check doesn't verify that connection. So health check doesn't help here.

@saubyk saubyk added this to the 0.20.0 milestone Dec 19, 2024
@saubyk saubyk added P1 MUST be fixed or reviewed P2 should be fixed if one has time and removed needs triage P1 MUST be fixed or reviewed labels Dec 19, 2024
@Dominion5254
Copy link

I'm running into something similar but it seems LND is not able to recover and connect to the new container IP of bitcoind. This occurred after updating Bitcoin Core from 28.0 -> 28.1 resulting in a new container IP for Bitcoin Core. LND logs show the failed Health Check, but it never seems to recover.

2025-03-17T14:43:06-06:00  2025-03-17 20:43:06.807 [CRT] SRVR: Health check: chain backend failed after 5 calls
2025-03-17T14:43:06-06:00  2025-03-17 20:43:06.807 [INF] SRVR: Sending request for shutdown
2025-03-17T14:43:06-06:00  2025-03-17 20:43:06.807 [INF] LTND: Received shutdown request.
2025-03-17T14:43:06-06:00  2025-03-17 20:43:06.808 [INF] LTND: Shutting down...
2025-03-17T14:43:06-06:00  2025-03-17 20:43:06.808 [INF] LTND: Gracefully shutting down.
2025-03-17T14:43:06-06:00  2025-03-17 20:43:06.808 [INF] NANN: Channel Status Manager shutting down...
2025-03-17T14:43:06-06:00  2025-03-17 20:43:06.818 [INF] HSWC: HTLC Switch shutting down...
2025-03-17T14:43:06-06:00  2025-03-17 20:43:06.828 [INF] NTFN: Cancelling epoch notification, epoch_id=6
2025-03-17T14:43:35-06:00  2025-03-17 20:43:35.530 [INF] HSWC: Onion processor shutting down...
2025-03-17T14:43:35-06:00  2025-03-17 20:43:35.530 [INF] HSWC: Decaying hash log received shutdown request
2025-03-17T14:43:35-06:00  2025-03-17 20:43:35.530 [INF] NTFN: Cancelling epoch notification, epoch_id=11
2025-03-17T14:43:35-06:00  2025-03-17 20:43:35.530 [INF] INVC: InvoiceRegistry shutting down...
2025-03-17T14:43:35-06:00  2025-03-17 20:43:35.530 [INF] INVC: InvoiceRegistry shutting down...
2025-03-17T14:43:35-06:00  2025-03-17 20:43:35.530 [INF] NTFN: Cancelling epoch notification, epoch_id=10
2025-03-17T14:43:35-06:00  2025-03-17 20:43:35.531 [INF] HSWC: InterceptableSwitch shutting down...
2025-03-17T14:43:35-06:00  2025-03-17 20:43:35.531 [INF] NTFN: Cancelling epoch notification, epoch_id=7
2025-03-17T14:43:35-06:00  2025-03-17 20:43:35.531 [INF] CRTR: Channel Router shutting down...
2025-03-17T14:43:35-06:00  2025-03-17 20:43:35.531 [INF] CNCT: ChainArbitrator shutting down...
2025-03-17T14:43:35-06:00  2025-03-17 20:43:35.531 [INF] NTFN: Cancelling epoch notification, epoch_id=8
2025-03-17T14:43:35-06:00  2025-03-17 20:43:35.531 [INF] FNDG: Funding manager shutting down...
2025-03-17T14:43:35-06:00  2025-03-17 20:43:35.531 [INF] BRAR: Breach arbiter shutting down...
2025-03-17T14:43:35-06:00  2025-03-17 20:43:35.531 [INF] UTXN: UTXO nursery shutting down...
2025-03-17T14:43:35-06:00  2025-03-17 20:43:35.531 [INF] NTFN: Cancelling epoch notification, epoch_id=5
2025-03-17T14:43:35-06:00  2025-03-17 20:43:35.531 [INF] DISC: Authenticated gossiper shutting down...
2025-03-17T14:43:35-06:00  2025-03-17 20:43:35.531 [INF] NTFN: Cancelling epoch notification, epoch_id=9
2025-03-17T14:43:50-06:00  2025-03-17 20:43:50.071 [ERR] RPCS: [/lnrpc.Lightning/GetInfo]: unable to get best block info: invalid http POST response (nil), method: getblockchaininfo, id: 811, last error=Post "http://bitcoind.embassy:8332": dial tcp 172.18.0.68:8332: connect: no route to host

Bitcoin Core's RPC is accessible at bitcoind.embassy:8332, but LND is not able to connect to the host until LND is restarted. It would be desirable for LND to be resilient to changes in Bitcoin Core's container IP instead of requiring the user to restart LND themselves.

Env:
Bitcoin Core 28.0 -> 28.1
LND 0.18.5
StartOS 0.3.5.1

@guggero
Copy link
Collaborator

guggero commented Mar 18, 2025

@Dominion5254 how does Start9 configure the bitcoind host in the lnd settings? Does it give a container host name or does it resolve an IP and use that directly?
If it's the former, then it would be something lnd has to fix. If it's the latter then Start9 would need to fix that.

@Dominion5254
Copy link

It is the former, bitcoind has a static hostname which LND uses.

@guggero
Copy link
Collaborator

guggero commented Mar 20, 2025

Hmm, okay. Then I guess we need to make sure we re-resolve the IP address when reconnecting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unintended code behaviour P2 should be fixed if one has time
Projects
None yet
Development

No branches or pull requests

5 participants