Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdctl cluster-health and member list commands do not work correctly #2711

Closed
kelseyhightower opened this issue Apr 19, 2015 · 9 comments · Fixed by #3178
Closed

etcdctl cluster-health and member list commands do not work correctly #2711

kelseyhightower opened this issue Apr 19, 2015 · 9 comments · Fixed by #3178
Milestone

Comments

@kelseyhightower
Copy link
Contributor

etcd server version

/opt/bin/etcd --version
etcd version 2.0.9

etcd client version

/usr/local/bin/etcdctl --version
etcdctl version 2.0.9

Start a 3 node etcd cluster

vmrun list
Total running VMs: 3
/Users/kelseyhightower/Documents/Virtual Machines.localized/core0.vmwarevm/core0.vmx
/Users/kelseyhightower/Documents/Virtual Machines.localized/core1.vmwarevm/core1.vmx
/Users/kelseyhightower/Documents/Virtual Machines.localized/core2.vmwarevm/core2.vmx
etcdctl cluster-health
cluster is healthy
member 5ae3067007f7fb85 is healthy
member 7931e79c0d8b47c5 is healthy
member 987146e8925f10e5 is healthy
etcdctl member list
5ae3067007f7fb85: name=etcd2 peerURLs=http://192.168.12.52:2380 clientURLs=http://192.168.12.52:2379
7931e79c0d8b47c5: name=etcd0 peerURLs=http://192.168.12.50:2380 clientURLs=http://192.168.12.50:2379
987146e8925f10e5: name=etcd1 peerURLs=http://192.168.12.51:2380 clientURLs=http://192.168.12.51:2379

Poweroff one of the etcd members

vmrun stop /Users/kelseyhightower/Documents/Virtual\ Machines.localized/core0.vmwarevm/core0.vmx
vmrun list
Total running VMs: 2
/Users/kelseyhightower/Documents/Virtual Machines.localized/core1.vmwarevm/core1.vmx
/Users/kelseyhightower/Documents/Virtual Machines.localized/core2.vmwarevm/core2.vmx

The member list commands fails

etcdctl -C http://192.168.12.50:2379,http://192.168.12.51:2379,http://192.168.12.52:2379 member list
context deadline exceeded

The cluster is reported healthy, but no nodes are marked unhealthy even though member 7931e79c0d8b47c5 is powered off.

etcdctl -C http://192.168.12.50:2379,http://192.168.12.51:2379,http://192.168.12.52:2379 cluster-health
cluster is healthy
member 5ae3067007f7fb85 is healthy
member 7931e79c0d8b47c5 is healthy
member 987146e8925f10e5 is healthy
@barakmich barakmich modified the milestone: v2.1.0-alpha.1 Apr 24, 2015
@yichengq
Copy link
Contributor

This is basically the same one as what we met in 2.0.3: #2340

@mariusgrigaitis
Copy link

Can confirm this.

@barakmich barakmich modified the milestones: v2.2.0, v2.1.0-alpha.1 May 15, 2015
@xiang90 xiang90 added the etcdctl label Jun 6, 2015
@durzo
Copy link

durzo commented Jun 24, 2015

Running v2.1.0-alpha.1, shutting down a node still has cluster-health returning that all members are healthy.

2015/06/24 09:29:30 etcdserver: failed to reach the peerURL(http://etcd2:7001) of member 7a9767de17ea4500 (Get http://etcd2:7001/version: net/http: request canceled while waiting for connection)

root@etcd3:~$ etcdctl cluster-health
cluster is healthy
member 7a9767de17ea4500 is healthy
member cb1f485859524c11 is healthy
member d555fc8f72be9146 is healthy

@xiang90
Copy link
Contributor

xiang90 commented Jun 24, 2015

/cc @yichengq

@yichengq
Copy link
Contributor

@durzo Did it changes to unhealthy finally? How long did you see this false healthy info?

So far, we know that the implementation has some delay(around minutes) on healthy status for hard-kill machine, and we plan to improve it in 2.2. Internal details are that etcd 2.0 sends MsgApp async on HTTP stream, which cannot reflect whether the receive side works.

yichengq added a commit to yichengq/etcd that referenced this issue Jul 24, 2015
This method uses raft status exposed at /debug/varz to determine the
health of the cluster. It uses whether commit index increases to
determine the cluster health, and uses whether match index increases to
determine the member health.

This could fix the bug etcd-io#2711 that fails to detect follower is unhealthy
because it doesn't rely on whether message in long-polling connection is sent.

This health check is stricter than the old one, and reflects the
situation that whether followers are healthy in the view of the leader. One
example is that if the follower is receiving the snapshot, it will turns
out to be unhealthy because it doesn't move forward.

`etcdctl cluster-health` will reflect the healthy view in the raft level,
while connectivity checks reflects the healthy view in transport level.
yichengq added a commit to yichengq/etcd that referenced this issue Jul 28, 2015
This method uses raft status exposed at /debug/varz to determine the
health of the cluster. It uses whether commit index increases to
determine the cluster health, and uses whether match index increases to
determine the member health.

This could fix the bug etcd-io#2711 that fails to detect follower is unhealthy
because it doesn't rely on whether message in long-polling connection is sent.

This health check is stricter than the old one, and reflects the
situation that whether followers are healthy in the view of the leader. One
example is that if the follower is receiving the snapshot, it will turns
out to be unhealthy because it doesn't move forward.

`etcdctl cluster-health` will reflect the healthy view in the raft level,
while connectivity checks reflects the healthy view in transport level.
yichengq added a commit to yichengq/etcd that referenced this issue Jul 29, 2015
This method uses raft status exposed at /debug/varz to determine the
health of the cluster. It uses whether commit index increases to
determine the cluster health, and uses whether match index increases to
determine the member health.

This could fix the bug etcd-io#2711 that fails to detect follower is unhealthy
because it doesn't rely on whether message in long-polling connection is sent.

This health check is stricter than the old one, and reflects the
situation that whether followers are healthy in the view of the leader. One
example is that if the follower is receiving the snapshot, it will turns
out to be unhealthy because it doesn't move forward.

`etcdctl cluster-health` will reflect the healthy view in the raft level,
while connectivity checks reflects the healthy view in transport level.
yichengq added a commit to yichengq/etcd that referenced this issue Jul 30, 2015
This method uses raft status exposed at /debug/varz to determine the
health of the cluster. It uses whether commit index increases to
determine the cluster health, and uses whether match index increases to
determine the member health.

This could fix the bug etcd-io#2711 that fails to detect follower is unhealthy
because it doesn't rely on whether message in long-polling connection is sent.

This health check is stricter than the old one, and reflects the
situation that whether followers are healthy in the view of the leader. One
example is that if the follower is receiving the snapshot, it will turns
out to be unhealthy because it doesn't move forward.

`etcdctl cluster-health` will reflect the healthy view in the raft level,
while connectivity checks reflects the healthy view in transport level.
junxu pushed a commit to junxu/etcd that referenced this issue Aug 7, 2015
This method uses raft status exposed at /debug/varz to determine the
health of the cluster. It uses whether commit index increases to
determine the cluster health, and uses whether match index increases to
determine the member health.

This could fix the bug etcd-io#2711 that fails to detect follower is unhealthy
because it doesn't rely on whether message in long-polling connection is sent.

This health check is stricter than the old one, and reflects the
situation that whether followers are healthy in the view of the leader. One
example is that if the follower is receiving the snapshot, it will turns
out to be unhealthy because it doesn't move forward.

`etcdctl cluster-health` will reflect the healthy view in the raft level,
while connectivity checks reflects the healthy view in transport level.
yichengq added a commit to yichengq/etcd that referenced this issue Aug 21, 2015
This method uses raft status exposed at /debug/varz to determine the
health of the cluster. It uses whether commit index increases to
determine the cluster health, and uses whether match index increases to
determine the member health.

This could fix the bug etcd-io#2711 that fails to detect follower is unhealthy
because it doesn't rely on whether message in long-polling connection is sent.

This health check is stricter than the old one, and reflects the
situation that whether followers are healthy in the view of the leader. One
example is that if the follower is receiving the snapshot, it will turns
out to be unhealthy because it doesn't move forward.

`etcdctl cluster-health` will reflect the healthy view in the raft level,
while connectivity checks reflects the healthy view in transport level.
mwitkow pushed a commit to mwitkow/etcd that referenced this issue Sep 29, 2015
This method uses raft status exposed at /debug/varz to determine the
health of the cluster. It uses whether commit index increases to
determine the cluster health, and uses whether match index increases to
determine the member health.

This could fix the bug etcd-io#2711 that fails to detect follower is unhealthy
because it doesn't rely on whether message in long-polling connection is sent.

This health check is stricter than the old one, and reflects the
situation that whether followers are healthy in the view of the leader. One
example is that if the follower is receiving the snapshot, it will turns
out to be unhealthy because it doesn't move forward.

`etcdctl cluster-health` will reflect the healthy view in the raft level,
while connectivity checks reflects the healthy view in transport level.
@kerk1v
Copy link

kerk1v commented Feb 6, 2018

Hello,

It seems behaviour is still the same in etcd v3.2.15, so I have no way a cluster operator can manually confirm the health of an etcd v3 cluster. Any ideas here to help on this, or any alternative etcdctl command to check cluster health in etcd3 than 'etcdctl member list'?

Edited:

It seems

ETCDCTL_API=3 etcdctl --cert=/etc/etcd_k8s/etcd.pem --key /etc/etcd_k8s/etcd-key.pem --i
nsecure-skip-tls-verify=true --endpoints=[https://master-1:2379,https://master-2:2370,https://master-3:2379] endpoint health

will check each of the nodes and report

https://master-3:2379 is healthy: successfully committed proposal: took = 3.665702ms
https://master-1:2379 is healthy: successfully committed proposal: took = 3.202865ms
https://master-2:2370 is unhealthy: failed to connect: dial tcp 192.168.33.102:2370: getsockopt: no route to host
Error:  unhealthy cluster

in case one of the etcd cluster members is down, however, this requires at least some knowledge of the etcd cluster by the operator.

@dxlr8r
Copy link

dxlr8r commented Oct 25, 2018

I have the same issue with:

etcdctl version: 3.2.22
API version: 3.2

I get "unhealthy cluster" as well when using "etcdctl member list" even though 2/3 is online.

I did however notice that when going from 2/3 to 1/3 there where no leader:

[foo1@bar ~]# etcdctl3 endpoint status
Failed to get the status of endpoint https://foo1:2379 (context deadline exceeded)
Failed to get the status of endpoint https://foo2:2379 (context deadline exceeded)
https://foo3:2379, 6074b97ec42826bg, 3.2.22, 16 MB, false, 760, 20792785

Note the false in the last line.

@knraju483
Copy link

knraju483 commented May 11, 2020

Use below command for V3.3.XX etcdctl
etcdctl --endpoints=https://192.168.56.113:2379,https://192.168.56.118:2379,https://192.168.56.119:2379 --key-file="/etc/kubernetes/pki/etcd/client-key.pem" --cert-file="/etc/kubernetes/pki/etcd/client.pem" --ca-file="/etc/kubernetes/pki/etcd/ca.pem" member list -w table

Use below command for V3.4.7 etcdctl

etcdctl --endpoints=https://192.168.56.113:2379,https://192.168.56.118:2379,https://192.168.56.119:2379 --key="/etc/kubernetes/pki/etcd/client-key.pem" --cert="/etc/kubernetes/pki/etcd/client.pem" --cacert="/etc/kubernetes/pki/etcd/ca.pem" member list -w table
+------------------+---------+----------+-----------------------------+-----------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+----------+-----------------------------+-----------------------------+------------+
| 29338f91dec951c0 | started | master01 | https://192.168.56.113:2380 | https://192.168.56.113:2379 | false |
| 438679b543748ad8 | started | master02 | https://192.168.56.118:2380 | https://192.168.56.118:2379 | false |
| 48544942dc6b8509 | started | master03 | https://192.168.56.119:2380 | https://192.168.56.119:2379 | false |
+------------------+---------+----------+-----------------------------+-----------------------------+------------+

@bu3ny
Copy link

bu3ny commented Jul 13, 2020

I'm using latest release of etcd at the time of writing this comment(etcd-v3.4.9) and the following command works for me:

[root@master01 ~]#  etcdctl --endpoints=https://192.168.122.101:2379,https://192.168.122.102:2379,https://192.168.122.103:2379   --cacert=/etc/etcd/ssl/ca.pem --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem member list -w table
+------------------+---------+----------+------------------------------+------------------------------+------------+
|        ID        | STATUS  |   NAME   |          PEER ADDRS          |         CLIENT ADDRS         | IS LEARNER |
+------------------+---------+----------+------------------------------+------------------------------+------------+
| 148f9f6172465414 | started | master02 | https://192.168.122.102:2380 | https://192.168.122.102:2379 |      false |
| 79ad015295a746a9 | started | master01 | https://192.168.122.101:2380 | https://192.168.122.101:2379 |      false |
| f857eddf41ed1741 | started | master03 | https://192.168.122.103:2380 | https://192.168.122.103:2379 |      false |
+------------------+---------+----------+------------------------------+------------------------------+------------+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

10 participants