Skip to content

Fix ironic rebalance race rocky #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Nov 21, 2019

Conversation

markgoddard
Copy link

@markgoddard markgoddard commented Nov 21, 2019

Includes fixes for nova-compute-ironic rebalance race conditions:

These have been backported from patches in review upstream.

This PR also includes TravisCI enablement and some backported patches to get the new functional test that covers the issue to work.

Balazs Gibizer and others added 4 commits November 21, 2019 09:18
The virt driver FakeDriver used in both the functional and in the unit
test used a global state to configure the host and node names the driver
reports. This was hard to use when more then one compute service is started.
Also global state is dangerous.

It turned out that only a set of unit tests are using multiple nodes per
compute the rest of the tests can simply use host=<hostname>,
nodes=[<hostname>] setup.

So this removes the global state.

Change-Id: I2cf2fcbaebc706f897ce5dfbff47d32117064f9c
(cherry picked from commit b5666fb)
Bugfix Icaf1bae8cb040b939f916a19ce026031ddb84af7 showed that restarting
a compute service in the functional env is unrealistic causing faults
to slip through. During that bug fix only the minimal change was done
in the functional env regarding compute service restart to reproduce
the reported fault. However the restart of the compute service could
be made even more realistic.

This patch simulates a compute service restart in the functional env
by stopping the original compute service and starting a totally new
compute service for the same host and node. This way we can make sure
that we get a brand new ComputeManager in the new service and no
state can leak between the old and the new service.

This change revealed another shortcoming of the functional env.
In the real world the nova-compute service could be restarted without
loosing any running servers on the compute host. But with the naive
implementation of this change the compute service is re-created. This
means that a new ComputeManager is instantiated that loads a new
FakeDriver instance as well. That new FakeDriver instance then reports
an empty hypervisor. This behavior is not totally unrealistic as it
simulates such a compute host restart that cleans the hypervisor state
as well (e.g. compute host redeployment). However this type of restart
shows another bug in the code path that destroys and deallocates
evacuated instance from the source host. Therefore this patch
implements the compute service restart in a way that simulates only a
service restart and not a full compute restart. A subsequent patch will
add a test that uses the clean hypervisor case to reproduces the
revealed bug.

Related-Bug: #1724172
Change-Id: I9d6cd6259659a35383c0c9c21db72a9434ba86b1
Bug 1853009 describes a race condition involving multiple nova-compute
services with ironic. As the compute services start up, the hash ring
rebalances, and the compute services have an inconsistent view of which
is responsible for a compute node.

The sequence of actions here is adapted from a real world log [1], where
multiple nova-compute services were started simultaneously. In some
cases mocks are used to simulate race conditions.

There are three main issues with the behaviour:

* host2 deletes the orphan node compute node after host1 has taken
  ownership of it.

* host1 assumes that another compute service will not delete its nodes.
  Once a node is in rt.compute_nodes, it is not removed again unless the
  node is orphaned. This prevents host1 from recreating the compute
  node.

* host1 assumes that another compute service will not delete its
  resource providers. Once an RP is in the provider tree, it is not
  removed.

This functional test documents the current behaviour, with the idea that
it can be updated as this behaviour is fixed.

[1] http://paste.openstack.org/show/786272/

Co-Authored-By: Matt Riedemann <mriedem.os@gmail.com>

Change-Id: Ice4071722de54e8d20bb8c3795be22f1995940cd
Related-Bug: #1853009
Related-Bug: #1853159
There is a race condition in nova-compute with the ironic virt driver as
nodes get rebalanced. It can lead to compute nodes being removed in the
DB and not repopulated. Ultimately this prevents these nodes from being
scheduled to.

The issue being addressed here is that if a compute node is deleted by a host
which thinks it is an orphan, then the compute host that actually owns the node
might not recreate it if the node is already in its resource tracker cache.

This change fixes the issue by clearing nodes from the resource tracker cache
for which a compute node entry does not exist. Then, when the available
resource for the node is updated, the compute node object is not found in the
cache and gets recreated.

Change-Id: I39241223b447fcc671161c370dbf16e1773b684a
Partial-Bug: #1853009
@markgoddard markgoddard self-assigned this Nov 21, 2019
@markgoddard
Copy link
Author

I ended up backporting a couple of patches to get the multi-service functional test passing.

There is a race condition in nova-compute with the ironic virt driver as
nodes get rebalanced. It can lead to compute nodes being removed in the
DB and not repopulated. Ultimately this prevents these nodes from being
scheduled to.

The main race condition involved is in update_available_resources in
the compute manager. When the list of compute nodes is queried, there is
a compute node belonging to the host that it does not expect to be
managing, i.e. it is an orphan. Between that time and deleting the
orphan, the real owner of the compute node takes ownership of it ( in
the resource tracker). However, the node is still deleted as the first
host is unaware of the ownership change.

This change prevents this from occurring by filtering on the host when
deleting a compute node. If another compute host has taken ownership of
a node, it will have updated the host field and this will prevent
deletion from occurring. The first host sees this has happened via the
ComputeHostNotFound exception, and avoids deleting its resource
provider.

Closes-Bug: #1853009
Related-Bug: #1841481
Change-Id: I260c1fded79a85d4899e94df4d9036a1ee437f02
In the fix for bug 1839560 [1][2], soft-deleted compute nodes may be
restored, to ensure we can reuse ironic node UUIDs as compute node
UUIDs. While this seems to largely work, it results in some nasty errors
being generated [3]:

    InvalidRequestError This session is in 'inactive' state, due to the
    SQL transaction being rolled back; no further SQL can be emitted
    within this transaction.

This happens because compute_node_create is decorated with
pick_context_manager_writer, which begins a transaction. While
_compute_node_get_and_update_deleted claims that calling a second
pick_context_manager_writer decorated function will begin a new
subtransaction, this does not appear to be the case.

This change removes pick_context_manager_writer from the
compute_node_create function, and adds a new _compute_node_create
function which ensures the transaction is finished if
_compute_node_get_and_update_deleted is called.

The new unit test added here fails without this change.

This change marks the removal of the final FIXME from the functional
test added in [4].

[1] https://bugs.launchpad.net/nova/+bug/1839560
[2] https://git.openstack.org/cgit/openstack/nova/commit/?id=89dd74ac7f1028daadf86cb18948e27fe9d1d411
[3] http://paste.openstack.org/show/786350/
[4] https://review.opendev.org/#/c/695012/

Change-Id: Iae119ea8776bc7f2e5dbe2e502a743217beded73
Closes-Bug: #1853159
Related-Bug: #1853009
This was added in error in 1ed3f8f.

Change-Id: I94570d254f9dcdd716c5eb33672a7c1d1a738f6a
Adds pep8, py27, functional and docs tox environments.

Change-Id: Iede0a01d5c180b635278e1c27095eef4eac63672
@markgoddard markgoddard force-pushed the fix-ironic-rebalance-race-rocky branch from 7b463fe to 6a9790b Compare November 21, 2019 09:50
Copy link
Member

@JohnGarbutt JohnGarbutt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More functional test changes than I expected to see, which I don't really have time to get my head around, but the code seems good to go.

@markgoddard
Copy link
Author

CI failure was in docs, I think we can live with that.

@markgoddard markgoddard merged commit 55a6149 into stackhpc/rocky Nov 21, 2019
@markgoddard markgoddard deleted the fix-ironic-rebalance-race-rocky branch November 21, 2019 10:47
@markgoddard
Copy link
Author

Tagged as stackhpc/18.2.3.2.

markgoddard pushed a commit that referenced this pull request Jan 25, 2023
NOTE(sbauza): Stable policy allows us to proactively merge a backport without waiting for the parent patch to be merged (exception to rule #4 in [1]. Marking [stable-only] in order to silence nova-tox-validate-backport

[1] https://docs.openstack.org/project-team-guide/stable-branches.html#appropriate-fixes

Related-Bug: #1996188
Change-Id: I5a399f1d3d702bfb76c067893e9c924904c8c360
markgoddard pushed a commit that referenced this pull request Jan 25, 2023
Trivial conflicts on xena only in:
	nova/conf/compute.py

NOTE(sbauza): Stable policy allows us to proactively merge a backport without waiting for the parent patch to be merged (exception to rule #4 in [1]. Marking [stable-only] in order to silence nova-tox-validate-backport

[1] https://docs.openstack.org/project-team-guide/stable-branches.html#appropriate-fixes

Related-Bug: #1996188
Change-Id: I5a399f1d3d702bfb76c067893e9c924904c8c360
JohnGarbutt pushed a commit that referenced this pull request Jan 25, 2023
NOTE(sbauza): Stable policy allows us to proactively merge a backport without waiting for the parent patch to be merged (exception to rule #4 in [1]. Marking [stable-only] in order to silence nova-tox-validate-backport

[1] https://docs.openstack.org/project-team-guide/stable-branches.html#appropriate-fixes

Conflicts vs victoria in:
	nova/conf/compute.py

Related-Bug: #1996188
Change-Id: I5a399f1d3d702bfb76c067893e9c924904c8c360
(cherry picked from commit bf8b6f698c23c631c0a71d77f2716ca292c45e74)
JohnGarbutt pushed a commit that referenced this pull request Jan 25, 2023
NOTE(sbauza): Stable policy allows us to proactively merge a backport without waiting for the parent patch to be merged (exception to rule #4 in [1]. Marking [stable-only] in order to silence nova-tox-validate-backport

[1] https://docs.openstack.org/project-team-guide/stable-branches.html#appropriate-fixes

Conflicts vs wallaby in:
	nova/conf/compute.py
	nova/tests/unit/virt/test_images.py

Related-Bug: #1996188
Change-Id: I5a399f1d3d702bfb76c067893e9c924904c8c360
(cherry picked from commit bf8b6f698c23c631c0a71d77f2716ca292c45e74)
github-actions bot pushed a commit that referenced this pull request Jan 30, 2023
NOTE(sbauza): Stable policy allows us to proactively merge a backport without waiting for the parent patch to be merged (exception to rule #4 in [1]. Marking [stable-only] in order to silence nova-tox-validate-backport

[1] https://docs.openstack.org/project-team-guide/stable-branches.html#appropriate-fixes

Related-Bug: #1996188
Change-Id: I5a399f1d3d702bfb76c067893e9c924904c8c360
github-actions bot pushed a commit that referenced this pull request Jan 30, 2023
Trivial conflicts on xena only in:
	nova/conf/compute.py

NOTE(sbauza): Stable policy allows us to proactively merge a backport without waiting for the parent patch to be merged (exception to rule #4 in [1]. Marking [stable-only] in order to silence nova-tox-validate-backport

[1] https://docs.openstack.org/project-team-guide/stable-branches.html#appropriate-fixes

Related-Bug: #1996188
Change-Id: I5a399f1d3d702bfb76c067893e9c924904c8c360
JohnGarbutt pushed a commit that referenced this pull request Feb 1, 2023
NOTE(sbauza): Stable policy allows us to proactively merge a backport without waiting for the parent patch to be merged (exception to rule #4 in [1]. Marking [stable-only] in order to silence nova-tox-validate-backport

[1] https://docs.openstack.org/project-team-guide/stable-branches.html#appropriate-fixes

Conflicts vs victoria in:
	nova/conf/compute.py

Related-Bug: #1996188
Change-Id: I5a399f1d3d702bfb76c067893e9c924904c8c360
JohnGarbutt pushed a commit that referenced this pull request Feb 1, 2023
NOTE(sbauza): Stable policy allows us to proactively merge a backport without waiting for the parent patch to be merged (exception to rule #4 in [1]. Marking [stable-only] in order to silence nova-tox-validate-backport

[1] https://docs.openstack.org/project-team-guide/stable-branches.html#appropriate-fixes

Conflicts vs victoria in:
	nova/conf/compute.py

Related-Bug: #1996188
Change-Id: I5a399f1d3d702bfb76c067893e9c924904c8c360
github-actions bot pushed a commit that referenced this pull request Feb 13, 2023
NOTE(sbauza): Stable policy allows us to proactively merge a backport without waiting for the parent patch to be merged (exception to rule #4 in [1]. Marking [stable-only] in order to silence nova-tox-validate-backport

[1] https://docs.openstack.org/project-team-guide/stable-branches.html#appropriate-fixes


Related-Bug: #1996188
Change-Id: I5a399f1d3d702bfb76c067893e9c924904c8c360
github-actions bot pushed a commit that referenced this pull request Jun 25, 2024
NOTE(sbauza): Stable policy allows us to proactively merge a backport without waiting for the parent patch to be merged (exception to rule #4 in [1]. Marking [stable-only] in order to silence nova-tox-validate-backport

[1] https://docs.openstack.org/project-team-guide/stable-branches.html#appropriate-fixes

Related-Bug: #1996188
Change-Id: I5a399f1d3d702bfb76c067893e9c924904c8c360
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants