Fix ironic rebalance race rocky #4

markgoddard · 2019-11-21T09:28:32Z

Includes fixes for nova-compute-ironic rebalance race conditions:

These have been backported from patches in review upstream.

This PR also includes TravisCI enablement and some backported patches to get the new functional test that covers the issue to work.

The virt driver FakeDriver used in both the functional and in the unit test used a global state to configure the host and node names the driver reports. This was hard to use when more then one compute service is started. Also global state is dangerous. It turned out that only a set of unit tests are using multiple nodes per compute the rest of the tests can simply use host=<hostname>, nodes=[<hostname>] setup. So this removes the global state. Change-Id: I2cf2fcbaebc706f897ce5dfbff47d32117064f9c (cherry picked from commit b5666fb)

Bugfix Icaf1bae8cb040b939f916a19ce026031ddb84af7 showed that restarting a compute service in the functional env is unrealistic causing faults to slip through. During that bug fix only the minimal change was done in the functional env regarding compute service restart to reproduce the reported fault. However the restart of the compute service could be made even more realistic. This patch simulates a compute service restart in the functional env by stopping the original compute service and starting a totally new compute service for the same host and node. This way we can make sure that we get a brand new ComputeManager in the new service and no state can leak between the old and the new service. This change revealed another shortcoming of the functional env. In the real world the nova-compute service could be restarted without loosing any running servers on the compute host. But with the naive implementation of this change the compute service is re-created. This means that a new ComputeManager is instantiated that loads a new FakeDriver instance as well. That new FakeDriver instance then reports an empty hypervisor. This behavior is not totally unrealistic as it simulates such a compute host restart that cleans the hypervisor state as well (e.g. compute host redeployment). However this type of restart shows another bug in the code path that destroys and deallocates evacuated instance from the source host. Therefore this patch implements the compute service restart in a way that simulates only a service restart and not a full compute restart. A subsequent patch will add a test that uses the clean hypervisor case to reproduces the revealed bug. Related-Bug: #1724172 Change-Id: I9d6cd6259659a35383c0c9c21db72a9434ba86b1

Bug 1853009 describes a race condition involving multiple nova-compute services with ironic. As the compute services start up, the hash ring rebalances, and the compute services have an inconsistent view of which is responsible for a compute node. The sequence of actions here is adapted from a real world log [1], where multiple nova-compute services were started simultaneously. In some cases mocks are used to simulate race conditions. There are three main issues with the behaviour: * host2 deletes the orphan node compute node after host1 has taken ownership of it. * host1 assumes that another compute service will not delete its nodes. Once a node is in rt.compute_nodes, it is not removed again unless the node is orphaned. This prevents host1 from recreating the compute node. * host1 assumes that another compute service will not delete its resource providers. Once an RP is in the provider tree, it is not removed. This functional test documents the current behaviour, with the idea that it can be updated as this behaviour is fixed. [1] http://paste.openstack.org/show/786272/ Co-Authored-By: Matt Riedemann <mriedem.os@gmail.com> Change-Id: Ice4071722de54e8d20bb8c3795be22f1995940cd Related-Bug: #1853009 Related-Bug: #1853159

There is a race condition in nova-compute with the ironic virt driver as nodes get rebalanced. It can lead to compute nodes being removed in the DB and not repopulated. Ultimately this prevents these nodes from being scheduled to. The issue being addressed here is that if a compute node is deleted by a host which thinks it is an orphan, then the compute host that actually owns the node might not recreate it if the node is already in its resource tracker cache. This change fixes the issue by clearing nodes from the resource tracker cache for which a compute node entry does not exist. Then, when the available resource for the node is updated, the compute node object is not found in the cache and gets recreated. Change-Id: I39241223b447fcc671161c370dbf16e1773b684a Partial-Bug: #1853009

markgoddard · 2019-11-21T09:39:37Z

I ended up backporting a couple of patches to get the multi-service functional test passing.

There is a race condition in nova-compute with the ironic virt driver as nodes get rebalanced. It can lead to compute nodes being removed in the DB and not repopulated. Ultimately this prevents these nodes from being scheduled to. The main race condition involved is in update_available_resources in the compute manager. When the list of compute nodes is queried, there is a compute node belonging to the host that it does not expect to be managing, i.e. it is an orphan. Between that time and deleting the orphan, the real owner of the compute node takes ownership of it ( in the resource tracker). However, the node is still deleted as the first host is unaware of the ownership change. This change prevents this from occurring by filtering on the host when deleting a compute node. If another compute host has taken ownership of a node, it will have updated the host field and this will prevent deletion from occurring. The first host sees this has happened via the ComputeHostNotFound exception, and avoids deleting its resource provider. Closes-Bug: #1853009 Related-Bug: #1841481 Change-Id: I260c1fded79a85d4899e94df4d9036a1ee437f02

In the fix for bug 1839560 [1][2], soft-deleted compute nodes may be restored, to ensure we can reuse ironic node UUIDs as compute node UUIDs. While this seems to largely work, it results in some nasty errors being generated [3]: InvalidRequestError This session is in 'inactive' state, due to the SQL transaction being rolled back; no further SQL can be emitted within this transaction. This happens because compute_node_create is decorated with pick_context_manager_writer, which begins a transaction. While _compute_node_get_and_update_deleted claims that calling a second pick_context_manager_writer decorated function will begin a new subtransaction, this does not appear to be the case. This change removes pick_context_manager_writer from the compute_node_create function, and adds a new _compute_node_create function which ensures the transaction is finished if _compute_node_get_and_update_deleted is called. The new unit test added here fails without this change. This change marks the removal of the final FIXME from the functional test added in [4]. [1] https://bugs.launchpad.net/nova/+bug/1839560 [2] https://git.openstack.org/cgit/openstack/nova/commit/?id=89dd74ac7f1028daadf86cb18948e27fe9d1d411 [3] http://paste.openstack.org/show/786350/ [4] https://review.opendev.org/#/c/695012/ Change-Id: Iae119ea8776bc7f2e5dbe2e502a743217beded73 Closes-Bug: #1853159 Related-Bug: #1853009

This was added in error in 1ed3f8f. Change-Id: I94570d254f9dcdd716c5eb33672a7c1d1a738f6a

Adds pep8, py27, functional and docs tox environments. Change-Id: Iede0a01d5c180b635278e1c27095eef4eac63672

JohnGarbutt

More functional test changes than I expected to see, which I don't really have time to get my head around, but the code seems good to go.

markgoddard · 2019-11-21T10:47:03Z

CI failure was in docs, I think we can live with that.

markgoddard · 2019-11-21T10:49:53Z

Tagged as stackhpc/18.2.3.2.

NOTE(sbauza): Stable policy allows us to proactively merge a backport without waiting for the parent patch to be merged (exception to rule #4 in [1]. Marking [stable-only] in order to silence nova-tox-validate-backport [1] https://docs.openstack.org/project-team-guide/stable-branches.html#appropriate-fixes Related-Bug: #1996188 Change-Id: I5a399f1d3d702bfb76c067893e9c924904c8c360

Trivial conflicts on xena only in: nova/conf/compute.py NOTE(sbauza): Stable policy allows us to proactively merge a backport without waiting for the parent patch to be merged (exception to rule #4 in [1]. Marking [stable-only] in order to silence nova-tox-validate-backport [1] https://docs.openstack.org/project-team-guide/stable-branches.html#appropriate-fixes Related-Bug: #1996188 Change-Id: I5a399f1d3d702bfb76c067893e9c924904c8c360

NOTE(sbauza): Stable policy allows us to proactively merge a backport without waiting for the parent patch to be merged (exception to rule #4 in [1]. Marking [stable-only] in order to silence nova-tox-validate-backport [1] https://docs.openstack.org/project-team-guide/stable-branches.html#appropriate-fixes Conflicts vs victoria in: nova/conf/compute.py Related-Bug: #1996188 Change-Id: I5a399f1d3d702bfb76c067893e9c924904c8c360 (cherry picked from commit bf8b6f698c23c631c0a71d77f2716ca292c45e74)

NOTE(sbauza): Stable policy allows us to proactively merge a backport without waiting for the parent patch to be merged (exception to rule #4 in [1]. Marking [stable-only] in order to silence nova-tox-validate-backport [1] https://docs.openstack.org/project-team-guide/stable-branches.html#appropriate-fixes Conflicts vs wallaby in: nova/conf/compute.py nova/tests/unit/virt/test_images.py Related-Bug: #1996188 Change-Id: I5a399f1d3d702bfb76c067893e9c924904c8c360 (cherry picked from commit bf8b6f698c23c631c0a71d77f2716ca292c45e74)

NOTE(sbauza): Stable policy allows us to proactively merge a backport without waiting for the parent patch to be merged (exception to rule #4 in [1]. Marking [stable-only] in order to silence nova-tox-validate-backport [1] https://docs.openstack.org/project-team-guide/stable-branches.html#appropriate-fixes Related-Bug: #1996188 Change-Id: I5a399f1d3d702bfb76c067893e9c924904c8c360

Trivial conflicts on xena only in: nova/conf/compute.py NOTE(sbauza): Stable policy allows us to proactively merge a backport without waiting for the parent patch to be merged (exception to rule #4 in [1]. Marking [stable-only] in order to silence nova-tox-validate-backport [1] https://docs.openstack.org/project-team-guide/stable-branches.html#appropriate-fixes Related-Bug: #1996188 Change-Id: I5a399f1d3d702bfb76c067893e9c924904c8c360

NOTE(sbauza): Stable policy allows us to proactively merge a backport without waiting for the parent patch to be merged (exception to rule #4 in [1]. Marking [stable-only] in order to silence nova-tox-validate-backport [1] https://docs.openstack.org/project-team-guide/stable-branches.html#appropriate-fixes Conflicts vs victoria in: nova/conf/compute.py Related-Bug: #1996188 Change-Id: I5a399f1d3d702bfb76c067893e9c924904c8c360

NOTE(sbauza): Stable policy allows us to proactively merge a backport without waiting for the parent patch to be merged (exception to rule #4 in [1]. Marking [stable-only] in order to silence nova-tox-validate-backport [1] https://docs.openstack.org/project-team-guide/stable-branches.html#appropriate-fixes Related-Bug: #1996188 Change-Id: I5a399f1d3d702bfb76c067893e9c924904c8c360

Balazs Gibizer and others added 4 commits November 21, 2019 09:18

markgoddard requested a review from JohnGarbutt November 21, 2019 09:28

markgoddard self-assigned this Nov 21, 2019

markgoddard added 4 commits November 21, 2019 09:49

Remove duplicate unit test

ceb8279

This was added in error in 1ed3f8f. Change-Id: I94570d254f9dcdd716c5eb33672a7c1d1a738f6a

Add travis.yml

6a9790b

Adds pep8, py27, functional and docs tox environments. Change-Id: Iede0a01d5c180b635278e1c27095eef4eac63672

markgoddard force-pushed the fix-ironic-rebalance-race-rocky branch from 7b463fe to 6a9790b Compare November 21, 2019 09:50

JohnGarbutt approved these changes Nov 21, 2019

View reviewed changes

markgoddard merged commit 55a6149 into stackhpc/rocky Nov 21, 2019

markgoddard deleted the fix-ironic-rebalance-race-rocky branch November 21, 2019 10:47

markgoddard mentioned this pull request Jan 25, 2023

[stable-only][cve] Check VMDK create-type against an allowed list #19

Merged

markgoddard mentioned this pull request Jan 25, 2023

[stable-only][cve] Check VMDK create-type against an allowed list #20

Merged

JohnGarbutt mentioned this pull request Jan 25, 2023

[stable-only][cve] Check VMDK create-type against an allowed list #21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ironic rebalance race rocky #4

Fix ironic rebalance race rocky #4

markgoddard commented Nov 21, 2019 •

edited

Loading

markgoddard commented Nov 21, 2019

JohnGarbutt left a comment

markgoddard commented Nov 21, 2019

markgoddard commented Nov 21, 2019

Fix ironic rebalance race rocky #4

Fix ironic rebalance race rocky #4

Conversation

markgoddard commented Nov 21, 2019 • edited Loading

markgoddard commented Nov 21, 2019

JohnGarbutt left a comment

Choose a reason for hiding this comment

markgoddard commented Nov 21, 2019

markgoddard commented Nov 21, 2019

markgoddard commented Nov 21, 2019 •

edited

Loading