Retry fragile tests or allow them to fail #189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

Totktonada opened this issue Sep 23, 2019 · 0 comments

Assignees

Labels

feature

Member

Totktonada commented Sep 23, 2019

Manual retrying of testing to understand whether a fail is persistent spends my time, it is undesirable.

I have two ways to overcome this, both are on 'raw idea' rights.

Provide a way to set a retrying count for fragile tests

It could be an option in suite.ini. If one of those attempts is successful, then the test should be considered as passed.

We also need an ability to retry a hang test.

I was against retrying for a long time and had hope that we'll fix all flaky tests in some future. It seems we really unable to achieve this goal.

Allow fragile tests to fail

Maybe it worth to do that under command line option and use it in CI, but not locally.

Totktonada added feature raw idea labels

ligurio self-assigned this

Totktonada mentioned this issue

[1pt] gitlab-ci: implement testing rerun based on fragile lists tarantool/tarantool#5050

Closed

avtikhon added a commit that referenced this issue


          Enable test reruns on failed fragiled tests

838ef34

Added ability to set per suite in suite.ini configuration file
'fragile_retries' option, which sets the number of accepted
reruns of the test failed from 'fragile' list.

Part of #189.

avtikhon added a commit that referenced this issue


          Add ability to check failed tests w/ fragile list

7869fe9

Added ability to check failed tests w/ fragile list to be sure that
the current fail equal to the issue mentioned in the fragile list.

Closes #189.

avtikhon added a commit that referenced this issue


          Add ability to check failed tests w/ fragile list

e76a809

Added ability to check failed tests w/ fragile list to be sure that
the current fail equal to the issue mentioned in the fragile list.

Closes #189.

avtikhon added a commit that referenced this issue


          Add ability to check failed tests w/ fragile list

358cf41

Added ability to check failed tests w/ fragile list to be sure that
the current fail equal to the issue mentioned in the fragile list.

Closes #189.

avtikhon added a commit that referenced this issue


          Add ability to check failed tests w/ fragile list

55a8fca

Added ability to check failed tests w/ fragile list to be sure that
the current fail equal to the issue mentioned in the fragile list.
Fragile list should consist of the results files checksums with
mentioned issues in the format:

  fragile = <basename of the test> ; gh-<issue> md5sum:<checksum>

Closes #189.

avtikhon mentioned this issue

Enable test reruns on failed fragiled tests #217

Merged

avtikhon added a commit that referenced this issue


          Add ability to check failed tests w/ fragile list

cc4ada1

Added ability to check failed tests w/ fragile list to be sure that
the current fail equal to the issue mentioned in the fragile list.
Fragile list should consist of the results files checksums with
mentioned issues in the format:

  fragile = <basename of the test> ; gh-<issue> md5sum:<checksum>

Closes #189.

avtikhon added a commit that referenced this issue


          Add ability to check failed tests w/ fragile list

68cb0a4

Added ability to check failed tests w/ fragile list to be sure that
the current fail equal to the issue mentioned in the fragile list.
Fragile list should consist of the results files checksums with
mentioned issues in the format:

  fragile = <basename of the test> ; gh-<issue> md5sum:<checksum>

Closes #189.

avtikhon added a commit that referenced this issue


          Enable test reruns on failed fragiled tests

1950f53

Added ability to set per suite in suite.ini configuration file
'fragile_retries' option, which sets the number of accepted
reruns of the test failed from 'fragile' list.

Part of #189.

avtikhon added a commit that referenced this issue


          Add ability to check failed tests w/ fragile list

e9e6fd2

Added ability to check failed tests w/ fragile list to be sure that
the current fail equal to the issue mentioned in the fragile list.
Fragile list should consist of the results files checksums with
mentioned issues in the format:

  fragile = <basename of the test> ; gh-<issue> md5sum:<checksum>

Closes #189.

avtikhon added a commit that referenced this issue


          Enable test reruns on failed fragiled tests

9fd5c68

Added ability to set per suite in suite.ini configuration file
'retries' option, which sets the number of accepted reruns of
the tests failed from 'fragile' list:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
                "issues": [ "gh-4095" ],
        }
    }}

Part of #189

avtikhon added a commit that referenced this issue


          Add ability to check failed tests w/ fragile list

7956ee3

Added ability to check failed tests w/ fragile list to be sure that
the current fail equal to the issue mentioned in the fragile list.
Fragile list should consist of the results files checksums with
mentioned issues in the format:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
            "checksums": [ "050af3a99561a724013995668a4bc71c", "f34be60193cfe9221d3fe50df657e9d3" ]
        }
    }}

Closes #189

avtikhon added a commit that referenced this issue


          Add ability to check failed tests w/ fragile list

22a23ef

Added ability to check failed tests w/ fragile list to be sure that
the current fail equal to the issue mentioned in the fragile list.
Fragile list should consist of the results files checksums with
mentioned issues in the format:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
            "checksums": [ "050af3a99561a724013995668a4bc71c", "f34be60193cfe9221d3fe50df657e9d3" ]
        }
    }}

Closes #189

avtikhon added a commit that referenced this issue


          Enable test reruns on failed fragiled tests

5ba694e

Added ability to set per suite in suite.ini configuration file
'retries' option, which sets the number of accepted reruns of
the tests failed from 'fragile' list:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
        }
    }}

Part of #189

avtikhon added a commit that referenced this issue


          Add ability to check failed tests w/ fragile list

339ed9b

Added ability to check failed tests w/ fragile list to be sure that
the current fail equal to the issue mentioned in the fragile list.
Fragile list should consist of the results files checksums with
mentioned issues in the format:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
            "checksums": [ "050af3a99561a724013995668a4bc71c", "f34be60193cfe9221d3fe50df657e9d3" ]
        }
    }}

Closes #189

avtikhon added a commit that referenced this issue


          Add ability to check failed tests w/ fragile list

01d9865

Added ability to check failed tests w/ fragile list to be sure that
the current fail equal to the issue mentioned in the fragile list.
Fragile list should consist of the results files checksums with
mentioned issues in the format:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
            "checksums": [ "050af3a99561a724013995668a4bc71c", "f34be60193cfe9221d3fe50df657e9d3" ]
        }
    }}

Closes #189

avtikhon added a commit that referenced this issue


          Enable test reruns on failed fragiled tests

8433af3

Added ability to set per suite in suite.ini configuration file
'retries' option, which sets the number of accepted reruns of
the tests failed from 'fragile' list:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
        }
    }}

Part of #189

avtikhon added a commit that referenced this issue


          Add ability to check failed tests w/ fragile list

4ef109e

Added ability to check failed tests w/ fragile list to be sure that
the current fail equal to the issue mentioned in the fragile list.
Fragile list should consist of the results files checksums with
mentioned issues in the format:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
            "checksums": [ "050af3a99561a724013995668a4bc71c", "f34be60193cfe9221d3fe50df657e9d3" ]
        }
    }}

Closes #189

avtikhon added a commit that referenced this issue


          Enable test reruns on failed fragiled tests

Added ability to set per suite in suite.ini configuration file
'retries' option, which sets the number of accepted reruns of
the tests failed from 'fragile' list:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
        }
    }}

Part of #189

avtikhon added a commit that referenced this issue


          Add ability to check failed tests w/ fragile list

86902a7

Added ability to check failed tests w/ fragile list to be sure that
the current fail equal to the issue mentioned in the fragile list.
Fragile list should consist of the results files checksums with
mentioned issues in the format:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
            "checksums": [ "050af3a99561a724013995668a4bc71c", "f34be60193cfe9221d3fe50df657e9d3" ]
        }
    }}

Closes #189

avtikhon added a commit that referenced this issue


          Add ability to check failed tests w/ fragile list

40802d9

Added ability to check failed tests w/ fragile list to be sure that
the current fail equal to the issue mentioned in the fragile list.
Fragile list should consist of the results files checksums with
mentioned issues in the format:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
            "checksums": [ "050af3a99561a724013995668a4bc71c", "f34be60193cfe9221d3fe50df657e9d3" ]
        }
    }}

Closes #189

avtikhon added a commit that referenced this issue


          Add ability to check failed tests w/ fragile list

2cb1a36

Added ability to check failed tests w/ fragile list to be sure that
the current fail equal to the issue mentioned in the fragile list.
Fragile list should consist of the results files checksums with
mentioned issues in the format:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
            "checksums": [ "050af3a99561a724013995668a4bc71c", "f34be60193cfe9221d3fe50df657e9d3" ]
        }
    }}

Closes #189

avtikhon added a commit that referenced this issue


          Enable test reruns on failed fragiled tests

760f1f1

Added ability to set per suite in suite.ini configuration file
'retries' option, which sets the number of accepted reruns of
the tests failed from 'fragile' list:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
        }
    }}

Part of #189

avtikhon added a commit that referenced this issue


          Add ability to check failed tests w/ fragile list

895b91b

Added ability to check failed tests w/ fragile list to be sure that
the current fail equal to the issue mentioned in the fragile list.
Fragile list should consist of the results files checksums with
mentioned issues in the format:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
            "checksums": [ "050af3a99561a724013995668a4bc71c", "f34be60193cfe9221d3fe50df657e9d3" ]
        }
    }}

Closes #189

avtikhon added a commit that referenced this issue


          Add ability to check failed tests w/ fragile list

Added ability to check failed tests w/ fragile list to be sure that
the current fail equal to the issue mentioned in the fragile list.
Fragile list should consist of the results files checksums with
mentioned issues in the format:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
            "checksums": [ "050af3a99561a724013995668a4bc71c", "f34be60193cfe9221d3fe50df657e9d3" ]
        }
    }}

Closes #189

avtikhon added a commit that referenced this issue


          Add ability to check failed tests w/ fragile list

daa97cb

Added ability to check failed tests w/ fragile list to be sure that
the current fail equal to the issue mentioned in the fragile list.
Fragile list should consist of the results files checksums with
mentioned issues in the format:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
            "checksums": [ "050af3a99561a724013995668a4bc71c", "f34be60193cfe9221d3fe50df657e9d3" ]
        }
    }}

Closes #189

avtikhon added a commit that referenced this issue


          Add ability to check failed tests w/ fragile list

9cfc3bd

Added ability to check failed tests w/ fragile list to be sure that
the current fail equal to the issue mentioned in the fragile list.
Fragile list should consist of the results files checksums with
mentioned issues in the format:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
            "checksums": [ "050af3a99561a724013995668a4bc71c", "f34be60193cfe9221d3fe50df657e9d3" ]
        }
    }}

Closes #189

avtikhon added a commit that referenced this issue


          Enable test reruns on failed fragiled tests

83367e4

Added ability to set per suite in suite.ini configuration file
'retries' option, which sets the number of accepted reruns of
the tests failed from 'fragile' list:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
        }
    }}

Part of #189

avtikhon added a commit that referenced this issue


          Add ability to check failed tests w/ fragile list

80e499d

Added ability to check failed tests w/ fragile list to be sure that
the current fail equal to the issue mentioned in the fragile list.
Fragile list should consist of the results files checksums with
mentioned issues in the format:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
            "checksums": [ "050af3a99561a724013995668a4bc71c", "f34be60193cfe9221d3fe50df657e9d3" ]
        }
    }}

Closes #189

avtikhon added a commit that referenced this issue


          Enable test reruns on failed fragiled tests

2d58399

Added ability to set per suite in suite.ini configuration file
'retries' option, which sets the number of accepted reruns of
the tests failed from 'fragile' list:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
        }
    }}

Part of #189

avtikhon added a commit that referenced this issue


          Add ability to check failed tests w/ fragile list

04654b6

Added ability to check failed tests w/ fragile list to be sure that
the current fail equal to the issue mentioned in the fragile list.
Fragile list should consist of the results files checksums with
mentioned issues in the format:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
            "checksums": [ "050af3a99561a724013995668a4bc71c", "f34be60193cfe9221d3fe50df657e9d3" ]
        }
    }}

Closes #189

avtikhon added a commit that referenced this issue


          Add ability to check results file checksum on fail

f801765

Added ability to check results file checksum on tests fail and
compare with the checksums of the known issues mentioned in the
fragile list. Fragile list should consist of the results files
checksums with its issues in the format:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
            "checksums": [ "050af3a99561a724013995668a4bc71c", "f34be60193cfe9221d3fe50df657e9d3" ]
        }
    }}

Closes #189

avtikhon added a commit that referenced this issue


          Add ability to check results file checksum on fail

af3acc4

Added ability to check results file checksum on tests fail and
compare with the checksums of the known issues mentioned in the
fragile list. Fragile list should consist of the results files
checksums with its issues in the format:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
            "checksums": [ "050af3a99561a724013995668a4bc71c", "f34be60193cfe9221d3fe50df657e9d3" ]
        }
    }}

Closes #189

avtikhon added a commit that referenced this issue


          Add ability to check results file checksum on fail

cbfeda9

Added ability to check results file checksum on tests fail and
compare with the checksums of the known issues mentioned in the
fragile list. Fragile list should consist of the results files
checksums with its issues in the format:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
            "checksums": [ "050af3a99561a724013995668a4bc71c", "f34be60193cfe9221d3fe50df657e9d3" ]
        }
    }}

Closes #189

avtikhon added a commit that referenced this issue


          Enable test reruns on failed fragiled tests

702b42e

Added ability to set per suite in suite.ini configuration file
'retries' option, which sets the number of accepted reruns of
the tests failed from 'fragile' list:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
        }
    }}

Part of #189

avtikhon added a commit that referenced this issue


          Add ability to check results file checksum on fail

568344a

Added ability to check results file checksum on tests fail and
compare with the checksums of the known issues mentioned in the
fragile list. Fragile list should consist of the results files
checksums with its issues in the format:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
            "checksums": [ "050af3a99561a724013995668a4bc71c", "f34be60193cfe9221d3fe50df657e9d3" ]
        }
    }}

Closes #189

avtikhon added a commit that referenced this issue


          Enable test reruns on failed fragiled tests

d21ab8d

Added ability to set per suite in suite.ini configuration file
'retries' option, which sets the number of accepted reruns of
the tests failed from 'fragile' list:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
        }
    }}

Part of #189

avtikhon added a commit that referenced this issue


          Add ability to check results file checksum on fail

c7b1b6c

Added ability to check results file checksum on tests fail and
compare with the checksums of the known issues mentioned in the
fragile list. Fragile list should consist of the results files
checksums with its issues in the format:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
            "checksums": [ "050af3a99561a724013995668a4bc71c", "f34be60193cfe9221d3fe50df657e9d3" ]
        }
    }}

Closes #189

avtikhon added a commit that referenced this issue


          Enable test reruns on failed fragiled tests

49cfcc1

Added ability to set per suite in suite.ini configuration file
'retries' option, which sets the number of accepted reruns of
the tests failed from 'fragile' list:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
        }
    }}

Part of #189

avtikhon added a commit that referenced this issue


          Add ability to check results file checksum on fail

b6b7d41

Added ability to check results file checksum on tests fail and
compare with the checksums of the known issues mentioned in the
fragile list. Fragile list should consist of the results files
checksums with its issues in the format:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
            "checksums": [ "050af3a99561a724013995668a4bc71c", "f34be60193cfe9221d3fe50df657e9d3" ]
        }
    }}

Closes #189

avtikhon added a commit that referenced this issue


          Enable test reruns on failed fragiled tests

f25fef3

Added ability to set per suite in suite.ini configuration file
'retries' option, which sets the number of accepted reruns of
the tests failed from 'fragile' list:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
        }
    }}

Part of #189

avtikhon added a commit that referenced this issue


          Add ability to check results file checksum on fail

cde0988

Added ability to check results file checksum on tests fail and
compare with the checksums of the known issues mentioned in the
fragile list. Fragile list should consist of the results files
checksums with its issues in the format:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
            "checksums": [ "050af3a99561a724013995668a4bc71c", "f34be60193cfe9221d3fe50df657e9d3" ]
        }
    }}

Closes #189

avtikhon added a commit that referenced this issue


          Enable test reruns on failed fragiled tests

58e5214

Added ability to set per suite in suite.ini configuration file
'retries' option, which sets the number of accepted reruns of
the tests failed from 'fragile' list:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
        }
    }}

Part of #189

avtikhon added a commit that referenced this issue


          Add ability to check results file checksum on fail

f8a8135

Added ability to check results file checksum on tests fail and
compare with the checksums of the known issues mentioned in the
fragile list. Fragile list should consist of the results files
checksums with its issues in the format:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
            "checksums": [ "050af3a99561a724013995668a4bc71c", "f34be60193cfe9221d3fe50df657e9d3" ]
        }
    }}

Closes #189

Totktonada closed this as completed in #217

Totktonada pushed a commit that referenced this issue


          Enable test reruns on failed fragiled tests

dfcb8b4

Added ability to set per suite in suite.ini configuration file
'retries' option, which sets the number of accepted reruns of
the tests failed from 'fragile' list:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
        }
    }}

Part of #189

Totktonada pushed a commit that referenced this issue


          Add ability to check results file checksum on fail

ec8c991

Added ability to check results file checksum on tests fail and
compare with the checksums of the known issues mentioned in the
fragile list. Fragile list should consist of the results files
checksums with its issues in the format:

  fragile = {
    "retries": 10,
    "tests": {
        "bitset.test.lua": {
            "issues": [ "gh-4095" ],
            "checksums": [ "050af3a99561a724013995668a4bc71c", "f34be60193cfe9221d3fe50df657e9d3" ]
        }
    }}

Closes #189

Totktonada assigned avtikhon and unassigned ligurio

Totktonada removed the raw idea label

Totktonada added a commit to tarantool/tarantool that referenced this issue


          test: update test-run

43482ee

Retry a failed test when it is marked as fragile (and several other
conditions are met, see below).

The test-run already allows to set a list of fragile tests. They are run
one-by-one after all parallel ones in order to eliminate possible
resource starvation and fit timings to ones when the tests pass. See
[1].

In practice this approach does not help much against our problem with
flaky tests. We decided to retry failed tests, when they are known as
flagile. See [2].

The core idea is to split responsibility: known flaky fails will not
deflect attention of a developer, but each fragile test will be marked
explicitly, trackerized and will be analyzed by the quality assurance
team.

The default behaviour is not changed: each test from the fragile list
will be run once after all parallel ones. But now it is possible to set
retries amount.

Beware: the implementation does not allow to just set retries count, it
also requires to provide an md5sum of a failed test output (so called
reject file). The idea here is to ensure that we retry the test only in
case of a known fail: not some other fail within the test.

This approach has the limitation: in case of fail a test may output an
information that varies from run to run or depend of a base directory.
We should always verify the output before put its checksum into the
configuration file.

Despite doubts regarding this approach, it looks simple and we decided
to try and revisit it if there will be a need.

See configuration example in [3].

[1]: tarantool/test-run#187
[2]: tarantool/test-run#189
[3]: tarantool/test-run#217

Part of #5050

Totktonada added a commit to tarantool/tarantool that referenced this issue


          test: update test-run

c5bb549

Retry a failed test when it is marked as fragile (and several other
conditions are met, see below).

The test-run already allows to set a list of fragile tests. They are run
one-by-one after all parallel ones in order to eliminate possible
resource starvation and fit timings to ones when the tests pass. See
[1].

In practice this approach does not help much against our problem with
flaky tests. We decided to retry failed tests, when they are known as
flagile. See [2].

The core idea is to split responsibility: known flaky fails will not
deflect attention of a developer, but each fragile test will be marked
explicitly, trackerized and will be analyzed by the quality assurance
team.

The default behaviour is not changed: each test from the fragile list
will be run once after all parallel ones. But now it is possible to set
retries amount.

Beware: the implementation does not allow to just set retries count, it
also requires to provide an md5sum of a failed test output (so called
reject file). The idea here is to ensure that we retry the test only in
case of a known fail: not some other fail within the test.

This approach has the limitation: in case of fail a test may output an
information that varies from run to run or depend of a base directory.
We should always verify the output before put its checksum into the
configuration file.

Despite doubts regarding this approach, it looks simple and we decided
to try and revisit it if there will be a need.

See configuration example in [3].

[1]: tarantool/test-run#187
[2]: tarantool/test-run#189
[3]: tarantool/test-run#217

Part of #5050

(cherry picked from commit 43482ee)

Totktonada added a commit to tarantool/tarantool that referenced this issue


          test: update test-run

ef330c3

Retry a failed test when it is marked as fragile (and several other
conditions are met, see below).

The test-run already allows to set a list of fragile tests. They are run
one-by-one after all parallel ones in order to eliminate possible
resource starvation and fit timings to ones when the tests pass. See
[1].

In practice this approach does not help much against our problem with
flaky tests. We decided to retry failed tests, when they are known as
flagile. See [2].

The core idea is to split responsibility: known flaky fails will not
deflect attention of a developer, but each fragile test will be marked
explicitly, trackerized and will be analyzed by the quality assurance
team.

The default behaviour is not changed: each test from the fragile list
will be run once after all parallel ones. But now it is possible to set
retries amount.

Beware: the implementation does not allow to just set retries count, it
also requires to provide an md5sum of a failed test output (so called
reject file). The idea here is to ensure that we retry the test only in
case of a known fail: not some other fail within the test.

This approach has the limitation: in case of fail a test may output an
information that varies from run to run or depend of a base directory.
We should always verify the output before put its checksum into the
configuration file.

Despite doubts regarding this approach, it looks simple and we decided
to try and revisit it if there will be a need.

See configuration example in [3].

[1]: tarantool/test-run#187
[2]: tarantool/test-run#189
[3]: tarantool/test-run#217

Part of #5050

(cherry picked from commit 43482ee)

Totktonada added a commit to tarantool/tarantool that referenced this issue


          test: update test-run

Retry a failed test when it is marked as fragile (and several other
conditions are met, see below).

The test-run already allows to set a list of fragile tests. They are run
one-by-one after all parallel ones in order to eliminate possible
resource starvation and fit timings to ones when the tests pass. See
[1].

In practice this approach does not help much against our problem with
flaky tests. We decided to retry failed tests, when they are known as
flagile. See [2].

The core idea is to split responsibility: known flaky fails will not
deflect attention of a developer, but each fragile test will be marked
explicitly, trackerized and will be analyzed by the quality assurance
team.

The default behaviour is not changed: each test from the fragile list
will be run once after all parallel ones. But now it is possible to set
retries amount.

Beware: the implementation does not allow to just set retries count, it
also requires to provide an md5sum of a failed test output (so called
reject file). The idea here is to ensure that we retry the test only in
case of a known fail: not some other fail within the test.

This approach has the limitation: in case of fail a test may output an
information that varies from run to run or depend of a base directory.
We should always verify the output before put its checksum into the
configuration file.

Despite doubts regarding this approach, it looks simple and we decided
to try and revisit it if there will be a need.

See configuration example in [3].

[1]: tarantool/test-run#187
[2]: tarantool/test-run#189
[3]: tarantool/test-run#217

Part of #5050

(cherry picked from commit 43482ee)

avtikhon mentioned this issue

Need to make able to run flaky tests in parallel tarantool/tarantool-qa#80

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment