Skip to content

test: flaky replication/election_qsync_stress.test.lua test #166

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
avtikhon opened this issue Oct 8, 2020 · 15 comments
Closed

test: flaky replication/election_qsync_stress.test.lua test #166

avtikhon opened this issue Oct 8, 2020 · 15 comments

Comments

@avtikhon
Copy link
Contributor

avtikhon commented Oct 8, 2020

Tarantool version:
Tarantool 2.6.0-143-g0dc72812fb
Target: Linux-x86_64-Debug
Build options: cmake . -DCMAKE_INSTALL_PREFIX=/usr/local -DENABLE_BACKTRACE=ON
Compiler: /usr/bin/cc /usr/bin/c++
C_FLAGS: -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -fopenmp -msse2 -fprofile-arcs -ftest-coverage -std=c11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -Wno-gnu-alignof-expression -fno-gnu89-inline -Wno-cast-function-type -Werror
CXX_FLAGS: -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -fopenmp -msse2 -fprofile-arcs -ftest-coverage -std=c++11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -Wno-invalid-offsetof -Wno-gnu-alignof-expression -Wno-cast-function-type -Werror

OS version:

  1. Linux (Debian 9)
  2. FreeBSD 12

Bug description:
Found 2 issues:

  1. https://gitlab.com/tarantool/tarantool/-/jobs/779113169#L5139

artifacts.zip

results file checksum: cc93d7c69c6368217634718bdf3de16c

[035] replication/election_qsync.test.lua             vinyl           
[035] TarantoolInpector.handle() received the following error:
[035] Traceback (most recent call last):
[035]   File "/builds/M4RrgQZ3/0/tarantool/tarantool/test-run/lib/inspector.py", line 94, in handle
[035]     result = self.parser.parse_preprocessor(line)
[035]   File "/builds/M4RrgQZ3/0/tarantool/tarantool/test-run/lib/preprocessor.py", line 73, in parse_preprocessor
[035]     return self.lua_eval(name, expr[1:-1])
[035]   File "/builds/M4RrgQZ3/0/tarantool/tarantool/test-run/lib/preprocessor.py", line 388, in lua_eval
[035]     self.servers[name].admin.reconnect()
[035] KeyError: 'election_replica0'
[035] [ fail ]
[035] 
[035] Test failed! Result content mismatch:
[035] --- replication/election_qsync.result	Thu Oct  8 00:07:13 2020
[035] +++ replication/election_qsync.reject	Thu Oct  8 04:11:02 2020
[035] @@ -102,27 +102,3 @@
[035]      old_leader_nr = new_leader_nr
[035]      old_leader = new_leader
[035]  end;
[035] - | ---
[035] - | ...
[035] -test_run:cmd('setopt delimiter ""');
[035] - | ---
[035] - | - true
[035] - | ...
[035] --- We're connected to some leader.
[035] -c.space.test:select{}
[035] - | ---
[035] - | - - [1]
[035] - |   - [2]
[035] - |   - [3]
[035] - |   - [4]
[035] - |   - [5]
[035] - |   - [6]
[035] - |   - [7]
[035] - |   - [8]
[035] - |   - [9]
[035] - |   - [10]
[035] - | ...
[035] -
[035] -test_run:drop_cluster(SERVERS)
[035] - | ---
[035] - | ...
[035] 
  1. https://gitlab.com/tarantool/tarantool/-/jobs/779113714#L5729

artifacts.zip

results file checksum: 3fb2e6cef4c8fa1d0edd8654fd2d8ef6

[021] replication/election_qsync.test.lua             vinyl           
[021] TarantoolInpector.handle() received the following error:
[021] Traceback (most recent call last):
[021]   File "/home/vagrant/tarantool/test-run/lib/inspector.py", line 94, in handle
[021]     result = self.parser.parse_preprocessor(line)
[021]   File "/home/vagrant/tarantool/test-run/lib/preprocessor.py", line 106, in parse_preprocessor
[021]     return self.server(stype, sname, options)
[021]   File "/home/vagrant/tarantool/test-run/lib/preprocessor.py", line 338, in server
[021]     return getattr(self, attr)(ctype, sname, opts)
[021]   File "/home/vagrant/tarantool/test-run/lib/preprocessor.py", line 218, in server_stop
[021]     self.connections[sname].disconnect()
[021] KeyError: 'election_replica2'
[021] [ fail ]
[021] 
[021] Test failed! Result content mismatch:
[021] --- replication/election_qsync.result	Thu Oct  8 04:07:26 2020
[021] +++ replication/election_qsync.reject	Fri May  8 08:24:37 2020
[021] @@ -103,6 +103,7 @@
[021]      old_leader = new_leader
[021]  end;
[021]   | ---
[021] + | - error: Can't modify data because this instance is in read-only mode.
[021]   | ...
[021]  test_run:cmd('setopt delimiter ""');
[021]   | ---
[021] @@ -113,16 +114,6 @@
[021]   | ---
[021]   | - - [1]
[021]   |   - [2]
[021] - |   - [3]
[021] - |   - [4]
[021] - |   - [5]
[021] - |   - [6]
[021] - |   - [7]
[021] - |   - [8]
[021] - |   - [9]
[021] - |   - [10]
[021]   | ...
[021]  
[021]  test_run:drop_cluster(SERVERS)
[021] - | ---
[021] - | ...
[021] 
  1. https://gitlab.com/tarantool/tarantool/-/jobs/795044495#L5273

artifacts.zip

results file checksum: 634bda94accdcdef7b1db3e14f28f445

[155] replication/election_qsync_stress.test.lua      memtx           
[155] TarantoolInpector.handle() received the following error:
[155] Traceback (most recent call last):
[155]   File "/builds/4BUsapPU/0/tarantool/tarantool/test-run/lib/inspector.py", line 94, in handle
[155]     result = self.parser.parse_preprocessor(line)
[155]   File "/builds/4BUsapPU/0/tarantool/tarantool/test-run/lib/preprocessor.py", line 106, in parse_preprocessor
[155]     return self.server(stype, sname, options)
[155]   File "/builds/4BUsapPU/0/tarantool/tarantool/test-run/lib/preprocessor.py", line 338, in server
[155]     return getattr(self, attr)(ctype, sname, opts)
[155]   File "/builds/4BUsapPU/0/tarantool/tarantool/test-run/lib/preprocessor.py", line 218, in server_stop
[155]     self.connections[sname].disconnect()
[155] KeyError: 'election_replica3'
[155] [ fail ]
[155] 
[155] Test failed! Result content mismatch:
[155] --- replication/election_qsync_stress.result	Fri Oct 16 11:31:28 2020
[155] +++ replication/election_qsync_stress.reject	Fri Oct 16 12:12:35 2020
[155] @@ -102,6 +102,7 @@
[155]      old_leader = new_leader
[155]  end;
[155]   | ---
[155] + | - error: Can't modify data because this instance is in read-only mode.
[155]   | ...
[155]  test_run:cmd('setopt delimiter ""');
[155]   | ---
[155] @@ -110,9 +111,7 @@
[155]  -- We're connected to some leader.
[155]  #c.space.test:select{} == 10 or require('log').error(c.space.test:select{})
[155]   | ---
[155] - | - true
[155] + | - null
[155]   | ...
[155]  
[155]  test_run:drop_cluster(SERVERS)
[155] - | ---
[155] - | ...
[155] 

https://gitlab.com/tarantool/tarantool/-/jobs/795376806#L5359

artifacts.zip

results file checksum: 36bcdae426c18a60fd13025c09f197d0

[164] --- replication/election_qsync_stress.result	Fri Oct 16 14:39:00 2020
[164] +++ replication/election_qsync_stress.reject	Fri Oct 16 14:48:14 2020
[164] @@ -110,7 +110,7 @@
[164]  -- We're connected to some leader.
[164]  #c.space.test:select{} == 10 or require('log').error(c.space.test:select{})
[164]   | ---
[164] - | - true
[164] + | - null
[164]   | ...
[164]  
[164]  test_run:drop_cluster(SERVERS)
[164] 

https://gitlab.com/tarantool/tarantool/-/jobs/795916093#L5738

artifacts.zip

results file checksum: 209c865525154a91435c63850f15eca0

[027] --- replication/election_qsync_stress.result	Fri Oct 16 19:25:53 2020
[027] +++ replication/election_qsync_stress.reject	Fri May  8 08:23:59 2020
[027] @@ -102,6 +102,7 @@
[027]      old_leader = new_leader
[027]  end;
[027]   | ---
[027] + | - error: Found uncommitted sync transactions from other instance with id 2
[027]   | ...
[027]  test_run:cmd('setopt delimiter ""');
[027]   | ---
[027] @@ -110,9 +111,7 @@
[027]  -- We're connected to some leader.
[027]  #c.space.test:select{} == 10 or require('log').error(c.space.test:select{})
[027]   | ---
[027] - | - true
[027] + | - null
[027]   | ...
[027]  
[027]  test_run:drop_cluster(SERVERS)
[027] - | ---
[027] - | ...
[027] 

Steps to reproduce:

Optional (but very desirable):

  • coredump
  • backtrace
  • netstat
avtikhon referenced this issue in tarantool/tarantool Oct 8, 2020
Added for tests with issues:

  box/gh-5135-invalid-upsert.test.lua		gh-5376
  box/huge_field_map_long.test.lua		gh-5375
  replication/anon.test.lua			gh-5381
  replication/box_set_replication_stress.test.lua gh-4992
  replication/election_basic.test.lua		gh-5368
  replication/election_qsync.test.lua test	gh-5395
  replication/gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-5380
  replication/gh-5298-qsync-recovery-snap.test.lua.test.lua gh-5379
  replication/show_error_on_disconnect.test.lua	gh-5371
  vinyl/gc.test.lua				gh-5383
  vinyl/gh-4957-too-many-upserts.test.lua	gh-5378
  vinyl/gh.test.lua				gh-5141
  vinyl/quota.test.lua				gh-5377
  vinyl/snapshot.test.lua			gh-4984
  vinyl/stat.test.lua				gh-4951
@PersDep
Copy link

PersDep commented Oct 8, 2020

It also catches assertion time to time:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
tarantool/tarantool#1  0x00007f0cf07da8b1 in __GI_abort () at abort.c:79
tarantool/tarantool#2  0x00007f0cf07ca42a in __assert_fail_base (fmt=0x7f0cf0951a38 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x55e86001da89 "lsn > limbo->confirmed_lsn", 
    file=file@entry=0x55e86001d708 "/home/kosarev/tarantool/src/box/txn_limbo.c", line=line@entry=340, function=function@entry=0x55e86001dc40 <__PRETTY_FUNCTION__.12489> "txn_limbo_write_confirm")
    at assert.c:92
tarantool/tarantool#3  0x00007f0cf07ca4a2 in __GI___assert_fail (assertion=0x55e86001da89 "lsn > limbo->confirmed_lsn", file=0x55e86001d708 "/home/kosarev/tarantool/src/box/txn_limbo.c", line=340, 
    function=0x55e86001dc40 <__PRETTY_FUNCTION__.12489> "txn_limbo_write_confirm") at assert.c:101
tarantool/tarantool#4  0x000055e85fbdb753 in txn_limbo_write_confirm (limbo=0x55e860416a60 <txn_limbo>, lsn=9) at /home/kosarev/tarantool/src/box/txn_limbo.c:340
tarantool/tarantool#5  0x000055e85fbdc328 in txn_limbo_force_empty (limbo=0x55e860416a60 <txn_limbo>, confirm_lsn=9) at /home/kosarev/tarantool/src/box/txn_limbo.c:622
tarantool/tarantool#6  0x000055e85fbe291e in box_clear_synchro_queue () at /home/kosarev/tarantool/src/box/box.cc:1071
tarantool/tarantool#7  0x000055e85fcb5cd3 in lbox_ctl_clear_synchro_queue (L=0x4151c0b8) at /home/kosarev/tarantool/src/box/lua/ctl.c:85
tarantool/tarantool#8  0x000055e85fe400cb in lj_BC_FUNCC ()
tarantool/tarantool#9  0x000055e85fdfb84b in jit_secure_call (L=0x4151c0b8, base=0x4151c118, nres=0) at lj_api.c:88
tarantool/tarantool#10 0x000055e85fe02ae0 in lua_call (L=0x4151c0b8, nargs=0, nresults=-1) at lj_api.c:1133
tarantool/tarantool#11 0x000055e85fca3b94 in execute_lua_eval (L=0x4151c0b8) at /home/kosarev/tarantool/src/box/lua/call.c:394
tarantool/tarantool#12 0x000055e85fe400cb in lj_BC_FUNCC ()
tarantool/tarantool#13 0x000055e85fe02e15 in lua_pcall (L=0x4151c0b8, nargs=1, nresults=-1, errfunc=0) at lj_api.c:1158
tarantool/tarantool#14 0x000055e85fd87bd6 in luaT_call (L=0x4151c0b8, nargs=1, nreturns=-1) at /home/kosarev/tarantool/src/lua/utils.c:1016
tarantool/tarantool#15 0x000055e85fca42e1 in box_process_lua (handler=HANDLER_EVAL, ctx=0x7f0ce79ffd10, ret=0x7f0ce79ffe60) at /home/kosarev/tarantool/src/box/lua/call.c:570
tarantool/tarantool#16 0x000055e85fca43ff in box_lua_eval (expr=0x7f0ce701803d "box.ctl.clear_synchro_queue()!\220\021", expr_len=29, args=0x7f0ce79ffd80, ret=0x7f0ce79ffe60)
    at /home/kosarev/tarantool/src/box/lua/call.c:614
tarantool/tarantool#17 0x000055e85fc130bb in box_process_eval (request=0x7f0ce7010430, port=0x7f0ce79ffe60) at /home/kosarev/tarantool/src/box/call.c:187
tarantool/tarantool#18 0x000055e85fcc75f4 in tx_process_call (m=0x7f0ce70103b0) at /home/kosarev/tarantool/src/box/iproto.cc:1632
tarantool/tarantool#19 0x000055e85fdb062e in cmsg_deliver (msg=0x7f0ce70103b0) at /home/kosarev/tarantool/src/lib/core/cbus.c:375
tarantool/tarantool#20 0x000055e85fdb170a in fiber_pool_f (ap=0x7f0ced801008) at /home/kosarev/tarantool/src/lib/core/fiber_pool.c:64
tarantool/tarantool#21 0x000055e85fba0269 in fiber_cxx_invoke(fiber_func, typedef __va_list_tag __va_list_tag *) (f=0x55e85fdb156a <fiber_pool_f>, ap=0x7f0ced801008) at /home/kosarev/tarantool/src/lib/core/fiber.h:788
tarantool/tarantool#22 0x000055e85fdaae15 in fiber_loop (data=0x0) at /home/kosarev/tarantool/src/lib/core/fiber.c:879
tarantool/tarantool#23 0x000055e86001556d in coro_init () at /home/kosarev/tarantool/third_party/coro/coro.c:110

@Gerold103
Copy link

Gerold103 commented Oct 8, 2020

A minimal 100% reproducer. Follow the steps in the specified order on the specified instances.
However it is not related to why the test hangs sometimes. This I don't know.

-- Instance 1
-- Step 1
fiber = require('fiber')
box.cfg{
    listen = 3313,
    replication = {'localhost:3313', 'localhost:3314'},
    replication_synchro_quorum = 2,
    replication_synchro_timeout = 1000000,
    read_only = false,
}
box.schema.user.grant('guest', 'super')
s1 = box.schema.create_space('test1', {is_sync = true})
_ = s1:create_index('pk')

-- Step 3
for i = 1, 10 do s1:replace{i} end

-- Step 5
box.cfg{
    replication_synchro_timeout = 0.001,
}
box.ctl.clear_synchro_queue()
-- Instance 2
-- Step 2
box.cfg{
    listen = 3314,
    replication = {'localhost:3313', 'localhost:3314'},
    replication_synchro_quorum = 3,
    replication_synchro_timeout = 1000000,
    read_only = true,
}

-- Step 4
box.cfg{read_only = false}
box.space.test1:replace{11}

avtikhon referenced this issue in tarantool/tarantool Oct 8, 2020
Added for tests with issues:

  box/gh-5135-invalid-upsert.test.lua		gh-5376
  box/huge_field_map_long.test.lua		gh-5375
  replication/anon.test.lua			gh-5381
  replication/box_set_replication_stress.test.lua gh-4992
  replication/election_basic.test.lua		gh-5368
  replication/election_qsync.test.lua test	gh-5395
  replication/gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-5380
  replication/gh-5298-qsync-recovery-snap.test.lua.test.lua gh-5379
  replication/show_error_on_disconnect.test.lua	gh-5371
  vinyl/gc.test.lua				gh-5383
  vinyl/gh-4957-too-many-upserts.test.lua	gh-5378
  vinyl/gh.test.lua				gh-5141
  vinyl/quota.test.lua				gh-5377
  vinyl/snapshot.test.lua			gh-4984
  vinyl/stat.test.lua				gh-4951
  vinyl/upsert.test.lua				gh-5398
avtikhon referenced this issue in tarantool/tarantool Oct 9, 2020
Added for tests with issues:

  box/gh-5135-invalid-upsert.test.lua		gh-5376
  box/huge_field_map_long.test.lua		gh-5375
  replication/anon.test.lua			gh-5381
  replication/box_set_replication_stress.test.lua gh-4992
  replication/election_basic.test.lua		gh-5368
  replication/election_qsync.test.lua test	gh-5395
  replication/gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-5380
  replication/gh-5298-qsync-recovery-snap.test.lua.test.lua gh-5379
  replication/show_error_on_disconnect.test.lua	gh-5371
  vinyl/gc.test.lua				gh-5383
  vinyl/gh-4957-too-many-upserts.test.lua	gh-5378
  vinyl/gh.test.lua				gh-5141
  vinyl/quota.test.lua				gh-5377
  vinyl/snapshot.test.lua			gh-4984
  vinyl/stat.test.lua				gh-4951
  vinyl/upsert.test.lua				gh-5398
avtikhon referenced this issue in tarantool/tarantool Oct 9, 2020
Added for tests with issues:

  box/gh-5135-invalid-upsert.test.lua		gh-5376
  box/huge_field_map_long.test.lua		gh-5375
  replication/anon.test.lua			gh-5381
  replication/box_set_replication_stress.test.lua gh-4992
  replication/election_basic.test.lua		gh-5368
  replication/election_qsync.test.lua test	gh-5395
  replication/gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-5380
  replication/gh-5298-qsync-recovery-snap.test.lua.test.lua gh-5379
  replication/show_error_on_disconnect.test.lua	gh-5371
  unit/swim.test				gh-5399
  vinyl/gc.test.lua				gh-5383
  vinyl/gh-4957-too-many-upserts.test.lua	gh-5378
  vinyl/gh.test.lua				gh-5141
  vinyl/quota.test.lua				gh-5377
  vinyl/snapshot.test.lua			gh-4984
  vinyl/stat.test.lua				gh-4951
  vinyl/upsert.test.lua				gh-5398
avtikhon referenced this issue in tarantool/tarantool Oct 9, 2020
Added for tests with issues:

  box/gh-5135-invalid-upsert.test.lua		gh-5376
  box/hash_replace.test.lua			gh-5400
  box/huge_field_map_long.test.lua		gh-5375
  replication/anon.test.lua			gh-5381
  replication/box_set_replication_stress.test.lua gh-4992
  replication/election_basic.test.lua		gh-5368
  replication/election_qsync.test.lua test	gh-5395
  replication/gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-5380
  replication/gh-5298-qsync-recovery-snap.test.lua.test.lua gh-5379
  replication/show_error_on_disconnect.test.lua	gh-5371
  unit/swim.test				gh-5399
  vinyl/gc.test.lua				gh-5383
  vinyl/gh-4957-too-many-upserts.test.lua	gh-5378
  vinyl/gh.test.lua				gh-5141
  vinyl/quota.test.lua				gh-5377
  vinyl/snapshot.test.lua			gh-4984
  vinyl/stat.test.lua				gh-4951
  vinyl/upsert.test.lua				gh-5398
avtikhon referenced this issue in tarantool/tarantool Oct 9, 2020
Added for tests with issues:

  box/access_misc.test.lua			gh-5401
  box/gh-5135-invalid-upsert.test.lua		gh-5376
  box/hash_replace.test.lua			gh-5400
  box/huge_field_map_long.test.lua		gh-5375
  replication/anon.test.lua			gh-5381
  replication/box_set_replication_stress.test.lua gh-4992
  replication/election_basic.test.lua		gh-5368
  replication/election_qsync.test.lua test	gh-5395
  replication/gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-5380
  replication/gh-5298-qsync-recovery-snap.test.lua.test.lua gh-5379
  replication/show_error_on_disconnect.test.lua	gh-5371
  unit/swim.test				gh-5399
  vinyl/gc.test.lua				gh-5383
  vinyl/gh-4957-too-many-upserts.test.lua	gh-5378
  vinyl/gh.test.lua				gh-5141
  vinyl/quota.test.lua				gh-5377
  vinyl/snapshot.test.lua			gh-4984
  vinyl/stat.test.lua				gh-4951
  vinyl/upsert.test.lua				gh-5398
avtikhon referenced this issue in tarantool/tarantool Oct 9, 2020
Added for tests with issues:

  box/access_misc.test.lua			gh-5401
  box/gh-5135-invalid-upsert.test.lua		gh-5376
  box/hash_replace.test.lua			gh-5400
  box/huge_field_map_long.test.lua		gh-5375
  box/net.box_huge_data_gh-983.test.lua		gh-5402
  replication/anon.test.lua			gh-5381
  replication/box_set_replication_stress.test.lua gh-4992
  replication/election_basic.test.lua		gh-5368
  replication/election_qsync.test.lua test	gh-5395
  replication/gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-5380
  replication/gh-5298-qsync-recovery-snap.test.lua.test.lua gh-5379
  replication/show_error_on_disconnect.test.lua	gh-5371
  unit/swim.test				gh-5399
  vinyl/gc.test.lua				gh-5383
  vinyl/gh-4957-too-many-upserts.test.lua	gh-5378
  vinyl/gh.test.lua				gh-5141
  vinyl/quota.test.lua				gh-5377
  vinyl/snapshot.test.lua			gh-4984
  vinyl/stat.test.lua				gh-4951
  vinyl/upsert.test.lua				gh-5398
avtikhon referenced this issue in tarantool/tarantool Oct 10, 2020
Added for tests with issues:

  box/access_misc.test.lua			gh-5401
  box/gh-5135-invalid-upsert.test.lua		gh-5376
  box/hash_replace.test.lua			gh-5400
  box/huge_field_map_long.test.lua		gh-5375
  box/net.box_huge_data_gh-983.test.lua		gh-5402
  replication/anon.test.lua			gh-5381
  replication/box_set_replication_stress.test.lua gh-4992
  replication/election_basic.test.lua		gh-5368
  replication/election_qsync.test.lua test	gh-5395
  replication/gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-5380
  replication/gh-5298-qsync-recovery-snap.test.lua.test.lua gh-5379
  replication/show_error_on_disconnect.test.lua	gh-5371
  swim/swim.test.lua				gh-5403
  unit/swim.test				gh-5399
  vinyl/gc.test.lua				gh-5383
  vinyl/gh-4957-too-many-upserts.test.lua	gh-5378
  vinyl/gh.test.lua				gh-5141
  vinyl/quota.test.lua				gh-5377
  vinyl/snapshot.test.lua			gh-4984
  vinyl/stat.test.lua				gh-4951
  vinyl/upsert.test.lua				gh-5398
avtikhon referenced this issue in tarantool/tarantool Oct 10, 2020
Added for tests with issues:

  box/access_misc.test.lua			gh-5401
  box/gh-5135-invalid-upsert.test.lua		gh-5376
  box/hash_replace.test.lua			gh-5400
  box/huge_field_map_long.test.lua		gh-5375
  box/net.box_huge_data_gh-983.test.lua		gh-5402
  replication/anon.test.lua			gh-5381
  replication/box_set_replication_stress.test.lua gh-4992
  replication/election_basic.test.lua		gh-5368
  replication/election_qsync.test.lua test	gh-5395
  replication/gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-5380
  replication/gh-5298-qsync-recovery-snap.test.lua.test.lua gh-5379
  replication/show_error_on_disconnect.test.lua	gh-5371
  swim/swim.test.lua				gh-5403
  unit/swim.test				gh-5399
  vinyl/gc.test.lua				gh-5383
  vinyl/gh-4957-too-many-upserts.test.lua	gh-5378
  vinyl/gh.test.lua				gh-5141
  vinyl/quota.test.lua				gh-5377
  vinyl/snapshot.test.lua			gh-4984
  vinyl/stat.test.lua				gh-4951
  vinyl/upsert.test.lua				gh-5398
avtikhon referenced this issue in tarantool/tarantool Oct 10, 2020
Added for tests with issues:

  box/access_misc.test.lua			gh-5401
  box/gh-5135-invalid-upsert.test.lua		gh-5376
  box/hash_replace.test.lua			gh-5400
  box/huge_field_map_long.test.lua		gh-5375
  box/net.box_huge_data_gh-983.test.lua		gh-5402
  replication/anon.test.lua			gh-5381
  replication/box_set_replication_stress.test.lua gh-4992
  replication/election_basic.test.lua		gh-5368
  replication/election_qsync.test.lua test	gh-5395
  replication/gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-5380
  replication/gh-5298-qsync-recovery-snap.test.lua.test.lua gh-5379
  replication/show_error_on_disconnect.test.lua	gh-5371
  swim/swim.test.lua				gh-5403
  unit/swim.test				gh-5399
  vinyl/gc.test.lua				gh-5383
  vinyl/gh-4957-too-many-upserts.test.lua	gh-5378
  vinyl/gh.test.lua				gh-5141
  vinyl/quota.test.lua				gh-5377
  vinyl/snapshot.test.lua			gh-4984
  vinyl/stat.test.lua				gh-4951
  vinyl/upsert.test.lua				gh-5398
avtikhon referenced this issue in tarantool/tarantool Oct 11, 2020
Added for tests with issues:

  box/access_misc.test.lua			gh-5401
  box/gh-5135-invalid-upsert.test.lua		gh-5376
  box/hash_replace.test.lua			gh-5400
  box/huge_field_map_long.test.lua		gh-5375
  box/net.box_huge_data_gh-983.test.lua		gh-5402
  replication/anon.test.lua			gh-5381
  replication/box_set_replication_stress.test.lua gh-4992
  replication/election_basic.test.lua		gh-5368
  replication/election_qsync.test.lua test	gh-5395
  replication/gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-5380
  replication/gh-5298-qsync-recovery-snap.test.lua.test.lua gh-5379
  replication/show_error_on_disconnect.test.lua	gh-5371
  swim/swim.test.lua				gh-5403
  unit/swim.test				gh-5399
  vinyl/gc.test.lua				gh-5383
  vinyl/gh-4957-too-many-upserts.test.lua	gh-5378
  vinyl/gh.test.lua				gh-5141
  vinyl/quota.test.lua				gh-5377
  vinyl/snapshot.test.lua			gh-4984
  vinyl/stat.test.lua				gh-4951
  vinyl/upsert.test.lua				gh-5398
avtikhon referenced this issue in tarantool/tarantool Oct 11, 2020
Added for tests with issues:

  box/access_misc.test.lua			gh-5401
  box/gh-5135-invalid-upsert.test.lua		gh-5376
  box/hash_replace.test.lua			gh-5400
  box/huge_field_map_long.test.lua		gh-5375
  box/net.box_huge_data_gh-983.test.lua		gh-5402
  replication/anon.test.lua			gh-5381
  replication/box_set_replication_stress.test.lua gh-4992
  replication/election_basic.test.lua		gh-5368
  replication/election_qsync.test.lua test	gh-5395
  replication/gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-5380
  replication/gh-3711-misc-no-restart-on-same-configuration.test.lua gh-5407
  replication/gh-5298-qsync-recovery-snap.test.lua.test.lua gh-5379
  replication/show_error_on_disconnect.test.lua	gh-5371
  swim/swim.test.lua				gh-5403
  unit/swim.test				gh-5399
  vinyl/gc.test.lua				gh-5383
  vinyl/gh-4864-stmt-alloc-fail-compact.test.lua test gh-5408
  vinyl/gh-4957-too-many-upserts.test.lua	gh-5378
  vinyl/gh.test.lua				gh-5141
  vinyl/quota.test.lua				gh-5377
  vinyl/snapshot.test.lua			gh-4984
  vinyl/stat.test.lua				gh-4951
  vinyl/upsert.test.lua				gh-5398
avtikhon referenced this issue in tarantool/tarantool Oct 11, 2020
Added for tests with issues:

  box/access_misc.test.lua			gh-5401
  box/gh-5135-invalid-upsert.test.lua		gh-5376
  box/hash_replace.test.lua			gh-5400
  box/huge_field_map_long.test.lua		gh-5375
  box/net.box_huge_data_gh-983.test.lua		gh-5402
  replication/anon.test.lua			gh-5381
  replication/box_set_replication_stress.test.lua gh-4992
  replication/election_basic.test.lua		gh-5368
  replication/election_qsync.test.lua test	gh-5395
  replication/gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-5380
  replication/gh-3711-misc-no-restart-on-same-configuration.test.lua gh-5407
  replication/gh-5298-qsync-recovery-snap.test.lua.test.lua gh-5379
  replication/show_error_on_disconnect.test.lua	gh-5371
  swim/swim.test.lua				gh-5403
  unit/swim.test				gh-5399
  vinyl/gc.test.lua				gh-5383
  vinyl/gh-4864-stmt-alloc-fail-compact.test.lua test gh-5408
  vinyl/gh-4957-too-many-upserts.test.lua	gh-5378
  vinyl/gh.test.lua				gh-5141
  vinyl/quota.test.lua				gh-5377
  vinyl/snapshot.test.lua			gh-4984
  vinyl/stat.test.lua				gh-4951
  vinyl/upsert.test.lua				gh-5398
avtikhon referenced this issue in tarantool/tarantool Oct 11, 2020
Added for tests with issues:

  box/access_misc.test.lua			gh-5401
  box/gh-5135-invalid-upsert.test.lua		gh-5376
  box/hash_64bit_replace.test.lua test		gh-5410
  box/hash_replace.test.lua			gh-5400
  box/huge_field_map_long.test.lua		gh-5375
  box/net.box_huge_data_gh-983.test.lua		gh-5402
  replication/anon.test.lua			gh-5381
  replication/box_set_replication_stress.test.lua gh-4992
  replication/election_basic.test.lua		gh-5368
  replication/election_qsync.test.lua test	gh-5395
  replication/gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-5380
  replication/gh-3711-misc-no-restart-on-same-configuration.test.lua gh-5407
  replication/gh-5298-qsync-recovery-snap.test.lua.test.lua gh-5379
  replication/show_error_on_disconnect.test.lua	gh-5371
  replication/status.test.lua			gh-5409
  swim/swim.test.lua				gh-5403
  unit/swim.test				gh-5399
  vinyl/gc.test.lua				gh-5383
  vinyl/gh-4864-stmt-alloc-fail-compact.test.lua test gh-5408
  vinyl/gh-4957-too-many-upserts.test.lua	gh-5378
  vinyl/gh.test.lua				gh-5141
  vinyl/quota.test.lua				gh-5377
  vinyl/snapshot.test.lua			gh-4984
  vinyl/stat.test.lua				gh-4951
  vinyl/upsert.test.lua				gh-5398
avtikhon referenced this issue in tarantool/tarantool Oct 11, 2020
Added for tests with issues:

  box/access_misc.test.lua			gh-5401
  box/gh-5135-invalid-upsert.test.lua		gh-5376
  box/hash_64bit_replace.test.lua test		gh-5410
  box/hash_replace.test.lua			gh-5400
  box/huge_field_map_long.test.lua		gh-5375
  box/net.box_huge_data_gh-983.test.lua		gh-5402
  replication/anon.test.lua			gh-5381
  replication/autoboostrap.test.lua		gh-4933
  replication/box_set_replication_stress.test.lua gh-4992
  replication/election_basic.test.lua		gh-5368
  replication/election_qsync.test.lua test	gh-5395
  replication/gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-5380
  replication/gh-3711-misc-no-restart-on-same-configuration.test.lua gh-5407
  replication/gh-5298-qsync-recovery-snap.test.lua.test.lua gh-5379
  replication/show_error_on_disconnect.test.lua	gh-5371
  replication/status.test.lua			gh-5409
  swim/swim.test.lua				gh-5403
  unit/swim.test				gh-5399
  vinyl/gc.test.lua				gh-5383
  vinyl/gh-4864-stmt-alloc-fail-compact.test.lua test gh-5408
  vinyl/gh-4957-too-many-upserts.test.lua	gh-5378
  vinyl/gh.test.lua				gh-5141
  vinyl/quota.test.lua				gh-5377
  vinyl/snapshot.test.lua			gh-4984
  vinyl/stat.test.lua				gh-4951
  vinyl/upsert.test.lua				gh-5398

t
avtikhon referenced this issue in tarantool/tarantool Oct 11, 2020
Added for tests with issues:

  box/access_misc.test.lua			gh-5401
  box/gh-5135-invalid-upsert.test.lua		gh-5376
  box/hash_64bit_replace.test.lua test		gh-5410
  box/hash_replace.test.lua			gh-5400
  box/huge_field_map_long.test.lua		gh-5375
  box/net.box_huge_data_gh-983.test.lua		gh-5402
  replication/anon.test.lua			gh-5381
  replication/autoboostrap.test.lua		gh-4933
  replication/box_set_replication_stress.test.lua gh-4992
  replication/election_basic.test.lua		gh-5368
  replication/election_qsync.test.lua test	gh-5395
  replication/gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-5380
  replication/gh-3711-misc-no-restart-on-same-configuration.test.lua gh-5407
  replication/gh-5298-qsync-recovery-snap.test.lua.test.lua gh-5379
  replication/show_error_on_disconnect.test.lua	gh-5371
  replication/status.test.lua			gh-5409
  swim/swim.test.lua				gh-5403
  unit/swim.test				gh-5399
  vinyl/gc.test.lua				gh-5383
  vinyl/gh-4864-stmt-alloc-fail-compact.test.lua test gh-5408
  vinyl/gh-4957-too-many-upserts.test.lua	gh-5378
  vinyl/gh.test.lua				gh-5141
  vinyl/quota.test.lua				gh-5377
  vinyl/snapshot.test.lua			gh-4984
  vinyl/stat.test.lua				gh-4951
  vinyl/upsert.test.lua				gh-5398
avtikhon referenced this issue in tarantool/tarantool Oct 11, 2020
Added for tests with issues:

  box/access_misc.test.lua			gh-5401
  box/gh-5135-invalid-upsert.test.lua		gh-5376
  box/hash_64bit_replace.test.lua test		gh-5410
  box/hash_replace.test.lua			gh-5400
  box/huge_field_map_long.test.lua		gh-5375
  box/net.box_huge_data_gh-983.test.lua		gh-5402
  replication/anon.test.lua			gh-5381
  replication/autoboostrap.test.lua		gh-4933
  replication/box_set_replication_stress.test.lua gh-4992
  replication/election_basic.test.lua		gh-5368
  replication/election_qsync.test.lua test	gh-5395
  replication/gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-5380
  replication/gh-3711-misc-no-restart-on-same-configuration.test.lua gh-5407
  replication/gh-5298-qsync-recovery-snap.test.lua.test.lua gh-5379
  replication/show_error_on_disconnect.test.lua	gh-5371
  replication/status.test.lua			gh-5409
  swim/swim.test.lua				gh-5403
  unit/swim.test				gh-5399
  vinyl/gc.test.lua				gh-5383
  vinyl/gh-4864-stmt-alloc-fail-compact.test.lua test gh-5408
  vinyl/gh-4957-too-many-upserts.test.lua	gh-5378
  vinyl/gh.test.lua				gh-5141
  vinyl/quota.test.lua				gh-5377
  vinyl/snapshot.test.lua			gh-4984
  vinyl/stat.test.lua				gh-4951
  vinyl/upsert.test.lua				gh-5398
avtikhon referenced this issue in tarantool/tarantool Oct 12, 2020
Added for tests with issues:

  box/access_misc.test.lua			gh-5401
  box/gh-5135-invalid-upsert.test.lua		gh-5376
  box/hash_64bit_replace.test.lua test		gh-5410
  box/hash_replace.test.lua			gh-5400
  box/huge_field_map_long.test.lua		gh-5375
  box/net.box_huge_data_gh-983.test.lua		gh-5402
  replication/anon.test.lua			gh-5381
  replication/autoboostrap.test.lua		gh-4933
  replication/box_set_replication_stress.test.lua gh-4992
  replication/election_basic.test.lua		gh-5368
  replication/election_qsync.test.lua test	gh-5395
  replication/gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-5380
  replication/gh-3711-misc-no-restart-on-same-configuration.test.lua gh-5407
  replication/gh-5298-qsync-recovery-snap.test.lua.test.lua gh-5379
  replication/show_error_on_disconnect.test.lua	gh-5371
  replication/status.test.lua			gh-5409
  swim/swim.test.lua				gh-5403
  unit/swim.test				gh-5399
  vinyl/gc.test.lua				gh-5383
  vinyl/gh-4864-stmt-alloc-fail-compact.test.lua test gh-5408
  vinyl/gh-4957-too-many-upserts.test.lua	gh-5378
  vinyl/gh.test.lua				gh-5141
  vinyl/quota.test.lua				gh-5377
  vinyl/snapshot.test.lua			gh-4984
  vinyl/stat.test.lua				gh-4951
  vinyl/upsert.test.lua				gh-5398
avtikhon referenced this issue in tarantool/tarantool Oct 12, 2020
Added for tests with issues:

  box/access.test.lua				gh-5411
  box/access_misc.test.lua			gh-5401
  box/gh-5135-invalid-upsert.test.lua		gh-5376
  box/hash_64bit_replace.test.lua test		gh-5410
  box/hash_replace.test.lua			gh-5400
  box/huge_field_map_long.test.lua		gh-5375
  box/net.box_huge_data_gh-983.test.lua		gh-5402
  replication/anon.test.lua			gh-5381
  replication/autoboostrap.test.lua		gh-4933
  replication/box_set_replication_stress.test.lua gh-4992
  replication/election_basic.test.lua		gh-5368
  replication/election_qsync.test.lua test	gh-5395
  replication/gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-5380
  replication/gh-3711-misc-no-restart-on-same-configuration.test.lua gh-5407
  replication/gh-5298-qsync-recovery-snap.test.lua.test.lua gh-5379
  replication/show_error_on_disconnect.test.lua	gh-5371
  replication/status.test.lua			gh-5409
  swim/swim.test.lua				gh-5403
  unit/swim.test				gh-5399
  vinyl/gc.test.lua				gh-5383
  vinyl/gh-4864-stmt-alloc-fail-compact.test.lua test gh-5408
  vinyl/gh-4957-too-many-upserts.test.lua	gh-5378
  vinyl/gh.test.lua				gh-5141
  vinyl/quota.test.lua				gh-5377
  vinyl/snapshot.test.lua			gh-4984
  vinyl/stat.test.lua				gh-4951
  vinyl/upsert.test.lua				gh-5398

t
avtikhon referenced this issue in tarantool/tarantool Oct 12, 2020
Added for tests with issues:

  app/socket.test.lua				gh-4978
  box/access.test.lua				gh-5411
  box/access_misc.test.lua			gh-5401
  box/gh-5135-invalid-upsert.test.lua		gh-5376
  box/hash_64bit_replace.test.lua test		gh-5410
  box/hash_replace.test.lua			gh-5400
  box/huge_field_map_long.test.lua		gh-5375
  box/net.box_huge_data_gh-983.test.lua		gh-5402
  replication/anon.test.lua			gh-5381
  replication/autoboostrap.test.lua		gh-4933
  replication/box_set_replication_stress.test.lua gh-4992
  replication/election_basic.test.lua		gh-5368
  replication/election_qsync.test.lua test	gh-5395
  replication/gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-5380
  replication/gh-3711-misc-no-restart-on-same-configuration.test.lua gh-5407
  replication/gh-5287-boot-anon.test.lua	gh-5412
  replication/gh-5298-qsync-recovery-snap.test.lua.test.lua gh-5379
  replication/show_error_on_disconnect.test.lua	gh-5371
  replication/status.test.lua			gh-5409
  swim/swim.test.lua				gh-5403
  unit/swim.test				gh-5399
  vinyl/gc.test.lua				gh-5383
  vinyl/gh-4864-stmt-alloc-fail-compact.test.lua test gh-5408
  vinyl/gh-4957-too-many-upserts.test.lua	gh-5378
  vinyl/gh.test.lua				gh-5141
  vinyl/quota.test.lua				gh-5377
  vinyl/snapshot.test.lua			gh-4984
  vinyl/stat.test.lua				gh-4951
  vinyl/upsert.test.lua				gh-5398
sergepetrenko referenced this issue in tarantool/tarantool Oct 12, 2020
Instead of having all possible 2^10 diffs in case select{} fails,
log the select{} output in case the space contains less than 10
elements.
Requested by @avtikhon for easier flaky test handling.

Related to #5395
avtikhon referenced this issue in tarantool/tarantool Oct 12, 2020
Added for tests with issues:

  app/socket.test.lua				gh-4978
  box/access.test.lua				gh-5411
  box/access_misc.test.lua			gh-5401
  box/gh-5135-invalid-upsert.test.lua		gh-5376
  box/hash_64bit_replace.test.lua test		gh-5410
  box/hash_replace.test.lua			gh-5400
  box/huge_field_map_long.test.lua		gh-5375
  box/net.box_huge_data_gh-983.test.lua		gh-5402
  replication/anon.test.lua			gh-5381
  replication/autoboostrap.test.lua		gh-4933
  replication/box_set_replication_stress.test.lua gh-4992
  replication/election_basic.test.lua		gh-5368
  replication/election_qsync.test.lua test	gh-5395
  replication/gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-5380
  replication/gh-3711-misc-no-restart-on-same-configuration.test.lua gh-5407
  replication/gh-5287-boot-anon.test.lua	gh-5412
  replication/gh-5298-qsync-recovery-snap.test.lua.test.lua gh-5379
  replication/show_error_on_disconnect.test.lua	gh-5371
  replication/status.test.lua			gh-5409
  swim/swim.test.lua				gh-5403
  unit/swim.test				gh-5399
  vinyl/gc.test.lua				gh-5383
  vinyl/gh-4864-stmt-alloc-fail-compact.test.lua test gh-5408
  vinyl/gh-4957-too-many-upserts.test.lua	gh-5378
  vinyl/gh.test.lua				gh-5141
  vinyl/quota.test.lua				gh-5377
  vinyl/snapshot.test.lua			gh-4984
  vinyl/stat.test.lua				gh-4951
  vinyl/upsert.test.lua				gh-5398
avtikhon referenced this issue in tarantool/tarantool Oct 12, 2020
Added for tests with issues:

  app/socket.test.lua				gh-4978
  box/access.test.lua				gh-5411
  box/access_misc.test.lua			gh-5401
  box/gh-5135-invalid-upsert.test.lua		gh-5376
  box/hash_64bit_replace.test.lua test		gh-5410
  box/hash_replace.test.lua			gh-5400
  box/huge_field_map_long.test.lua		gh-5375
  box/net.box_huge_data_gh-983.test.lua		gh-5402
  replication/anon.test.lua			gh-5381
  replication/autoboostrap.test.lua		gh-4933
  replication/box_set_replication_stress.test.lua gh-4992
  replication/election_basic.test.lua		gh-5368
  replication/election_qsync.test.lua test	gh-5395
  replication/gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-5380
  replication/gh-3711-misc-no-restart-on-same-configuration.test.lua gh-5407
  replication/gh-5287-boot-anon.test.lua	gh-5412
  replication/gh-5298-qsync-recovery-snap.test.lua.test.lua gh-5379
  replication/show_error_on_disconnect.test.lua	gh-5371
  replication/status.test.lua			gh-5409
  swim/swim.test.lua				gh-5403
  unit/swim.test				gh-5399
  vinyl/gc.test.lua				gh-5383
  vinyl/gh-4864-stmt-alloc-fail-compact.test.lua test gh-5408
  vinyl/gh-4957-too-many-upserts.test.lua	gh-5378
  vinyl/gh.test.lua				gh-5141
  vinyl/quota.test.lua				gh-5377
  vinyl/snapshot.test.lua			gh-4984
  vinyl/stat.test.lua				gh-4951
  vinyl/upsert.test.lua				gh-5398
avtikhon referenced this issue in tarantool/tarantool Oct 13, 2020
Added for tests with issues:

  app/socket.test.lua				gh-4978
  box/access.test.lua				gh-5411
  box/access_misc.test.lua			gh-5401
  box/gh-5135-invalid-upsert.test.lua		gh-5376
  box/hash_64bit_replace.test.lua test		gh-5410
  box/hash_replace.test.lua			gh-5400
  box/huge_field_map_long.test.lua		gh-5375
  box/net.box_huge_data_gh-983.test.lua		gh-5402
  replication/anon.test.lua			gh-5381
  replication/autoboostrap.test.lua		gh-4933
  replication/box_set_replication_stress.test.lua gh-4992
  replication/election_basic.test.lua		gh-5368
  replication/election_qsync.test.lua test	gh-5395
  replication/gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-5380
  replication/gh-3711-misc-no-restart-on-same-configuration.test.lua gh-5407
  replication/gh-5287-boot-anon.test.lua	gh-5412
  replication/gh-5298-qsync-recovery-snap.test.lua.test.lua gh-5379
  replication/show_error_on_disconnect.test.lua	gh-5371
  replication/status.test.lua			gh-5409
  swim/swim.test.lua				gh-5403
  unit/swim.test				gh-5399
  vinyl/gc.test.lua				gh-5383
  vinyl/gh-4864-stmt-alloc-fail-compact.test.lua test gh-5408
  vinyl/gh-4957-too-many-upserts.test.lua	gh-5378
  vinyl/gh.test.lua				gh-5141
  vinyl/quota.test.lua				gh-5377
  vinyl/snapshot.test.lua			gh-4984
  vinyl/stat.test.lua				gh-4951
  vinyl/upsert.test.lua				gh-5398
avtikhon referenced this issue in tarantool/tarantool Dec 11, 2020
Found that replication/election_qsync_stress.test.lua test may fail on
restating instances. It occures on heavy loaded hosts when its local
call to stop instance using SIGTERM fails to stop it. Decided to use
SIGKILL in local stop call options to be sure that the instance will
be stopped.

Also found that running loop inline new hangs occured on server start:

  --- replication/election_qsync_stress.result    Thu Nov 12 16:23:16 2020
  +++ var/128_replication/election_qsync_stress.result    Thu Nov 12 16:31:22 2020
  @@ -323,7 +323,7 @@
    | ...
       test_run:wait_cond(function() return c.space.test ~= nil and c.space.test:get{i} ~= nil end)
    | ---
  - | - true
  + | - false
    | ...
  @@ -380,7 +380,7 @@
    | ...
       test_run:wait_cond(function() return c.space.test ~= nil and c.space.test:get{i} ~= nil end)
    | ---
  - | - true
  + | - false
    | ...
  @@ -494,687 +494,3 @@
    | ---
    | ...
       test_run:cmd('start server '..old_leader..' with wait=True, wait_load=True, args="2 0.4"')
  - | ---
  - | - true
  - | …

but the test already failed before on getting 'c.space.test:get{i}'.
To avoid of the hang and make test code more correct running way it
were added log.error messages and return calls. Also the test was
changed to use function for each loop iteration to be able to check
return values and break the loop just after the fails.

Also found that test hangs on recreation of the replica. It happend
because replica creation had wait_load flag enabled which set to wait
test-run for replica creation mark in replica log file in seek_wait
subroutine. To fix it added replication_sync_timeout set to 5 secs.

Needed for #5395
avtikhon referenced this issue in tarantool/tarantool Dec 13, 2020
Found that replication/election_qsync_stress.test.lua test may fail on
restating instances. It occures on heavy loaded hosts when its local
call to stop instance using SIGTERM fails to stop it. Decided to use
SIGKILL in local stop call options to be sure that the instance will
be stopped.

Also found that running loop inline new hangs occured on server start:

  --- replication/election_qsync_stress.result    Thu Nov 12 16:23:16 2020
  +++ var/128_replication/election_qsync_stress.result    Thu Nov 12 16:31:22 2020
  @@ -323,7 +323,7 @@
    | ...
       test_run:wait_cond(function() return c.space.test ~= nil and c.space.test:get{i} ~= nil end)
    | ---
  - | - true
  + | - false
    | ...
  @@ -380,7 +380,7 @@
    | ...
       test_run:wait_cond(function() return c.space.test ~= nil and c.space.test:get{i} ~= nil end)
    | ---
  - | - true
  + | - false
    | ...
  @@ -494,687 +494,3 @@
    | ---
    | ...
       test_run:cmd('start server '..old_leader..' with wait=True, wait_load=True, args="2 0.4"')
  - | ---
  - | - true
  - | …

but the test already failed before on getting 'c.space.test:get{i}'.
To avoid of the hang and make test code more correct running way it
were added log.error messages and return calls. Also the test was
changed to use function for each loop iteration to be able to check
return values and break the loop just after the fails.

Also found that test hangs on recreation of the replica. It happend
because replica creation had wait_load flag enabled which set to wait
test-run for replica creation mark in replica log file in seek_wait
subroutine. To fix it added replication_sync_timeout set to 5 secs.

Needed for #5395
avtikhon referenced this issue in tarantool/tarantool Dec 20, 2020
Found that replication/election_qsync_stress.test.lua test may fail on
restating instances. It occures on heavy loaded hosts when its local
call to stop instance using SIGTERM fails to stop it. Decided to use
SIGKILL in local stop call options to be sure that the instance will
be stopped.

Also found that running loop inline new hangs occured on server start:

  --- replication/election_qsync_stress.result    Thu Nov 12 16:23:16 2020
  +++ var/128_replication/election_qsync_stress.result    Thu Nov 12 16:31:22 2020
  @@ -323,7 +323,7 @@
    | ...
       test_run:wait_cond(function() return c.space.test ~= nil and c.space.test:get{i} ~= nil end)
    | ---
  - | - true
  + | - false
    | ...
  @@ -380,7 +380,7 @@
    | ...
       test_run:wait_cond(function() return c.space.test ~= nil and c.space.test:get{i} ~= nil end)
    | ---
  - | - true
  + | - false
    | ...
  @@ -494,687 +494,3 @@
    | ---
    | ...
       test_run:cmd('start server '..old_leader..' with wait=True, wait_load=True, args="2 0.4"')
  - | ---
  - | - true
  - | …

but the test already failed before on getting 'c.space.test:get{i}'.
To avoid of the hang and make test code more correct running way it
were added log.error messages and return calls. Also the test was
changed to use function for each loop iteration to be able to check
return values and break the loop just after the fails.

Also found that test hangs on recreation of the replica. It happend
because replica creation had wait_load flag enabled which set to wait
test-run for replica creation mark in replica log file in seek_wait
subroutine. To fix it added replication_sync_timeout set to 5 secs.

Needed for #5395
avtikhon referenced this issue in tarantool/tarantool Dec 20, 2020
Found that replication/election_qsync_stress.test.lua test may fail on
restating instances. It occures on heavy loaded hosts when its local
call to stop instance using SIGTERM fails to stop it. Decided to use
SIGKILL in local stop call options to be sure that the instance will
be stopped.

Also found that running loop inline new hangs occured on server start:

  --- replication/election_qsync_stress.result    Thu Nov 12 16:23:16 2020
  +++ var/128_replication/election_qsync_stress.result    Thu Nov 12 16:31:22 2020
  @@ -323,7 +323,7 @@
    | ...
       test_run:wait_cond(function() return c.space.test ~= nil and c.space.test:get{i} ~= nil end)
    | ---
  - | - true
  + | - false
    | ...
  @@ -380,7 +380,7 @@
    | ...
       test_run:wait_cond(function() return c.space.test ~= nil and c.space.test:get{i} ~= nil end)
    | ---
  - | - true
  + | - false
    | ...
  @@ -494,687 +494,3 @@
    | ---
    | ...
       test_run:cmd('start server '..old_leader..' with wait=True, wait_load=True, args="2 0.4"')
  - | ---
  - | - true
  - | …

but the test already failed before on getting 'c.space.test:get{i}'.
To avoid of the hang and make test code more correct running way it
were added log.error messages and return calls. Also the test was
changed to use function for each loop iteration to be able to check
return values and break the loop just after the fails.

Also found that test hangs on recreation of the replica. It happend
because replica creation had wait_load flag enabled which set to wait
test-run for replica creation mark in replica log file in seek_wait
subroutine. To fix it added replication_sync_timeout set to 5 secs.

Needed for #5395
avtikhon referenced this issue in tarantool/tarantool Dec 23, 2020
Found that replication/election_qsync_stress.test.lua test may fail on
restating instances. It occures on heavy loaded hosts when its local
call to stop instance using SIGTERM fails to stop it. Decided to use
SIGKILL in local stop call options to be sure that the instance will
be stopped.

Also found that running loop inline new hangs occured on server start:

  --- replication/election_qsync_stress.result    Thu Nov 12 16:23:16 2020
  +++ var/128_replication/election_qsync_stress.result    Thu Nov 12 16:31:22 2020
  @@ -323,7 +323,7 @@
    | ...
       test_run:wait_cond(function() return c.space.test ~= nil and c.space.test:get{i} ~= nil end)
    | ---
  - | - true
  + | - false
    | ...
  @@ -380,7 +380,7 @@
    | ...
       test_run:wait_cond(function() return c.space.test ~= nil and c.space.test:get{i} ~= nil end)
    | ---
  - | - true
  + | - false
    | ...
  @@ -494,687 +494,3 @@
    | ---
    | ...
       test_run:cmd('start server '..old_leader..' with wait=True, wait_load=True, args="2 0.4"')
  - | ---
  - | - true
  - | …

but the test already failed before on getting 'c.space.test:get{i}'.
To avoid of the hang and make test code more correct running way it
were added log.error messages and return calls. Also the test was
changed to use function for each loop iteration to be able to check
return values and break the loop just after the fails.

Also found that test hangs on recreation of the replica. It happend
because replica creation had wait_load flag enabled which set to wait
test-run for replica creation mark in replica log file in seek_wait
subroutine. To fix it added replication_sync_timeout set to 5 secs.

Needed for #5395
@Gerold103
Copy link

Cases 1 and 2 are not present in election_qsync.test.lua file at all. They existed in election_qsync_stress.test.lua in the beginning but in tarantool/tarantool@eb989bc they were merged with other failures having the same reason.

Cases 4 and 5 were failing because the test didn't wait for the old leader to send all data to the other nodes. As a result, they could have less data than needed and the space didn't have enough tuples in the end of the test. It seems to be fixed by tarantool/tarantool@bf0fbf3.

Cases 3 and 4 (the latter had 2 issues) seem to be caused by old code relying on box.info.election.role == 'leader' being enough to be writable. That is still a problem (for tarantool/tarantool#6689), but the test got a workaround in this commit: tarantool/tarantool@71377c2.

I have run both election_qsync.test.lua and its stress version on my machine for half an hour and couldn't get any failures. Probably it still fails in CI, but I believe due to other reasons, not related to this ticket already.

As a summary: I think @sergepetrenko fixed this ticket quite some time ago in the commits above. Sergey, can you try it on your machine too? @sergos, you could also try (me and Sergey P. both are on Macs I think, we could use a person with Linux)?

@Gerold103
Copy link

The most recent new error not in this ticket on master I see here: https://github.com/tarantool/tarantool/runs/5590105468?check_suite_focus=true. Have no idea what might be causing it. It fails on bootstrap apparently.

@sergos
Copy link

sergos commented Mar 23, 2022

After running of ./test-run.py $(yes election_qsync | head -100) couple of times I have got:

[141] replication/election_qsync_stress.test.lua      vinyl           
[141] * [QA Notice]
[141] *
[141] * Attempt to evaluate a command on the nonexistent server 'election_replica0'
[141] *
[141] 
[141] * [QA Notice]
[141] *
[141] * Attempt to stop already stopped server 'election_replica1'
[141] *
[141] [ fail ]
[141] 
[141] Test failed! Result content mismatch:
[141] --- replication/election_qsync_stress.result      Fri Feb 11 13:34:49 2022
[141] +++ var/rejects/replication/election_qsync_stress.reject  Wed Mar 23 18:26:30 2022
[141] @@ -111,6 +111,7 @@
[141]      old_leader = new_leader
[141]  end;
[141]   | ---
[141] + | - error: Attempt to evaluate a command on the nonexistent server 'election_replica0'
[141]   | ...
[141]  test_run:cmd('setopt delimiter ""');
[141]   | ---
[141] @@ -119,7 +120,7 @@
[141]  -- We're connected to some leader.
[141]  #c.space.test:select{} == 10 or require('log').error(c.space.test:select{})
[141]   | ---
[141] - | - true
[141] + | - error: Peer closed
[141]   | ...
[141]  
[141]  test_run:drop_cluster(SERVERS)
[141] 
[141] Last 15 lines of Tarantool Log file [Instance "master"][/workspaces/t.sergos/test/var/141_replication/master.log]:
[141] 2022-03-23 18:25:18.773 [15158] main/121/console/unix/: I> election_replica1 -> election_replica2: waiting for connection
[141] 2022-03-23 18:25:18.792 [15158] main/121/console/unix/: I> connecting election_replica1 -> election_replica2: {"status":"follow"}
[141] 2022-03-23 18:25:18.792 [15158] main/121/console/unix/: I> election_replica1 -> election_replica2: connected
[141] 2022-03-23 18:25:18.792 [15158] main/121/console/unix/: I> election_replica3 -> election_replica2: waiting for connection
[141] 2022-03-23 18:25:18.816 [15158] main/121/console/unix/: I> connecting election_replica3 -> election_replica2: {"status":"follow"}
[141] 2022-03-23 18:25:18.816 [15158] main/121/console/unix/: I> election_replica3 -> election_replica2: connected
[141] 2022-03-23 18:25:18.816 [15158] main/121/console/unix/: I> election_replica3: waiting bootstrap
[141] 2022-03-23 18:25:18.831 [15158] main/121/console/unix/: I> election_replica3: bootstrapped
[141] 2022-03-23 18:25:18.831 [15158] main/121/console/unix/: I> election_replica1 -> election_replica3: waiting for connection
[141] 2022-03-23 18:25:18.862 [15158] main/121/console/unix/: I> connecting election_replica1 -> election_replica3: {"status":"follow"}
[141] 2022-03-23 18:25:18.862 [15158] main/121/console/unix/: I> election_replica1 -> election_replica3: connected
[141] 2022-03-23 18:25:18.862 [15158] main/121/console/unix/: I> election_replica2 -> election_replica3: waiting for connection
[141] 2022-03-23 18:25:18.877 [15158] main/121/console/unix/: I> connecting election_replica2 -> election_replica3: {"status":"follow"}
[141] 2022-03-23 18:25:18.878 [15158] main/121/console/unix/: I> election_replica2 -> election_replica3: connected
[141] 2022-03-23 18:25:18.878 [15158] main/121/console/unix/: I> full mesh connected
[141] Test "replication/election_qsync_stress.test.lua", conf: "vinyl"
[141]   from "fragile" list failed, rerunning ...

Which apparently have one way to appear:

    new_leader_nr = get_leader(nrs)
    new_leader = 'election_replica'..new_leader_nr
    leader_port = test_run:eval(new_leader, 'box.cfg.listen')[1]

or the get_leader() itself, where 0 can be returned, so it will be used in eval.

@sergos
Copy link

sergos commented Mar 28, 2022

Got another failure:

[081] replication/election_qsync_stress.test.lua      memtx           
[081] 
[081] [Instance "election_replica2" returns with non-zero exit code: 1]
[081] 
[081] Last 15 lines of Tarantool Log file [Instance "election_replica2"][/workspaces/t.sergos/test/var/081_replication/election_replica2.log]:
[081] 2022-03-28 13:40:14.541 [30745] main/115/applier/unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock sio.c:210 E> SocketError: connect to unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock, called on fd 26, aka unix/:(socket): No such file or directory
[081] 2022-03-28 13:40:14.541 [30745] main/115/applier/unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock I> will retry every 0.10 second
[081] 2022-03-28 13:40:14.541 [30745] main/113/applier/unix/:/workspaces/t.sergos/test/var/081_replication/election_replica1.sock I> remote master 6dd0c5ad-e4f9-4d3f-adc5-25d83bc0db84 at unix/:/workspaces/t.sergos/test/var/081_replication/election_replica1.sock running Tarantool 2.10.0
[081] 2022-03-28 13:40:14.543 [30745] main/114/applier/unix/:/workspaces/t.sergos/test/var/081_replication/election_replica2.sock I> remote master afebf07a-a327-4890-8eaa-7f8cfc159eaa at unix/:/workspaces/t.sergos/test/var/081_replication/election_replica2.sock running Tarantool 2.10.0
[081] 2022-03-28 13:40:14.845 [30745] main/115/applier/unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock I> remote master e9e613a7-bbc2-4250-9f3b-e13ad7e46b45 at unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock running Tarantool 2.10.0
[081] 2022-03-28 13:40:14.846 [30745] main/103/election_replica2 I> connected to 3 replicas
[081] 2022-03-28 13:40:14.847 [30745] main/103/election_replica2 I> bootstrapping replica from e9e613a7-bbc2-4250-9f3b-e13ad7e46b45 at unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock
[081] 2022-03-28 13:40:14.847 [30745] main/115/applier/unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock I> can't read row
[081] 2022-03-28 13:40:14.847 [30745] main/115/applier/unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock box.cc:2869 E> ER_LOADING: Instance bootstrap hasn't finished yet
[081] 2022-03-28 13:40:14.847 [30745] main/115/applier/unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock I> will retry every 0.10 second
[081] 2022-03-28 13:40:14.947 [30745] main/115/applier/unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock I> can't read row
[081] 2022-03-28 13:40:14.947 [30745] main/115/applier/unix/:/workspaces/t.sergos/test/var/081_replication/election_replica3.sock box.cc:198 E> ER_READONLY: Can't modify data on a read-only instance - state is election follower with term 2, leader is 1 (6dd0c5ad-e4f9-4d3f-adc5-25d83bc0db84)
[081] 2022-03-28 13:40:14.947 [30745] main/103/election_replica2 box.cc:198 E> ER_READONLY: Can't modify data on a read-only instance - state is election follower with term 2, leader is 1 (6dd0c5ad-e4f9-4d3f-adc5-25d83bc0db84)
[081] 2022-03-28 13:40:14.947 [30745] main/103/election_replica2 F> can't initialize storage: Can't modify data on a read-only instance - state is election follower with term 2, leader is 1 (6dd0c5ad-e4f9-4d3f-adc5-25d83bc0db84)
[081] 2022-03-28 13:40:14.947 [30745] main/103/election_replica2 F> can't initialize storage: Can't modify data on a read-only instance - state is election follower with term 2, leader is 1 (6dd0c5ad-e4f9-4d3f-adc5-25d83bc0db84)
[081] [ fail ]

@sergos
Copy link

sergos commented Mar 28, 2022

or the get_leader() itself, where 0 can be returned, so it will be used in eval.

Apparently, we have an infinite elections - replica2 (id == 2) tried to deliver it's vote to the replica3 (id == 1) and after two sequential votes for replica3 it starts to vote for itself. Note, that replica2 keep receiving votes from replica3, but not the vice versa.

replica2:
2022-03-28 14:10:52.663 [76330] main/112/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica1.sock I> can't read row
2022-03-28 14:10:52.663 [76330] main/112/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica1.sock coio.c:349 E> SocketError: unexpected EOF when reading from socket, called on fd 26, aka unix/:(socket), peer of unix/:(socket): Broken pipe
2022-03-28 14:10:52.663 [76330] main/112/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica1.sock I> will retry every 0.10 second
2022-03-28 14:10:52.663 [76330] relay/unix/:(socket)/101/main coio.c:349 E> SocketError: unexpected EOF when reading from socket, called on fd 27, aka unix/:(socket), peer of unix/:(socket): Broken pipe
2022-03-28 14:10:52.663 [76330] relay/unix/:(socket)/101/main I> exiting the relay loop
2022-03-28 14:10:52.948 [76330] main/114/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica3.sock I> RAFT: message {term: 4, vote: 1, state: candidate, vclock: {1: 9, 3: 3}} from 1
2022-03-28 14:10:52.948 [76330] main/114/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica3.sock I> RAFT: bump term to 4, follow
2022-03-28 14:10:52.948 [76330] main/114/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica3.sock I> RAFT: vote for 1, follow
2022-03-28 14:10:52.948 [76330] main/118/raft_worker I> RAFT: persisted state {term: 4, vote: 1}
2022-03-28 14:10:53.356 [76330] main/114/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica3.sock I> RAFT: message {term: 5, vote: 1, state: candidate, vclock: {1: 9, 3: 3}} from 1
2022-03-28 14:10:53.356 [76330] main/114/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica3.sock I> RAFT: bump term to 5, follow
2022-03-28 14:10:53.356 [76330] main/114/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica3.sock I> RAFT: vote for 1, follow
2022-03-28 14:10:53.356 [76330] main/118/raft_worker I> RAFT: persisted state {term: 5, vote: 1}
2022-03-28 14:10:53.767 [76330] main I> RAFT: begin new election round
2022-03-28 14:10:53.767 [76330] main I> RAFT: bump term to 6, follow
2022-03-28 14:10:53.767 [76330] main I> RAFT: vote for 2, follow 
2022-03-28 14:10:53.767 [76330] main/118/raft_worker I> RAFT: persisted state {term: 6, vote: 2}
2022-03-28 14:10:53.767 [76330] main/118/raft_worker I> RAFT: enter candidate state with 1 self vote
2022-03-28 14:10:53.781 [76330] main/114/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica3.sock I> RAFT: message {term: 6, vote: 1, state: candidate, vclock: {1: 9, 3: 3}} from 1
2022-03-28 14:10:53.781 [76330] main/114/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica3.sock I> RAFT: vote request is skipped - competing candidate
2022-03-28 14:10:54.179 [76330] main I> RAFT: begin new election round

replica3:
2022-03-28 14:10:49.513 [2385] main/116/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica2.sock I> can't read row
2022-03-28 14:10:49.513 [2385] main/116/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica2.sock xrow.c:229 E> ER_INVALID_MSGPACK: Invalid MsgPack - packet body
2022-03-28 14:10:49.587 [2385] main/127/main I> subscribed replica b5e078b0-b862-4d11-b726-7dba38835615 at fd 27, aka unix/:(socket), peer of unix/:(socket)
2022-03-28 14:10:49.587 [2385] main/127/main I> remote vclock {1: 9, 3: 3} local vclock {0: 3, 1: 9, 3: 3}
2022-03-28 14:10:49.588 [2385] main/106/gc I> wal/engine cleanup is resumed
2022-03-28 14:10:49.592 [2385] relay/unix/:(socket)/101/main I> recover from `/workspaces/t.sergos/test/var/083_replication/election_replica3/00000000000000000011.xlog'
2022-03-28 14:10:52.663 [2385] main/117/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica1.sock I> can't read row
2022-03-28 14:10:52.663 [2385] main/117/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica1.sock coio.c:349 E> SocketError: unexpected EOF when reading from socket, called on fd 28, aka unix/:(socket), peer of unix/:(socket): Broken pipe
2022-03-28 14:10:52.663 [2385] main/117/applier/unix/:/workspaces/t.sergos/test/var/083_replication/election_replica1.sock I> will retry every 0.10 second
2022-03-28 14:10:52.663 [2385] relay/unix/:(socket)/101/main coio.c:349 E> SocketError: unexpected EOF when reading from socket, called on fd 27, aka unix/:(socket), peer of unix/:(socket): Broken pipe
2022-03-28 14:10:52.663 [2385] relay/unix/:(socket)/101/main I> exiting the relay loop
2022-03-28 14:10:52.947 [2385] main I> RAFT: begin new election round
2022-03-28 14:10:52.947 [2385] main I> RAFT: bump term to 4, follow
2022-03-28 14:10:52.947 [2385] main I> RAFT: vote for 1, follow
2022-03-28 14:10:52.947 [2385] main/118/raft_worker I> RAFT: persisted state {term: 4, vote: 1}
2022-03-28 14:10:52.948 [2385] main/118/raft_worker I> RAFT: enter candidate state with 1 self vote
2022-03-28 14:10:53.355 [2385] main I> RAFT: begin new election round
2022-03-28 14:10:53.355 [2385] main I> RAFT: bump term to 5, follow

@ylobankov ylobankov transferred this issue from tarantool/tarantool Apr 15, 2022
@sergepetrenko
Copy link

I've ran election_qsync_stress for quite some time, but couldn't reproduce the election hang that @sergos mentioned.

The ER_READONLY error was fixed recently in scope of tarantool/tarantool#6966

It seems that the case with infinite elections is caused by replica3's applier failing with ER_INVVALID_MSGPACK. This explains why two instances may enter the infinite election loop. replica3 doesn't receive vote requests from replica2 (its applier is dead), so replica2 never receives 2 votes. At the same time, replica3 never receives votes for itself from replica2, so replica3 never receives 2 votes. This results in infinite elections.

What causes the ER_INVALID_MSGPACK is a mystery though. It was fixed in scope of tarantool/tarantool#4040, so this might be another incarnation of this bug.
@sergos, could you please set log_level=6 in election_replica.lua and try to reproduce this once again? With log_level=6 replica will dump the malformed row once ER_INVALID_MSGPACK occurs. This will give us a chance to understand what's causing the error.

@sergos
Copy link

sergos commented Apr 26, 2022

I've got the error, but not the infinite elections. The reason surprises me: it's in relay, not applier:

2022-04-26 11:26:13.548 [53611] main/138/main I> set 'replication_synchro_timeout' configuration option to 1000
2022-04-26 11:26:13.717 [53611] main/113/main I> subscribed replica 38acc283-910b-40db-8ecc-e5ab45764352 at fd 23, aka unix/:(socket), peer of unix/:(socket)
2022-04-26 11:26:13.717 [53611] main/113/main I> remote vclock {1: 9} local vclock {0: 2, 1: 9, 3: 2}
2022-04-26 11:26:13.718 [53611] relay/unix/:(socket)/101/main I> recover from `/workspaces/t.sergos/test/var/027_replication/election_replica2/00000000000000000005.xlog'
2022-04-26 11:26:13.718 [53611] relay/unix/:(socket)/102/main:reader V> Got a corrupted row:
2022-04-26 11:26:13.718 [53611] relay/unix/:(socket)/102/main:reader V> 00000000: 81 00 00 81 26 81 01 09 03 01
2022-04-26 11:26:13.719 [53611] relay/unix/:(socket)/101/main xrow.c:229 E> ER_INVALID_MSGPACK: Invalid MsgPack - packet body
2022-04-26 11:26:13.719 [53611] relay/unix/:(socket)/101/main I> exiting the relay loop
2022-04-26 11:26:13.761 [53611] main/114/applier/unix/:/workspaces/t.sergos/test/var/027_replication/election_replica1.sock I> subscribed
2022-04-26 11:26:13.761 [53611] main/114/applier/unix/:/workspaces/t.sergos/test/var/027_replication/election_replica1.sock I> remote vclock {1: 9, 3: 2} local vclock {0: 2, 1: 9, 3: 2}
2022-04-26 11:26:13.761 [53611] main/114/applier/unix/:/workspaces/t.sergos/test/var/027_replication/election_replica1.sock I> RAFT: message {term: 3, leader: 3, leader is seen: true, state: follower} from 1

unfortunately, the xlog file is missing, even after I removed the test from the fragile list.
I run tests on the dev1 box under ubuntu docker, using a bind mount type

        "Mounts": [
            {
                "Type": "bind",
                "Source": "/home/s.ostanevich/workspaces",
                "Destination": "/workspaces",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            }
        ],

I will try to reproduce without docker.

@sergepetrenko
Copy link

The reason surprises me: it's in relay, not applier:

Wow, that's something already. Will take a look.

@sergos
Copy link

sergos commented Apr 26, 2022

Reproduced on plain el7.

2022-04-26 17:32:05.853 [36183] main/113/main I> subscribed replica 92a2b565-790f-4a2c-aa4e-2264a93a1d8e at fd 29, aka unix/:(socket), peer of unix/:(socket)
2022-04-26 17:32:05.854 [36183] main/113/main I> remote vclock {1: 5} local vclock {0: 1, 1: 5}
>>>> 2022-04-26 17:32:05.857 [36183] relay/unix/:(socket)/101/main I> recover from `/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica2/00000000000000000004.xlog'
2022-04-26 17:32:05.866 [36183] main/127/main I> subscribed replica 70f0ed78-3d15-4bcc-9fd0-87ec1ba7ea20 at fd 31, aka unix/:(socket), peer of unix/:(socket)
2022-04-26 17:32:05.866 [36183] main/127/main I> remote vclock {1: 5} local vclock {0: 1, 1: 5}
2022-04-26 17:32:05.866 [36183] main/127/main I> RAFT: fencing resumed
2022-04-26 17:32:05.867 [36183] main/106/gc I> wal/engine cleanup is resumed
>>>> 2022-04-26 17:32:05.868 [36183] relay/unix/:(socket)/101/main I> recover from `/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica2/00000000000000000004.xlog'
2022-04-26 17:32:05.911 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> subscribed
2022-04-26 17:32:05.911 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> remote vclock {1: 5} local vclock {0: 1, 1: 5}
2022-04-26 17:32:05.912 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: message {term: 2, leader: 1, leader is seen: true, state: follower} from 3
2022-04-26 17:32:05.914 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: message {term: 2, leader: 1, leader is seen: true, state: follower} from 3
2022-04-26 17:32:05.914 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> leaving orphan mode
2022-04-26 17:32:07.011 [36183] main/115/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica1.sock I> can't read row
2022-04-26 17:32:07.011 [36183] main/115/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica1.sock coio.c:349 E> SocketError: unexpected EOF when reading from socket, called on fd 25, aka unix/:(socket), peer of unix/:(socket): Broken pipe
2022-04-26 17:32:07.011 [36183] main/115/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica1.sock I> will retry every 0.10 second
>>>> 2022-04-26 17:32:07.011 [36183] relay/unix/:(socket)/101/main coio.c:349 E> SocketError: unexpected EOF when reading from socket, called on fd 31, aka unix/:(socket), peer of unix/:(socket): Broken pipe
2022-04-26 17:32:07.011 [36183] relay/unix/:(socket)/101/main I> exiting the relay loop
2022-04-26 17:32:07.353 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: message {term: 2, leader: 1, state: follower} from 3
2022-04-26 17:32:07.353 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: begin new election round
2022-04-26 17:32:07.353 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: bump term to 3, follow
2022-04-26 17:32:07.353 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: vote for 2, follow
2022-04-26 17:32:07.353 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: message {term: 3, vote: 3, state: candidate, vclock: {1: 9}} from 3
2022-04-26 17:32:07.353 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: vote request is skipped - already voted in this term
2022-04-26 17:32:07.353 [36183] main/118/raft_worker I> RAFT: persisted state {term: 3, vote: 2}
2022-04-26 17:32:07.353 [36183] main/118/raft_worker I> RAFT: enter candidate state with 1 self vote
2022-04-26 17:32:07.766 [36183] main I> RAFT: begin new election round
2022-04-26 17:32:07.766 [36183] main I> RAFT: bump term to 4, follow
2022-04-26 17:32:07.766 [36183] main I> RAFT: vote for 2, follow
2022-04-26 17:32:07.766 [36183] main/118/raft_worker I> RAFT: persisted state {term: 4, vote: 2}
2022-04-26 17:32:07.766 [36183] main/118/raft_worker I> RAFT: enter candidate state with 1 self vote
2022-04-26 17:32:07.767 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: message {term: 4, vote: 2, state: follower} from 3
2022-04-26 17:32:07.767 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: enter leader state with quorum 2
>>>> 2022-04-26 17:32:07.767 [36183] relay/unix/:(socket)/102/main:reader V> Got a corrupted row:
2022-04-26 17:32:07.767 [36183] relay/unix/:(socket)/102/main:reader V> 00000000: 81 00 00 81 26 81 01 09 02 01
2022-04-26 17:32:07.767 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: message {term: 4, vote: 2, leader: 2, leader is seen: true, state: follower} from 3
2022-04-26 17:32:07.767 [36183] main/117/applier/unix/:/export/s.ostanevich/workspaces/t.sergos2/test/var/029_replication/election_replica3.sock I> RAFT: vote request is skipped - the leader is already known - 2
>>>> 2022-04-26 17:32:07.767 [36183] relay/unix/:(socket)/101/main xrow.c:229 E> ER_INVALID_MSGPACK: Invalid MsgPack - packet body
2022-04-26 17:32:07.767 [36183] relay/unix/:(socket)/101/main I> exiting the relay loop

@sergepetrenko
Copy link

Found the issue together with @sergos:
tarantool/tarantool#7089

sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Apr 27, 2022
When applier ack writer was moved to applier thread, it was overlooked
that it would start sharing replicaset.vclock between two threads.

This could lead to the following replication errors on master:

 relay//102/main:reader V> Got a corrupted row:
 relay//102/main:reader V> 00000000: 81 00 00 81 26 81 01 09 02 01

Such a row has an incorrectly-encoded vclock: `81 01 09 02 01`.
When writer fiber encoded the vclock length (`81`), there was only one
vclock component: {1: 9}, but at the moment of iterating over the
components, another WAL write was reported to TX thread, which bumped
the second vclock component {1: 9, 2: 1}.

Let's fix the race by delivering a copy of current replicaset vclock to
the applier thread.

Closes tarantool#7089
Part-of tarantool/tarantool-qa#166

NO_DOC=internal fix
NO_CHANGELOG=internal fix
NO_TEST=hard to test
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue May 11, 2022
When applier ack writer was moved to applier thread, it was overlooked
that it would start sharing replicaset.vclock between two threads.

This could lead to the following replication errors on master:

 relay//102/main:reader V> Got a corrupted row:
 relay//102/main:reader V> 00000000: 81 00 00 81 26 81 01 09 02 01

Such a row has an incorrectly-encoded vclock: `81 01 09 02 01`.
When writer fiber encoded the vclock length (`81`), there was only one
vclock component: {1: 9}, but at the moment of iterating over the
components, another WAL write was reported to TX thread, which bumped
the second vclock component {1: 9, 2: 1}.

Let's fix the race by delivering a copy of current replicaset vclock to
the applier thread.

Also add a perf test to the perf/ directory.

Closes tarantool#7089
Part-of tarantool/tarantool-qa#166

NO_DOC=internal fix
NO_CHANGELOG=internal fix
NO_TEST=hard to test
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue May 11, 2022
When applier ack writer was moved to applier thread, it was overlooked
that it would start sharing replicaset.vclock between two threads.

This could lead to the following replication errors on master:

 relay//102/main:reader V> Got a corrupted row:
 relay//102/main:reader V> 00000000: 81 00 00 81 26 81 01 09 02 01

Such a row has an incorrectly-encoded vclock: `81 01 09 02 01`.
When writer fiber encoded the vclock length (`81`), there was only one
vclock component: {1: 9}, but at the moment of iterating over the
components, another WAL write was reported to TX thread, which bumped
the second vclock component {1: 9, 2: 1}.

Let's fix the race by delivering a copy of current replicaset vclock to
the applier thread.

Also add a perf test to the perf/ directory.

Closes tarantool#7089
Part-of tarantool/tarantool-qa#166

NO_DOC=internal fix
NO_CHANGELOG=internal fix
NO_TEST=hard to test
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue May 16, 2022
When applier ack writer was moved to applier thread, it was overlooked
that it would start sharing replicaset.vclock between two threads.

This could lead to the following replication errors on master:

 relay//102/main:reader V> Got a corrupted row:
 relay//102/main:reader V> 00000000: 81 00 00 81 26 81 01 09 02 01

Such a row has an incorrectly-encoded vclock: `81 01 09 02 01`.
When writer fiber encoded the vclock length (`81`), there was only one
vclock component: {1: 9}, but at the moment of iterating over the
components, another WAL write was reported to TX thread, which bumped
the second vclock component {1: 9, 2: 1}.

Let's fix the race by delivering a copy of current replicaset vclock to
the applier thread.

Also add a perf test to the perf/ directory.

Closes tarantool#7089
Part-of tarantool/tarantool-qa#166

NO_DOC=internal fix
NO_TEST=hard to test
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue May 17, 2022
When applier ack writer was moved to applier thread, it was overlooked
that it would start sharing replicaset.vclock between two threads.

This could lead to the following replication errors on master:

 relay//102/main:reader V> Got a corrupted row:
 relay//102/main:reader V> 00000000: 81 00 00 81 26 81 01 09 02 01

Such a row has an incorrectly-encoded vclock: `81 01 09 02 01`.
When writer fiber encoded the vclock length (`81`), there was only one
vclock component: {1: 9}, but at the moment of iterating over the
components, another WAL write was reported to TX thread, which bumped
the second vclock component {1: 9, 2: 1}.

Let's fix the race by delivering a copy of current replicaset vclock to
the applier thread.

Also add a perf test to the perf/ directory.

Closes tarantool#7089
Part-of tarantool/tarantool-qa#166

NO_DOC=internal fix
NO_TEST=hard to test
kyukhin pushed a commit to tarantool/tarantool that referenced this issue May 18, 2022
When applier ack writer was moved to applier thread, it was overlooked
that it would start sharing replicaset.vclock between two threads.

This could lead to the following replication errors on master:

 relay//102/main:reader V> Got a corrupted row:
 relay//102/main:reader V> 00000000: 81 00 00 81 26 81 01 09 02 01

Such a row has an incorrectly-encoded vclock: `81 01 09 02 01`.
When writer fiber encoded the vclock length (`81`), there was only one
vclock component: {1: 9}, but at the moment of iterating over the
components, another WAL write was reported to TX thread, which bumped
the second vclock component {1: 9, 2: 1}.

Let's fix the race by delivering a copy of current replicaset vclock to
the applier thread.

Also add a perf test to the perf/ directory.

Closes #7089
Part-of tarantool/tarantool-qa#166

NO_DOC=internal fix
NO_TEST=hard to test
kyukhin pushed a commit to tarantool/tarantool that referenced this issue May 18, 2022
When applier ack writer was moved to applier thread, it was overlooked
that it would start sharing replicaset.vclock between two threads.

This could lead to the following replication errors on master:

 relay//102/main:reader V> Got a corrupted row:
 relay//102/main:reader V> 00000000: 81 00 00 81 26 81 01 09 02 01

Such a row has an incorrectly-encoded vclock: `81 01 09 02 01`.
When writer fiber encoded the vclock length (`81`), there was only one
vclock component: {1: 9}, but at the moment of iterating over the
components, another WAL write was reported to TX thread, which bumped
the second vclock component {1: 9, 2: 1}.

Let's fix the race by delivering a copy of current replicaset vclock to
the applier thread.

Also add a perf test to the perf/ directory.

Closes #7089
Part-of tarantool/tarantool-qa#166

NO_DOC=internal fix
NO_TEST=hard to test

(cherry picked from commit ddec704)
@sergepetrenko
Copy link

Seems like we've fixed all the known issues with this test. No failures on my machine after a 1000 runs.
Closing.

mkokryashkin pushed a commit to mkokryashkin/tarantool that referenced this issue Sep 9, 2022
When applier ack writer was moved to applier thread, it was overlooked
that it would start sharing replicaset.vclock between two threads.

This could lead to the following replication errors on master:

 relay//102/main:reader V> Got a corrupted row:
 relay//102/main:reader V> 00000000: 81 00 00 81 26 81 01 09 02 01

Such a row has an incorrectly-encoded vclock: `81 01 09 02 01`.
When writer fiber encoded the vclock length (`81`), there was only one
vclock component: {1: 9}, but at the moment of iterating over the
components, another WAL write was reported to TX thread, which bumped
the second vclock component {1: 9, 2: 1}.

Let's fix the race by delivering a copy of current replicaset vclock to
the applier thread.

Also add a perf test to the perf/ directory.

Closes tarantool#7089
Part-of tarantool/tarantool-qa#166

NO_DOC=internal fix
NO_TEST=hard to test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants