Description
Hi all,
It all started in the Heroku review-envs when we upgraded our application from (Erlang 21.3.7 and Elixir 1.8.2) to (Erlang 22.0.7 and Elixir 1.9.1).
We noticed, for some reason, that downgrading Erlang back to 21.3.7 solved the timeouts.
In this first moment, messages were not being consumed/produced and thousands of timeout errors were flooding the logs. Then we bumped to kafka_ex
0.10, which seemed to have improved a bit (fewer timeouts in the logs and messages being consumed).
However, in the last week, we started to get the same timeouts in production but starting with a lower frequency (~10 timeouts/day).
Some extra information:
- In the Heroku review-envs we use Basic Kafka instances (Multi-Tenant);
- In production, we use a Private Standard Kafka instance;
- We currently have 28 topics created;
I also tried the kafka_ex
on the master branch and with the new kayrock
API, but seems that it didn't help. Lots of timeouts keep flooding the logs.
The error on 0.10:
2019-11-22T22:25:24.679435+00:00 app[web.1]: 22:25:24.672 [error] GenServer #PID<0.2066.0> terminating
2019-11-22T22:25:24.679445+00:00 app[web.1]: ** (MatchError) no match of right hand side value: :timeout
2019-11-22T22:25:24.679448+00:00 app[web.1]: (kafka_ex) lib/kafka_ex/gen_consumer.ex:678: KafkaEx.GenConsumer.consume/1
2019-11-22T22:25:24.679450+00:00 app[web.1]: (kafka_ex) lib/kafka_ex/gen_consumer.ex:630: KafkaEx.GenConsumer.handle_info/2
2019-11-22T22:25:24.679452+00:00 app[web.1]: (stdlib) gen_server.erl:637: :gen_server.try_dispatch/4
2019-11-22T22:25:24.679454+00:00 app[web.1]: (stdlib) gen_server.erl:711: :gen_server.handle_msg/6
2019-11-22T22:25:24.679458+00:00 app[web.1]: (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
2019-11-22T22:25:24.679461+00:00 app[web.1]: Last message: :timeout
2019-11-22T22:25:24.679465+00:00 app[web.1]: State: %KafkaEx.GenConsumer.State{acked_offset: 0, auto_offset_reset: :earliest, commit_interval: 5000, commit_threshold: 500, committed_offset: 0, consumer_module: FOO.Consumer, consumer_state: nil, current_offset: 0, fetch_options: [auto_commit: false, worker_name: #PID<0.2289.0>], group: "foo.xxx.yyy", last_commit: -576460703586, partition: 0, topic: "bar.baz.bla", worker_name: #PID<0.2289.0>}
And the error now with master:
2019-11-28T19:30:08.763137+00:00 app[web.1]: 19:30:08.762 [debug] Successfully connected to broker "xxxxxxxxx.amazonaws.com":9096
2019-11-28T19:30:08.816252+00:00 app[web.1]: 19:30:08.815 [debug] Successfully connected to broker "xxxxxxxxx.amazonaws.com":9096
2019-11-28T19:30:08.863527+00:00 app[web.1]: 19:30:08.863 [error] Receiving data from broker "xxxxxxxxx.amazonaws.com":9096 failed with :timeout
2019-11-28T19:30:08.869546+00:00 app[web.1]: 19:30:08.869 [debug] Successfully connected to broker "xxxxxxxxx.amazonaws.com":9096
2019-11-28T19:30:08.879048+00:00 app[web.1]: 19:30:08.878 [error] GenServer #PID<0.9105.93> terminating
2019-11-28T19:30:08.879052+00:00 app[web.1]: ** (stop) {:error, :timeout}
2019-11-28T19:30:08.879053+00:00 app[web.1]: Last message: :timeout
2019-11-28T19:30:08.879063+00:00 app[web.1]: State: %KafkaEx.GenConsumer.State{acked_offset: 0, api_versions: %{fetch: 0, offset_commit: 0, offset_fetch: 0}, auto_offset_reset: :earliest, commit_interval: 5000, commit_threshold: 500, committed_offset: 0, consumer_module: FOO.Consumer, consumer_state: nil, current_offset: 0, fetch_options: [auto_commit: false, worker_name: #PID<0.19074.91>], generation_id: 136, group: "foo-51686.foo_service", last_commit: -576424651131, member_id: "kafka_ex-fe6cdc27-cf12-4c45-9beb-76fcc2a1cdf8", partition: 0, topic: "foo.zzz.yyy", worker_name: #PID<0.19074.91>}
2019-11-28T19:30:08.926883+00:00 app[web.1]: 19:30:08.926 [debug] Successfully connected to broker "xxxxxxxxx.amazonaws.com":9096
2019-11-28T19:30:08.971555+00:00 app[web.1]: 19:30:08.971 [debug] Successfully connected to broker "xxxxxxxxx.amazonaws.com":9096
The Successfully connected...
and Receiving data from broker...
messages are also logged indefinitely, many per second.