-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault with Stackdriver output plugin #2580
Comments
Another similar, but slightly different stack trace:
|
I had no reason to think an upgrade to 1.5.7 would improve things, but the following are stack traces captured from some Pods post-upgrade:
|
I updated my Fluent Bit config, the diff is as follows:
The configuration in its entirety is now:
I will continue to monitor, but I've noticed an immediate improvement in the health of the Pods. |
FYI: re-opening since @JeffLuoo will take a look at this |
Hi @slewiskelly, thanks for the detailed issue description. I have some questions regarding to this issue I am wondering is the configuration of 1.5.7 same as the one for the 1.5.6? And,
|
@JeffLuoo, thanks for taking a look.
The configuration was the same between versions, until I updated the configuration described in #2580 (comment).
I can't find any historical logs with that specific message. For the most part, the only logs observed before crashes occurred were:
|
@slewiskelly I see. Thanks for the update! I just checked the code and found that the log message:
will only show up if the log level of fluent bit is set to "debug". And what this message means is that in your json message there is no field with the key:
so it is going to use the tag value of the log to assign the value of local_resource_id. And local_resource_id is just the name of variable I used to assign the value of metadata in the final log for k8s_container resource type. According to the error message:
the error will be narrowed down to the function here:
I will add some information to the error message (to
local_resource_id that is passed in to this function. And this will help use better debug and to see whether the local_resource_id passed in is valid of not. Also we might need @slewiskelly to reproduce the error again to see what is the value of local_resource_id since I tested it locally but still didn't find the error. Thank you!
cc @erain: Hi Yu, I am wondering that have you seen this kind of error before? Since I don't have the access to deploy the Fluent Bit on gke environment. Thank you! |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
This issue was closed because it has been stalled for 5 days with no activity. |
Just did an install on CentOS 8 and I'm encountering this issue, and seeing a similar segfault in
Stack trace (relevants parts, eg. no threads in epoll_wait):
Fluent-bit is installed from Google Cloud's Ops Agent (I don't think this is Google-specific though):
|
My bad, this was the result of using |
Did anyone find any fix for this? |
We faced same error in our environments. (on Kubernetes, fluent-bit v1.8.15) Some crash logs (debug log enabled)
|
Bug Report
Describe the bug
When deploying Fluent Bit into our Kubernetes cluster, a portion of Pods crash with the following errors:
To Reproduce
It's difficult to provide reproducible steps given the nature of the environment.
Given some advice on how to better troubleshoot, I will be able to provide more specific information than what I have provided in the "additional context" section.
Expected behavior
Fluent Bit to not crash and/or display more information at the appropriate log level (error or above).
Your Environment
fluent/fluent-bit:1.5.6
Additional context
Fluent Bit is deployed in a multi-tenant environment with a variety of log formats (though mostly JSON formatted).
I've tested Fluent Bit on Kubernetes to Stackdriver quite extensively with JSON formatted log files, and not observed these issues. It is only when deploying to a heterogeneous environment do I observe the failures.
When the Pods are restarted, only a portion of them crash (for an indeterminate amount of time). However, it seems they do eventually recover.
I have collected some debug logs, but I can't make any correlation after a cursory look over them. I can share them, but I will first have to ensure there is no sensitive information included.
The text was updated successfully, but these errors were encountered: