-
Notifications
You must be signed in to change notification settings - Fork 2.2k
feat(matcher): cache labels matcher regexp results #4345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
In some scenarios Alertmanager might use labels regexp matching heavily: - Inhibition - Silences - Dispatch Same alert label strings are matched against same regexp expressions repeatedly. We have observed ~25% cumulative cpu time usage by `regexp.(*Regexp).MatchString` in captured CPU profiles. This change implements a cache in labels matcher to store regexp results. It increases the performance of silence queries ~39-47% in existing benchmarks. Similar performance improvements are expected in inhibit and dispatch when regexp matching is used. ``` goos: darwin goarch: arm64 pkg: github.com/prometheus/alertmanager/silence cpu: Apple M3 Pro │ silence-main.txt │ silence-cached.txt │ │ sec/op │ sec/op vs base │ Mutes/1_silence_mutes_alert-12 540.7n ± 1% 540.5n ± 0% ~ (p=0.271 n=10) Mutes/10_silences_mute_alert-12 1.391µ ± 0% 1.402µ ± 1% +0.83% (p=0.000 n=10) Mutes/100_silences_mute_alert-12 8.626µ ± 2% 8.739µ ± 1% +1.30% (p=0.001 n=10) Mutes/1000_silences_mute_alert-12 102.1µ ± 1% 106.4µ ± 1% +4.21% (p=0.000 n=10) Mutes/10000_silences_mute_alert-12 1.121m ± 1% 1.117m ± 0% -0.36% (p=0.004 n=10) Query/100_silences-12 14.413µ ± 0% 8.206µ ± 2% -43.06% (p=0.000 n=10) Query/1000_silences-12 153.99µ ± 0% 93.09µ ± 1% -39.54% (p=0.000 n=10) Query/10000_silences-12 1.979m ± 0% 1.060m ± 1% -46.45% (p=0.000 n=10) geomean 36.66µ 29.89µ -18.46% │ silence-main.txt │ silence-cached.txt │ │ B/op │ B/op vs base │ Mutes/1_silence_mutes_alert-12 1.180Ki ± 0% 1.180Ki ± 0% ~ (p=1.000 n=10) ¹ Mutes/10_silences_mute_alert-12 4.227Ki ± 0% 4.227Ki ± 0% ~ (p=1.000 n=10) ¹ Mutes/100_silences_mute_alert-12 33.32Ki ± 0% 33.32Ki ± 0% ~ (p=1.000 n=10) ¹ Mutes/1000_silences_mute_alert-12 304.7Ki ± 0% 304.7Ki ± 0% +0.00% (p=0.000 n=10) Mutes/10000_silences_mute_alert-12 3.526Mi ± 0% 3.526Mi ± 0% +0.02% (p=0.000 n=10) Query/100_silences-12 4.753Ki ± 0% 4.750Ki ± 0% -0.06% (p=0.000 n=10) Query/1000_silences-12 39.92Ki ± 0% 39.91Ki ± 0% -0.04% (p=0.000 n=10) Query/10000_silences-12 523.8Ki ± 0% 523.7Ki ± 0% -0.01% (p=0.000 n=10) geomean 45.44Ki 45.43Ki -0.01% ¹ all samples are equal │ silence-main.txt │ silence-cached.txt │ │ allocs/op │ allocs/op vs base │ Mutes/1_silence_mutes_alert-12 19.00 ± 0% 19.00 ± 0% ~ (p=1.000 n=10) ¹ Mutes/10_silences_mute_alert-12 40.00 ± 0% 40.00 ± 0% ~ (p=1.000 n=10) ¹ Mutes/100_silences_mute_alert-12 139.0 ± 0% 139.0 ± 0% ~ (p=1.000 n=10) ¹ Mutes/1000_silences_mute_alert-12 1.050k ± 0% 1.050k ± 0% ~ (p=1.000 n=10) ¹ Mutes/10000_silences_mute_alert-12 10.09k ± 0% 10.10k ± 0% +0.09% (p=0.000 n=10) Query/100_silences-12 32.00 ± 0% 32.00 ± 0% ~ (p=1.000 n=10) ¹ Query/1000_silences-12 128.0 ± 0% 128.0 ± 0% ~ (p=1.000 n=10) ¹ Query/10000_silences-12 1.038k ± 0% 1.038k ± 0% ~ (p=1.000 n=10) ¹ geomean 216.1 216.1 +0.01% ¹ all samples are equal ``` Signed-off-by: Siavash Safi <siavash@cloudflare.com>
You never evict stale items from the cache, but matchers are long lived (consider the case of routes). The Alertmanager will just accumulate memory by growing |
This was my thinking as well, but then concluded that memory usage will map statically to the amount of items configured in am and prom config? The more items you have the more memory needed, or am i missing something 👀 Nevertheless tho, @siavashs i think adding metrics on cache hit, miss and entries would help mapping memory increase with cache items. So that if OOM's happen, we can pin it to this cache. |
Memory usage will increase as the number of unique label sets over time increases. For example, if you have 100 label matchers in memory, but you only see 100 unique alerts over a 24 hour period, then in the worst case you will cache 100 * 100 entries (10000 entries) in those 24 hours. If, on the other hand, you have the same number of label matchers in memory, but churn 1 million of unique alerts over a 24 hour period (consider large multi-tenant systems like Grafana Mimir), then you will cache 100 * 1 million = 100 million entries in memory in those 24 hours. As you never evict them, it will keep accumulating until you OOM. |
What is your suggestion for cache expiry? should we use a hard (configurable) limit like 1000 entries or use last hit time and/or number of hits to evict stale items? |
Well, I think some analysis is required here. With a hard limit like 1000 entries and no expiration, over time your cache could turn into a 100% miss rate as the data changes over time. With something like LRU, I suspect you will find very high churn when you have lots of alerts. Prometheus resends all of its alerts to the Alertmanager at regular intervals, and I anticipate this will manifest as high cache miss rate. However feel free to run some experiments here. Probably an LFU cache makes the most sense as you want to cache the most frequent inputs. You may also find different access patterns for dispatching, silences and inhibitions, as these are run at different times. What's the problem you are trying to solve our of interest? It's it something specific like silences, or just general regex matching? |
I also think LFU would be the best option here. In our setup we have noticed Alertmanager's dispatcher is using 25% of the total cpu time during alert spikes, to do label matching using regex expressions. The root cause is the (bad) practice of using a I'll try to implement an LFU cache and run some experiments with benchmarks for the dispatcher and regex benchmarks for inhibitor. |
With the cache be conscious of the overhead of the cache itself. Some large multi-tenant installations can have 100,000s of label matchers across various configuration files, silences, inhibition rules, etc. I would also avoid using a goroutine for eviction. |
In some scenarios Alertmanager might use labels regexp matching heavily:
Same alert label strings are matched against same regexp expressions repeatedly.
We have observed ~25% cumulative cpu time usage by
regexp.(*Regexp).MatchString
in captured CPU profiles.This change implements a cache in labels matcher to store regexp results.
It increases the performance of silence queries ~39-47% in existing benchmarks. Similar performance improvements are expected in inhibit and dispatch when regexp matching is used.
Signed-off-by: Siavash Safi siavash@cloudflare.com