Skip to content

feat(matcher): cache labels matcher regexp results #4345

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

siavashs
Copy link
Contributor

@siavashs siavashs commented Apr 8, 2025

In some scenarios Alertmanager might use labels regexp matching heavily:

  • Inhibition
  • Silences
  • Dispatch

Same alert label strings are matched against same regexp expressions repeatedly.

We have observed ~25% cumulative cpu time usage by regexp.(*Regexp).MatchString in captured CPU profiles.

This change implements a cache in labels matcher to store regexp results.

It increases the performance of silence queries ~39-47% in existing benchmarks. Similar performance improvements are expected in inhibit and dispatch when regexp matching is used.

goos: darwin
goarch: arm64
pkg: github.com/prometheus/alertmanager/silence
cpu: Apple M3 Pro
                                   │ silence-main.txt │         silence-cached.txt          │
                                   │      sec/op      │   sec/op     vs base                │
Mutes/1_silence_mutes_alert-12            540.7n ± 1%   540.5n ± 0%        ~ (p=0.271 n=10)
Mutes/10_silences_mute_alert-12           1.391µ ± 0%   1.402µ ± 1%   +0.83% (p=0.000 n=10)
Mutes/100_silences_mute_alert-12          8.626µ ± 2%   8.739µ ± 1%   +1.30% (p=0.001 n=10)
Mutes/1000_silences_mute_alert-12         102.1µ ± 1%   106.4µ ± 1%   +4.21% (p=0.000 n=10)
Mutes/10000_silences_mute_alert-12        1.121m ± 1%   1.117m ± 0%   -0.36% (p=0.004 n=10)
Query/100_silences-12                    14.413µ ± 0%   8.206µ ± 2%  -43.06% (p=0.000 n=10)
Query/1000_silences-12                   153.99µ ± 0%   93.09µ ± 1%  -39.54% (p=0.000 n=10)
Query/10000_silences-12                   1.979m ± 0%   1.060m ± 1%  -46.45% (p=0.000 n=10)
geomean                                   36.66µ        29.89µ       -18.46%

                                   │ silence-main.txt │          silence-cached.txt           │
                                   │       B/op       │     B/op      vs base                 │
Mutes/1_silence_mutes_alert-12           1.180Ki ± 0%   1.180Ki ± 0%       ~ (p=1.000 n=10) ¹
Mutes/10_silences_mute_alert-12          4.227Ki ± 0%   4.227Ki ± 0%       ~ (p=1.000 n=10) ¹
Mutes/100_silences_mute_alert-12         33.32Ki ± 0%   33.32Ki ± 0%       ~ (p=1.000 n=10) ¹
Mutes/1000_silences_mute_alert-12        304.7Ki ± 0%   304.7Ki ± 0%  +0.00% (p=0.000 n=10)
Mutes/10000_silences_mute_alert-12       3.526Mi ± 0%   3.526Mi ± 0%  +0.02% (p=0.000 n=10)
Query/100_silences-12                    4.753Ki ± 0%   4.750Ki ± 0%  -0.06% (p=0.000 n=10)
Query/1000_silences-12                   39.92Ki ± 0%   39.91Ki ± 0%  -0.04% (p=0.000 n=10)
Query/10000_silences-12                  523.8Ki ± 0%   523.7Ki ± 0%  -0.01% (p=0.000 n=10)
geomean                                  45.44Ki        45.43Ki       -0.01%
¹ all samples are equal

                                   │ silence-main.txt │          silence-cached.txt          │
                                   │    allocs/op     │  allocs/op   vs base                 │
Mutes/1_silence_mutes_alert-12             19.00 ± 0%    19.00 ± 0%       ~ (p=1.000 n=10) ¹
Mutes/10_silences_mute_alert-12            40.00 ± 0%    40.00 ± 0%       ~ (p=1.000 n=10) ¹
Mutes/100_silences_mute_alert-12           139.0 ± 0%    139.0 ± 0%       ~ (p=1.000 n=10) ¹
Mutes/1000_silences_mute_alert-12         1.050k ± 0%   1.050k ± 0%       ~ (p=1.000 n=10) ¹
Mutes/10000_silences_mute_alert-12        10.09k ± 0%   10.10k ± 0%  +0.09% (p=0.000 n=10)
Query/100_silences-12                      32.00 ± 0%    32.00 ± 0%       ~ (p=1.000 n=10) ¹
Query/1000_silences-12                     128.0 ± 0%    128.0 ± 0%       ~ (p=1.000 n=10) ¹
Query/10000_silences-12                   1.038k ± 0%   1.038k ± 0%       ~ (p=1.000 n=10) ¹
geomean                                    216.1         216.1       +0.01%
¹ all samples are equal

Signed-off-by: Siavash Safi siavash@cloudflare.com

In some scenarios Alertmanager might use labels regexp matching heavily:
- Inhibition
- Silences
- Dispatch

Same alert label strings are matched against same regexp expressions
repeatedly.

We have observed ~25% cumulative cpu time usage by
`regexp.(*Regexp).MatchString` in captured CPU profiles.

This change implements a cache in labels matcher to store regexp results.

It increases the performance of silence queries ~39-47% in existing benchmarks.
Similar performance improvements are expected in inhibit and dispatch
when regexp matching is used.

```
goos: darwin
goarch: arm64
pkg: github.com/prometheus/alertmanager/silence
cpu: Apple M3 Pro
                                   │ silence-main.txt │         silence-cached.txt          │
                                   │      sec/op      │   sec/op     vs base                │
Mutes/1_silence_mutes_alert-12            540.7n ± 1%   540.5n ± 0%        ~ (p=0.271 n=10)
Mutes/10_silences_mute_alert-12           1.391µ ± 0%   1.402µ ± 1%   +0.83% (p=0.000 n=10)
Mutes/100_silences_mute_alert-12          8.626µ ± 2%   8.739µ ± 1%   +1.30% (p=0.001 n=10)
Mutes/1000_silences_mute_alert-12         102.1µ ± 1%   106.4µ ± 1%   +4.21% (p=0.000 n=10)
Mutes/10000_silences_mute_alert-12        1.121m ± 1%   1.117m ± 0%   -0.36% (p=0.004 n=10)
Query/100_silences-12                    14.413µ ± 0%   8.206µ ± 2%  -43.06% (p=0.000 n=10)
Query/1000_silences-12                   153.99µ ± 0%   93.09µ ± 1%  -39.54% (p=0.000 n=10)
Query/10000_silences-12                   1.979m ± 0%   1.060m ± 1%  -46.45% (p=0.000 n=10)
geomean                                   36.66µ        29.89µ       -18.46%

                                   │ silence-main.txt │          silence-cached.txt           │
                                   │       B/op       │     B/op      vs base                 │
Mutes/1_silence_mutes_alert-12           1.180Ki ± 0%   1.180Ki ± 0%       ~ (p=1.000 n=10) ¹
Mutes/10_silences_mute_alert-12          4.227Ki ± 0%   4.227Ki ± 0%       ~ (p=1.000 n=10) ¹
Mutes/100_silences_mute_alert-12         33.32Ki ± 0%   33.32Ki ± 0%       ~ (p=1.000 n=10) ¹
Mutes/1000_silences_mute_alert-12        304.7Ki ± 0%   304.7Ki ± 0%  +0.00% (p=0.000 n=10)
Mutes/10000_silences_mute_alert-12       3.526Mi ± 0%   3.526Mi ± 0%  +0.02% (p=0.000 n=10)
Query/100_silences-12                    4.753Ki ± 0%   4.750Ki ± 0%  -0.06% (p=0.000 n=10)
Query/1000_silences-12                   39.92Ki ± 0%   39.91Ki ± 0%  -0.04% (p=0.000 n=10)
Query/10000_silences-12                  523.8Ki ± 0%   523.7Ki ± 0%  -0.01% (p=0.000 n=10)
geomean                                  45.44Ki        45.43Ki       -0.01%
¹ all samples are equal

                                   │ silence-main.txt │          silence-cached.txt          │
                                   │    allocs/op     │  allocs/op   vs base                 │
Mutes/1_silence_mutes_alert-12             19.00 ± 0%    19.00 ± 0%       ~ (p=1.000 n=10) ¹
Mutes/10_silences_mute_alert-12            40.00 ± 0%    40.00 ± 0%       ~ (p=1.000 n=10) ¹
Mutes/100_silences_mute_alert-12           139.0 ± 0%    139.0 ± 0%       ~ (p=1.000 n=10) ¹
Mutes/1000_silences_mute_alert-12         1.050k ± 0%   1.050k ± 0%       ~ (p=1.000 n=10) ¹
Mutes/10000_silences_mute_alert-12        10.09k ± 0%   10.10k ± 0%  +0.09% (p=0.000 n=10)
Query/100_silences-12                      32.00 ± 0%    32.00 ± 0%       ~ (p=1.000 n=10) ¹
Query/1000_silences-12                     128.0 ± 0%    128.0 ± 0%       ~ (p=1.000 n=10) ¹
Query/10000_silences-12                   1.038k ± 0%   1.038k ± 0%       ~ (p=1.000 n=10) ¹
geomean                                    216.1         216.1       +0.01%
¹ all samples are equal
```

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
@grobinson-grafana
Copy link
Collaborator

You never evict stale items from the cache, but matchers are long lived (consider the case of routes). The Alertmanager will just accumulate memory by growing sync.Map until it OOMs?

@OGKevin
Copy link
Contributor

OGKevin commented Apr 9, 2025

You never evict stale items from the cache, but matchers are long lived (consider the case of routes). The Alertmanager will just accumulate memory by growing sync.Map until it OOMs?

This was my thinking as well, but then concluded that memory usage will map statically to the amount of items configured in am and prom config? The more items you have the more memory needed, or am i missing something 👀

Nevertheless tho, @siavashs i think adding metrics on cache hit, miss and entries would help mapping memory increase with cache items. So that if OOM's happen, we can pin it to this cache.

@grobinson-grafana
Copy link
Collaborator

grobinson-grafana commented Apr 9, 2025

You never evict stale items from the cache, but matchers are long lived (consider the case of routes). The Alertmanager will just accumulate memory by growing sync.Map until it OOMs?

This was my thinking as well, but then concluded that memory usage will map statically to the amount of items configured in am and prom config? The more items you have the more memory needed, or am i missing something 👀

Nevertheless tho, @siavashs i think adding metrics on cache hit, miss and entries would help mapping memory increase with cache items. So that if OOM's happen, we can pin it to this cache.

Memory usage will increase as the number of unique label sets over time increases. For example, if you have 100 label matchers in memory, but you only see 100 unique alerts over a 24 hour period, then in the worst case you will cache 100 * 100 entries (10000 entries) in those 24 hours.

If, on the other hand, you have the same number of label matchers in memory, but churn 1 million of unique alerts over a 24 hour period (consider large multi-tenant systems like Grafana Mimir), then you will cache 100 * 1 million = 100 million entries in memory in those 24 hours. As you never evict them, it will keep accumulating until you OOM.

@siavashs
Copy link
Contributor Author

siavashs commented Apr 9, 2025

You never evict stale items from the cache, but matchers are long lived (consider the case of routes). The Alertmanager will just accumulate memory by growing sync.Map until it OOMs?

This was my thinking as well, but then concluded that memory usage will map statically to the amount of items configured in am and prom config? The more items you have the more memory needed, or am i missing something 👀
Nevertheless tho, @siavashs i think adding metrics on cache hit, miss and entries would help mapping memory increase with cache items. So that if OOM's happen, we can pin it to this cache.

Memory usage will increase as the number of unique label sets over time increases. For example, if you have 100 label matchers in memory, but you only see 100 unique alerts over a 24 hour period, then in the worst case you will cache 100 * 100 entries (10000 entries) in those 24 hours.

If, on the other hand, you have the same number of label matchers in memory, but churn 1 million of unique alerts over a 24 hour period (consider large multi-tenant systems like Grafana Mimir), then you will cache 100 * 1 million = 100 million entries in memory in those 24 hours. As you never evict them, it will keep accumulating until you OOM.

What is your suggestion for cache expiry? should we use a hard (configurable) limit like 1000 entries or use last hit time and/or number of hits to evict stale items?

@grobinson-grafana
Copy link
Collaborator

grobinson-grafana commented Apr 9, 2025

You never evict stale items from the cache, but matchers are long lived (consider the case of routes). The Alertmanager will just accumulate memory by growing sync.Map until it OOMs?

This was my thinking as well, but then concluded that memory usage will map statically to the amount of items configured in am and prom config? The more items you have the more memory needed, or am i missing something 👀
Nevertheless tho, @siavashs i think adding metrics on cache hit, miss and entries would help mapping memory increase with cache items. So that if OOM's happen, we can pin it to this cache.

Memory usage will increase as the number of unique label sets over time increases. For example, if you have 100 label matchers in memory, but you only see 100 unique alerts over a 24 hour period, then in the worst case you will cache 100 * 100 entries (10000 entries) in those 24 hours.
If, on the other hand, you have the same number of label matchers in memory, but churn 1 million of unique alerts over a 24 hour period (consider large multi-tenant systems like Grafana Mimir), then you will cache 100 * 1 million = 100 million entries in memory in those 24 hours. As you never evict them, it will keep accumulating until you OOM.

What is your suggestion for cache expiry? should we use a hard (configurable) limit like 1000 entries or use last hit time and/or number of hits to evict stale items?

Well, I think some analysis is required here.

With a hard limit like 1000 entries and no expiration, over time your cache could turn into a 100% miss rate as the data changes over time.

With something like LRU, I suspect you will find very high churn when you have lots of alerts. Prometheus resends all of its alerts to the Alertmanager at regular intervals, and I anticipate this will manifest as high cache miss rate. However feel free to run some experiments here.

Probably an LFU cache makes the most sense as you want to cache the most frequent inputs.

You may also find different access patterns for dispatching, silences and inhibitions, as these are run at different times.

What's the problem you are trying to solve our of interest? It's it something specific like silences, or just general regex matching?

@siavashs
Copy link
Contributor Author

siavashs commented Apr 9, 2025

You never evict stale items from the cache, but matchers are long lived (consider the case of routes). The Alertmanager will just accumulate memory by growing sync.Map until it OOMs?

This was my thinking as well, but then concluded that memory usage will map statically to the amount of items configured in am and prom config? The more items you have the more memory needed, or am i missing something 👀
Nevertheless tho, @siavashs i think adding metrics on cache hit, miss and entries would help mapping memory increase with cache items. So that if OOM's happen, we can pin it to this cache.

Memory usage will increase as the number of unique label sets over time increases. For example, if you have 100 label matchers in memory, but you only see 100 unique alerts over a 24 hour period, then in the worst case you will cache 100 * 100 entries (10000 entries) in those 24 hours.
If, on the other hand, you have the same number of label matchers in memory, but churn 1 million of unique alerts over a 24 hour period (consider large multi-tenant systems like Grafana Mimir), then you will cache 100 * 1 million = 100 million entries in memory in those 24 hours. As you never evict them, it will keep accumulating until you OOM.

What is your suggestion for cache expiry? should we use a hard (configurable) limit like 1000 entries or use last hit time and/or number of hits to evict stale items?

Well, I think some analysis is required here.

With a hard limit like 1000 entries and no expiration, over time your cache could turn into a 100% miss rate as the data changes over time.

With something like LRU, I suspect you will find very high churn when you have lots of alerts. Prometheus resends all of its alerts to the Alertmanager at regular intervals, and I anticipate this will manifest as high cache miss rate. However feel free to run some experiments here.

Probably an LFU cache makes the most sense as you want to cache the most frequent inputs.

You may also find different access patterns for dispatching, silences and inhibitions, as these are run at different times.

What's the problem you are trying to solve our of interest? It's it something specific like silences, or just general regex matching?

I also think LFU would be the best option here.

In our setup we have noticed Alertmanager's dispatcher is using 25% of the total cpu time during alert spikes, to do label matching using regex expressions. The root cause is the (bad) practice of using a notify label on alerts to define routing on alerts themselves. For example alerts have labels like notify="foo bar baz". The dispatcher then uses regex matchers to decide which receivers to route the alerts to.
Since notify label rarely changes, my expectation is that dispatcher will use much less cpu time when we have alert spikes, with the side-effect of more memory usage.
And memory is not a big concern at least in our setup which allocates ~16GB usually, this already includes active alerts and all the routes, so my expectation is that the cache will use a fraction of that. Even 2-4x increased memory usage is acceptable if dispatcher will work 2x faster during alert spikes.

I'll try to implement an LFU cache and run some experiments with benchmarks for the dispatcher and regex benchmarks for inhibitor.

@grobinson-grafana
Copy link
Collaborator

With the cache be conscious of the overhead of the cache itself. Some large multi-tenant installations can have 100,000s of label matchers across various configuration files, silences, inhibition rules, etc. I would also avoid using a goroutine for eviction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants