feat(matcher): cache labels matcher regexp results #4345

siavashs · 2025-04-08T21:52:24Z

In some scenarios Alertmanager might use labels regexp matching heavily:

Inhibition
Silences
Dispatch

Same alert label strings are matched against same regexp expressions repeatedly.

We have observed ~25% cumulative cpu time usage by regexp.(*Regexp).MatchString in captured CPU profiles.

This change implements a cache in labels matcher to store regexp results.

It increases the performance of silence queries ~39-47% in existing benchmarks. Similar performance improvements are expected in inhibit and dispatch when regexp matching is used.

goos: darwin
goarch: arm64
pkg: github.com/prometheus/alertmanager/silence
cpu: Apple M3 Pro
                                   │ silence-main.txt │         silence-cached.txt          │
                                   │      sec/op      │   sec/op     vs base                │
Mutes/1_silence_mutes_alert-12            540.7n ± 1%   540.5n ± 0%        ~ (p=0.271 n=10)
Mutes/10_silences_mute_alert-12           1.391µ ± 0%   1.402µ ± 1%   +0.83% (p=0.000 n=10)
Mutes/100_silences_mute_alert-12          8.626µ ± 2%   8.739µ ± 1%   +1.30% (p=0.001 n=10)
Mutes/1000_silences_mute_alert-12         102.1µ ± 1%   106.4µ ± 1%   +4.21% (p=0.000 n=10)
Mutes/10000_silences_mute_alert-12        1.121m ± 1%   1.117m ± 0%   -0.36% (p=0.004 n=10)
Query/100_silences-12                    14.413µ ± 0%   8.206µ ± 2%  -43.06% (p=0.000 n=10)
Query/1000_silences-12                   153.99µ ± 0%   93.09µ ± 1%  -39.54% (p=0.000 n=10)
Query/10000_silences-12                   1.979m ± 0%   1.060m ± 1%  -46.45% (p=0.000 n=10)
geomean                                   36.66µ        29.89µ       -18.46%

                                   │ silence-main.txt │          silence-cached.txt           │
                                   │       B/op       │     B/op      vs base                 │
Mutes/1_silence_mutes_alert-12           1.180Ki ± 0%   1.180Ki ± 0%       ~ (p=1.000 n=10) ¹
Mutes/10_silences_mute_alert-12          4.227Ki ± 0%   4.227Ki ± 0%       ~ (p=1.000 n=10) ¹
Mutes/100_silences_mute_alert-12         33.32Ki ± 0%   33.32Ki ± 0%       ~ (p=1.000 n=10) ¹
Mutes/1000_silences_mute_alert-12        304.7Ki ± 0%   304.7Ki ± 0%  +0.00% (p=0.000 n=10)
Mutes/10000_silences_mute_alert-12       3.526Mi ± 0%   3.526Mi ± 0%  +0.02% (p=0.000 n=10)
Query/100_silences-12                    4.753Ki ± 0%   4.750Ki ± 0%  -0.06% (p=0.000 n=10)
Query/1000_silences-12                   39.92Ki ± 0%   39.91Ki ± 0%  -0.04% (p=0.000 n=10)
Query/10000_silences-12                  523.8Ki ± 0%   523.7Ki ± 0%  -0.01% (p=0.000 n=10)
geomean                                  45.44Ki        45.43Ki       -0.01%
¹ all samples are equal

                                   │ silence-main.txt │          silence-cached.txt          │
                                   │    allocs/op     │  allocs/op   vs base                 │
Mutes/1_silence_mutes_alert-12             19.00 ± 0%    19.00 ± 0%       ~ (p=1.000 n=10) ¹
Mutes/10_silences_mute_alert-12            40.00 ± 0%    40.00 ± 0%       ~ (p=1.000 n=10) ¹
Mutes/100_silences_mute_alert-12           139.0 ± 0%    139.0 ± 0%       ~ (p=1.000 n=10) ¹
Mutes/1000_silences_mute_alert-12         1.050k ± 0%   1.050k ± 0%       ~ (p=1.000 n=10) ¹
Mutes/10000_silences_mute_alert-12        10.09k ± 0%   10.10k ± 0%  +0.09% (p=0.000 n=10)
Query/100_silences-12                      32.00 ± 0%    32.00 ± 0%       ~ (p=1.000 n=10) ¹
Query/1000_silences-12                     128.0 ± 0%    128.0 ± 0%       ~ (p=1.000 n=10) ¹
Query/10000_silences-12                   1.038k ± 0%   1.038k ± 0%       ~ (p=1.000 n=10) ¹
geomean                                    216.1         216.1       +0.01%
¹ all samples are equal

Signed-off-by: Siavash Safi siavash@cloudflare.com

In some scenarios Alertmanager might use labels regexp matching heavily: - Inhibition - Silences - Dispatch Same alert label strings are matched against same regexp expressions repeatedly. We have observed ~25% cumulative cpu time usage by `regexp.(*Regexp).MatchString` in captured CPU profiles. This change implements a cache in labels matcher to store regexp results. It increases the performance of silence queries ~39-47% in existing benchmarks. Similar performance improvements are expected in inhibit and dispatch when regexp matching is used. ``` goos: darwin goarch: arm64 pkg: github.com/prometheus/alertmanager/silence cpu: Apple M3 Pro │ silence-main.txt │ silence-cached.txt │ │ sec/op │ sec/op vs base │ Mutes/1_silence_mutes_alert-12 540.7n ± 1% 540.5n ± 0% ~ (p=0.271 n=10) Mutes/10_silences_mute_alert-12 1.391µ ± 0% 1.402µ ± 1% +0.83% (p=0.000 n=10) Mutes/100_silences_mute_alert-12 8.626µ ± 2% 8.739µ ± 1% +1.30% (p=0.001 n=10) Mutes/1000_silences_mute_alert-12 102.1µ ± 1% 106.4µ ± 1% +4.21% (p=0.000 n=10) Mutes/10000_silences_mute_alert-12 1.121m ± 1% 1.117m ± 0% -0.36% (p=0.004 n=10) Query/100_silences-12 14.413µ ± 0% 8.206µ ± 2% -43.06% (p=0.000 n=10) Query/1000_silences-12 153.99µ ± 0% 93.09µ ± 1% -39.54% (p=0.000 n=10) Query/10000_silences-12 1.979m ± 0% 1.060m ± 1% -46.45% (p=0.000 n=10) geomean 36.66µ 29.89µ -18.46% │ silence-main.txt │ silence-cached.txt │ │ B/op │ B/op vs base │ Mutes/1_silence_mutes_alert-12 1.180Ki ± 0% 1.180Ki ± 0% ~ (p=1.000 n=10) ¹ Mutes/10_silences_mute_alert-12 4.227Ki ± 0% 4.227Ki ± 0% ~ (p=1.000 n=10) ¹ Mutes/100_silences_mute_alert-12 33.32Ki ± 0% 33.32Ki ± 0% ~ (p=1.000 n=10) ¹ Mutes/1000_silences_mute_alert-12 304.7Ki ± 0% 304.7Ki ± 0% +0.00% (p=0.000 n=10) Mutes/10000_silences_mute_alert-12 3.526Mi ± 0% 3.526Mi ± 0% +0.02% (p=0.000 n=10) Query/100_silences-12 4.753Ki ± 0% 4.750Ki ± 0% -0.06% (p=0.000 n=10) Query/1000_silences-12 39.92Ki ± 0% 39.91Ki ± 0% -0.04% (p=0.000 n=10) Query/10000_silences-12 523.8Ki ± 0% 523.7Ki ± 0% -0.01% (p=0.000 n=10) geomean 45.44Ki 45.43Ki -0.01% ¹ all samples are equal │ silence-main.txt │ silence-cached.txt │ │ allocs/op │ allocs/op vs base │ Mutes/1_silence_mutes_alert-12 19.00 ± 0% 19.00 ± 0% ~ (p=1.000 n=10) ¹ Mutes/10_silences_mute_alert-12 40.00 ± 0% 40.00 ± 0% ~ (p=1.000 n=10) ¹ Mutes/100_silences_mute_alert-12 139.0 ± 0% 139.0 ± 0% ~ (p=1.000 n=10) ¹ Mutes/1000_silences_mute_alert-12 1.050k ± 0% 1.050k ± 0% ~ (p=1.000 n=10) ¹ Mutes/10000_silences_mute_alert-12 10.09k ± 0% 10.10k ± 0% +0.09% (p=0.000 n=10) Query/100_silences-12 32.00 ± 0% 32.00 ± 0% ~ (p=1.000 n=10) ¹ Query/1000_silences-12 128.0 ± 0% 128.0 ± 0% ~ (p=1.000 n=10) ¹ Query/10000_silences-12 1.038k ± 0% 1.038k ± 0% ~ (p=1.000 n=10) ¹ geomean 216.1 216.1 +0.01% ¹ all samples are equal ``` Signed-off-by: Siavash Safi <siavash@cloudflare.com>

grobinson-grafana · 2025-04-09T09:29:25Z

You never evict stale items from the cache, but matchers are long lived (consider the case of routes). The Alertmanager will just accumulate memory by growing sync.Map until it OOMs?

OGKevin · 2025-04-09T09:40:47Z

You never evict stale items from the cache, but matchers are long lived (consider the case of routes). The Alertmanager will just accumulate memory by growing sync.Map until it OOMs?

This was my thinking as well, but then concluded that memory usage will map statically to the amount of items configured in am and prom config? The more items you have the more memory needed, or am i missing something 👀

Nevertheless tho, @siavashs i think adding metrics on cache hit, miss and entries would help mapping memory increase with cache items. So that if OOM's happen, we can pin it to this cache.

grobinson-grafana · 2025-04-09T10:17:50Z

You never evict stale items from the cache, but matchers are long lived (consider the case of routes). The Alertmanager will just accumulate memory by growing sync.Map until it OOMs?

This was my thinking as well, but then concluded that memory usage will map statically to the amount of items configured in am and prom config? The more items you have the more memory needed, or am i missing something 👀

Nevertheless tho, @siavashs i think adding metrics on cache hit, miss and entries would help mapping memory increase with cache items. So that if OOM's happen, we can pin it to this cache.

Memory usage will increase as the number of unique label sets over time increases. For example, if you have 100 label matchers in memory, but you only see 100 unique alerts over a 24 hour period, then in the worst case you will cache 100 * 100 entries (10000 entries) in those 24 hours.

If, on the other hand, you have the same number of label matchers in memory, but churn 1 million of unique alerts over a 24 hour period (consider large multi-tenant systems like Grafana Mimir), then you will cache 100 * 1 million = 100 million entries in memory in those 24 hours. As you never evict them, it will keep accumulating until you OOM.

siavashs · 2025-04-09T21:32:01Z

You never evict stale items from the cache, but matchers are long lived (consider the case of routes). The Alertmanager will just accumulate memory by growing sync.Map until it OOMs?

This was my thinking as well, but then concluded that memory usage will map statically to the amount of items configured in am and prom config? The more items you have the more memory needed, or am i missing something 👀
Nevertheless tho, @siavashs i think adding metrics on cache hit, miss and entries would help mapping memory increase with cache items. So that if OOM's happen, we can pin it to this cache.

Memory usage will increase as the number of unique label sets over time increases. For example, if you have 100 label matchers in memory, but you only see 100 unique alerts over a 24 hour period, then in the worst case you will cache 100 * 100 entries (10000 entries) in those 24 hours.

If, on the other hand, you have the same number of label matchers in memory, but churn 1 million of unique alerts over a 24 hour period (consider large multi-tenant systems like Grafana Mimir), then you will cache 100 * 1 million = 100 million entries in memory in those 24 hours. As you never evict them, it will keep accumulating until you OOM.

What is your suggestion for cache expiry? should we use a hard (configurable) limit like 1000 entries or use last hit time and/or number of hits to evict stale items?

grobinson-grafana · 2025-04-09T22:28:50Z

You never evict stale items from the cache, but matchers are long lived (consider the case of routes). The Alertmanager will just accumulate memory by growing sync.Map until it OOMs?

This was my thinking as well, but then concluded that memory usage will map statically to the amount of items configured in am and prom config? The more items you have the more memory needed, or am i missing something 👀
Nevertheless tho, @siavashs i think adding metrics on cache hit, miss and entries would help mapping memory increase with cache items. So that if OOM's happen, we can pin it to this cache.

Memory usage will increase as the number of unique label sets over time increases. For example, if you have 100 label matchers in memory, but you only see 100 unique alerts over a 24 hour period, then in the worst case you will cache 100 * 100 entries (10000 entries) in those 24 hours.
If, on the other hand, you have the same number of label matchers in memory, but churn 1 million of unique alerts over a 24 hour period (consider large multi-tenant systems like Grafana Mimir), then you will cache 100 * 1 million = 100 million entries in memory in those 24 hours. As you never evict them, it will keep accumulating until you OOM.

What is your suggestion for cache expiry? should we use a hard (configurable) limit like 1000 entries or use last hit time and/or number of hits to evict stale items?

Well, I think some analysis is required here.

With a hard limit like 1000 entries and no expiration, over time your cache could turn into a 100% miss rate as the data changes over time.

With something like LRU, I suspect you will find very high churn when you have lots of alerts. Prometheus resends all of its alerts to the Alertmanager at regular intervals, and I anticipate this will manifest as high cache miss rate. However feel free to run some experiments here.

Probably an LFU cache makes the most sense as you want to cache the most frequent inputs.

You may also find different access patterns for dispatching, silences and inhibitions, as these are run at different times.

What's the problem you are trying to solve our of interest? It's it something specific like silences, or just general regex matching?

siavashs · 2025-04-09T22:57:33Z

You never evict stale items from the cache, but matchers are long lived (consider the case of routes). The Alertmanager will just accumulate memory by growing sync.Map until it OOMs?

This was my thinking as well, but then concluded that memory usage will map statically to the amount of items configured in am and prom config? The more items you have the more memory needed, or am i missing something 👀
Nevertheless tho, @siavashs i think adding metrics on cache hit, miss and entries would help mapping memory increase with cache items. So that if OOM's happen, we can pin it to this cache.

Memory usage will increase as the number of unique label sets over time increases. For example, if you have 100 label matchers in memory, but you only see 100 unique alerts over a 24 hour period, then in the worst case you will cache 100 * 100 entries (10000 entries) in those 24 hours.
If, on the other hand, you have the same number of label matchers in memory, but churn 1 million of unique alerts over a 24 hour period (consider large multi-tenant systems like Grafana Mimir), then you will cache 100 * 1 million = 100 million entries in memory in those 24 hours. As you never evict them, it will keep accumulating until you OOM.

What is your suggestion for cache expiry? should we use a hard (configurable) limit like 1000 entries or use last hit time and/or number of hits to evict stale items?

Well, I think some analysis is required here.

With a hard limit like 1000 entries and no expiration, over time your cache could turn into a 100% miss rate as the data changes over time.

With something like LRU, I suspect you will find very high churn when you have lots of alerts. Prometheus resends all of its alerts to the Alertmanager at regular intervals, and I anticipate this will manifest as high cache miss rate. However feel free to run some experiments here.

Probably an LFU cache makes the most sense as you want to cache the most frequent inputs.

You may also find different access patterns for dispatching, silences and inhibitions, as these are run at different times.

What's the problem you are trying to solve our of interest? It's it something specific like silences, or just general regex matching?

I also think LFU would be the best option here.

In our setup we have noticed Alertmanager's dispatcher is using 25% of the total cpu time during alert spikes, to do label matching using regex expressions. The root cause is the (bad) practice of using a notify label on alerts to define routing on alerts themselves. For example alerts have labels like notify="foo bar baz". The dispatcher then uses regex matchers to decide which receivers to route the alerts to.
Since notify label rarely changes, my expectation is that dispatcher will use much less cpu time when we have alert spikes, with the side-effect of more memory usage.
And memory is not a big concern at least in our setup which allocates ~16GB usually, this already includes active alerts and all the routes, so my expectation is that the cache will use a fraction of that. Even 2-4x increased memory usage is acceptable if dispatcher will work 2x faster during alert spikes.

I'll try to implement an LFU cache and run some experiments with benchmarks for the dispatcher and regex benchmarks for inhibitor.

grobinson-grafana · 2025-04-09T23:19:51Z

With the cache be conscious of the overhead of the cache itself. Some large multi-tenant installations can have 100,000s of label matchers across various configuration files, silences, inhibition rules, etc. I would also avoid using a goroutine for eviction.

OGKevin approved these changes Apr 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(matcher): cache labels matcher regexp results #4345

feat(matcher): cache labels matcher regexp results #4345

Uh oh!

siavashs commented Apr 8, 2025

Uh oh!

grobinson-grafana commented Apr 9, 2025

Uh oh!

OGKevin commented Apr 9, 2025

Uh oh!

grobinson-grafana commented Apr 9, 2025 •

edited

Loading

Uh oh!

siavashs commented Apr 9, 2025

Uh oh!

grobinson-grafana commented Apr 9, 2025 •

edited

Loading

Uh oh!

siavashs commented Apr 9, 2025

Uh oh!

grobinson-grafana commented Apr 9, 2025

Uh oh!

Uh oh!

feat(matcher): cache labels matcher regexp results #4345

Are you sure you want to change the base?

feat(matcher): cache labels matcher regexp results #4345

Uh oh!

Conversation

siavashs commented Apr 8, 2025

Uh oh!

grobinson-grafana commented Apr 9, 2025

Uh oh!

OGKevin commented Apr 9, 2025

Uh oh!

grobinson-grafana commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

siavashs commented Apr 9, 2025

Uh oh!

grobinson-grafana commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

siavashs commented Apr 9, 2025

Uh oh!

grobinson-grafana commented Apr 9, 2025

Uh oh!

Uh oh!

grobinson-grafana commented Apr 9, 2025 •

edited

Loading

grobinson-grafana commented Apr 9, 2025 •

edited

Loading