Speed up calculation of truncated normal mean and cdf #652

DanTanAtAims · 2024-01-19T07:02:08Z

Instead of using built in mean, cdf, and truncation functions provided by DIstributions.jl, use the explicit formula of the truncated normal distributions.

Uses the approximated error function provided by SpecialFunctions.jl (added as dependency).

This does not resolve memory issues mentioned in issue #572.

Instead of using bulit in mean, cdf and truncation functions provided by DIstributions.jl use the explicit formula of the truncated normal distributions. Uses the approximated error function provided by SpecialFunctions.jl. Removed unused variable removed unused distribution variable

ConnectedSystems · 2024-01-21T07:34:28Z

General comments:

There's a few other spots (at least two from memory) where the truncated normal mean is used. Please check the spec document I sent you earlier
Could you add some high level tests? I think something that ensures the method produces values < some small error threshold compared to the original approach. This is just in case someone accidentally changes something in the future.
There's some minor formatting issues we'll check when we chat again next.

ConnectedSystems

Some minor concerns, mostly to do with formatting.

src/ecosystem/corals/growth.jl

Instead of using bulit in mean, cdf and truncation functions provided by DIstributions.jl use the explicit formula of the truncated normal distributions. Uses the approximated error function provided by SpecialFunctions.jl. Removed unused variable removed unused distribution variable

…/ADRIA.jl into truncated-norm-speedup

Added spaces between operators in new code.

ConnectedSystems · 2024-01-21T23:29:53Z

Some performance notes:

Trial runs with the Moore domain (256 scenarios).

Prior to the changes in this PR bleaching_mortality() takes ~38% of runtime with adjust_DHW_distribution() taking another ~13% (total of 51%).

Initial runtime was: ~1 min 40 secs (estimate - forgot to actually write it down)
Second run took: ~1 min 20 secs

With changes:

bleaching_mortality() takes ~13% + 9% (22%),
adjusted_DHW_distribution() taking another 12% (total of 34% of runtime).

Trajectories look as expected:

The image above is misleading as samples are biased towards guided scenarios, so here's one where I make sure there are equivalent number of samples for each scenario type.

DanTanAtAims · 2024-01-21T23:50:11Z

I'll change interventions/seeding.jl to use the new calculation.

Should I export the function truncated_normal_mean and truncated_normal_cdf so that in can be used in interventions/seeding.jl and tests?

ConnectedSystems · 2024-01-22T00:05:30Z

Shouldn't need to export those, no - I think the functions need to be moved elsewhere (e.g., outside of corals/growth.jl) because it's not specific to growth.

ConnectedSystems · 2024-01-22T00:08:30Z

Right now I'm thinking Ecosystem.jl, does that sound good to you?

DanTanAtAims · 2024-01-23T01:41:36Z

I've come across a potential solution to the numerical issues we were having with the new implementation. I came across the implementation here that is currently the implementation Julia DIstribution.jl uses.

It turns out mean(Truncated(Normal(...)...)) uses this implementation but it appears that the calls to truncated() and Normal() are big slow downs from profiling.

I implemented a nearly identical version of it excluding some error checks which we don't need. And benchmarked them and the result is nearly twice as fast as our previous attempt and agrees with mean(truncated(Normal())) which is unsurprising as its the same. The Benchmark results are attached. 1st benchmark is the original implementation and the second is the new implementation from Distributions.jl minus error checks.

ConnectedSystems · 2024-01-23T01:47:03Z

Hmm interesting - the potential maximum is about double your earlier implementation, but median is indeed half.
Either way, very happy with this as it's ~17% of runtime compared to the usual mean(truncated()) approach.

Let me know when this is ready for a final review :)

updated new truncated normal mean calculations with tests. New calculations is the same as Distributions calculation minus checks.

Added truncated normal cdf calculations and tests added functions

Swapped use of mean(Truncated(Normal(mu, stdev), lower, upper)) to use truncated_normal_mean

DanTanAtAims · 2024-01-23T22:19:28Z

The pull request is ready for final review.

The Truncated Normal Mean has nearly equivalent accuracy to the in built implementation we were using but is much faster.

The truncated normal cdf function has equivalent performance and I haven't found any speed ups that are stable for large deviations from the normal mean. However the new truncated_normal_cdf doesn't return NaN values for the large values unlike the inbuild cdf we were using.

src/ecosystem/Ecosystem.jl

test/Ecosystem.jl

ConnectedSystems · 2024-01-24T00:06:06Z

test/runtests.jl

@@ -336,6 +336,7 @@ end
 include("clustering.jl")
 include("data_loading.jl")
 include("domain.jl")
+include("Ecosystem.jl")


Not sure how I feel about this, but okay to leave as is.

src/ecosystem/Ecosystem.jl

ConnectedSystems

Thanks @DanTanAtAims

Some mostly style-related issues to fix then we're good to go.

Addressed style issues. Used random sampling for testing of truncated normal functions

Fixed grammar error

DanTanAtAims · 2024-01-24T02:23:53Z

Thanks @ConnectedSystems for the comments.

I've addressed the issues raised and swapped the testing to draw random numbers using the same testing bounds as before.

Not sure if I mentioned this earlier but testing the tuncated normal cdf becomes difficult when the bounds start exceeding 10 standard deviations from the given mean as the built-in function sometimes returns NaN unexpectedly. These bounds aren't exceeded as far as I know in ADRIA however the implemented function we're are currently using won't throw an error.

I can added a warning to the cdf function indicating possible loss of accuracy is we ever test these bounds?

ConnectedSystems · 2024-01-24T03:40:15Z

I can added a warning to the cdf function indicating possible loss of accuracy is we ever test these bounds?

Could you add it as a @debug level log please?

Same usage as @info and @warn

https://docs.julialang.org/en/v1/stdlib/Logging/

src/ecosystem/Ecosystem.jl

test/Ecosystem.jl

Debug log possible loss of accuracy when truncated bounds exceed 10 standard deviations from normal mean.

Fixed doc string function signature and corrected comment grammar

fix spelling error

ConnectedSystems

Looking good, thanks!

ConnectedSystems · 2024-01-24T08:41:44Z

Hmm, unfortunately we seem to be almost back at square 1.

bleaching_mortality!() + adjust_DHW_distribution() takes up ~50% of runtime.

Most of the time is spent in erf() and logerf(). Let's leave this for now and see if we can come up with anything else.

ConnectedSystems · 2024-01-24T11:46:06Z

Just had a thought - it could be that that there is a small speed up, but the reason it's spending so much time in those functions is they're called a lot - it's just the nature of the use context. We can discuss more when Pedro gets back.

DanTanAtAims requested a review from ConnectedSystems January 19, 2024 07:02

ConnectedSystems reviewed Jan 21, 2024

View reviewed changes

src/ecosystem/corals/growth.jl Outdated Show resolved Hide resolved

ConnectedSystems force-pushed the truncated-norm-speedup branch from 8092086 to 183eaba Compare January 21, 2024 12:18

DanTanAtAims added 2 commits January 22, 2024 08:54

Merge branch 'truncated-norm-speedup' of https://github.com/open-AIMS…

08c6e93

…/ADRIA.jl into truncated-norm-speedup

Formatting: Space between operators

987b3e7

Added spaces between operators in new code.

DanTanAtAims added 3 commits January 23, 2024 17:01

updated truncated normal mean

4d5e708

updated new truncated normal mean calculations with tests. New calculations is the same as Distributions calculation minus checks.

Added truncated normal cdf

dcddbe0

Added truncated normal cdf calculations and tests added functions

Update truncated normal mean in Seeding.jl

5155c9b

Swapped use of mean(Truncated(Normal(mu, stdev), lower, upper)) to use truncated_normal_mean