Efficient Simulation
The tutorial simulation walks through the steps of a single simulation. In many cases, we will want to run a large number of realizations for a hypothetical deployment, sampling from all the variable inputs to see how much the results vary. This basic Monty Carly approach is flexible and can be used to answer lots of questions (hence the name Monty
). But, the quality of Monty Carlo results depends on the number of samples: more is better. Setting up efficient repeated simulations requires some attention and depends almost entirely on whether iterations avoid unnecessary memory allocation. The example below shows how to do this.
Setup
First, we load the modules
using Monty, Random, Distributions, Meshes
using DataFrames
using GeoStatsFunctions
import CairoMakie as Mke
const rng = Xoshiro(42)
and set up the geometry for a hypothetical deployment. For this example, I use a simple rectangular grid split up into alternating treatment/control strips. The grid has 100 total cells, each of which will have one composite sample taken from it.
grid = CartesianGrid((10,10), (0.0, 0.0), (10.0, 10.0))
treatment = reduce(vcat, [grid[i,:] for i ∈ 1:2:10])
control = reduce(vcat, [grid[i,:] for i ∈ 2:2:10])
treatment_points = treatment .|> centroid
control_points = control .|> centroid
In this example, we use an intentional "grid centroid jitter," which means every realization of the sample plan moves the sample points to random locations inside each grid cell. This can help avoid potential issues with putting sample points along straight lines. Here is a quick visualization of one realization of this jitter on the grid.
plannedjitter = GridCentroidJitter(10.0, 2.0)
points = plannedjitter.(grid .|> centroid)
Pre-allocation
Running efficient simulations means avoiding unnecessary memory allocations inside the loop. To avoid them, we have to allocate the memory we need for simulations once and use it for each realization, shuffling the final output of each realization into some kind of data structure along the way and returning those results. We also need to set the parameters and distributions that each simulation is realized from.
Here I allocate everything needed for repeated simulations. First, I create the sample plan
and allocate space for cores (samp
) and simulations sim
. Then I define the variables needed to realize an individual simulation, which captures both operational and physical/chemical variability.
# location-paired sampling at t=0 and t=1
plan = pairedsampleplan(treatment_points, control_points, [0.0, 1.0])
# allocate space for realized core locations, 10 composites per sample
samp = CoreSet(plan, 10)
# allocate space for the simulation
sim = Simulation((:Ca, :Mg), samp)
# spreading patterns with streaks, as in the tutorial
Q = GaussianSimulator(samp, 4.0)
Q_cov = GaussianCovariance(
MetricBall((10.0, 1000.0)),
nugget=(4.0 / 10)^2,
sill=(4.0 / 3)^2,
)
# circular compositing pattern with a radius of 1
stencil = CircleStencil(ncore(sim), 1.0)
# sampler location error with standard deviation of 2
samplerjitter = Jitter(2.0)
# core location error with standard deviation of 0.1
corejitter = Jitter(0.1)
# constant feedstock density across all simulations
sim.ρf .= 3e3
# cross-correlated baseline soil concentrations
cs = GaussianCosimulator(samp, (Ca=0.002, Mg=0.001), 0.7)
cs_cov = SphericalCovariance(nugget=5e-5^2, sill=1e-4^2, range=20.0)
# variable sample depths
depth = TriangularDist(0.06, 0.14)
# somewhat variable soil bulk density
ρs = Normal(1e3, 50)
# feedstock mean concentrations with 5 % variability
cf = (Ca=0.07, Mg=0.05)
σf = 0.05
# leaching models for each element
Ca_leaching = ExponentialLeaching(λ=1.0)
Mg_leaching = ExponentialLeaching(λ=1.5)
# analytical noise
σ_analysis = (Ca=0.04, Mg=0.03)
Simulation Stacks
Monty has a convenience function for repeating a simulation and dumping the results into a final stack of arrays. This function is meant to be used with Julia's do-block syntax. This means that we write the simulation routine into the function call directly. See below.
sims = simulationstack(100, sim, samp, plan) do
executeplan!(
samp,
plan,
stencil=stencil,
plannedjitter=plannedjitter,
samplerjitter=samplerjitter,
corejitter=corejitter,
)
updategaussian!(Q, samp.points, Q_cov)
spreading!(rng, sim, plan, Q)
clamp!(sim.Q, 0.0, Inf)
unmixed!(rng, sim, depth=depth)
rand!(rng, ρs, sim.ρs)
feedstockconcentration!(rng, sim, cf, σf)
updategaussian!(cs, samp.points, cs_cov)
soilconcentration!(rng, sim, cs)
clamp!(sim.cs[:Ca], 0.0, Inf)
clamp!(sim.cs[:Mg], 0.0, Inf)
leaching!(sim, :Ca, Ca_leaching, plan)
leaching!(sim, :Mg, Mg_leaching, plan)
massloss!(sim, plan, x -> (x[:Ca] / 2 + x[:Mg] / 2))
analyze!(rng, sim, (Ca=0.01, Mg=0.005), 0.005)
end
╭────────────────────╮
│ 100×3×200 DimStack │
├────────────────────┴────────────────────────────────────────── dims ┐
↓ realization Sampled{Int64} 1:100 ForwardOrdered Regular Points,
→ analyte Categorical{Symbol} [:Ca, :Mg, :mass] ForwardOrdered,
↗ sample Sampled{Int64} 1:200 ForwardOrdered Regular Points
├─────────────────────────────────────────────────────────────────────┴ layers ┐
:data eltype: Float64 dims: realization, analyte, sample size: 100×3×200
:x eltype: Float64 dims: realization, sample size: 100×200
:y eltype: Float64 dims: realization, sample size: 100×200
:control eltype: Bool dims: sample size: 200
:location eltype: UInt16 dims: sample size: 200
:round eltype: UInt16 dims: sample size: 200
:time eltype: Float64 dims: sample size: 200
└──────────────────────────────────────────────────────────────────────────────┘
The simulationstack
function above is passed 100
, the number of realizations, then the data structures we need to simulate (sim
, samp
, plan
). Then the do
block contains the entire script for our simulation.
The simulationstack
function handles the storage of results automatically, returning a stack of arrays in the form of a DimStack
from DimensionalData.jl
. As the read-out above indicates, the simulation results are packed into the data
member of the results, with associated information like control flags in the other members.
For example, to see the first realization, we can just slice it out of the results:
sims[:data][1,:,:]
╭────────────────────────────────╮
│ 3×200 DimArray{Float64,2} data │
├────────────────────────────────┴────────────────────────── dims ┐
↓ analyte Categorical{Symbol} [:Ca, :Mg, :mass] ForwardOrdered,
→ sample Sampled{Int64} 1:200 ForwardOrdered Regular Points
└─────────────────────────────────────────────────────────────────┘
↓ → 1 2 3 … 199 200
:Ca 0.00438357 0.00295182 0.0040116 0.00197784 0.00191481
:Mg 0.00275947 0.00140839 0.00246664 0.00093164 0.000890766
:mass 1.23311 1.23982 1.29882 1.18181 1.1442
What happens next depends on what we want to know. If you just want to save the data, use tonetcdf
or tocsv
. You can also convert the entire body of simulations directly into a DataFrame
. Here are the first 5 rows of the dataframe for all simulations:
df = sims |> DataFrame
Row | realization | analyte | sample | data | x | y | control | location | round | time |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Symbol | Int64 | Float64 | Float64 | Float64 | Bool | UInt16 | UInt16 | Float64 | |
1 | 1 | Ca | 1 | 0.00438357 | 5.11737 | 7.17448 | false | 1 | 1 | 0.0 |
2 | 2 | Ca | 1 | 0.00404551 | 3.06514 | 5.21896 | false | 1 | 1 | 0.0 |
3 | 3 | Ca | 1 | 0.0037263 | 8.82374 | 6.10879 | false | 1 | 1 | 0.0 |
4 | 4 | Ca | 1 | 0.0052264 | 5.81479 | 5.78723 | false | 1 | 1 | 0.0 |
5 | 5 | Ca | 1 | 0.00399728 | 5.58126 | 8.46768 | false | 1 | 1 | 0.0 |
Or you can slice/select from the simulations and convert that portion into a frame. Here are the first 5 rows of this dataframe for only the first realization:
df = unstack(sims[1,:,:] |> DataFrame, :analyte, :data)
Row | sample | x | y | control | location | round | time | Ca | Mg | mass |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Bool | UInt16 | UInt16 | Float64 | Float64? | Float64? | Float64? | |
1 | 1 | 5.11737 | 7.17448 | false | 1 | 1 | 0.0 | 0.00438357 | 0.00275947 | 1.23311 |
2 | 2 | 5.85044 | 9.75474 | false | 1 | 2 | 1.0 | 0.00295182 | 0.00140839 | 1.23982 |
3 | 3 | 10.4717 | 14.703 | false | 2 | 1 | 0.0 | 0.0040116 | 0.00246664 | 1.29882 |
4 | 4 | 10.244 | 14.1288 | false | 2 | 2 | 1.0 | 0.00290332 | 0.00139308 | 1.28623 |
5 | 5 | 7.52915 | 25.6157 | false | 3 | 1 | 0.0 | 0.00423808 | 0.00277797 | 1.25464 |
Notes
- In the example above, almost all of the computation time is spent updating the Gaussian simulators (performing
cholesky!
factorizations). Without spatial correlation/structure, the simulations are very fast. Benchmarks with a similar simulation script indicate that simulations without spatial correlation are about 1000 times faster. This is only a very rough comparison though.- Without spatial correlation, the computational cost scales linearly. These simulations will be extremely fast, on the order of tens of microseconds. So, it would take about 1 second to run 100k realizations of a hypothetical deployment without spatial correlation.
- With spatial correlation, the timing depends roughly cubically on the number of total cores in the simulation because of the Cholesky factorizations. So, it may vary a lot. To speed things up, one option is to use spatial correlation but not relocate the sample points for every realization. This avoids refactorizing the covariance matrix and the values of the points can still be drawn random from a persistent multivariate normal distribution.
- Benchmarks also verify that the simulation loop above allocates zero memory, which is good.
- What goes into the simulation script, inside the
do
block, is up to you. If any parts of the simulation are shared across realizations, leave them out of the script. For example, above theρf
field of the simulation is assumed to be the same in all realizations, so it's set outside the simulation loop. - I think the
do
-block syntax helps keep the focus on simulation choices, instead of how the results arrays are allocated and filled. We just have to write the script, and then the looping and storing for each realization is handled out of view.