using Random
using Agents
using Graphs
using Statistics
using DataFrames
using HypothesisTests
include("VL2.jl")
using .VL
Ensembles and statistics
Quantifying the duration of change
How could you quantify the duration of a change using a single number? In other words, what sort of summary statistic can you use to decide whether one trajectory goes up earlier than another one? Try to go for the simplest such summary statistic.
Here’s a simple summary statistic that will do the job. Let \(\overline{p}\) refer to the mean \(p\) across the population of our variational learners; i.e. \(\overline{p}\) is the average probability that \(G_1\) is used. We will then find out the time point at which a simulation first satisfies \(\overline{p} > 0.5\), i.e. the earliest time at which \(G_1\) has more than 50% usage. Let \(T\) refer to this time point.
The statistical test
Once you have such a number for each trajectory, you have a set of these numbers. What kind of statistical test could you use to decide whether the set of numbers for \(\beta = 0.1\) is significantly different from the set of numbers for \(\beta = 0.5\)? (Hint: you want a test that compares two means from two samples.)
We get a single value of \(T\) for each simulation trajectory; for example, if we repeat the simulation 100 times for each value of \(\beta\), we have two sets of 100 \(T\) numbers. To test whether there is a statistically significant difference between these sets of numbers, we can use a two-sample t-test.
Implementation
Once you have answers to the above questions, you can try and implement the following procedure:
- Instead of 10 simulations, use
ensemblerun!
to produce simulated trajectories for 100 repetitions for each \(\beta\).- Then figure out how to extract your summary statistic from these data.
- Finally, carry out the statistical test in order to make a decision.
a. Running the simulations
We first load all the necessary ingredients:
Here is our function that creates one model:
function make_model(beta)
= watts_strogatz(50, 8, beta)
G = GraphSpace(G)
space
= StandardABM(NetworkVL, space,
model = VL_step!)
agent_step!
for i in 1:50
add_agent_single!(model, 0.01, 0.01, 0.2, 0.1)
end
return model
end
We can set the PRNG seed for reproducibility:
Random.seed!(1539)
Create vectors of models using array comprehensions:
= [make_model(0.1) for i in 1:100]
models1 = [make_model(0.5) for i in 1:100] models2
Use ensemblerun!
to simulate and obtain the mean of p
:
= ensemblerun!(models1, 10_000; adata = [(:p, mean)])
data1, _ = ensemblerun!(models2, 10_000; adata = [(:p, mean)]) data2, _
Verify that the dataframes look the way we’d expect them to:
data1
Row | time | mean_p | ensemble |
---|---|---|---|
Int64 | Float64 | Int64 | |
1 | 0 | 0.01 | 1 |
2 | 1 | 0.0103 | 1 |
3 | 2 | 0.010197 | 1 |
4 | 3 | 0.010095 | 1 |
5 | 4 | 0.00999408 | 1 |
6 | 5 | 0.00989414 | 1 |
7 | 6 | 0.0097952 | 1 |
8 | 7 | 0.00989725 | 1 |
9 | 8 | 0.00999827 | 1 |
10 | 9 | 0.00989829 | 1 |
11 | 10 | 0.00979931 | 1 |
12 | 11 | 0.00990131 | 1 |
13 | 12 | 0.0100023 | 1 |
⋮ | ⋮ | ⋮ | ⋮ |
1000089 | 9989 | 0.990469 | 100 |
1000090 | 9990 | 0.990564 | 100 |
1000091 | 9991 | 0.990659 | 100 |
1000092 | 9992 | 0.990752 | 100 |
1000093 | 9993 | 0.990845 | 100 |
1000094 | 9994 | 0.990736 | 100 |
1000095 | 9995 | 0.990829 | 100 |
1000096 | 9996 | 0.990921 | 100 |
1000097 | 9997 | 0.991011 | 100 |
1000098 | 9998 | 0.991101 | 100 |
1000099 | 9999 | 0.99099 | 100 |
1000100 | 10000 | 0.99108 | 100 |
b. Obtaining the summary statistics
Now for the tricky part: in order to determine \(T\) for a simulation run, we need to find the lowest value of time
such that mean_p
is at least 0.5, for each value of ensemble
separately.
Reading the part about indexing in the DataFrames.jl documentation, we find that the following command will take a subset of the original dataframe, a subset which only contains rows of the original dataframe on which the value of mean_p
is greater than 0.5:
.> 0.5, :] data1[data1.mean_p
Row | time | mean_p | ensemble |
---|---|---|---|
Int64 | Float64 | Int64 | |
1 | 4543 | 0.500058 | 1 |
2 | 4545 | 0.500459 | 1 |
3 | 4546 | 0.501054 | 1 |
4 | 4547 | 0.502444 | 1 |
5 | 4548 | 0.503219 | 1 |
6 | 4549 | 0.502587 | 1 |
7 | 4550 | 0.501361 | 1 |
8 | 4551 | 0.501548 | 1 |
9 | 4552 | 0.503332 | 1 |
10 | 4553 | 0.504099 | 1 |
11 | 4554 | 0.504658 | 1 |
12 | 4555 | 0.504611 | 1 |
13 | 4556 | 0.505565 | 1 |
⋮ | ⋮ | ⋮ | ⋮ |
524271 | 9989 | 0.990469 | 100 |
524272 | 9990 | 0.990564 | 100 |
524273 | 9991 | 0.990659 | 100 |
524274 | 9992 | 0.990752 | 100 |
524275 | 9993 | 0.990845 | 100 |
524276 | 9994 | 0.990736 | 100 |
524277 | 9995 | 0.990829 | 100 |
524278 | 9996 | 0.990921 | 100 |
524279 | 9997 | 0.991011 | 100 |
524280 | 9998 | 0.991101 | 100 |
524281 | 9999 | 0.99099 | 100 |
524282 | 10000 | 0.99108 | 100 |
This operation returns a new dataframe, which we can now save in a new variable:
= data1[data1.mean_p .> 0.5, :] df1
All we need to do now, to extract the \(T\) numbers we need, is to obtain the first row from this new dataframe for each separate ensemble
. How do we do this?
The answer is something known as split–apply–combine. This procedure allows us to first split a dataframe based on the values in one column (in our case, ensemble
), then carry out an operation on each of the resulting dataframes individually, and then finally combine them back into a single dataframe. The groupby
function from DataFrames.jl is used for the splitting; here, we split on the ensemble
column:
= groupby(df1, :ensemble) df1
Then, we use combine
to apply an operation to each individual dataframe in the grouping. The function we apply is minimum
, which returns the smallest element in an array. In this case, we wish to obtain the smallest time
for each individual dataframe:
= combine(df1, :time => minimum) df1
Row | ensemble | time_minimum |
---|---|---|
Int64 | Int64 | |
1 | 1 | 4543 |
2 | 2 | 4431 |
3 | 3 | 4648 |
4 | 4 | 4421 |
5 | 5 | 4985 |
6 | 6 | 5691 |
7 | 7 | 5601 |
8 | 8 | 5243 |
9 | 9 | 4732 |
10 | 10 | 4749 |
11 | 11 | 5069 |
12 | 12 | 4342 |
13 | 13 | 4325 |
⋮ | ⋮ | ⋮ |
89 | 89 | 6073 |
90 | 90 | 4126 |
91 | 91 | 4386 |
92 | 92 | 4041 |
93 | 93 | 6995 |
94 | 94 | 5015 |
95 | 95 | 4455 |
96 | 96 | 4101 |
97 | 97 | 4528 |
98 | 98 | 5090 |
99 | 99 | 3895 |
100 | 100 | 5525 |
The \(T\) values we’re interested in are now in the time_minimum
column; let’s store them in a new variable for ease of use:
= df1.time_minimum T1
100-element Vector{Int64}:
4543
4431
4648
4421
4985
5691
5601
5243
4732
4749
5069
4342
4325
⋮
6073
4126
4386
4041
6995
5015
4455
4101
4528
5090
3895
5525
We can now perform all the same steps for the second set of simulations:
= data2[data2.mean_p .> 0.5, :]
df2 = groupby(df2, :ensemble)
df2 = combine(df2, :time => minimum)
df2 = df2.time_minimum T2
100-element Vector{Int64}:
4075
4992
4488
5096
4602
4885
5299
4444
5876
4178
4397
4873
4753
⋮
4924
4594
4410
5430
4672
4682
4807
4276
4807
4444
5099
4310
c. Carrying out the statistical test
The t-test can be performed using EqualVarianceTTest
from HypothesisTests.jl (see documentation):
EqualVarianceTTest(T1, T2)
Two sample t-test (equal variance)
----------------------------------
Population details:
parameter of interest: Mean difference
value under h_0: 0
point estimate: -41.98
95% confidence interval: (-192.0, 108.0)
Test summary:
outcome with 95% confidence: fail to reject h_0
two-sided p-value: 0.5817
Details:
number of observations: [100,100]
t-statistic: -0.5518436540887174
degrees of freedom: 198
empirical standard error: 76.07227099371633
The “test summary” bit tells us that the test failed to reject the null hypothesis (which in this case states that the mean \(T\) values between the two sets of simulations do not differ). Thus, we do not have any evidence that there is actually a difference in the speed with which trajectories reach \(p = 0.5\) between the two sets of simulations.
Bonus: pipes
To obtain the vector of \(T\) values from a simulation history, we did this:
= data1[data1.mean_p .> 0.5, :]
df1 = groupby(df1, :ensemble)
df1 = combine(df1, :time => minimum)
df1 = df1.time_minimum T1
Notice that what we’re doing here is to take the contents of a variable (df1
), carry out some operation, and put the result back in the same variable. Julia, like many modern programming languages, support an operation known as the pipe which makes this kind of process simpler. The idea is that the result of an operation is piped into the following operation, whose result is then piped into the following operation, and so on. In Julia, the pipe operator is |>
, and the following does exactly the same as the above code snippet:
using Pipe
@pipe data1[data1.mean_p .> 0.5, :] |> groupby(_, :ensemble) |> combine(_, :time => minimum) |> _.time_minimum
100-element Vector{Int64}:
4543
4431
4648
4421
4985
5691
5601
5243
4732
4749
5069
4342
4325
⋮
6073
4126
4386
4041
6995
5015
4455
4101
4528
5090
3895
5525
Notice that the underscore (_
) symbol takes the place of the “anonymous” variable. The @pipe
macro does the magic of populating this temporary variable for you, so you don’t need to do it yourself.
What this means is that, to create the T1
and T2
arrays, all we need are the following two lines of code:
= @pipe data1 |> _[_.mean_p .> 0.5, :] |> groupby(_, :ensemble) |> combine(_, :time => minimum) |> _.time_minimum
T1 = @pipe data2 |> _[_.mean_p .> 0.5, :] |> groupby(_, :ensemble) |> combine(_, :time => minimum) |> _.time_minimum T2
Whether you find using pipes more natural than explicitly creating temporary variables (such as df1
above) boils down to personal preference and experience. If you’re like me, you will initially find pipes confusing, but the more programming experience you gather, the more natural pipes become. Having said this, it’s good to point out that using a pipe is never necessary; whatever you can do with a pipe you can also do without.
Bonus 2: plotting the distributions of \(T\)
The statistical test suggests that there is no difference between the two sets of \(T\) numbers. Can we visualize this somehow? A usual way of doing this is by way of a boxplot. Here’s how we can do it in Julia.
# load the StatsPlots package
using StatsPlots
# create dataframes: first column is the beta value, second column is the T values
= DataFrame(beta="0.1", T=T1)
set1 = DataFrame(beta="0.5", T=T2)
set2
# join these dataframes (literally, put one on top of the other, "vertical catenation")
= vcat(set1, set2)
sets
# plot
@df sets boxplot(:beta, :T, label="time to p = 0.5")
We see that the distributions of \(T\) numbers overlap to a large extent; this is a visual representation of the fact that there is no difference between the two sets.
Peruse the StatsPlots.jl documentation to learn more about the @df
macro and all the visualization functions you can use with dataframes.