Overview
The analogs package implements a general framework for distance-based neighborhood models. Every analysis follows a two-stage pattern:
- First, select a neighborhood of analog locations from a reference pool, based on climate similarity, geographic proximity, or both.
- Optionally, summarize the neighborhood using counts, weighted means, regression coefficients, or other statistics.
Common methods in climate change ecology—climate velocity, analog
impact models, analog availability—are all specific configurations of
this framework. So are geographically weighted regression (GWR) and
inverse distance-weighted (IDW) interpolation, which use purely
geographic neighborhoods. The package implements these via a flexible
core function, analog_search(), along with simplified
wrappers for common analysis types.
The table below shows how each wrapper maps to the framework:
| Function | Neighborhood | Summary | Use case |
|---|---|---|---|
analog_velocity() |
k nearest geographic, climate-constrained | Pairs | Where is this climate moving? |
analog_similarity() |
k nearest climatic, geographically-constrained | Pairs | What climates are reachable? |
analog_availability() |
All within thresholds | Count | Where do analogs exist? |
analog_density() |
All within thresholds | Weighted sum | How strong are analog matches? |
analog_impact() |
All within thresholds | Weighted mean | What ecological conditions to expect? |
analog_regression() |
Flexible | Local regression | How do covariates predict outcomes within neighborhoods? |
As described below, these functions fall into two categories: analog
enumeration functions (which find analogs for each focal site and either
return them directly or summarize their abundance or quality), and
analog-based prediction functions (which fit local models predicting an
outcome variable y observed across each focal site’s
analogs). Note that while this document covers the key wrapper
functions, you can also implement a broader range of analyses by calling
analog_search() directly.
Data
All functions in the package require two data inputs: x
(the set of focal locations for which analogs are to be found) and
pool (the set of reference locations to search for
analogs). Both these components need to have spatial coordinates and
environmental data (traditionally climate, though other variables can
also be used) for each location. Prediction functions also require
y (outcome variable(s)). Data can be provided as matrices,
data.frames, or SpatRasters. x and pool must
have the same set of environmental variables, but can be in different
formats and can represent different sets of sites.
Most of the examples in this vignette are focused on climate change,
and assume that x and pool come from different
time periods with different climates. In these situations, you typically
have a choice about which dataset to map to which of the two parameters,
and swapping them can give different insights. For example, as shown in
the climate velocity section below, swapping the parameter mapping
switches the analysis from forward to reverse velocity; while it’s only
shown for velocity, this is also possible for the other climate change
analyses. For non-climate-change applications like GWR and IDW, shown
toward the end of this vignette, x and pool
can be from the same time period.
The package includes an example dataset of climate rasters covering a portion of western North America at approximately 5 km resolution, with two scaled climate variables (climatic water deficit and actual evapotranspiration) for a historical baseline (1981–2010) and a future projection (2041–2070, SSP585). These are used throughout this vignette. Let’s unpack them and make some quick maps:
clim <- example_rasters()
hist <- clim$historic
fut <- clim$future
plot(hist, main = c("Historical CWD", "Historical AET"))
Distance metrics
Analog computations are centered around pairwise distance
calculations between every location in x and every location
in pool. Internally, the package calculates geographic
distances and environmental distances, each with their own distinct
handling.
For a given focal location in x, geographic and
environmental distances to locations in pool are used in
two ways. They’re used as hard cutoffs (via max_geog and
max_clim) to select the set of analogs, and they’re often
also used as weights (via kernel) to place higher value on
nearby/similar analogs that may be more relevant for climate adaptation
or for statistical prediction.
Geographic distance
The package automatically detects whether coordinates are
longitude/latitude or projected, and uses great-circle or planar
distance accordingly. You can override this with
coord_type = "lonlat" or
coord_type = "projected". Geographic thresholds
(max_geog) are in kilometers for lon/lat data and in
projection units for projected data.
Climate distance
By default, climate distance is Euclidean. When using multiple climate variables, it is recommended to scale them first to avoid artifacts from differing units. The example data used here is already scaled.
For dataset-wide Mahalanobis distance (accounting for covariance
among climate variables), use mahalanobis_transform() to
pre-whiten the data:
transformed <- mahalanobis_transform(x = hist, pool = fut)
vel_mahal <- analog_velocity(
x = transformed$x,
pool = transformed$pool,
max_clim = .25,
k = 1
)For site-specific covariance (e.g., based on local year-to-year
climate variability), supply pre-computed covariance matrices via the
x_cov parameter.
Kernel functions
The kernel parameter controls how analog weights decay
with distance, which is relevant for some stat options.
Available kernel shapes:
-
"uniform"— all analogs weighted equally -
"gaussian_clim"/"gaussian_geog"— Gaussian kernel on climate or geographic distance -
"inverse_clim"/"inverse_geog"— inverse distance weighting -
"gaussian_joint"/"inverse_joint"— kernels on combined climate and geographic distance
The theta parameter controls the kernel bandwidth. For
Gaussian kernels, theta is the standard deviation of the
kernel (but note that the appropriate value depends on the number of
climate variables; see below). For joint kernels, theta is
a 2-element vector c(theta_clim, theta_geog).
When using a kernel that weights climatic and/or geographic
distances, the combination of parameters should be set so that kernel
weight decays to near zero at the max_clim or
max_geog boundary. This allows the kernel function to drive
the results while the max cutoffs function mainly to reduce run time by
minimizing unneeded computation.
The helper function kernel_params() is useful for
determining what values to choose for theta,
max_clim, and max_geog. It gives theoretical
answers to the questions, “How big should theta be in order
for my kernel to capture a given fraction of pool sites for the typical
focal site?” and “How big should max_clim or
max_geog be so that it truncates only a given percentage of
kernel weight for the typical focal site?” The answers are theoretical,
assuming that pool follows a multivariate normal
distribution (as approximately achieved by
mahalanobis_transform()) or a uniform distribution, and
that the focal site is randomly sampled from pool.
Importantly, the answers are also subject to the curse of dimensionality
and depend on the number of climate variables, so intuition does not
always hold.
For example, if you had d = 2 climate variables and
wanted to size your Gaussian climate kernel so that it captures 10% of
the pool for the average site during the baseline era (and
assuming you had run mahanobis_transform() so that your
data were approximately multivariate normal), here’s how you get kernel
parameter recommendations:
kernel_params(fraction = 0.10, d = 2, loss = 0.01,
kernel = "gaussian", data_dist = "mvn")
#> $theta
#> [1] 0.3333333
#>
#> $max
#> [1] 0.9597052Analog weights
Beyond the kernel weights described above, the framework supports two other ways to weight pool sites in summary statistics. Both get multiplied through any kernel weighting and influence which analogs receive more weight in the aggregation, but do not affect which analogs are selected.
-
Cell-area weights correct for the fact that raster
grid cells often have non-uniform area. Without correction, naive
summaries over multiple analog grid cells are biased toward regions of
small cells (e.g. high latitudes on a lon-lat grid, or distorted regions
of a non-equal-area projection). When
poolis a raster, analog functions compute and apply cell-area weights by default. You can override this withcell_area_weight = FALSEto disable it. -
User weights let you express ecological or
methodological variation across pool sites — for example, sampling
effort, ecological intactness, occurrence probabilities for a focal
species, or any other per-site weight that should influence aggregation.
Pass them via the
weightargument as a numeric vector, single-column matrix or data frame, or single-layer SpatRaster matchingpool. Values must be non-negative;NAis allowed and is treated as zero (the site contributes nothing). User weights are reported alongside the analog index in pair-mode output, making it easy to filter or post-process based on the weight of each match. - Also note that a fourth source of weighting, sample
weights, arises only when you set
downsample < 1to accelerate queries on a large pool (as discussed below). The package rescales the surviving points’ weights to keep aggregations unbiased, and these weights get multiplied with the other three. You generally do not need to think about them; they are surfaced as asample_weightcolumn in pair-mode output for the rare case where you want to inspect them.
Analog enumeration functions
This category of analyses includes functions that simply identify and
return analogs (e.g. climate_velocity()) and functions that
instead return a summary of the count or weights of analogs. Let’s look
at them each in turn.
Climate velocity
Analog-based climate velocity metrics are widely used to assess dispersal constraints under climate change. Forward or outbound velocity, the distance from a site to the nearest location with a projected future climate similar to the site’s historical/current climate, represents the migration required for current residents to track baseline climate conditions. Reverse or inbound velocity, the distance from a site to the nearest location with a historical climate similar to the site’s projected future climate, represents constraints on the arrival of new residents adapted to the site’s future conditions.
These metrics are implemented as a geographic nearest-neighbor search
under a hard climate constraint, with reverse velocity implemented by
swapping the climate data for the two eras. For each focal site, the
result returns data on its single nearest geographic analog (subject to
the climate similarity threshold max_clim). The geographic
distance to that analog, divided by the time elapsed, is the velocity.
Sites with no analog in the reference pool—an important category— have
NA results.
fwd_vel <- analog_velocity(
x = hist,
pool = fut,
max_clim = 0.25,
k = 1
)
rev_vel <- analog_velocity(
x = fut,
pool = hist,
max_clim = 0.25,
k = 1
)
decades <- 6
vel <- c(fwd_vel$geog_dist, rev_vel$geog_dist) / decades
plot(vel, range = range(minmax(vel)),
main = c("Forward velocity (km/dec)", "Reverse velocity (km/dec)"))
In addition to geographic and climatic distances, the result also
includes the spatial coordinates of each site’s nearest analog (as well
as the analog’s index in the pool). This can be used for a
variety of downstream analyses, including visualizing the direction from
a site to its analog location:
fwd_vel$bearing <- geosphere::bearing(crds(fwd_vel, na.rm = FALSE),
values(fwd_vel)[, c("analog_x", "analog_y")])
plot(fwd_vel$bearing, main = c("Bearing to forward analog (deg)"),
col = rainbow(12))
Climate similarity
Climate similarity answers the inverse question from climate velocity: for each location’s climate, what is the most similar climate within a given geographic radius? This finds the most climatically similar locations that are geographically reachable, which can provide insight into how much climate change organisms must tolerate given limited dispersal capacity.
sim <- analog_similarity(
x = fut,
pool = hist,
max_geog = 100,
k = 1
)
plot(sim$clim_dist, main = "Similarity of best nearby climate analog")
Analog availability
How many analogs exist within specified climate and geographic thresholds? This maps the density of suitable analogs across the landscape. Locations with large numbers of nearby analogs have greater potential for adaptive dispersal under climate change, while locations with zero analogs represent novel or disappearing climates.
avail <- analog_availability(
x = fut,
pool = hist,
max_clim = 0.5,
max_geog = 200
)
plot(avail, main = "Analog availability (count)")
Analog density
Similar to availability, but weighted by climate kernel-based
similarity and/or geographic proximity rather than simply counted. This
captures both the quantity and quality (proximity/suitability) of analog
matches. By default, raw density is normalized by dividing it by its
theoretical maximum value (calculated from kernel,
max_geog, and theta) to make it
interpretable.
intens <- analog_density(
x = fut,
pool = hist,
max_clim = 0.6,
max_geog = 150,
kernel = "gaussian_joint",
theta = c(0.2, 50)
)
plot(intens, main = "Analog density")
Prediction functions
Compared to the functions covered above that simply enumerate
analogs, analog-based prediction functions find analogs and then fit a
model that summarizes additional properties (outcome variables
y) of those analogs. Model options include weighted means
and categorical counts (via analog_impact()) and ordinary
and ridge regression on additional covariates (via
analog_regression()).
Options for assessing uncertainty include returning standard errors
on fitted means and coefficients (via the se parameter) and
running various forms of cross-validation within pool (via
analog_cv()).
Analog impact models
Analog impact models (AIMs) predict ecological outcomes
(y) under climate change. For each focal site, this
approach finds locations within geographic range that have current
climates similar to the focal site’s future climate, then computes an
average (or class count for categorical y) of their
ecological characteristics, weighted by climatic similarity. As an
example, let’s use CWD values from the historical period as a stand-in
for an ecological response variable. To assess uncertainty in weighted
means, let’s also specify se = "ess" to compute standard
errors based on effective sample size for each grid cell.
# Use historical CWD as a proxy ecological variable
eco_var <- hist$CWD
impact <- analog_impact(
x = fut,
pool = hist,
y = eco_var,
max_clim = 0.5,
max_geog = 200,
kernel = "gaussian_clim",
theta = 0.15,
se = "ess"
)
plot(impact[[c("weighted_mean", "se_weighted_mean")]],
main = c("Predicted ecological state", "Standard error"))
The kernel and theta parameters control how
analog influence decays with climate distance. A Gaussian kernel with
theta = 0.15 means analogs at a climate distance of 0.15
receive about 60% weight, while those at 0.45 (= 3 × theta) receive
almost none. The hard threshold max_clim provides an
absolute cutoff.
Spatial interpolation
The analog framework is also useful for analyses outside climate
change scenarios. When x and pool share the
same climate era, analog_impact() performs climate- and/or
geographically-informed interpolation: for each grid cell, it finds
sample locations with similar climates or nearby locations, then
computes a weighted average of their measured values. This is useful for
mapping ecological variables from sparse field observations onto a
continuous grid. Unlike purely geographic interpolation methods (inverse
distance weighting, kriging), this approach can also weight observations
by climate similarity, so a distant site with matching climate can
contribute more than a nearby site with different climate.
The example below shows an interpolation informed by both climatic
similarity and geographic proximity, with their relative importance
determined by the theta parameters and the shape of the
selected kernel function.
# Simulate sparse field observations at 500 random locations
set.seed(123)
n_sites <- 500
cells <- sample(which(!is.na(values(hist[[1]]))), n_sites)
sites <- as.data.frame(hist, xy = TRUE, na.rm = FALSE)[cells, ]
sites$observed <- 2 * sites$CWD - 0.5 * sites$AET + rnorm(n_sites, sd = 0.3)
# Interpolate onto the full grid
interp <- analog_impact(
x = hist,
pool = as.matrix(sites[, c("x", "y", "CWD", "AET")]),
y = sites$observed,
max_clim = 1,
max_geog = 200,
kernel = "gaussian_joint",
theta = c(0.15, 100)
)
plot(interp[["weighted_mean"]], main = "Climate-informed spatial interpolation")
Local regression
analog_regression() fits a weighted linear regression
model within each analog neighborhood. This is useful if your outcome
variable y has important relationships with additional
predictors beyond the variables (climate and geography) used to define
the analog search kernel. In the example below we’ll use AET and its
square as covariates. The function returns and returns the coefficients
(and optionally, their standard errors) for each location in
x, and also returns predicted values if
x_covariates is provided.
The lambda parameter controls the amount of ridge
regularization. The default (lambda = 0) is ordinary
weighted least squares regression, with no regularization. Setting
lambda > 0 adds regularization, which shrinks
high-variance coefficients toward zero; this is useful for stabilizing
predictions when some neighborhoods have few analogs relative to the
number of covariates, or when covariates are strongly correlated. As
lambda -> Inf, the intercept term approaches the
weighted mean, i.e. the behavior of analog_impact(). The
typical approach for choosing a lambda value is
cross-validation (see the section on analog_cv() below),
but here we’ll arbitrarily pick a modest value.
# Simulate covariates for the pool (just using AET for expediency)
set.seed(42)
pool_covariates <- data.frame(
aet = as.vector(values(hist$AET)),
aet2 = as.vector(values(hist$AET))^2
)
# Do the same for x (not required but enables predicted values)
x_covariates <- data.frame(
aet = as.vector(values(fut$AET)),
aet2 = as.vector(values(fut$AET))^2
)
# Fit analog regression model
fit <- analog_regression(
x = fut,
pool = hist,
y = eco_var,
covariates = pool_covariates,
x_covariates = x_covariates,
max_clim = 0.25,
max_geog = 200,
kernel = "gaussian_clim",
theta = 0.15,
lambda = 1,
se = "ess" # request standard errors
)
# Plot a few of the output variables
plot(fit[[c("coef_aet", "se_aet", "pred")]],
main = c("A coefficient", "The coefficient's SE", "Predicted value"),
nr = 1)
Geographically weighted regression
The same regression machinery supports geographically weighted regression (GWR) by using geographic neighborhoods and geographic distance kernel weighting. This is a different configuration of the same underlying framework — no climate constraint, geographic kernel, local regression on covariates.
# GWR: geographic neighborhood, geographic kernel weighting
gwr <- analog_regression(
x = hist,
pool = hist,
y = eco_var,
covariates = pool_covariates,
select = "all",
max_geog = 100,
max_clim = NULL,
kernel = "inverse_geog",
theta = 30,
lambda = 0
)
plot(gwr[[c("coef_intercept", "coef_aet")]],
main = c("GWR intercept", "GWR coefficient: cov1"),
range = c(-2, 2), fill_range = TRUE)
Cross-validation
Cross-validation (CV) is a technique that involves comparing observed
data (y values for a given focal site) to predicted
versions of those same values made using a model that was fit on a
subset of the data that excludes the focal site. CV is useful for
estimating prediction error on historical data (to understand
uncertainty) and for calibrating model parameters (to improve predictive
accuracy).
analog_cv() provides CV workflows for
analog_impact() and analog_regression(). It
runs an analog analysis in cross-validation mode and returns held-out
predictions alongside observed values and residuals. It supports
leave-one-out (LOO) and k-fold cross-validation methods. LOO is the
default and for analog models (in contrast to traditional statistical
models) actually runs faster than k-fold because separate models are
already being fit for each focal site. You can plot and analyze CV
results directly, or pass them to cv_performance() to
calculate a basic set of overall performance metrics (e.g., R-squared
for continuous outcomes, AUC for binary outcomes, etc.).
Importantly, cross-validation is internal to
pool—it does not take an x argument
and can’t quantify how well the model predicts values in an independent
dataset of focal sites. Thus it should not be assumed that CV
performance metrics represent model transferability to data from
different regimes (such as an x dataset representing future
climates).
As an example, let’s run LOO cross-validation for an analog impact
model, plot a map of the residuals, and quantify the model’s
performance. If we wanted to tune a model parameter (labmda
max_geog, theta, etc.), we could call
cv_performance(analog_cv(...)) repeatedly with different
parameter values to identify the best-performing settings.
# Run cross-validation and plot residuals
cv <- analog_cv(
fun = analog_impact, pool = hist, y = eco_var,
max_geog = 200, max_clim = 0.5, theta = 0.15,
se = "ess", cv_method = "loo"
)
#> Error: `y` must have exactly 228823 rows/cells to match pool.
plot(cv$residual, main = "Cross-validation residual")
#> Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'plot': object 'cv' not found
# Calculate prediction error metrics
cv_performance(cv)
#> Error: object 'cv' not foundComputational performance
The package is designed for large-scale analyses involving millions of pairwise comparisons.
Pre-built indices
When running multiple queries against the same reference pool, build the lattice index once and reuse it:
idx <- build_analog_index(hist)
# Multiple queries reuse the same index
avail_tight <- analog_availability(fut, idx, max_clim = 0.3, max_geog = 100)
avail_loose <- analog_availability(fut, idx, max_clim = 0.8, max_geog = 300)Index tuning
By default, the lattice resolution is auto-tuned for each query. For repeated queries with the same configuration, you can tune once and reuse:
res <- tune_index_res(
x = fut, pool = hist,
stat = "count",
max_clim = 0.5, max_geog = 200,
verbose = TRUE
)
idx <- build_analog_index(hist, index_res = res)Parallel processing
Use the n_threads parameter to parallelize across focal
locations:
result <- analog_velocity(fut, hist, max_clim = 0.5, k = 1, n_threads = 4)Large raster datasets
For rasters too large to fit in memory,
tiled_analog_search() processes the focal data in spatial
tiles:
result <- tiled_analog_search(
x = very_large_raster,
pool = idx,
stat = "count",
max_clim = 0.5,
max_geog = 200,
n_tiles = 16
)Downsampling
For very large reference pools, downsampling reduces computation. The
package uses an adaptive sampling routine that reduces the effects on
precision by downsampling more heavily in dense regions of
climatic-geographic space in order to preserve coverage in sparse
regions. As noted above, each pool site in the downsampled
data gets a sample weight that’s used to correct summary statistics so
that downsampling doesn’t bias the results:
result <- analog_availability(
x = fut, pool = hist,
max_clim = 0.5, max_geog = 200,
downsample = 0.1
)