Local weighted regression across analog neighborhoods
Source:R/analog_regression.R
analog_regression.RdFits a weighted local regression of y on covariates within
each focal location's analog neighborhood. Analog neighborhoods are
defined by climatic similarity, geographic proximity, or both, while
covariates capture additional predictors that influence outcomes within
each neighborhood but are not captured by to the search dimensions.
This generalizes the weighted mean — which averages over all within-neighborhood
variation — by resolving variation driven by these auxiliary predictors.
Supports ordinary and ridge-penalized weighted least squares. With purely geographic
neighborhoods, this is equivalent to geographically weighted regression
(GWR); with climate-based neighborhoods, it extends the analog impact
model (AIM) framework to incorporate local covariate effects. This function is
a wrapper that calls analog_search() with "regression" included in stat.
Usage
analog_regression(
x,
pool,
y,
weight = NULL,
covariates,
x_covariates = NULL,
max_geog = NULL,
max_clim = NULL,
select = "all",
k = NULL,
kernel = c("gaussian_clim", "inverse_clim", "gaussian_geog", "inverse_geog",
"gaussian_joint", "inverse_joint", "uniform"),
theta = NULL,
lambda = 0,
stat = c("count", "ess", "regression"),
se = c("none", "ess", "design"),
normalize = "auto",
x_cov = NULL,
coord_type = "auto",
index_res = "auto",
cell_area_weight = "auto",
n_threads = NULL,
progress = FALSE,
...
)Arguments
- x
Focal locations for which regressions will be fit. Should be a matrix/data.frame with columns x, y, and climate variables, or a SpatRaster with climate variable layers.
- pool
The reference dataset to search for analogs. Either:
Matrix/data.frame with columns x, y, and climate variables, or SpatRaster with climate variable layers, OR
An
analog_indexobject created bybuild_analog_index()(for repeated queries).
- y
Response variable(s) to model via local regression. Can be a numeric vector, matrix, data.frame, or SpatRaster. Must have exactly the same number of rows/cells as
pool. A separate regression is fit for each variable.- weight
Optional pool site weights for use in aggregation. Numeric vector, single-column matrix/data.frame, or single-layer SpatRaster, with one value per row/cell of
pool. For aggregation stats like"weighted_mean","regression", etc., weights multiply through the weighted aggregation alongside any kernel weighting and cell-area weighting; they do not influence which analogs are selected byknn_*modes (selection remains distance-only). They are reported in pair mode as auser_weightcolumn. Values must be non-negative;NAis allowed and treated as 0 (the point is excluded from aggregation). DefaultNULLmeans no user-supplied weights.If you want to exclude a static subset of pool sites entirely, masking
pool(and any associatedy/covariates) upfront is more efficient than passingweight = 0for those sites, since the lattice index will not have to scan or distance-compute against them. Useweight = 0for cases where the mask varies per query against a shared index, or where some sites have a continuous weight and others should be excluded.- covariates
Predictor variables for local regression, supplied at pool locations. Can be a numeric vector (single covariate), matrix, data.frame, or SpatRaster. Must have exactly the same number of rows/cells as
pool. Column/layer names carry through to output. These variables are NOT used for the analog search itself – only for regression within each neighborhood.- x_covariates
Optional predictor values at focal (
x) locations, at which to evaluate the local regressions. When supplied, the output includes one additional column (pred) or one column per response variable (pred_{yname}for multi-y), giving the fitted value at the focal. Must have exactly the same number of rows/cells asx, and the same column/layer names ascovariates. DefaultNULLreturns only coefficients.- max_geog
Maximum geographic distance constraint (default: NULL = no geographic constraint). When specified, only reference locations within this distance are considered. Radius units should be specified in kilometers if
coord_type = "lonlat", or in projected coordinate units ifcoord_type = "projected".- max_clim
Maximum climate distance constraint (default: NULL = no climate constraint). Can be either:
A scalar: Euclidean radius in climate space (e.g., 0.5)
A vector: Per-variable absolute differences (length must equal number of climate variables)
Only reference locations within this climate distance are considered. When
x_covis provided, scalar thresholds are interpreted in Mahalanobis distance units.- select
Character string specifying the analog selection strategy. One of:
"all"(default): Select all analogs that satisfy themax_climandmax_geogconstraints."knn_clim": For each focal, select up tokanalogs with smallest climate distance, subject to filters."knn_geog": For each focal, select up tokanalogs with smallest geographic distance, subject to filters.
- k
Number of nearest analogs to return per focal location for kNN selection modes. Required when
selectis"knn_geog"or"knn_clim"; must beNULLforselect = "all".- kernel
Kernel decay function for weighting matches, used only when
statincludes a weighted aggregation ("sum_weights","mean_weights","weighted_sum","weighted_mean","ess","regression", or"tabulate"). One of:"uniform": All matches weighted equally (kernel weight = 1.0)."inverse_clim": Inverse climate distance, kernel weight = 1 / (climate_distance + eps), with epsilon given bytheta."inverse_geog": Inverse geographic distance, kernel weight = 1 / (geographic_distance + eps), with epsilon given bytheta."gaussian_clim": Gaussian kernel on climate distance, kernel weight = exp(-climate_distance^2 / (2 sigma^2)), with sigma given bytheta."gaussian_geog": Gaussian kernel on geographic distance, kernel weight = exp(-geographic_distance^2 / (2 sigma^2)), with sigma given bytheta."gaussian_joint": Gaussian kernel on combined distance, kernel weight = exp(-(clim_dist^2 / (2 sigma_clim^2) + geog_dist^2 / (2 sigma_geog^2))), with sigmas given bytheta."inverse_joint": Inverse joint distance, kernel weight = 1 / (sqrt(clim_dist^2 + geog_dist^2) + eps), with epsilon given bytheta.
- theta
Optional numeric parameter controlling the shape of the weighting
kernel, used wheneverkernelis active (i.e. wheneverstatincludes a weighted aggregation) andkernelis not"uniform". Interpretation depends onkernel:For
"inverse_clim"or"inverse_geog": epsilon value added to distances (scalar; default: 1e-12 for climate, 1e-6 for geography).For
"gaussian_clim"or"gaussian_geog": sigma bandwidth parameter (scalar; larger values = slower decay with distance).For
"gaussian_joint"or"inverse_joint": 2-element vectorc(theta_clim, theta_geog)(defaults: 1 for climate, 1 for geography).
See
kernel_params()for help choosingthetaandmax_clim/max_geogvalues that work well together.- lambda
Ridge penalty parameter (default: 0, giving ordinary weighted least squares). Higher values shrink high-variance coefficients toward zero, causing the intercept to approach the weighted mean as
lambda -> Inf. Useful when some neighborhoods have few analogs relative to the number of covariates, or when covariates are strongly inter-correlated.- stat
Statistic(s) to compute.
"regression"is always included. Additional stats like"count","ess", and"weighted_mean"can be requested alongside regression coefficients. Default includes"count"and"ess"for diagnostics.- se
Standard-error framing to apply to SE-supporting stats (
"weighted_mean"and"regression"). One of:"none"(default): no SE columns are returned."ess": effective-sample-size framing. Forweighted_mean,SE = sqrt(var_w(y) / n_eff), wheren_eff = (Σw)² / Σw²is Kish's effective sample size andvar_w(y) = Σwy²/Σw - ȳ_w². For regression,Var(β̂) = σ²_ess · (X'WX + λI)⁻¹, with residual variance corrected usingn_eff - pdegrees of freedom."design": design-based framing (no assumption that weights are precisions). Forweighted_mean,SE = sqrt(Σ w²(y - ȳ_w)²) / Σw.
- normalize
One of
TRUE,FALSE, or"auto"(default). Only used ifstatincludes"sum_weights"or"tabulate"andpoolis a raster. When active, results for these stats are divided by a global scalar so that they represent a fraction of a theoretically "perfect" scenario where the full search area withinmax_geogis occupied wall-to-wall by cells whose climate exactly matchesx. See details underanalog_search()for more info.- x_cov
Optional focal-specific covariance matrices for Mahalanobis distance calculations. Should be a matrix or data.frame with one row per focal location and one column per unique covariance component, or a SpatRaster with a layer for each component. For n climate variables, there are n*(n+1)/2 unique components, ordered as: variances first (diagonals), then covariances (upper triangle by row).
- coord_type
Coordinate system type:
"auto"(default): Automatically detect from coordinate ranges."lonlat": Unprojected lon/lat coordinates (uses great-circle distance; assumesmax_geogis in km)."projected": Projected XY coordinates (uses planar distance; assumesmax_geogis in projection units).
- index_res
Tuning parameter giving the number of bins per dimension of the internally-used lattice search index. Either:
A positive integer.
"auto"(the default): Automatically tune the index resolution by optimizing compute time on a subsample of focal points. If focal has relatively few rows, auto-tuning is skipped and a default resolution of 16 is used. Auto-tuning is not supported whendownsample < 1, because the speed-optimal resolution can sometimes result in higher uncertainty of stat results under downsampling. In that case setindex_resexplicitly; finer values (e.g. 32) generally give better accuracy at the possible cost of query speed.
Ignored if
poolis ananalog_index(uses index's resolution).- cell_area_weight
Controls cell-area weighting when
poolis a raster. One of"auto"(default; on for raster pools, off otherwise),TRUE(force on; errors ifpoolis not a SpatRaster), orFALSE(force off). Cell-area weights correct aggregation statistics for non-uniform cell areas (e.g. lonlat grids near the poles, or projected grids on non-equal-area projections); they are computed viaterra::cellSize()and normalized to mean 1. Whenpoolis a pre-builtanalog_index, this argument must agree with the index's stored configuration:cell_area_weight = FALSEerrors if the index was built with cell-area weighting on (rebuild the index instead).- n_threads
Optional integer number of threads to use for the computation. If
NULL(default), the global RcppParallel setting is used (seeRcppParallel::setThreadOptions).- progress
Logical; if
TRUE, display a progress bar during computation. Progress tracking works by splitting the focal dataset into chunks and processing them sequentially. Useful for large datasets. Default isFALSE.- ...
additional arguments passed to
analog_search()(usually unneeded)
Value
A data.frame (or SpatRaster if x is a SpatRaster) with one
row per focal location containing:
index,x,y: Focal location identifiersColumns for any additional stats requested (e.g.,
count,ess)coef_intercept: Regression interceptcoef_{name}: Regression slope for each covariateWhen
se != "none":se_interceptandse_{name}giving standard errors of each coefficientWith multiple
yvariables: columns are namedcoef_{coeff}_{varname}(andse_{coeff}_{varname}when SEs are returned)When
x_covariatesis supplied:pred(single-y) orpred_{varname}(multi-y) giving the fitted value at each focal.
See analog_search() for column-naming conventions across stats
and metadata() for attached metadata attributes.
Details
Method
For each focal location, the function:
Selects analog pool locations based on
select,max_clim,max_geog, andkComputes distance-based kernel weights for each analog (via
kernelandtheta)Fits a weighted least squares regression of
yoncovariatesusing these weights, with optional ridge penaltylambdaReturns the regression coefficients (intercept + slopes), and optionally their standard errors (see
seinanalog_search()).
The math: beta = (X'WX + lambda * I_p)^{-1} X'Wy, where W is diagonal
weights, X is the design matrix (intercept + covariates), and I_p penalizes
covariate coefficients only (not the intercept).
Relationship to Weighted Mean
With lambda -> Inf, covariate coefficients shrink to zero and the
intercept converges to the weighted mean. With lambda = 0 (the default),
the full local regression is used. If covariates are centered (weighted
mean zero within each neighborhood), the intercept equals the weighted
mean at any lambda. The lambda parameter thus provides smooth
interpolation between a simple weighted average and a full local regression.
Common Configurations
Geographically weighted regression (
select = "all"or"knn_geog", withmax_geogand geographic weights): Local spatial regression using geographic proximity to define and weight neighborhoods, equivalent to GWR.Climate-nearest regression (
select = "knn_clim", withmax_geogandk): Fixed-size neighborhoods of the k most similar climates within geographic range.
Prediction
To evaluate fitted values at the focal locations, pass covariate values at
those locations via x_covariates. The output will then include a pred
column (single-y case) or pred_{yname} columns (multi-y case) alongside
the usual coefficient columns.
If you only need coefficients, leave x_covariates = NULL.
See also
analog_search() for the underlying flexible analog search function;
analog_impact() for the standard AIM workflow;
analog_cv() for cross-validation of regression fits.
Examples
if (FALSE) { # \dontrun{
# GWR-style spatial regression
gwr_result <- analog_regression(
x = sites,
pool = sites,
y = sites$income,
covariates = data.frame(education = sites$edu, access = sites$access),
select = "knn_geog",
k = 50,
max_clim = NULL,
kernel = "gaussian_geog",
theta = 20,
se = "ess"
)
# With focal-side covariates, to obtain fitted predictions
pred_result <- analog_regression(
x = future_sites,
pool = sites,
y = sites$income,
covariates = data.frame(education = sites$edu), # at pool
x_covariates = data.frame(education = future_sites$edu) # at focals
)
head(pred_result[, c("coef_intercept", "coef_education", "pred")])
} # }