Pre-builds a reusable lattice index from reference climate data. The index can be queried multiple times with different focal points and parameters, avoiding the need to rebuild the lattice for each query.
Usage
build_analog_index(
pool,
coord_type = c("auto", "lonlat", "projected"),
index_res = 16,
downsample = 1,
seed = NULL
)Arguments
- pool
The reference dataset to search for analogs. Should be a matrix/data.frame with columns x, y, and climate variables, or a SpatRaster with climate variable layers.
- coord_type
Coordinate system type:
"auto"(default): Automatically detect from coordinate ranges."lonlat": Unprojected lon/lat coordinates (uses great-circle distance; assumesmax_geogis in km)."projected": Projected XY coordinates (uses planar distance; assumesmax_geogis in projection units).
- index_res
Tuning parameter giving the number of bins per dimension of the internally-used lattice search index. Either:
A positive integer.
"auto"(the default): Automatically tune the index resolution by optimizing compute time on a subsample of focal points. If focal has relatively few rows, auto-tuning is skipped and a default resolution of 16 is used.
Ignored if
poolis ananalog_index(uses index's resolution).- downsample
Optional downsampling rate (0-1) indicating the proportion of points in
poolto retain. Downsampling reduces memory use and improves query speed at the cost of some precision; adaptive stratified sampling is used to minimize loss of precision. The default is 1.0 (no downsampling). See Details for more info.- seed
Optional random seed for reproducible downsampling. If
NULL(default), uses current R random state.
Value
An S3 object of class "analog_index" containing:
The compiled lattice index (internal C++ structure)
Reference data
Metadata: coordinate type, dimensions, ranges, resolution
Diagnostics: bin counts, occupancy statistics, and downsampling info
Details
The lattice index is a multidimensional grid of bins, built over both geographic and climate dimensions. This structure enables efficient analog searches by first filtering and sorting bins of similar points before computing exact results. For lon/lat coordinates, the index uses ECEF (Earth-Centered Earth-Fixed) space internally for optimal performance.
Index resolution (index_res) controls the granularity of spatial
binning. The optimal value depends on your data size and query patterns.
Use tune_index_res() to find the best resolution for your use case,
or accept the default of 16 which works well for many applications.
Downsampling
For very large datasets, downsampling can significantly improve memory usage
and query speed, at the cost of some precision. The downsample parameter controls
the target fraction of the data points in pool that are retained in the index.
Downsampling uses an adaptive stratified approach: densely-packed bins are thinned more
aggressively while sparse bins are preserved, which helps reduce imprecision
in sparse regions compared to fully random sampling. Note: The actual rate may be
higher than requested if maintaining at least one point per occupied bin requires
it (common with sparse data or fine-grained binning); check index$downsample_actual.
Each remaining analog in the downsampled pool gets a sample_weight indicating
the number of points it represents in the original pool; this weight is the inverse
of the sampling rate in the analog's index bin. For pair queries (stat = "none"),
results include each analog's sample_weight. For aggregation stats (count, sum,
mean, etc.), sampling weights are used internally to automatically correct for
the downsampling bias.
Examples
if (FALSE) { # \dontrun{
# Build index with default settings
index <- build_analog_index(climate_data)
# Build with explicit resolution
index <- build_analog_index(climate_data, index_res = 20)
# Build with downsampling for large datasets
index <- build_analog_index(
large_climate_data,
index_res = 16,
downsample = 0.1, # Reduce max bin size to 10%
seed = 123 # Reproducible sampling
)
# Query the index multiple times
v1 <- analog_velocity(sites1, pool = index, max_clim = 0.5)
v2 <- analog_velocity(sites2, pool = index, max_clim = 0.3)
a1 <- analog_availability(sites3, pool = index, max_clim = 0.5, max_geog = 100)
} # }