Build Analog Index — build_analog

Pre-builds a reusable lattice index from reference climate data. The index can be queried multiple times with different focal points and parameters, avoiding the need to rebuild the lattice for each query.

Usage

build_analog_index(
  pool,
  coord_type = c("auto", "lonlat", "projected"),
  index_res = 16,
  downsample = 1,
  seed = NULL
)

Arguments

pool

The reference dataset to search for analogs. Should be a matrix/data.frame with columns x, y, and climate variables, or a SpatRaster with climate variable layers.

coord_type

Coordinate system type:

"auto" (default): Automatically detect from coordinate ranges.
"lonlat": Unprojected lon/lat coordinates (uses great-circle distance; assumes max_geog is in km).
"projected": Projected XY coordinates (uses planar distance; assumes max_geog is in projection units).

index_res

Tuning parameter giving the number of bins per dimension of the internally-used lattice search index. Either:

A positive integer.
"auto" (the default): Automatically tune the index resolution by optimizing compute time on a subsample of focal points. If focal has relatively few rows, auto-tuning is skipped and a default resolution of 16 is used.

Ignored if pool is an analog_index (uses index's resolution).

downsample

Optional downsampling rate (0-1) indicating the proportion of points in pool to retain. Downsampling reduces memory use and improves query speed at the cost of some precision; adaptive stratified sampling is used to minimize loss of precision. The default is 1.0 (no downsampling). See Details for more info.

seed

Optional random seed for reproducible downsampling. If NULL (default), uses current R random state.

Value

An S3 object of class "analog_index" containing:

The compiled lattice index (internal C++ structure)
Reference data
Metadata: coordinate type, dimensions, ranges, resolution
Diagnostics: bin counts, occupancy statistics, and downsampling info

Details

The lattice index is a multidimensional grid of bins, built over both geographic and climate dimensions. This structure enables efficient analog searches by first filtering and sorting bins of similar points before computing exact results. For lon/lat coordinates, the index uses ECEF (Earth-Centered Earth-Fixed) space internally for optimal performance.

Index resolution (index_res) controls the granularity of spatial binning. The optimal value depends on your data size and query patterns. Use tune_index_res() to find the best resolution for your use case, or accept the default of 16 which works well for many applications.

Downsampling

For very large datasets, downsampling can significantly improve memory usage and query speed, at the cost of some precision. The downsample parameter controls the target fraction of the data points in pool that are retained in the index. Downsampling uses an adaptive stratified approach: densely-packed bins are thinned more aggressively while sparse bins are preserved, which helps reduce imprecision in sparse regions compared to fully random sampling. Note: The actual rate may be higher than requested if maintaining at least one point per occupied bin requires it (common with sparse data or fine-grained binning); check index$downsample_actual.

Each remaining analog in the downsampled pool gets a sample_weight indicating the number of points it represents in the original pool; this weight is the inverse of the sampling rate in the analog's index bin. For pair queries (stat = "none"), results include each analog's sample_weight. For aggregation stats (count, sum, mean, etc.), sampling weights are used internally to automatically correct for the downsampling bias.

Examples

if (FALSE) { # \dontrun{
# Build index with default settings
index <- build_analog_index(climate_data)

# Build with explicit resolution
index <- build_analog_index(climate_data, index_res = 20)

# Build with downsampling for large datasets
index <- build_analog_index(
  large_climate_data,
  index_res = 16,
  downsample = 0.1,  # Reduce max bin size to 10%
  seed = 123         # Reproducible sampling
)

# Query the index multiple times
v1 <- analog_velocity(sites1, pool = index, max_clim = 0.5)
v2 <- analog_velocity(sites2, pool = index, max_clim = 0.3)
a1 <- analog_availability(sites3, pool = index, max_clim = 0.5, max_geog = 100)
} # }