Segment a dataset by each row once, then compute vicinities of samples in the neighborhood.

Given an entire dataset, uses each instance in it to demarcate a neighborhood using the selected features. Then, for each neighborhood, the vicinity of all samples to it is computed. The result of this is an N x N matrix, where the entry \(m_{i,j}\) corresponds to the vicinity of sample \(s_j\) in neighborhood \(N_i\).

vicinities(
  df,
  selectedFeatureNames = c(),
  shiftAmount = 0.1,
  doEcdf = FALSE,
  ecdfMinusOne = FALSE,
  retainMinValues = 0,
  useParallel = NULL
)

Arguments

df	data.frame to compute the matrix of vicinites for.
selectedFeatureNames	vector of names of features to use for computing the vicinity/centrality of each sample to each neighborhood.
shiftAmount	numeric DEFAULT 0.1 optional amount to shift each features probability by. This is useful for when the centrality not necessarily must be an actual probability and too many features are selected. To obtain actual probabilities, this needs to be 0, and you must use the ECDF.
doEcdf	boolean DEFAULT FALSE whether to use the ECDF instead of the EPDF to find the likelihood of continuous values.
ecdfMinusOne	boolean DEFAULT FALSE only has an effect if the ECDF is used. If true, uses 1 minus the ECDF to find the probability of a continuous value. Depending on the interpretation of what you try to do, this may be of use.
retainMinValues	DEFAULT 0 the amount of samples to retain during segmentation. For separating a neighborhood, this value typically should be 0, so that no samples are included that are not within it. However, for very sparse data or a great amount of variables, it might still make sense to retain samples.
useParallel	boolean DEFAULT NULL whether to use parallelism or not. Setting this to true requires also having previously registered a parallel backend. If parallel computing is enabled, then each neighborhood is computed separately.

Value

matrix of length \(N^2\) (N being the length of the data.frame). Each row i demarcates the neighborhood as selected by sample i, and each column j then is the vicinity of sample \(s_j\) to that neighborhood. No value of the diagonal is zero, because each neighborhood always contains the sample it was demarcated by, and that sample has a similarity greater than zero to it.

Examples

w <- mmb::getWarnings()
mmb::setWarnings(FALSE)
#> [1] FALSE
mmb::vicinities(df = iris[1:10,])
#> Warning: provided 4 variables to replace 1 variables
#> Warning: provided 2 variables to replace 1 variables
#> Warning: provided 2 variables to replace 1 variables
#> Warning: provided 4 variables to replace 1 variables
#> Warning: provided 10 variables to replace 1 variables
#> Warning: provided 2 variables to replace 1 variables
#> Warning: provided 6 variables to replace 1 variables
#>           1         2       3        4       5        6        7        8
#> 1  5.646927 0.0000000 0.00000   0.0000  0.0000 1.518046  0.00000 0.000000
#> 2  5.646927 0.6764991 0.00000   0.0000 18.2328 1.518046  0.00000 2.738358
#> 3  5.646927 0.0000000 1.61051   0.0000 18.2328 1.518046  0.00000 2.738358
#> 4  0.000000 0.0000000 0.00000 144.9559  0.0000 1.518046  0.00000 2.738358
#> 5  0.000000 0.0000000 0.00000   0.0000 18.2328 1.518046  0.00000 0.000000
#> 6  0.000000 0.0000000 0.00000   0.0000  0.0000 1.518046  0.00000 0.000000
#> 7  0.000000 0.0000000 0.00000   0.0000  0.0000 1.518046 58.97993 0.000000
#> 8  0.000000 0.0000000 0.00000   0.0000  0.0000 1.518046  0.00000 2.738358
#> 9  5.646927 0.6764991 0.00000 144.9559 18.2328 1.518046 58.97993 2.738358
#> 10 0.000000 0.0000000 0.00000   0.0000  0.0000 1.518046  0.00000 2.738358
#>          9      10
#> 1  0.00000 0.00000
#> 2  0.00000 0.00000
#> 3  0.00000 0.00000
#> 4  0.00000 0.00000
#> 5  0.00000 0.00000
#> 6  0.00000 0.00000
#> 7  0.00000 0.00000
#> 8  0.00000 0.00000
#> 9  1.61051 0.00000
#> 10 0.00000 1.61051

# Run the same, but use the ECDF and retain more values:
mmb::vicinities(df = iris[1:10,], doEcdf = TRUE, retainMinValues = 10)
#> Warning: provided 10 variables to replace 1 variables
#> Warning: provided 10 variables to replace 1 variables
#> Warning: provided 10 variables to replace 1 variables
#> Warning: provided 10 variables to replace 1 variables
#> Warning: provided 10 variables to replace 1 variables
#> Warning: provided 10 variables to replace 1 variables
#> Warning: provided 10 variables to replace 1 variables
#> Warning: provided 10 variables to replace 1 variables
#> Warning: provided 10 variables to replace 1 variables
#> Warning: provided 10 variables to replace 1 variables
#>         1      2      3      4      5      6      7      8      9     10
#> 1  0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237
#> 2  0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237
#> 3  0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237
#> 4  0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237
#> 5  0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237
#> 6  0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237
#> 7  0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237
#> 8  0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237
#> 9  0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237
#> 10 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237 0.6237
mmb::setWarnings(w)
#> [1] TRUE

Segment a dataset by each row once, then compute vicinities of samples in the neighborhood.

Arguments

Value

See also

Examples