metrics_as_scores.data package
Submodules
metrics_as_scores.data.pregenerate module
This module contains top-level function that are used in highly parallel scenarios for pre-generating densities for own datasets, either from previously computed fits for random variables or empirical densities.
- metrics_as_scores.data.pregenerate.generate_densities(dataset: ~metrics_as_scores.distribution.distribution.Dataset, clazz: type[metrics_as_scores.distribution.distribution.Density] = <class 'metrics_as_scores.distribution.distribution.Empirical'>, unique_vals: ~typing.Optional[bool] = None, resample_samples=250000, dist_transform: ~metrics_as_scores.distribution.distribution.DistTransform = DistTransform.NONE, num_jobs: ~typing.Optional[int] = None) dict[str, metrics_as_scores.distribution.distribution.Density] [source]
Generates a set of
Density
objects for a certainDistTransform
. For each combination, we will later save one file that is then to be used in the web application, as generating these on-the-fly would take too long.- dataset:
Dataset
Required for obtaining quantity types, contexts, and filtered data.
- clazz:
type[Density]
A type of empirical density to generate densities for.
- unique_vals:
bool
Used to conditionally add some jitter to data to all data points unique. This is automatically set to True if the class is
Empirical
, because this class is for continuous RVs. If the data is not continuous (real), then setting this to True will make it so.- resample_samples:
int
Unsigned integer, passed forward to the type of dict[str, Density].
- dist_transform:
DistTransform
The chosen transformation for the data.
- Return type:
dict[str, Density]
- Returns:
A dictionary where the key is made of the context and quantity type, and the value is the generated
Empirical
density.
- dataset:
- metrics_as_scores.data.pregenerate.fits_to_MAS_densities(dataset: Dataset, distns_dict: dict[int, metrics_as_scores.data.pregenerate_fit.FitResult], dist_transform: DistTransform, use_continuous: bool) dict[str, Union[metrics_as_scores.distribution.distribution.Parametric, metrics_as_scores.distribution.distribution.Parametric_discrete]] [source]
Converts previously produced parametric fits to
Density
objects that can be loaded and used in the web application. Similar togenerate_densities()
, this method also returns a dictionary with generated parametric densities.- dataset:
Dataset
Required for obtaining quantity types, contexts, and filtered data.
- distns_dict:
dict[int, FitResult]
Dictionary with all fit results for a data transform. The int-key is just the previously used grid index and not relevant here.
- dist_transform:
DistTransform
The chosen transformation for the data.
- use_continuous:
bool
Used to select and generate densities based on either continuous (True) RVs or discrete RVs.
- Return type:
dict[str, Union[Parametric, Parametric_discrete]]
- Returns:
A dictionary where the key is made of the context and quantity type, and the value is the generated
Union[Parametric, Parametric_discrete]
density.
- dataset:
- metrics_as_scores.data.pregenerate.generate_empirical(dataset: Dataset, densities_dir: Path, clazz: Union[Empirical, KDE_approx], transform: DistTransform) None [source]
Generates a set of empirical (continuous) densities for a given density type (Empirical or KDE_Approx) and data transform.
- dataset:
Dataset
Required for obtaining quantity types, contexts, and filtered data.
- densities_dir:
Path
The directory to store the generated densities. The resulting file is a key of the used density type and data transform.
- clazz:
Union[Empirical, KDE_approx]
The type of density you wish to create.
- transform:
DistTransform
The chosen transformation for the data.
- Returns:
This method does not return anything but only writes the result to disk.
- dataset:
- metrics_as_scores.data.pregenerate.generate_parametric(dataset: Dataset, densities_dir: Path, fits_dir: Path, clazz: Union[Parametric, Parametric_discrete], transform: DistTransform) None [source]
Generates a set of parametric densities for a given density type (Parametric or Parametric_discrete) and data transform.
- dataset:
Dataset
Required for obtaining quantity types, contexts, and filtered data.
- densities_dir:
Path
The directory to store the generated densities. The resulting file is a key of the used density type and data transform.
- clazz:
Union[Parametric, Parametric_discrete]
The type of density you wish to create.
- transform:
DistTransform
The chosen transformation for the data.
- Returns:
This method does not return anything but only writes the result to disk.
- dataset:
- metrics_as_scores.data.pregenerate.generate_empirical_discrete(dataset: Dataset, densities_dir: Path, transform: DistTransform) None [source]
Generates discrete empirical densities for a given data transform. Only uses the type :py:class:
Empirical_discrete
for this.- dataset:
Dataset
Required for obtaining quantity types, contexts, and filtered data.
- densities_dir:
Path
The directory to store the generated densities. The resulting file is a key of the used density type and data transform.
- transform:
DistTransform
The chosen transformation for the data.
- Returns:
This method does not return anything but only writes the result to disk.
- dataset:
metrics_as_scores.data.pregenerate_distns module
This module contains a single function that is used in highly parallel scenarios for fitting continuous and discrete random variables to data.
- metrics_as_scores.data.pregenerate_distns.generate_parametric_fits(ds: Dataset, num_jobs: int, fitter_type: type[metrics_as_scores.distribution.fitting.Fitter], dist_transform: DistTransform, selected_rvs_c: list[type[scipy.stats._distn_infrastructure.rv_continuous]], selected_rvs_d: list[type[scipy.stats._distn_infrastructure.rv_discrete]], data_dict: dict[str, nptyping.ndarray.NDArray], transform_values_dict: dict[str, float], data_discrete_dict: dict[str, nptyping.ndarray.NDArray], transform_values_discrete_dict: dict[str, float]) list[metrics_as_scores.data.pregenerate_fit.FitResult] [source]
The thinking is this: To each data series we can always fit a continuous distribution, whether it’s discrete or continuous data. The same is not true the other way round, i.e., we must not fit a discrete distribution if the data is known to be continuous. Therefore, we do the following:
Regardless of the data, always attempt to fit a continuous RV
For all discrete data, also attempt to fit a discrete RV
That means that for discrete data, we will have two kinds of fitted RVs. Also, when fitting a continuous RV to discrete data, we will add jitter to the data.
- ds:
Dataset
The data, needed for obtaining quantity types and contexts. Also passed forward to
fit()
.- num_jobs:
int
Degree of parallelization used.
- fitter_type:
type[Fitter]
The class for the fitter to use, either
Fitter
orFitterPymoo
.- dist_transform:
DistTransform
The transform for which to generate parametric fits for. Later, we will save a single file per transform, containing all related fits.
- selected_rvs_c:
list[type[rv_continuous]]
Continuous RVs to attempt to fit.
- selected_rvs_d:
list[type[rv_discrete]]
Discrete RVs to attempt to fit.
- data_dict:
dict[str, NDArray[Shape["*"], Float]]
A dictionary where they key consists of the context and the quantity type. For each entry, it contains a 1-D array of data used for fitting.
- transform_values_dict:
dict[str, float]
Similar to
data_dict
, this dictionary contains the transformation value that was used to transform the data in the 1-D array.- data_discrete_dict:
dict[str, NDArray[Shape["*"], Float]]
Like
data_dict
, but for discrete RVs fitted to discrete data.- transform_values_discrete_dict:
dict[str, float]
Like
transform_values_dict
, but for the discrete datas.
- Returns:
A list of :py:class:
FitResult
objects.
metrics_as_scores.data.pregenerate_fit module
This is an extra module that holds functions globally, such that we can exploit
multiprocessing effortlessly. Here, the main fit()
function is defined.
- metrics_as_scores.data.pregenerate_fit.Continuous_RVs_dict: dict[str, type[scipy.stats._distn_infrastructure.rv_continuous]] = {'alpha_gen': <class 'scipy.stats._continuous_distns.alpha_gen'>, 'anglit_gen': <class 'scipy.stats._continuous_distns.anglit_gen'>, 'arcsine_gen': <class 'scipy.stats._continuous_distns.arcsine_gen'>, 'argus_gen': <class 'scipy.stats._continuous_distns.argus_gen'>, 'beta_gen': <class 'scipy.stats._continuous_distns.beta_gen'>, 'betaprime_gen': <class 'scipy.stats._continuous_distns.betaprime_gen'>, 'bradford_gen': <class 'scipy.stats._continuous_distns.bradford_gen'>, 'burr12_gen': <class 'scipy.stats._continuous_distns.burr12_gen'>, 'burr_gen': <class 'scipy.stats._continuous_distns.burr_gen'>, 'cauchy_gen': <class 'scipy.stats._continuous_distns.cauchy_gen'>, 'chi2_gen': <class 'scipy.stats._continuous_distns.chi2_gen'>, 'chi_gen': <class 'scipy.stats._continuous_distns.chi_gen'>, 'cosine_gen': <class 'scipy.stats._continuous_distns.cosine_gen'>, 'crystalball_gen': <class 'scipy.stats._continuous_distns.crystalball_gen'>, 'dgamma_gen': <class 'scipy.stats._continuous_distns.dgamma_gen'>, 'dweibull_gen': <class 'scipy.stats._continuous_distns.dweibull_gen'>, 'erlang_gen': <class 'scipy.stats._continuous_distns.erlang_gen'>, 'expon_gen': <class 'scipy.stats._continuous_distns.expon_gen'>, 'exponnorm_gen': <class 'scipy.stats._continuous_distns.exponnorm_gen'>, 'exponpow_gen': <class 'scipy.stats._continuous_distns.exponpow_gen'>, 'exponweib_gen': <class 'scipy.stats._continuous_distns.exponweib_gen'>, 'f_gen': <class 'scipy.stats._continuous_distns.f_gen'>, 'fatiguelife_gen': <class 'scipy.stats._continuous_distns.fatiguelife_gen'>, 'fisk_gen': <class 'scipy.stats._continuous_distns.fisk_gen'>, 'foldcauchy_gen': <class 'scipy.stats._continuous_distns.foldcauchy_gen'>, 'foldnorm_gen': <class 'scipy.stats._continuous_distns.foldnorm_gen'>, 'gamma_gen': <class 'scipy.stats._continuous_distns.gamma_gen'>, 'gausshyper_gen': <class 'scipy.stats._continuous_distns.gausshyper_gen'>, 'genexpon_gen': <class 'scipy.stats._continuous_distns.genexpon_gen'>, 'genextreme_gen': <class 'scipy.stats._continuous_distns.genextreme_gen'>, 'gengamma_gen': <class 'scipy.stats._continuous_distns.gengamma_gen'>, 'genhalflogistic_gen': <class 'scipy.stats._continuous_distns.genhalflogistic_gen'>, 'genhyperbolic_gen': <class 'scipy.stats._continuous_distns.genhyperbolic_gen'>, 'geninvgauss_gen': <class 'scipy.stats._continuous_distns.geninvgauss_gen'>, 'genlogistic_gen': <class 'scipy.stats._continuous_distns.genlogistic_gen'>, 'gennorm_gen': <class 'scipy.stats._continuous_distns.gennorm_gen'>, 'genpareto_gen': <class 'scipy.stats._continuous_distns.genpareto_gen'>, 'gibrat_gen': <class 'scipy.stats._continuous_distns.gibrat_gen'>, 'gilbrat_gen': <class 'scipy.stats._continuous_distns.gilbrat_gen'>, 'gompertz_gen': <class 'scipy.stats._continuous_distns.gompertz_gen'>, 'gumbel_l_gen': <class 'scipy.stats._continuous_distns.gumbel_l_gen'>, 'gumbel_r_gen': <class 'scipy.stats._continuous_distns.gumbel_r_gen'>, 'halfcauchy_gen': <class 'scipy.stats._continuous_distns.halfcauchy_gen'>, 'halfgennorm_gen': <class 'scipy.stats._continuous_distns.halfgennorm_gen'>, 'halflogistic_gen': <class 'scipy.stats._continuous_distns.halflogistic_gen'>, 'halfnorm_gen': <class 'scipy.stats._continuous_distns.halfnorm_gen'>, 'hypsecant_gen': <class 'scipy.stats._continuous_distns.hypsecant_gen'>, 'invgamma_gen': <class 'scipy.stats._continuous_distns.invgamma_gen'>, 'invgauss_gen': <class 'scipy.stats._continuous_distns.invgauss_gen'>, 'invweibull_gen': <class 'scipy.stats._continuous_distns.invweibull_gen'>, 'johnsonsb_gen': <class 'scipy.stats._continuous_distns.johnsonsb_gen'>, 'johnsonsu_gen': <class 'scipy.stats._continuous_distns.johnsonsu_gen'>, 'kappa3_gen': <class 'scipy.stats._continuous_distns.kappa3_gen'>, 'kappa4_gen': <class 'scipy.stats._continuous_distns.kappa4_gen'>, 'ksone_gen': <class 'scipy.stats._continuous_distns.ksone_gen'>, 'kstwo_gen': <class 'scipy.stats._continuous_distns.kstwo_gen'>, 'kstwobign_gen': <class 'scipy.stats._continuous_distns.kstwobign_gen'>, 'laplace_asymmetric_gen': <class 'scipy.stats._continuous_distns.laplace_asymmetric_gen'>, 'laplace_gen': <class 'scipy.stats._continuous_distns.laplace_gen'>, 'levy_gen': <class 'scipy.stats._continuous_distns.levy_gen'>, 'levy_l_gen': <class 'scipy.stats._continuous_distns.levy_l_gen'>, 'loggamma_gen': <class 'scipy.stats._continuous_distns.loggamma_gen'>, 'logistic_gen': <class 'scipy.stats._continuous_distns.logistic_gen'>, 'loglaplace_gen': <class 'scipy.stats._continuous_distns.loglaplace_gen'>, 'lognorm_gen': <class 'scipy.stats._continuous_distns.lognorm_gen'>, 'lomax_gen': <class 'scipy.stats._continuous_distns.lomax_gen'>, 'maxwell_gen': <class 'scipy.stats._continuous_distns.maxwell_gen'>, 'mielke_gen': <class 'scipy.stats._continuous_distns.mielke_gen'>, 'moyal_gen': <class 'scipy.stats._continuous_distns.moyal_gen'>, 'nakagami_gen': <class 'scipy.stats._continuous_distns.nakagami_gen'>, 'ncf_gen': <class 'scipy.stats._continuous_distns.ncf_gen'>, 'nct_gen': <class 'scipy.stats._continuous_distns.nct_gen'>, 'ncx2_gen': <class 'scipy.stats._continuous_distns.ncx2_gen'>, 'norm_gen': <class 'scipy.stats._continuous_distns.norm_gen'>, 'norminvgauss_gen': <class 'scipy.stats._continuous_distns.norminvgauss_gen'>, 'pareto_gen': <class 'scipy.stats._continuous_distns.pareto_gen'>, 'pearson3_gen': <class 'scipy.stats._continuous_distns.pearson3_gen'>, 'powerlaw_gen': <class 'scipy.stats._continuous_distns.powerlaw_gen'>, 'powerlognorm_gen': <class 'scipy.stats._continuous_distns.powerlognorm_gen'>, 'powernorm_gen': <class 'scipy.stats._continuous_distns.powernorm_gen'>, 'rayleigh_gen': <class 'scipy.stats._continuous_distns.rayleigh_gen'>, 'rdist_gen': <class 'scipy.stats._continuous_distns.rdist_gen'>, 'recipinvgauss_gen': <class 'scipy.stats._continuous_distns.recipinvgauss_gen'>, 'reciprocal_gen': <class 'scipy.stats._continuous_distns.reciprocal_gen'>, 'rice_gen': <class 'scipy.stats._continuous_distns.rice_gen'>, 'semicircular_gen': <class 'scipy.stats._continuous_distns.semicircular_gen'>, 'skew_norm_gen': <class 'scipy.stats._continuous_distns.skew_norm_gen'>, 'skewcauchy_gen': <class 'scipy.stats._continuous_distns.skewcauchy_gen'>, 'studentized_range_gen': <class 'scipy.stats._continuous_distns.studentized_range_gen'>, 't_gen': <class 'scipy.stats._continuous_distns.t_gen'>, 'trapezoid_gen': <class 'scipy.stats._continuous_distns.trapezoid_gen'>, 'triang_gen': <class 'scipy.stats._continuous_distns.triang_gen'>, 'truncexpon_gen': <class 'scipy.stats._continuous_distns.truncexpon_gen'>, 'truncnorm_gen': <class 'scipy.stats._continuous_distns.truncnorm_gen'>, 'truncpareto_gen': <class 'scipy.stats._continuous_distns.truncpareto_gen'>, 'truncweibull_min_gen': <class 'scipy.stats._continuous_distns.truncweibull_min_gen'>, 'tukeylambda_gen': <class 'scipy.stats._continuous_distns.tukeylambda_gen'>, 'uniform_gen': <class 'scipy.stats._continuous_distns.uniform_gen'>, 'vonmises_gen': <class 'scipy.stats._continuous_distns.vonmises_gen'>, 'wald_gen': <class 'scipy.stats._continuous_distns.wald_gen'>, 'weibull_max_gen': <class 'scipy.stats._continuous_distns.weibull_max_gen'>, 'weibull_min_gen': <class 'scipy.stats._continuous_distns.weibull_min_gen'>, 'wrapcauchy_gen': <class 'scipy.stats._continuous_distns.wrapcauchy_gen'>}
Dictionary of continuous random variables that are supported by scipy.stats. Note this is a dictionary of types, rather than instances.
- metrics_as_scores.data.pregenerate_fit.Discrete_RVs_dict: dict[str, type[scipy.stats._distn_infrastructure.rv_discrete]] = {'bernoulli_gen': <class 'scipy.stats._discrete_distns.bernoulli_gen'>, 'betabinom_gen': <class 'scipy.stats._discrete_distns.betabinom_gen'>, 'binom_gen': <class 'scipy.stats._discrete_distns.binom_gen'>, 'boltzmann_gen': <class 'scipy.stats._discrete_distns.boltzmann_gen'>, 'dlaplace_gen': <class 'scipy.stats._discrete_distns.dlaplace_gen'>, 'geom_gen': <class 'scipy.stats._discrete_distns.geom_gen'>, 'hypergeom_gen': <class 'scipy.stats._discrete_distns.hypergeom_gen'>, 'logser_gen': <class 'scipy.stats._discrete_distns.logser_gen'>, 'nbinom_gen': <class 'scipy.stats._discrete_distns.nbinom_gen'>, 'nchypergeom_fisher_gen': <class 'scipy.stats._discrete_distns.nchypergeom_fisher_gen'>, 'nchypergeom_wallenius_gen': <class 'scipy.stats._discrete_distns.nchypergeom_wallenius_gen'>, 'nhypergeom_gen': <class 'scipy.stats._discrete_distns.nhypergeom_gen'>, 'planck_gen': <class 'scipy.stats._discrete_distns.planck_gen'>, 'poisson_gen': <class 'scipy.stats._discrete_distns.poisson_gen'>, 'randint_gen': <class 'scipy.stats._discrete_distns.randint_gen'>, 'skellam_gen': <class 'scipy.stats._discrete_distns.skellam_gen'>, 'yulesimon_gen': <class 'scipy.stats._discrete_distns.yulesimon_gen'>, 'zipf_gen': <class 'scipy.stats._discrete_distns.zipf_gen'>, 'zipfian_gen': <class 'scipy.stats._discrete_distns.zipfian_gen'>}
Dictionary of discrete random variables that are supported by scipy.stats. Note this is a dictionary of types, rather than instances.
- metrics_as_scores.data.pregenerate_fit.get_data_tuple(ds: Dataset, qtype: str, dist_transform: DistTransform, continuous_transform: bool = True) list[tuple[str, nptyping.ndarray.NDArray]] [source]
This method is part of the workflow for computing parametric fits. For a specific type of quantity and transform, it creates datasets for all available contexts.
ds:
Dataset
- qtype:
str
The type of quantity to get datasets for.
- dist_transform:
DistTransform
The chosen distribution transform.
- continuous_transform:
bool
Whether the transform is real-valued or must be converted to integer.
- Return type:
list[tuple[str, NDArray[Shape["*"], Float]]]
- Returns:
A list of tuples of three elements. The first element is a key that identifies the context, the quantity type, and whether the data was computed using unique values (see
Dataset.transform()
).
- qtype:
- class metrics_as_scores.data.pregenerate_fit.FitResult[source]
Bases:
TypedDict
This class is derived from
TypedDict
and holds all properties related to a single fit result, that is, a single specific configuration that was fit to a 1-D array of data.- grid_idx: int
- dist_transform: str
- transform_value: Optional[float]
- params: dict[str, Union[float, int]]
- context: str
- qtype: str
- rv: str
- type: str
- stat_tests: StatisticalTestJson
- metrics_as_scores.data.pregenerate_fit.fit(ds: ~metrics_as_scores.distribution.distribution.Dataset, fitter_type: type[metrics_as_scores.distribution.fitting.Fitter], grid_idx: int, row, dist_transform: ~metrics_as_scores.distribution.distribution.DistTransform, the_data: ~nptyping.ndarray.NDArray[~nptyping.base_meta_classes.Shape[*], ~numpy.float64], the_data_unique: ~nptyping.ndarray.NDArray[~nptyping.base_meta_classes.Shape[*], ~numpy.float64], transform_value: ~typing.Optional[float], write_temporary_results: bool = False) FitResult [source]
This is the main stand-alone function that computes a single parametric fit to a single 1-D array of data. This function is used in Parallel contexts and, therefore, lives on module top level so it can be serialized.
- ds:
Dataset
The data, needed for obtaining quantity types and contexts. Also passed forward to
fit()
.- fitter_type:
type[Fitter]
The class for the fitter to use, either
Fitter
orFitterPymoo
.- grid_idx:
int
This is only used so it can be stored in the :py:class:
FitResult
. This method itself does not have access to the grid.- dist_transform:
DistTransform
The transform for which to generate parametric fits for. Later, we will save a single file per transform, containing all related fits.
- the_data:
NDArray[Shape["*"], Float]
The 1-D data used for fitting the RV.
- the_data_unique:
NDArray[Shape["*"], Float]
1-D Array of data. In case of continuous data, it is the same as
the_data
. In case of discrete data, the data in this array contains a slight jitter as to make all data points unique. Using this data is relevant for conducting statistical goodness of fit tests.
- Returns:
The :py:class:
FitResult
. If the RV could not be fitted, then the parameters in the fitting result will have a value ofNone
. This is so this method does not throw exceptions. In case of a failure, no statistical tests are computed, either.
- ds:
Module contents
This package contains functions that are needed for the mass-wise fitting of continuous and discrete random variables to own data, as well as functions for pre-generating densities for own datasets that are then used in the interactive web application.