metrics_as_scores.data package

Submodules

metrics_as_scores.data.pregenerate module

This module contains top-level function that are used in highly parallel scenarios for pre-generating densities for own datasets, either from previously computed fits for random variables or empirical densities.

metrics_as_scores.data.pregenerate.generate_densities(dataset: ~metrics_as_scores.distribution.distribution.Dataset, clazz: type[metrics_as_scores.distribution.distribution.Density] = <class 'metrics_as_scores.distribution.distribution.Empirical'>, unique_vals: ~typing.Optional[bool] = None, resample_samples=250000, dist_transform: ~metrics_as_scores.distribution.distribution.DistTransform = DistTransform.NONE, num_jobs: ~typing.Optional[int] = None) dict[str, metrics_as_scores.distribution.distribution.Density][source]

Generates a set of Density objects for a certain DistTransform. For each combination, we will later save one file that is then to be used in the web application, as generating these on-the-fly would take too long.

dataset: Dataset

Required for obtaining quantity types, contexts, and filtered data.

clazz: type[Density]

A type of empirical density to generate densities for.

unique_vals: bool

Used to conditionally add some jitter to data to all data points unique. This is automatically set to True if the class is Empirical, because this class is for continuous RVs. If the data is not continuous (real), then setting this to True will make it so.

resample_samples: int

Unsigned integer, passed forward to the type of dict[str, Density].

dist_transform: DistTransform

The chosen transformation for the data.

Return type:

dict[str, Density]

Returns:

A dictionary where the key is made of the context and quantity type, and the value is the generated Empirical density.

metrics_as_scores.data.pregenerate.fits_to_MAS_densities(dataset: Dataset, distns_dict: dict[int, metrics_as_scores.data.pregenerate_fit.FitResult], dist_transform: DistTransform, use_continuous: bool) dict[str, Union[metrics_as_scores.distribution.distribution.Parametric, metrics_as_scores.distribution.distribution.Parametric_discrete]][source]

Converts previously produced parametric fits to Density objects that can be loaded and used in the web application. Similar to generate_densities(), this method also returns a dictionary with generated parametric densities.

dataset: Dataset

Required for obtaining quantity types, contexts, and filtered data.

distns_dict: dict[int, FitResult]

Dictionary with all fit results for a data transform. The int-key is just the previously used grid index and not relevant here.

dist_transform: DistTransform

The chosen transformation for the data.

use_continuous: bool

Used to select and generate densities based on either continuous (True) RVs or discrete RVs.

Return type:

dict[str, Union[Parametric, Parametric_discrete]]

Returns:

A dictionary where the key is made of the context and quantity type, and the value is the generated Union[Parametric, Parametric_discrete] density.

metrics_as_scores.data.pregenerate.generate_empirical(dataset: Dataset, densities_dir: Path, clazz: Union[Empirical, KDE_approx], transform: DistTransform) None[source]

Generates a set of empirical (continuous) densities for a given density type (Empirical or KDE_Approx) and data transform.

dataset: Dataset

Required for obtaining quantity types, contexts, and filtered data.

densities_dir: Path

The directory to store the generated densities. The resulting file is a key of the used density type and data transform.

clazz: Union[Empirical, KDE_approx]

The type of density you wish to create.

transform: DistTransform

The chosen transformation for the data.

Returns:

This method does not return anything but only writes the result to disk.

metrics_as_scores.data.pregenerate.generate_parametric(dataset: Dataset, densities_dir: Path, fits_dir: Path, clazz: Union[Parametric, Parametric_discrete], transform: DistTransform) None[source]

Generates a set of parametric densities for a given density type (Parametric or Parametric_discrete) and data transform.

dataset: Dataset

Required for obtaining quantity types, contexts, and filtered data.

densities_dir: Path

The directory to store the generated densities. The resulting file is a key of the used density type and data transform.

clazz: Union[Parametric, Parametric_discrete]

The type of density you wish to create.

transform: DistTransform

The chosen transformation for the data.

Returns:

This method does not return anything but only writes the result to disk.

metrics_as_scores.data.pregenerate.generate_empirical_discrete(dataset: Dataset, densities_dir: Path, transform: DistTransform) None[source]

Generates discrete empirical densities for a given data transform. Only uses the type :py:class:Empirical_discrete for this.

dataset: Dataset

Required for obtaining quantity types, contexts, and filtered data.

densities_dir: Path

The directory to store the generated densities. The resulting file is a key of the used density type and data transform.

transform: DistTransform

The chosen transformation for the data.

Returns:

This method does not return anything but only writes the result to disk.

metrics_as_scores.data.pregenerate_distns module

This module contains a single function that is used in highly parallel scenarios for fitting continuous and discrete random variables to data.

metrics_as_scores.data.pregenerate_distns.generate_parametric_fits(ds: Dataset, num_jobs: int, fitter_type: type[metrics_as_scores.distribution.fitting.Fitter], dist_transform: DistTransform, selected_rvs_c: list[type[scipy.stats._distn_infrastructure.rv_continuous]], selected_rvs_d: list[type[scipy.stats._distn_infrastructure.rv_discrete]], data_dict: dict[str, nptyping.ndarray.NDArray], transform_values_dict: dict[str, float], data_discrete_dict: dict[str, nptyping.ndarray.NDArray], transform_values_discrete_dict: dict[str, float]) list[metrics_as_scores.data.pregenerate_fit.FitResult][source]

The thinking is this: To each data series we can always fit a continuous distribution, whether it’s discrete or continuous data. The same is not true the other way round, i.e., we must not fit a discrete distribution if the data is known to be continuous. Therefore, we do the following:

  • Regardless of the data, always attempt to fit a continuous RV

  • For all discrete data, also attempt to fit a discrete RV

That means that for discrete data, we will have two kinds of fitted RVs. Also, when fitting a continuous RV to discrete data, we will add jitter to the data.

ds: Dataset

The data, needed for obtaining quantity types and contexts. Also passed forward to fit().

num_jobs: int

Degree of parallelization used.

fitter_type: type[Fitter]

The class for the fitter to use, either Fitter or FitterPymoo.

dist_transform: DistTransform

The transform for which to generate parametric fits for. Later, we will save a single file per transform, containing all related fits.

selected_rvs_c: list[type[rv_continuous]]

Continuous RVs to attempt to fit.

selected_rvs_d: list[type[rv_discrete]]

Discrete RVs to attempt to fit.

data_dict: dict[str, NDArray[Shape["*"], Float]]

A dictionary where they key consists of the context and the quantity type. For each entry, it contains a 1-D array of data used for fitting.

transform_values_dict: dict[str, float]

Similar to data_dict, this dictionary contains the transformation value that was used to transform the data in the 1-D array.

data_discrete_dict: dict[str, NDArray[Shape["*"], Float]]

Like data_dict, but for discrete RVs fitted to discrete data.

transform_values_discrete_dict: dict[str, float]

Like transform_values_dict, but for the discrete datas.

Returns:

A list of :py:class:FitResult objects.

metrics_as_scores.data.pregenerate_fit module

This is an extra module that holds functions globally, such that we can exploit multiprocessing effortlessly. Here, the main fit() function is defined.

metrics_as_scores.data.pregenerate_fit.Continuous_RVs_dict: dict[str, type[scipy.stats._distn_infrastructure.rv_continuous]] = {'alpha_gen': <class 'scipy.stats._continuous_distns.alpha_gen'>, 'anglit_gen': <class 'scipy.stats._continuous_distns.anglit_gen'>, 'arcsine_gen': <class 'scipy.stats._continuous_distns.arcsine_gen'>, 'argus_gen': <class 'scipy.stats._continuous_distns.argus_gen'>, 'beta_gen': <class 'scipy.stats._continuous_distns.beta_gen'>, 'betaprime_gen': <class 'scipy.stats._continuous_distns.betaprime_gen'>, 'bradford_gen': <class 'scipy.stats._continuous_distns.bradford_gen'>, 'burr12_gen': <class 'scipy.stats._continuous_distns.burr12_gen'>, 'burr_gen': <class 'scipy.stats._continuous_distns.burr_gen'>, 'cauchy_gen': <class 'scipy.stats._continuous_distns.cauchy_gen'>, 'chi2_gen': <class 'scipy.stats._continuous_distns.chi2_gen'>, 'chi_gen': <class 'scipy.stats._continuous_distns.chi_gen'>, 'cosine_gen': <class 'scipy.stats._continuous_distns.cosine_gen'>, 'crystalball_gen': <class 'scipy.stats._continuous_distns.crystalball_gen'>, 'dgamma_gen': <class 'scipy.stats._continuous_distns.dgamma_gen'>, 'dweibull_gen': <class 'scipy.stats._continuous_distns.dweibull_gen'>, 'erlang_gen': <class 'scipy.stats._continuous_distns.erlang_gen'>, 'expon_gen': <class 'scipy.stats._continuous_distns.expon_gen'>, 'exponnorm_gen': <class 'scipy.stats._continuous_distns.exponnorm_gen'>, 'exponpow_gen': <class 'scipy.stats._continuous_distns.exponpow_gen'>, 'exponweib_gen': <class 'scipy.stats._continuous_distns.exponweib_gen'>, 'f_gen': <class 'scipy.stats._continuous_distns.f_gen'>, 'fatiguelife_gen': <class 'scipy.stats._continuous_distns.fatiguelife_gen'>, 'fisk_gen': <class 'scipy.stats._continuous_distns.fisk_gen'>, 'foldcauchy_gen': <class 'scipy.stats._continuous_distns.foldcauchy_gen'>, 'foldnorm_gen': <class 'scipy.stats._continuous_distns.foldnorm_gen'>, 'gamma_gen': <class 'scipy.stats._continuous_distns.gamma_gen'>, 'gausshyper_gen': <class 'scipy.stats._continuous_distns.gausshyper_gen'>, 'genexpon_gen': <class 'scipy.stats._continuous_distns.genexpon_gen'>, 'genextreme_gen': <class 'scipy.stats._continuous_distns.genextreme_gen'>, 'gengamma_gen': <class 'scipy.stats._continuous_distns.gengamma_gen'>, 'genhalflogistic_gen': <class 'scipy.stats._continuous_distns.genhalflogistic_gen'>, 'genhyperbolic_gen': <class 'scipy.stats._continuous_distns.genhyperbolic_gen'>, 'geninvgauss_gen': <class 'scipy.stats._continuous_distns.geninvgauss_gen'>, 'genlogistic_gen': <class 'scipy.stats._continuous_distns.genlogistic_gen'>, 'gennorm_gen': <class 'scipy.stats._continuous_distns.gennorm_gen'>, 'genpareto_gen': <class 'scipy.stats._continuous_distns.genpareto_gen'>, 'gibrat_gen': <class 'scipy.stats._continuous_distns.gibrat_gen'>, 'gilbrat_gen': <class 'scipy.stats._continuous_distns.gilbrat_gen'>, 'gompertz_gen': <class 'scipy.stats._continuous_distns.gompertz_gen'>, 'gumbel_l_gen': <class 'scipy.stats._continuous_distns.gumbel_l_gen'>, 'gumbel_r_gen': <class 'scipy.stats._continuous_distns.gumbel_r_gen'>, 'halfcauchy_gen': <class 'scipy.stats._continuous_distns.halfcauchy_gen'>, 'halfgennorm_gen': <class 'scipy.stats._continuous_distns.halfgennorm_gen'>, 'halflogistic_gen': <class 'scipy.stats._continuous_distns.halflogistic_gen'>, 'halfnorm_gen': <class 'scipy.stats._continuous_distns.halfnorm_gen'>, 'hypsecant_gen': <class 'scipy.stats._continuous_distns.hypsecant_gen'>, 'invgamma_gen': <class 'scipy.stats._continuous_distns.invgamma_gen'>, 'invgauss_gen': <class 'scipy.stats._continuous_distns.invgauss_gen'>, 'invweibull_gen': <class 'scipy.stats._continuous_distns.invweibull_gen'>, 'johnsonsb_gen': <class 'scipy.stats._continuous_distns.johnsonsb_gen'>, 'johnsonsu_gen': <class 'scipy.stats._continuous_distns.johnsonsu_gen'>, 'kappa3_gen': <class 'scipy.stats._continuous_distns.kappa3_gen'>, 'kappa4_gen': <class 'scipy.stats._continuous_distns.kappa4_gen'>, 'ksone_gen': <class 'scipy.stats._continuous_distns.ksone_gen'>, 'kstwo_gen': <class 'scipy.stats._continuous_distns.kstwo_gen'>, 'kstwobign_gen': <class 'scipy.stats._continuous_distns.kstwobign_gen'>, 'laplace_asymmetric_gen': <class 'scipy.stats._continuous_distns.laplace_asymmetric_gen'>, 'laplace_gen': <class 'scipy.stats._continuous_distns.laplace_gen'>, 'levy_gen': <class 'scipy.stats._continuous_distns.levy_gen'>, 'levy_l_gen': <class 'scipy.stats._continuous_distns.levy_l_gen'>, 'loggamma_gen': <class 'scipy.stats._continuous_distns.loggamma_gen'>, 'logistic_gen': <class 'scipy.stats._continuous_distns.logistic_gen'>, 'loglaplace_gen': <class 'scipy.stats._continuous_distns.loglaplace_gen'>, 'lognorm_gen': <class 'scipy.stats._continuous_distns.lognorm_gen'>, 'lomax_gen': <class 'scipy.stats._continuous_distns.lomax_gen'>, 'maxwell_gen': <class 'scipy.stats._continuous_distns.maxwell_gen'>, 'mielke_gen': <class 'scipy.stats._continuous_distns.mielke_gen'>, 'moyal_gen': <class 'scipy.stats._continuous_distns.moyal_gen'>, 'nakagami_gen': <class 'scipy.stats._continuous_distns.nakagami_gen'>, 'ncf_gen': <class 'scipy.stats._continuous_distns.ncf_gen'>, 'nct_gen': <class 'scipy.stats._continuous_distns.nct_gen'>, 'ncx2_gen': <class 'scipy.stats._continuous_distns.ncx2_gen'>, 'norm_gen': <class 'scipy.stats._continuous_distns.norm_gen'>, 'norminvgauss_gen': <class 'scipy.stats._continuous_distns.norminvgauss_gen'>, 'pareto_gen': <class 'scipy.stats._continuous_distns.pareto_gen'>, 'pearson3_gen': <class 'scipy.stats._continuous_distns.pearson3_gen'>, 'powerlaw_gen': <class 'scipy.stats._continuous_distns.powerlaw_gen'>, 'powerlognorm_gen': <class 'scipy.stats._continuous_distns.powerlognorm_gen'>, 'powernorm_gen': <class 'scipy.stats._continuous_distns.powernorm_gen'>, 'rayleigh_gen': <class 'scipy.stats._continuous_distns.rayleigh_gen'>, 'rdist_gen': <class 'scipy.stats._continuous_distns.rdist_gen'>, 'recipinvgauss_gen': <class 'scipy.stats._continuous_distns.recipinvgauss_gen'>, 'reciprocal_gen': <class 'scipy.stats._continuous_distns.reciprocal_gen'>, 'rice_gen': <class 'scipy.stats._continuous_distns.rice_gen'>, 'semicircular_gen': <class 'scipy.stats._continuous_distns.semicircular_gen'>, 'skew_norm_gen': <class 'scipy.stats._continuous_distns.skew_norm_gen'>, 'skewcauchy_gen': <class 'scipy.stats._continuous_distns.skewcauchy_gen'>, 'studentized_range_gen': <class 'scipy.stats._continuous_distns.studentized_range_gen'>, 't_gen': <class 'scipy.stats._continuous_distns.t_gen'>, 'trapezoid_gen': <class 'scipy.stats._continuous_distns.trapezoid_gen'>, 'triang_gen': <class 'scipy.stats._continuous_distns.triang_gen'>, 'truncexpon_gen': <class 'scipy.stats._continuous_distns.truncexpon_gen'>, 'truncnorm_gen': <class 'scipy.stats._continuous_distns.truncnorm_gen'>, 'truncpareto_gen': <class 'scipy.stats._continuous_distns.truncpareto_gen'>, 'truncweibull_min_gen': <class 'scipy.stats._continuous_distns.truncweibull_min_gen'>, 'tukeylambda_gen': <class 'scipy.stats._continuous_distns.tukeylambda_gen'>, 'uniform_gen': <class 'scipy.stats._continuous_distns.uniform_gen'>, 'vonmises_gen': <class 'scipy.stats._continuous_distns.vonmises_gen'>, 'wald_gen': <class 'scipy.stats._continuous_distns.wald_gen'>, 'weibull_max_gen': <class 'scipy.stats._continuous_distns.weibull_max_gen'>, 'weibull_min_gen': <class 'scipy.stats._continuous_distns.weibull_min_gen'>, 'wrapcauchy_gen': <class 'scipy.stats._continuous_distns.wrapcauchy_gen'>}

Dictionary of continuous random variables that are supported by scipy.stats. Note this is a dictionary of types, rather than instances.

metrics_as_scores.data.pregenerate_fit.Discrete_RVs_dict: dict[str, type[scipy.stats._distn_infrastructure.rv_discrete]] = {'bernoulli_gen': <class 'scipy.stats._discrete_distns.bernoulli_gen'>, 'betabinom_gen': <class 'scipy.stats._discrete_distns.betabinom_gen'>, 'binom_gen': <class 'scipy.stats._discrete_distns.binom_gen'>, 'boltzmann_gen': <class 'scipy.stats._discrete_distns.boltzmann_gen'>, 'dlaplace_gen': <class 'scipy.stats._discrete_distns.dlaplace_gen'>, 'geom_gen': <class 'scipy.stats._discrete_distns.geom_gen'>, 'hypergeom_gen': <class 'scipy.stats._discrete_distns.hypergeom_gen'>, 'logser_gen': <class 'scipy.stats._discrete_distns.logser_gen'>, 'nbinom_gen': <class 'scipy.stats._discrete_distns.nbinom_gen'>, 'nchypergeom_fisher_gen': <class 'scipy.stats._discrete_distns.nchypergeom_fisher_gen'>, 'nchypergeom_wallenius_gen': <class 'scipy.stats._discrete_distns.nchypergeom_wallenius_gen'>, 'nhypergeom_gen': <class 'scipy.stats._discrete_distns.nhypergeom_gen'>, 'planck_gen': <class 'scipy.stats._discrete_distns.planck_gen'>, 'poisson_gen': <class 'scipy.stats._discrete_distns.poisson_gen'>, 'randint_gen': <class 'scipy.stats._discrete_distns.randint_gen'>, 'skellam_gen': <class 'scipy.stats._discrete_distns.skellam_gen'>, 'yulesimon_gen': <class 'scipy.stats._discrete_distns.yulesimon_gen'>, 'zipf_gen': <class 'scipy.stats._discrete_distns.zipf_gen'>, 'zipfian_gen': <class 'scipy.stats._discrete_distns.zipfian_gen'>}

Dictionary of discrete random variables that are supported by scipy.stats. Note this is a dictionary of types, rather than instances.

metrics_as_scores.data.pregenerate_fit.get_data_tuple(ds: Dataset, qtype: str, dist_transform: DistTransform, continuous_transform: bool = True) list[tuple[str, nptyping.ndarray.NDArray]][source]

This method is part of the workflow for computing parametric fits. For a specific type of quantity and transform, it creates datasets for all available contexts.

ds: Dataset

qtype: str

The type of quantity to get datasets for.

dist_transform: DistTransform

The chosen distribution transform.

continuous_transform: bool

Whether the transform is real-valued or must be converted to integer.

Return type:

list[tuple[str, NDArray[Shape["*"], Float]]]

Returns:

A list of tuples of three elements. The first element is a key that identifies the context, the quantity type, and whether the data was computed using unique values (see Dataset.transform()).

class metrics_as_scores.data.pregenerate_fit.FitResult[source]

Bases: TypedDict

This class is derived from TypedDict and holds all properties related to a single fit result, that is, a single specific configuration that was fit to a 1-D array of data.

grid_idx: int
dist_transform: str
transform_value: Optional[float]
params: dict[str, Union[float, int]]
context: str
qtype: str
rv: str
type: str
stat_tests: StatisticalTestJson
metrics_as_scores.data.pregenerate_fit.fit(ds: ~metrics_as_scores.distribution.distribution.Dataset, fitter_type: type[metrics_as_scores.distribution.fitting.Fitter], grid_idx: int, row, dist_transform: ~metrics_as_scores.distribution.distribution.DistTransform, the_data: ~nptyping.ndarray.NDArray[~nptyping.base_meta_classes.Shape[*], ~numpy.float64], the_data_unique: ~nptyping.ndarray.NDArray[~nptyping.base_meta_classes.Shape[*], ~numpy.float64], transform_value: ~typing.Optional[float], write_temporary_results: bool = False) FitResult[source]

This is the main stand-alone function that computes a single parametric fit to a single 1-D array of data. This function is used in Parallel contexts and, therefore, lives on module top level so it can be serialized.

ds: Dataset

The data, needed for obtaining quantity types and contexts. Also passed forward to fit().

fitter_type: type[Fitter]

The class for the fitter to use, either Fitter or FitterPymoo.

grid_idx: int

This is only used so it can be stored in the :py:class:FitResult. This method itself does not have access to the grid.

dist_transform: DistTransform

The transform for which to generate parametric fits for. Later, we will save a single file per transform, containing all related fits.

the_data: NDArray[Shape["*"], Float]

The 1-D data used for fitting the RV.

the_data_unique: NDArray[Shape["*"], Float]

1-D Array of data. In case of continuous data, it is the same as the_data. In case of discrete data, the data in this array contains a slight jitter as to make all data points unique. Using this data is relevant for conducting statistical goodness of fit tests.

Returns:

The :py:class:FitResult. If the RV could not be fitted, then the parameters in the fitting result will have a value of None. This is so this method does not throw exceptions. In case of a failure, no statistical tests are computed, either.

Module contents

This package contains functions that are needed for the mass-wise fitting of continuous and discrete random variables to own data, as well as functions for pre-generating densities for own datasets that are then used in the interactive web application.