module varspark.core¶

This module includes the core variant-spark API.

class varspark.core.FeatureSource(_jvm, _vs_api, _jsql, sql, _jfs)[source]¶

importance_analysis(**kwargs)[source]¶

Builds random forest classifier.

Parameters:	label_source – The ingested label source n_trees (int) – The number of trees to build in the forest. mtry_fraction (float) – The fraction of variables to try at each split. oob (bool) – Should OOB error be calculated. seed (int) – Random seed to use. batch_size (int) – The number of trees to build in one batch. var_ordinal_levels (int) –
Returns:	Importance analysis model.
Return type:	`ImportanceAnalysis`

class varspark.core.ImportanceAnalysis(_jia, sql)[source]¶

Model for random forest based importance analysis

important_variables(**kwargs)[source]¶: Gets the top limit important variables as a list of tuples (name, importance) where: - name: string - variable name - importance: double - gini importance

oob_error()[source]¶

OOB (Out of Bag) error estimate for the model

Return type:	float

variable_importance()[source]¶

Returns a DataFrame with the gini importance of variables.

The DataFrame has two columns: - variable: string - variable name - importance: double - gini importance

class varspark.core.VarsparkContext(ss, silent=False)[source]¶

The main entry point for VariantSpark functionality.

load_label(**kwargs)[source]¶

Loads the label source file

Parameters:	label_file_path – The file path for the label source file col_name – the name of the column containing labels

classmethod spark_conf(conf=<pyspark.conf.SparkConf object>)[source]¶: Adds the necessary option to the spark configuration. Note: In client mode these need to be setup up using –jars or –driver-class-path