module varspark.core

This module includes the core variant-spark API.

class varspark.core.FeatureSource(_jvm, _vs_api, _jsql, sql, _jfs)[source]
importance_analysis(**kwargs)[source]

Builds random forest classifier.

Parameters:
  • label_source – The ingested label source
  • n_trees (int) – The number of trees to build in the forest.
  • mtry_fraction (float) – The fraction of variables to try at each split.
  • oob (bool) – Should OOB error be calculated.
  • seed (int) – Random seed to use.
  • batch_size (int) – The number of trees to build in one batch.
  • var_ordinal_levels (int) –
Returns:

Importance analysis model.

Return type:

ImportanceAnalysis

class varspark.core.ImportanceAnalysis(_jia, sql)[source]

Model for random forest based importance analysis

important_variables(**kwargs)[source]

Gets the top limit important variables as a list of tuples (name, importance) where: - name: string - variable name - importance: double - gini importance

oob_error()[source]

OOB (Out of Bag) error estimate for the model

Return type:float
variable_importance()[source]

Returns a DataFrame with the gini importance of variables.

The DataFrame has two columns: - variable: string - variable name - importance: double - gini importance

varspark.core.VariantsContext

alias of varspark.core.VarsparkContext

class varspark.core.VarsparkContext(ss, silent=False)[source]

The main entry point for VariantSpark functionality.

import_vcf(**kwargs)[source]

Import features from a VCF file.

load_label(**kwargs)[source]

Loads the label source file

Parameters:
  • label_file_path – The file path for the label source file
  • col_name – the name of the column containing labels
classmethod spark_conf(conf=<pyspark.conf.SparkConf object>)[source]

Adds the necessary option to the spark configuration. Note: In client mode these need to be setup up using –jars or –driver-class-path

stop()[source]

Shut down the VariantsContext.