package varspark.hail

This package contains variant spark integration with Hail.

from hail import *
import varspark.hail
hc = HailContext(sc)
vds = hc.import_vcf(...)
...
via = vds.importance_analysis("sa.pheno.label", n_trees = 1000)

module varspark.hail.extend

Created on 7 Nov 2017

@author: szu004

class varspark.hail.extend.VariantsDatasetFunctions(*args, **kwargs)[source]

Extension to hail.VariantDataset with variant-spark related functions

importance_analysis(**kwargs)[source]

Builds random forest classifier for the response variable defined with y_expr.

Parameters:
  • y_expr (str) – Response expression. Must evaluate to Boolean or numeric with all values 0 or 1.
  • n_trees (int) – The number of trees to build in the forest.
  • mtry_fraction (float) – The fraction of variables to try at each split.
  • oob (bool) – Should OOB error be calculated.
  • seed (long) – Random seed to use.
  • batch_size (int) – The number of trees to build in one batch.
Returns:

Importance analysis model.

Return type:

ImportanceAnalysis

pairwise_operation(**kwargs)[source]

Computes a pairwise operation on encoded genotypes. Currently implemented operations include:

  • manhattan : the Manhattan distance
  • euclidean : the Euclidean distance
  • sharedAltAlleleCount: count of shared alternative alleles
  • anySharedAltAlleleCount: count of variants that share at least one alternative allele
Parameters:operation_name – name of the operaiton. One of manhattan, euclidean, sharedAltAlleleCount, anySharedAltAlleleCount
Returns:A symmetric no_of_samples x no_of_samples matrix with the result of the pairwise computation.
Return type:hail.KinshipMatrix

module varspark.hail.rf

Created on 10 Nov 2017

@author: szu004

class varspark.hail.rf.ImportanceAnalysis(hc, _jia)[source]

Model for random forest based importance analysis

important_variants(**kwargs)[source]

Gets the top n most important loci.

Parameters:n_limit (int) – the limit of the number of loci to return
Returns:A KeyTable with the variant in the first column and importance in the second.
Return type:hail.KeyTable
oob_error

OOB (Out of Bag) error estimate for the model

Return type:float