Running importance analysis with Python API¶

This is an VariantSpark example notebook.

One of the main applications of VariantSpark is discovery of genomic variants correlated with a response variable (e.g. case vs control) using random forest gini importance.

The chr22_1000.vcf is a very small sample of the chromosome 22 VCF file from the 1000 Genomes Project.

chr22-labels.csv is a CSV file with sample response variables (labels). In fact the labels directly represent the number of alternative alleles for each sample at a specific genomic position. E.g.: column 22_16050408 has labels derived from variants in chromosome 22 position 16050408. We would expect then that position 22:16050408 in the VCF file is strongly correlated with the label 22_16050408.

Both data sets are located in the ..\data directory.

This notebook demonstrates how to run importance analysis on these data with VariantSpark Python API.

Step 1: Create a spark session with VariantSpark jar attached.

[1]:

import varspark as vs
from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.jars', vs.find_jar()).getOrCreate()

Step 2: Create a VarsparkContext using SparkSession object (here injected as spark):

[2]:

vc = vs.VarsparkContext(spark, silent = True)

Step 3: Load the features fs and labels ls from data files.

[3]:

features = vc.import_vcf('../data/chr22_1000.vcf')
labels = vc.load_label('../data/chr22-labels.csv', '22_16050408')

Step 4: Run the importance analysis and retrieve top important variables:

[4]:

ia = features.importance_analysis(labels, seed = 13, n_trees=500, batch_size=20)
top_variables = ia.important_variables()

Step 5: Display the results.

[5]:

print("%s\t%s" % ('Variable', 'Importance'))
for var_and_imp in top_variables:
    print("%s\t%s" % var_and_imp)

Variable        Importance
22_16050408_T_C 0.000875899681306875
22_16050678_C_T 0.0008045828330856887
22_16053197_G_T 0.0006258776975143016
22_16051882_C_T 0.0005914004839298169
22_16051107_C_A 0.0005911526429890821
22_16051480_T_C 0.0005362221508961817
22_16052838_T_A 0.0004994650434540958
22_16052656_T_C 0.0004932212678113746
22_16053435_G_T 0.00046980813216784275
22_16054283_C_T 0.0004692021189492525

For more information on using VariantSpark and the Python API please visit the documentation.