Command Line Reference¶
variant-spark requires an existing spark 2.4+ installation (either a local one or a cluster one).
To run variant-spark use:
./variant-spark [(--spark|--local) <spark-options>* --] [<command>] <command-options>*
In order to obtain the list of the available commands use:
./variant-spark -h
In order to obtain help for a specific command (for example importance) use:
./variant-spark importance -h
You can use --spark
marker before the command to pass spark-submit options to variant-spark. The list of spark options needs to be terminated with --
, e.g:
./variant-spark –spark –master yarn-client –num-executors 32 – importance ….
Please, note that --spark
needs to be the first argument of variant-spark
You can also run variant-spark in the --local
mode. In this mode variant-spark will ignore any Hadoop or Spark configuration files and run in the local mode for both Hadoop and Spark.
In particular in this mode all file paths are interpreted as local file system paths. Also any parameters passed after –local and before – are ignored.
For example:
./variant-spark --local -- importance -if data/chr22_1000.vcf -ff data/chr22-labels.csv -fc 22_16051249 -v -rn 500 -rbs 20 -ro
Note:
The difference between running in --local
mode and in --spark
with local
master is that in the latter case Spark uses the hadoop filesystem configuration and the input files need to be copied to this filesystem (e.g. HDFS)
Also the output will be written to the location determined by the hadoop filesystem settings. In particular paths without schema e.g. ‘output.csv’ will be resolved with the hadoop default filesystem (usually HDFS)
To change this behavior you can set the default filesystem in the command line using spark.hadoop.fs.default.name option. For example to use local filesystem as the default use:
variant-spark --spark ... --conf "spark.hadoop.fs.default.name=file:///" ... -- importance ... -of output.csv
You can also use the full URI with the schema to address any filesystem for both input and output files e.g.:
variant-spark --spark ... --conf "spark.hadoop.fs.default.name=file:///" ... -- importance -if hdfs:///user/data/input.csv ... -of output.csv