Setting up snpArcher

Environment Setup

First, you will need to have Mamba installed. Follow the link and use the “Fresh Install (recommended)” directions.

Mamba is a faster version of conda. Conda is a package manager that makes it easy to create local environments with pre-configured versioning for your favorite packages.

Once Mamba is installed, create a conda environment with snakemake. These are the only two dependencies you need for the pipeline to work, the workflow will create mamba environments for each rule, and there is no need to install each package separately.

mamba create -c conda-forge -c bioconda -n snparcher "snakemake>=9" "python==3.11.4"
mamba activate snparcher

If you encounter issues, please see the Snakemake docs for detailed installation instructions.

Next, clone the snpArcher github repo to your machine:

git clone https://github.com/harvardinformatics/snpArcher.git
cd snpArcher

Creating a sample sheet

In order to determine what outputs to create, snpArcher requires sample sheet file. This comma-separated file contains the required sample metadata about your samples in order to run the workflow. At a minimum, the snpArcher pipeline requires that each sample have a unique sample name, a reference genome accession or a path to a fasta file, and an SRA accession, or path to two paired end fastq files.

Below are all of the accepted fields for a sample sheet:

Field	Description
BioSample	The name of the sample. This will be the sample name in the final VCF
LibraryName	LibraryID for sample, this can be the same or different than BioSample
Run	The SRR for the sample, if applicable. If not, must be some unique value. It is often the lane number if samples are sequenced on multiple lanes.
refGenome	Reference genome accession, if applicable. See note
refPath	Path to local reference genome, if applicable. See note
BioProject	If applicable. Otherwise any value is acceptable.
fq1	Optional if no SRR value in Run. Path to read 1 for sample
fq2	Optional if no SRR value in Run. Path to read 2 for sample
SampleType	Optional. Triggers postproccesing module. Accepted values are ‘include’ or ‘exclude’
lat	Optional. Decimal latitude for sample, required to generate map in QC dashboard.
long	Optional. Decimal longitude for sample, required to generate map in QC dashboard.

Note

refGenome is always required. refPath specifying the path to a reference fasta file is optional, but when specified, a name for the assembly (in refGenome) must also be included.

If you are using the same reference genome for all samples in your sample sheet, you can omit the refGenome and/or refPath column from the sample sheet and specify these fields in the config file. See config setup below for more details.

It is important to note that samples are proccessed together based on their refGenome metadata, so all BioSamples that share a reference genome will ultimately end up in the same final vcf file. If you are mapping multiple populations / species to a single reference genome, and want separate VCF files for each population / species, you will need to split your final vcf after the pipeline completes, or run multiple indpendent sample sheets in different results directories.

If your reads (and, optionally, your local reference genome) are stored in somewhere seperate of the workflow (e.g.: a scratch disk) then you can specify the path to your reads using the fq1 and fq2 fields, and the location of your reference genome fasta in the refPath field.

Using data from NCBI SRA

If you’d like to reanalyze an existing NCBI SRA BioProject, please follow these instructions to quickly create a sample sheet.

Go to the BioProject overview web page on the SRA.
In the subheading Project Data there is a table with the columns Resource Name and Number of Links. Click the link in the Number of Links column in the SRA Experiments row. You will be redirected to a search results page.
Near the top of the search results page, click the link Send results to Run Selector
On the Run Selector page, you can select/deselect samples you’d like to include/exclude in your sample sheet by using the checkboxes.
Once you are done selecting samples, click the Metadata button in the Download column in the table near the middle of the page.
This will download a a comma separated file called SraRunTable.txt.
Open the file in the editor of your choice, and add a column named refGenome. In this column, enter the reference genome accession you want to use for every row in the sheet.
Save the sample sheet, it is now ready to use.

Using local data

A python script workflow/snparcher_utils/write_samples.py is included to help write the sample sheet for you. In order to use this script, you must have organized all of your fastq files in to one directory. The script requies you provide a file with one sample per name that maps uniquely to a pair of fastq files in the afformentioned directory. The script also requires either a reference genome accession or path to reference fasta.

Note

This script cannot currently handle multiple sequencing runs per sample. Please see below for how to handle this case.

Usage details:

Argument	Description
`-s / --sample_list`	Path to a sample list. One sample per line
`-f / --fastq_dir`	Path to directory containing ALL fastq files. It is assumed that each fastq file will contain the sample name uniquely.
`-r / --ref`	Path to reference fasta. Mutually exclusive with -a
`-a / --acc`	NCBI accession of reference. Mutually exclusive with -r

Handling samples with more than one pair of reads

In order to specify samples that were sequenced multiple times in your sample sheet, you must:

Create a duplicate row for each unit of sequencing
Ensure the BioSample value is the same across all rows for the sample.
Give each row a unique Run value. This allows snpArcher to collect all read pairs for a BioSample. All runs for a sample will be mapped separately to the genome and subsequently merged.
Give each row a unique LibraryName value, if applicable. Used for marking duplicates, LibraryName should be the same in cases where the same library was sequenced multiple times. If a sample had multiple libraries prepared for it, then LibraryName should be unique for each library. See here for more info.

For example, consider we have 2 samples: A and B. Sample A was sequenced 3 times, 2 of which were derived from the same library prep, and the other was a unique library. Sample B was only sequenced once. Below is how the sample sheet would look in order to define these relationships. Note, only the relevant fields have been included.

BioSample	LibraryName	Run
sample_A	lib_A_1	1
sample_A	lib_A_1	2
sample_A	lib_A_2	3
sample_B	lib_B_1	4

Configuring snpArcher

Workflow variables such as tool settings and workflow options are set in config/config.yaml. Resource settings such as threads and memory are controlled per tool in the workflow-profiles/default/config.yaml.

Core configuration

The following options in config/config.yaml must be set before running snpArcher:

Option	Description	Type	Required	Default
`samples`	Path to CSV sample sheet.	`str`	`True`	`None`
`reference.name`	Reference genome name used for output filenames.	`str`	`True`	`None`
`reference.source`	Reference source (local path, URL, or accession).	`str`	`True`	`None`
`variant_calling.tool`	Variant caller implementation to run (`gatk`, `sentieon`, `bcftools`, `deepvariant`, or `parabricks`).	`str`	`True`	`gatk`
`variant_calling.sentieon.license`	Sentieon license value/path (used when tool is `sentieon`).	`str`	`False`	`""`
`intervals.enabled`	Use split-by-intervals calling workflow.	`bool`	`False`	`True`
`callable_sites.coverage.enabled`	Enable coverage-based callable region filtering.	`bool`	`False`	`True`
`callable_sites.generate_bed_file`	Generate `results/callable_sites/callable_sites.bed` from enabled callable-site sources.	`bool`	`False`	`True`

Other options

The following options can be adjusted based on your needs and your dataset.

Variant Calling Options

Option	Description	Type
`intervals.min_nmer`	Minimum span of Ns used to split reference for interval generation.	`int`
`intervals.num_gvcf_intervals`	Maximum number of GVCF intervals to create.	`int`
`intervals.db_scatter_factor`	Used to calculate the target number of DB intervals before complexity-aware post-processing (`num_db_intervals = scatter_factor * num_samples * num_gvcf_intervals`).	`float`
`intervals.db_max_intervals_per_shard`	Maximum interval records allowed in each final DB shard; set to `0` to disable this cap.	`int`
`intervals.db_max_contigs_per_shard`	Maximum unique contigs allowed in each final DB shard; set to `0` to disable this cap.	`int`
`intervals.min_contig_length`	Exclude interval records on contigs shorter than this length; `0` keeps all contigs. This changes called regions but does not filter the reference used for mapping.	`int`
`variant_calling.expected_coverage`	Coverage profile used to set caller tuning (`low` or `high`).	`str`
`variant_calling.ploidy`	Ploidy for variant calling step.	`int`
`variant_calling.gatk.het_prior`	Heterozygosity prior passed to GATK GenotypeGVCFs.	`float`
`variant_calling.gatk.concat_batch_size`	Max number of interval VCFs/gVCFs merged per staged concat job. Lower values reduce per-job file pressure; higher values reduce number of rounds.	`int`
`variant_calling.gatk.concat_max_rounds`	Safety limit for staged concat rounds before failing with a config error.	`int`
`variant_calling.bcftools.min_mapq`	Minimum mapping quality for bcftools mpileup.	`int`
`variant_calling.bcftools.min_baseq`	Minimum base quality for bcftools mpileup.	`int`
`variant_calling.bcftools.max_depth`	Maximum per-file depth for bcftools mpileup.	`int`
`variant_calling.deepvariant.model_type`	DeepVariant model type (`WGS`, `WES`, `PACBIO`, `ONT_R104`, `HYBRID_PACBIO_ILLUMINA`).	`str`
`variant_calling.deepvariant.num_shards`	Number of shards (threads) used by DeepVariant.	`int`
`variant_calling.parabricks.container_image`	Apptainer/Singularity image path for Parabricks (required when tool is `parabricks`).	`str`
`variant_calling.parabricks.num_gpus`	Number of GPUs requested for Parabricks haplotype calling.	`int`
`variant_calling.parabricks.num_cpu_threads`	CPU threads passed to Parabricks haplotype calling.	`int`
`variant_calling.parabricks.extra_args`	Extra CLI flags appended to `pbrun haplotypecaller`.	`str`

When variant_calling.tool is bcftools, deepvariant, or parabricks, sample rows with input_type: gvcf are rejected at config validation time.

intervals.enabled controls interval-split HaplotypeCaller for the GATK backend. Parabricks uses interval-split joint genotyping (GenomicsDBImport/GenotypeGVCFs) regardless of intervals.enabled. snpArcher automatically reserves native memory for GenomicsDBImport and applies GATK contig merging to DB shards with many whole-contig intervals.

variant_calling.long_contig_mode controls VCF/gVCF indexing for references with contigs longer than the Tabix/TBI coordinate limit. The default, auto, detects this from an existing reference .fai when available. Set it to true for first runs from URL/accession references with known long contigs. Long-contig mode is supported for gatk, bcftools, and deepvariant; generated archive gVCFs stay compressed and are indexed with CSI (.csi).

DeepVariant execution uses Snakemake’s native container: support with docker://google/deepvariant:1.10.0. When running with variant_calling.tool: "deepvariant", enable container execution with --use-apptainer or --software-deployment-method conda apptainer.

Parabricks execution expects NVIDIA GPUs and an Apptainer/Singularity image path in variant_calling.parabricks.container_image. Parabricks HaplotypeCaller also follows variant_calling.expected_coverage to set --min-pruning and --min-dangling-branch-length.

Callable Sites Options

Option	Description	Type
`callable_sites.generate_bed_file`	Generate a final callable-sites BED. If both callable sources are enabled, the final BED is the union of both inputs followed by sorting and merging.	`bool`
`callable_sites.mappability.min_score`	Remove regions with mappability score lower than this threshold.	`float`
`callable_sites.mappability.kmer`	Kmer size used to compute mappability.	`int`
`callable_sites.mappability.merge_distance`	Merge passing mappability regions within this many base pairs.	`int`
`callable_sites.coverage.merge_distance`	Merge passing coverage regions within this many base pairs.	`int`

GenMap indexing is selected automatically from the decompressed reference FASTA size. snpArcher uses the default divsufsort indexer for references up to 2 GiB, switches to -S 20 above 2 GiB to reduce RAM usage, and switches to -A skew above 5 GiB to favor lower-memory indexing on very large genomes.

Coverage Filtering Options

If callable_sites.coverage.enabled is set to True, then these options control coverage-based filtering:

Option	Description	Type
`callable_sites.coverage.fraction`	Minimum fraction of BAM-backed samples that must be callable for a site to be included in the coverage BED.	`float`
`callable_sites.coverage.min_coverage`	Minimum depth threshold passed to `clam loci`. Use `auto` to set it to `max(1, floor(global_mean_coverage / 2))`.	`float` or `auto`
`callable_sites.coverage.max_coverage`	Maximum depth threshold passed to `clam loci`. Use `auto` to set it to `ceil(global_mean_coverage * 2)`.	`float` or `auto`
`callable_sites.coverage.merge_distance`	Merge passing coverage regions within this many base pairs.	`int`

When any samples use input_type: gvcf, snpArcher can still compute coverage statistics for BAM-backed samples, but it will not create results/callable_sites/coverage.bed. This avoids producing a coverage BED from only the non-gVCF subset of a mixed cohort. If mappability is enabled, the final callable-sites BED uses mappability only; otherwise snpArcher warns and skips final BED generation.

If callable_sites.generate_bed_file is true and no callable-sites BED sources are enabled for the cohort, snpArcher warns and skips final BED generation.

If postprocess is enabled but no final callable-sites BED will be produced, snpArcher warns and disables the postprocess module for that run.

Module Options

Please refer to the modules page for each module’s options.

Resources

Compute resources (threads, memory, etc.) as well as Snakemake arguments are set by the workflow profile located in workflow-profiles/default/config.yaml. This profile is used by default when running Snakemake. To specify a different profile, use the --workflow-profile option in your Snakemake command.

In the profile you can set resources to be applied to all rules via the default-resources key. You can override this default per-rule by editing or adding entries under the set-threads and/or set-resources key.

To configure temporary file location for the entire workflow, set default-resources.tmpdir in the workflow profile.

Note

We recommend you use --workflow-profile to set resources for the workflow run. To set Snakemake options such as executor, retries, etc, use --profile. Read here for more info.

Threads

The profile controls how many threads (or CPU cores) a rule can use via the set-threads key. We have provided reasonable defaults, though you may need to adjust depending on the resources available on your system/cluster.

Note

Many rules can only use 1 thread, and providing more threads will not decrease runtime or improve performance. Please refer to the profiles/default/config.yaml for details.

Memory and other resources

The profile controls how much memory and what other resources a rule can use via the default-resources and set-resources keys. snpArcher does not hard-code GATK memory in the rules; the workflow profile is the source of truth for scheduler memory requests and Java heap sizes. When executing snpArcher on a cluster or the cloud, specifying memory is important as these environments will typically kill jobs that use more memory than they requested.

Other resources, such as slurm_partition, runtime, etc. can also be set here if they are required by your cluster. We have provided a SLURM profile profiles/slurm that has the most common SLURM resources.

Note

Snakemake allows you to dynamically assign resources. We use the attempt keyword to specify memory. For example, attempt * 16000 will provide 16 GB on the first attempt of the rule, and 32 GB on the second attempt if the job is retried. This behavior requires the -T/--retries Snakemake option.

For GATK and Picard rules, set both mem_mb and mem_mb_reduced. mem_mb is the scheduler request. mem_mb_reduced is passed directly to Java -Xmx, so it should be an integer number of MB and lower than mem_mb to leave room for native memory and process overhead.

set-resources:
  gatk_genomics_db_import:
    mem_mb: attempt * 32000
    mem_mb_reduced: attempt * 24000
    runtime: 240