# Examples
On this page you will find an example project scenario and how to setup and run it using snpArcher.

In this example, we have 10 resequenced individuals we would like to generate variant calls for. We will cover creating the sample sheet, selecting config options, and running the workflow.
## Directory structure
First, let's setup our directories as suggested in our [executing](./executing.md#optional-directory-setup) instructions. Let's assume we are working in a directory called `workdir/`, and the snpArcher repository has already been cloned there. We have also already created the `snparcher` conda env as instructed in the [setup docs](./setup.md#environment-setup).

1. Let's create a directory to organize this project and future ones, call it `projects`. Then, create a new directory for this project, we'll call it `secretarybird_reseq`. 
```
.
├── projects
│   └── secretarybird_reseq
└── snpArcher
    └── ...
```
```{note}
Not all files and directories are shown, only relevant ones. 
```
2. Copy the snpArcher config directory `snpArcher/config` to `projects/secretarybird_reseq`:
```
.
├── projects
│   └── secretarybird_reseq
│       └── config
│           └── config.yaml
└── snpArcher
    └── ...
```

3. Assume we already have all our sequence data and reference genome on our system, stored in a different location `/storage/data`. We do not need to move the raw data to our project directory. 
```{note}
We'll cover the cases using SRA data and refSeq genomes later on in this example.
```
## Sample sheet setup
Now we need to setup our sample sheet to inform snpArcher of our samples and their metadata. You can use any editor to create the sheet, as long as it is a CSV file. We will save the sample sheet in our project's config directory: `projects/secretarybird_reseq/samples.csv`. Below is the final sample sheet that we will use going forward, with explanations of each column following.

For a more comprehensive explanation of the sample sheet, please refer to [here](./setup.md#creating-a-sample-sheet) for more details.


### Final sample sheet
```
BioSample,LibraryName,Run,fq1,fq2,lat,long
bird_1,bird_1_lib,1,/storage/data/bird_1_R1.fq.gz,/storage/data/bird_1_R2.fq.gz,-8.758119,-36.280061
bird_2,bird_2_lib,2,/storage/data/bird_2_R1.fq.gz,/storage/data/bird_2_R2.fq.gz,-72.336165,35.751903
bird_3,bird_3_lib,3,/storage/data/bird_3_R1.fq.gz,/storage/data/bird_3_R2.fq.gz,-11.874137,-5.382251
bird_4,bird_4_lib,4,/storage/data/bird_4_R1.fq.gz,/storage/data/bird_4_R2.fq.gz,-73.235723,-145.261219
bird_5,bird_5_lib,5,/storage/data/bird_5_R1.fq.gz,/storage/data/bird_5_R2.fq.gz,88.08701,-52.658705
bird_6,bird_6_lib,6,/storage/data/bird_6_R1.fq.gz,/storage/data/bird_6_R2.fq.gz,69.640536,-12.971862
bird_7,bird_7_lib,7,/storage/data/bird_7_R1.fq.gz,/storage/data/bird_7_R2.fq.gz,18.608941,-100.485774
bird_8,bird_8_lib,8,/storage/data/bird_8_R1.fq.gz,/storage/data/bird_8_R2.fq.gz,-36.570632,-102.38721
bird_9,bird_9_lib,9,/storage/data/bird_9_R1.fq.gz,/storage/data/bird_9_R2.fq.gz,-88.592265,157.406505
bird_10,bird_10_lib,10,/storage/data/bird_10_R1.fq.gz,/storage/data/bird_10_R2.fq.gz,40.106437,-58.649016
```
### Description of Columns
1. **BioSample**: This is the name for the sample.
2. **LibraryName**: Identifier for the sample's sequencing library. This is especially important if you have samples that were sequenced multiple times across multiple lanes, which is not the case in this example. See [here](./setup.md#handling-samples-with-more-than-one-pair-of-reads) for more details.
3. **Run**: If we were using reads from the SRA, this is where the sample's SRR accession would go. However, since we have local data, this just has to be a unique value.
4. **fq1**: Path to the first read pair. Absolute paths are recommended. If we were using SRA data, this column should be omitted.
5. **fq1**: Path to the second read pair. Same note as fq1.
6. **lat**: Decimal latitude for the sample, used to generate map in QC module output.
6. **long**: Decimal longitude for the sample, used to generate map in QC module output. 

```{note}
If your project has multiple genomes, you can add the refPath and refGenome columns.
```

## Config file setup
Now that we've created our sample sheet, we need to edit the config file we copied earlier: `projects/secretarybird_reseq/config.yaml`. This file controls the main options for controlling snpArcher's outputs. Refer to the [setup section](./setup.md#configuring-snparcher) for more details. 

In our example we are using all of the default options. This will configure snpArcher to perform variant calling using GATK with the scatter-by-intervals approach. Also, we have set our reference genome name and path since we want to use the same genome for all samples in our sample sheet.

```
samples: "config/samples.csv"

reference:
  name: "bird_genome"
  source: "/storage/data/bird.fa.gz"

variant_calling:
  tool: "gatk" # "gatk", "sentieon", "bcftools", "deepvariant", or "parabricks"
  expected_coverage: "low"
  ploidy: 2
  gatk:
    het_prior: 0.005
  sentieon:
    license: ""
  bcftools:
    min_mapq: 20
    min_baseq: 20
    max_depth: 250
  deepvariant:
    model_type: "WGS"
    num_shards: 8
  parabricks:
    container_image: "/path/to/parabricks.sif" # required when tool == "parabricks"
    num_gpus: 1
    num_cpu_threads: 16
    extra_args: ""
```

## Profile setup
Snakemake uses profile YAML files to specify commonly used command line arguments, so you don't have to remember all of the arguments you need. Read more about profiles [here](https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles). To specify a profile, you can use the `--workflow-profile` option when running Snakemake.

```
cp -r snpArcher/workflow-profiles projects/secretarybird_reseq
```

The profile also enables you to specify the compute resources any of snpArcher's rules can use. This is done via the YAML keys `default-resources`, `set-resources`, and `set-threads`. `default-resources` will apply to all rules, and `set-resources` can be applied to individual rules, overriding what the default was set to. There is no way to set a default thread value.

First, we will specify how many threads each rule can use. This is the same using the default or SLURM profile. Both profiles come with reasonable default thread values, but you may need to adjust based on your system or cluster. 

Let's say we wanted the alignment step (bwa mem) to use more threads:
```
# ...
set-threads:
  bwa_mem: 16 # Changed from 8 to 16.
# ...
```
Next, we will specify memory and other resources. This step only applies if you are running on a SLURM cluster.

In our example cluster, we have two compute partitions, "short" and "long". So we want to put long running jobs on the "long" partition, and the rest on "short". Additionally, the "short" partition has a timelimit of 1 hour and "long" 10 hours, so we will specify that. 

First, lets specify the default resources:
```
default-resources:
  mem_mb: attempt * 16000
  mem_mb_reduced: attempt * 14400 # Java -Xmx for GATK/Picard rules.
  slurm_partition: "short" # This line was changed
  slurm_account: # Same as sbatch -A. Not all clusters use this.
  runtime: 60 # In minutes 
```
Then, lets modify the specific resources for the intervalized GATK HaplotypeCaller step:
```
set-resources:
  # ... other rules
  gatk_haplotypecaller_interval:
    mem_mb: attempt * 16000
    mem_mb_reduced: attempt * 14400
    slurm_partition: "long" # This line was changed
    runtime: 600 # This line was changed
```

## Running the workflow
We are now ready to run the workflow! From our working directory we can run the command:
```
snakemake -s snpArcher/workflow/Snakefile -d projects/secretarybird_reseq --workflow-profile projects/secretarybird_reseq/workflow-profiles/default
```
This instructs Snakemake to use snpArcher's workflow file, and to run in the project directory we setup using the config and sample sheet we setup there.

If we were on a SLURM cluster, we would add `--executor slurm` to our command:
```
snakemake --executor slurm -s snpArcher/workflow/Snakefile -d projects/secretarybird_reseq --workflow-profile projects/secretarybird_reseq/workflow-profiles/default
```