v1 to v2 Migration Guide

This guide maps legacy snpArcher v1-style inputs to the v2 contracts used by the current workflow.

Sample sheet migration

v2 uses a minimal manifest:

  • required: sample_id, input_type, input

  • optional: library_id, mark_duplicates

Column mapping

v1 column

v2 column

Notes

BioSample

sample_id

Must be alphanumeric plus ._- only.

Run (SRR/ERR/DRR)

input

Use with input_type: srr.

fq1,fq2

input

Join as fq1;fq2 and set input_type: fastq.

LibraryName

library_id

Optional; defaults to sample_id.

mark_duplicates (global or per-row)

mark_duplicates

Optional; defaults to true.

refGenome,refPath

config reference

v2 keeps reference in config, not sample sheet.

Example: v1 fastq row -> v2 row

v1:

BioSample,LibraryName,Run,fq1,fq2
bird_1,bird_1_lib,1,/data/bird_1_R1.fastq.gz,/data/bird_1_R2.fastq.gz

v2:

sample_id,input_type,input,library_id,mark_duplicates
bird_1,fastq,/data/bird_1_R1.fastq.gz;/data/bird_1_R2.fastq.gz,bird_1_lib,true

Example: v1 SRA row -> v2 row

v1:

BioSample,Run
bird_2,SRR12345678

v2:

sample_id,input_type,input
bird_2,srr,SRR12345678

Config migration

v2 uses nested config keys. Core keys are:

  • samples

  • reference.name

  • reference.source

  • variant_calling.*

  • intervals.*

  • callable_sites.*

  • modules.*

Common legacy -> v2 config mapping

v1 key

v2 key

Notes

refGenome

reference.name

Required in v2.

refPath

reference.source

Can be local path, URL, or accession.

sentieon

variant_calling.tool

Set to sentieon (else gatk, bcftools, deepvariant, or parabricks as needed).

sentieon_lic

variant_calling.sentieon.license

String path or license value.

minNmer

intervals.min_nmer

Integer.

cov_filter

callable_sites.coverage.enabled

Boolean.

mappability_k

callable_sites.mappability.kmer

Integer.

mappability_min

callable_sites.mappability.min_score

Number in [0,1].

mappability_merge

callable_sites.mappability.merge_distance

Integer.

Notes

  • v2 no longer expects per-sample reference columns in the manifest.

  • Optional metadata columns from v1 (for example lat, long, SampleType) are not part of the v2 core manifest contract.

  • If a sample appears in multiple rows, library_id identifies the library and may repeat across rows for the same library (for example, multiple lanes/runs); it only needs to differ when a sample truly has multiple distinct libraries.