Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
302 views
in Technique[技术] by (71.8m points)

bioinformatics - Snakemake: unknown output/input files after splitting by chromosome

To speed up a certain snakemake step I would like to:

  • split my bamfile per chromosome using
    bamtools split -in sample.bam --reference
    this results in files named as sample.REF_{chromosome}.bam
  • perform variant calling on each resulting in e.g. sample.REF_{chromosome}.vcf
  • recombine the obtained vcf files using vcf-concat (VCFtools) using
    vcf-concat file1.vcf file2.vcf file3.vcf > sample.vcf

The problem is that I don't know a priori which chromosomes may be in my bam file. So I cannot specify accurately the output of bamtools split. Furthermore, I'm not sure how to make the input of vcf-concat to take all vcf files.

I thought of using a samples.fofn and do something like

rule split_bam:
    input:
        bam = "alignment/{sample}.bam",
        pattern = "alignment/{sample}.REF_"
    output:
        alignment/anon.splitbams.fofn
    log:
        "logs/bamtools_split/{sample}.log"
    shell:
        "bamtools split -in {input.bam} -reference && 
         ls alignment/{input.pattern}*.bam | sed 's/.bam/.vcf/' > {output}"

And use the same fofn for concatenating the obtained vcf files. But this feels like a very awkward hack and I'd appreciate your suggestions.


EDIT 20180409

As suggested by @jeeyem I tried the dynamic() functions, but I can't figure it out.

My complete snakefile is on GitHub, the dynamic part is at lines 99-133.

The error I get is: InputFunctionException in line 44 of /home/wdecoster/DR34/SV-nanopore.smk: KeyError: 'anon___snakemake_dynamic' Wildcards: sample=anon___snakemake_dynamic (with anon an anonymized {sample} identifier)

Running with --debug-dag gives (last parts before erroring): candidate job cat_vcfs wildcards: sample=anon candidate job nanosv wildcards: sample=anon___snakemake_dynamic, chromosome=_ candidate job samtools_index wildcards: aligner=split_ngmlr, sample=anon___snakemake_dynamic.REF__ candidate job split_bam wildcards: sample=anon___snakemake_dynamic, chromosome=_ InputFunctionException in line 44 of /home/wdecoster/DR34/SV-nanopore.smk: KeyError: 'anon___snakemake_dynamic' Wildcards: sample=anon___snakemake_dynamic

Which shows that the wildcard is misinterpreted?


Cheers, Wouter

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can lookup the chromosome names from the bam header, or a corresponding .fai file for the used reference. This can be done at the beginning of your Snakefile. Then, you can use expand("alignment/{{sample}}.REF_{chromosome}.bam", chromosome=chromosomes) to define the output files of that rule. No need to use dynamic.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...