Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
388 views
in Technique[技术] by (71.8m points)

indexing - Snakemake: confusion on how to access config files properly

This question follows on from a question I asked previously and it regards understanding how to access config files correctly using Snakemake. I have a specific problem I need to address which I'll ask first and a general problem understanding how indexing works which I'll ask second.

I'm using snakemake to run and ATAC-seq pipeline from Alignment/QC through to motif analysis.

A: Specific Question

I'm trying to add a rule called trim_galore_pe to trim adapters from my fastq files before alignment and an error statement is thrown from snakemake as the names of the output files generated by trim galore do not match what is expected by snakemake. This is because I cannot work out how to write the output file statement correctly in my snakemake file to make the names match.

An example of the names generated by TRIM GALORE contain SRA numbers, for example:

trimmed_fastq_files/SRR2920475_1_val_1.fq.gz

Whereas the file expected by snakemake contain sample references and should read:

trimmed_fastq_files/Corces2016_4983.7B_Mono_1_val_1.fq.gz

This also affects the subsequent rules after the trim_galore_pe rule. I need to work out a way to use the info in my config file to generate the output files required.

For all rules after those shown in the Snakefile I need files to be named by sample name i.e. Corces2016_4983.7A_Mono. It would also be useful for all the FAST_QC and MULTIQC rules shown in the Snakefile below to have the sample names in the output file name structure, which they all already do in the current Snakefile.

However, inputs for Bowtie2, the FASTQC rules and input and output of the trim_galore_pe rules need to contain the SRA numbers. The problem starts at the output of trim_galore and influences all downstream rules.

Although I have extracted SRA numbers in previous rules, I'm not sure how to do this when not using the fastq_files folder which is explicitly stated in the config file. By introducing the trim_galore_pe rule I have effectively moved a new set of SRA files into the new trimmed_fastq_files folder. How to extract only the SRA number from the list of SRA files config file containing the old folder names whilst referencing the new trimmed_fastq_files folder in the Snakefile is the crux of my issue.

I hope this is clear.

Here is my config file:

samples:
    Corces2016_4983.7A_Mono: fastq_files/SRR2920475
    Corces2016_4983.7B_Mono: fastq_files/SRR2920476
cell_types:
    Mono:
    - Corces2016_4983.7A
index: /home/genomes_and_index_files/hg19

Here is my Snakefile:

# read config info into this namespace
configfile: "config.yaml"
print (config['samples'])

rule all:
    input:
        expand("FastQC/PRETRIM/{sample}_{num}_fastqc.zip", sample=config["samples"], num=['1', '2']),
        expand("bam_files/{sample}.bam", sample=config["samples"]),
        "FastQC/PRETRIM/fastq_multiqc.html",
        "FastQC/POSTTRIM/fastq_multiqc.html"

rule fastqc_pretrim:
    input:
        sample=lambda wildcards: f"{config['samples'][wildcards.sample]}_{wildcards.num}.fastq.gz"
    output:
        # Output needs to end in '_fastqc.html' for multiqc to work
        html="FastQC/PRETRIM/{sample}_{num}_fastqc.html",
        zip="FastQC/PRETRIM/{sample}_{num}_fastqc.zip"
    wrapper:
        "0.23.1/bio/fastqc"

rule multiqc_fastq_pretrim:
    input:
        expand("FastQC/PRETRIM/{sample}_{num}_fastqc.html", sample=config["samples"], num=['1', '2'])
    output:
        "FastQC/PRETRIM/fastq_multiqc.html"
    wrapper:
        "0.23.1/bio/multiqc"

rule trim_galore_pe:
    input:
        sample=lambda wildcards: expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
    output:
        "trimmed_fastq_files/{sample}_1_val_1.fq.gz",
        "trimmed_fastq_files/{sample}_1.fastq.gz_trimming_report.txt",
        "trimmed_fastq_files/{sample}_2_val_2.fq.gz",
        "trimmed_fastq_files/{sample}_2.fastq.gz_trimming_report.txt"
    params:
        extra="--illumina -q 20"
    log:
        "logs/trim_galore/{sample}.log"
    wrapper:
        "0.23.1/bio/trim_galore/pe"

rule fastqc_posttrim:
    input:
        "trimmed_fastq_files/{sample}_1_val_1.fq.gz", "trimmed_fastq_files/{sample}_2_val_2.fq.gz"
    output:
        # Output needs to end in '_fastqc.html' for multiqc to work
        html="FastQC/POSTTRIM/{sample}_{num}_fastqc.html",
        zip="FastQC/POSTTRIM/{sample}_{num}_fastqc.zip"
    wrapper:
        "0.23.1/bio/fastqc"

rule multiqc_fastq_posttrim:
    input:
        expand("FastQC/POSTTRIM/{sample}_{num}.trim_fastqc.html", sample=config["samples"], num=['1', '2'])
    output:
        "FastQC/POSTTRIM/fastq_multiqc.html"
    wrapper:
        "0.23.1/bio/multiqc"

rule bowtie2:
    input:
        "trimmed_fastq_files/{sample}_1_val_1.fq.gz", "trimmed_fastq_files/{sample}_2_val_2.fq.gz"
    output:
        "bam_files/{sample}.bam"
    log:
        "logs/bowtie2/{sample}.txt"
    params:
       index=config["index"],  # prefix of reference genome index (built with bowtie2-build),
       extra=""
    threads: 8
    wrapper:
        "0.23.1/bio/bowtie2/align"

This currently runs, and give a full job list using snakemake -np, but throws the error mentioned above.

B: General question

Is there an online resource that explains succinctly how to reference a config file using python, particularly with reference to snakemake? The online docs are pretty insufficient and assume prior knowledge of python.

My programming experience is mainly in bash and R but I like Snakemake and do generally understand how dictionaries and lists work in python and how to reference items stored within them. However I find the complex use of bracketing, wildcards and inverted commas in some of the Snakemake rules above confusing so tend to struggle when trying to reference different parts of file names in the config file. I want to understand fully how to utilise these elements.

For example, in a rule such as this from the Snakefile posted above:

sample=lambda wildcards: expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2]) 

What is actually happening in this command? My understanding is that we are accessing the config file using config['samples'] and we are using the [wildcards.sample] part to explicitly access the fastq_files/SRR2920475 part of the config file. The expand allows us to iterate through each item in the config file that fit the parameters in the command, i.e all the SRA files, and the lambda wildcards is needed to use the wildcards call in the command. What I'm uncertain about is:

  1. What does the f do just after the expand and why is it needed?
  2. Why does config['samples'] contain inverted commas within the square brackets but but inverted commas are not needed around [wildcards.sample]?
  3. Why are the single and double curly brackets used?
  4. Looking at the Snakefile above, a few of the rules contain parts assigning a sequence of numbers to num, but again these numbers are sometimes enclosed around inverted commas and sometimes not...why?

Any advice, tips, pointers would be greatly appreciated.

C: Clarification on suggestions made below by @bli

I have edited my config file as you suggested in your comment and omitted the folder names leaving only the SRA numbers. This make sense to me, but I have a couple of other issues preventing me getting this Snakefile running.

New config file:

samples:
    Corces2016_4983.7A_Mono: SRR2920475
    Corces2016_4983.7B_Mono: SRR2920476
cell_types:
    Mono:
    - Corces2016_4983.7A
index: /home/c1477909/genomes_and_index_files/hg19

New Snakefile:

# read config info into this namespace
configfile: "config.yaml"
print (config['samples'])

rule all:
    input:
        expand("FastQC/PRETRIM/{sample}_{num}_fastqc.zip", sample=config["samples"], num=['1', '2']),
        expand("bam_files/{sample}.bam", sample=config["samples"]),
        "FastQC/PRETRIM/fastq_multiqc.html",
        "FastQC/POSTTRIM/fastq_multiqc.html",

rule fastqc_pretrim:
    input:
      lambda wildcards: f"fastq_files/{config['samples'][wildcards.sample]}_{wildcards.num}.fastq.gz"
    output:
        # Output needs to end in '_fastqc.html' for multiqc to work
        html="FastQC/PRETRIM/{sample}_{num}_fastqc.html",
        zip="FastQC/PRETRIM/{sample}_{num}_fastqc.zip"
    wrapper:
        "0.23.1/bio/fastqc"

rule multiqc_fastq_pretrim:
    input:
        expand("FastQC/PRETRIM/{sample}_{num}_fastqc.html", sample=config["samples"], num=['1', '2'])
    output:
        "FastQC/PRETRIM/fastq_multiqc.html"
    wrapper:
        "0.23.1/bio/multiqc"

rule trim_galore_pe:
    input:
        lambda wildcards: expand(f"fastq_files/{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
    output:
        "trimmed_fastq_files/{wildcards.sample}_1_val_1.fq.gz",
        "trimmed_fastq_files/{wildcards.sample}_1.fastq.gz_trimming_report.txt",
        "trimmed_fastq_files/{wildcards.sample}_2_val_2.fq.gz",
        "trimmed_fastq_files/{wildcards.sample}_2.fastq.gz_trimming_report.txt"
    params:
        extra="--illumina -q 20"
    log:
        "logs/trim_galore/{sample}.log"
    wrapper:
        "0.23.1/bio/trim_galore/pe"

rule fastqc_posttrim:
    input:
        lambda wildcards: expand(f"trimmed_fastq_files/{config['samples'][wildcards.sample]}_{{num}}_val_{{num}}.fq.gz", num=[1,2])
    output:
        # Output needs to end in '_fastqc.html' for multiqc to work
        html="FastQC/POSTTRIM/{sample}_{num}_fastqc.html",
        zip="FastQC/POSTTRIM/{sample}_{num}_fastqc.zip"
    wrapper:
        "0.23.1/bio/fastqc"

rule multiqc_fastq_posttrim:
    input:
        expand("FastQC/POSTTRIM/{sample}_{num}.trim_fastqc.html", sample=config["samples"], num=['1', '2'])
    output:
        "FastQC/POSTTRIM/fastq_multiqc.html"
    wrapper:
        "0.23.1/bio/multiqc"

rule bowtie2:
    input:
        lambda wildcards: expand(f"trimmed_fastq_files/{config['samples'][wildcards.sample]}_{{num}}_val_{{num}}.fq.gz", num=[1,2])
    output:
        "bam_files/{sample}.bam"
    log:
        "logs/bowtie2/{sample}.txt"
    params:
        index=config["index"],  # prefix of reference genome index (built with bowtie2-build),
        extra=""
    threads: 8
    wrapper:
        "0.23.1/bio/bowtie2/align"

Using these new files everything worked fine initially, a partial job list was created by snakemake -np. However, this is because half of the complete job list had already been run; that is the trimmed_fastq_files folder was generated and the correctly named trimmed fastq files were in place within it. When I deleted all the previously created files to see if the entire new version of the Snakefile would work properly, snakemake -np failed, stating that there were missing input files for the rules downstream of the trim_galore_pe rule.

As you can see I


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I'll try to answer your question B, and give extra details that I hope can be useful for you and others.

Edit: I added some attempts at answering question C at the end.

Regarding quotes

First, regarding what you call "inverted" commas, they are usually called "single quotes", and they are used in python to build strings. Double quotes can also be used for the same purpose. The main difference is when you try to create strings that contain quotes. Using double quotes allows you to create strings containing single quotes, and vice-versa. Otherwise, you need to "escape" the quote using backslashes (""):

s1 = 'Contains "double quotes"'
s1_bis = "Contains "double quotes""
s2 = "Contains 'single quotes'"
s2_bis = 'Contains 'single quotes''

(I tend to prefer double quotes, that's just a personal taste.)

Decomposing the example

rule trim_galore_pe:
    input:
        sample=lambda wildcards: expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])

You are assigning a function (lambda wildcards: ...) to a variable (sample), which happens to belong to the input section of a rule.

This will cause snakemake to use this function when it comes to determine the input of a particular instance of the rule, based on the current values of the wildcards (as inferred from the current value of the output it wants to generate).

For clarity, one could very likely rewrite this by separating the function definition from the rule declaration, without using the lambda construct, and it would work identically:

def determine_sample(wildcards):
    return expand(
        f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz",
        num=[1,2])

rule trim_galore_pe:
    input:
        sample = determine_sample

expand is a snakemake-specific function (but you can import it in any python program or interactive interpreter with from snakemake.io import expand), that makes it easier to generate lists of strings. In the following interactive python3.6 session we will try to reproduce what happens when you use it, using different native python constructs.

Accessing the configuration

# We'll try to see how `expand` works, we can import it from snakemake
from snakemake.io import expand
    ?
# We want to see how it works using the following example
# expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])

# To make the example work, we will first simulate the reading
# of a configuration file
import yaml

config_text = """
samples:
    Corces2016_4983.7A_Mono: fastq_files/SRR2920475
    Corces2016_4983.7B_Mono: fastq_files/SRR2920476
cell_types:
    Mono:
    - Corces2016_4983.7A
index: /home/genomes_and_index_files/hg19
"""
# Here we used triple quotes, to have a readable multi-line string.
?
# The following is equivalent to what snakemake does with the configuration file:
config = yaml.load(config_text)
config

Output:

{'cell_types': {'Mono': ['Corces2016_4983.7A']},
 'index': '/home/genomes_and_index_files/hg19',
 'samples': {'Corces2016_4983.7A_Mono': 'fastq_files/SRR2920475',
  'Corces2016_4983.7B_Mono': 'fastq_files/SRR2920476'}}

We obtained a dictionary in which the key "samples" is associated with a nested dictionary.

# We can access the nested dictionary as follows
config["samples"]
# Note that single quotes could be used instead of double quotes
# Python interactive interpreter uses single quotes when it displays strings

Output:

{'Corces2016_4983.7A_Mono': 'fastq_files/SRR2920475',
 'Corces2016_4983.7B_Mono': 'fastq_files/SRR2920476'}
# We can access the value corresponding to one of the keys
# again using square brackets
config["samples"]["Corces2016_4983.7A_Mono"]

Output:

'fastq_files/SRR2920475'
# Now, we will simulate a `wildcards` object that has a `sample` attribute
# We'll use a namedtuple for that
# https://docs.python.org/3/library/collections.html#collections.namedtuple
from collections import namedtuple
Wildcards = namedtuple("Wildcards", ["sample"])
wildcards = Wildcards(sample="Corces2016_4983.7A_Mono")
wildcards.sample

Output:

'Corces2016_4983.7A_Mono'

Edit (15/11/2018): I found out a better way of creating wildcards:

from snakemake.io import Wildcards
wildcards = Wildcards(fromdict={"sample": "Corces2016_4983.7A_Mono"})

# We can use this attribute as a key in the nested dictionary
# instead of using directly the string
config["samples"][wildcards.sample]
# No quotes here: `wildcards.sample` is a string variable

Output:

'fastq_files/SRR2920475'

Deconstructing the expand

# Now, the expand of the example works, and it results in a list with two strings
expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])    
# Note: here, single quotes are used for the string "sample",
# in order not to close the opening double quote of the whole string

Output:

['fastq_files/SRR2920475_1.fastq.gz', 'fastq_files/SRR2920475_2.fastq.gz']
# Internally, I think what happens is something similar to the following:
filename_template = f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz"

# This template is then used for each element of this "list comprehension"    
[filename_template.format(num=num) for num in [1, 2]]

Output:

['fastq_files/SRR2920475_1.fastq.gz', 'fastq_files/SRR2920475_2.fastq.gz']
# This is equivalent to building the list using a for loop:
filenames = []
for num in [1, 2]:
    filename = filename_template.format(num=num)
    filenames.append(filename)
filenames

Output:

['fastq_files/SRR2920475_1.fastq.gz', 'fastq_files/SRR2920475_2.fastq.gz']

String templates and formatting

# It is interesting to have a look at `filename_template`    
filename_template

Output:

'fastq_files/SRR2920475_{num}.fastq.gz'
# The part between curly braces can be substituted
# during a string formatting operation:
"fastq_files/SRR2920475_{num}.fastq.gz".format(num=1)

Output:

'fastq_files/SRR2920475_1.fastq.gz'

Now let's further show how string formatting can be used.

# In python 3.6 and above, one can create formatted strings    
# in which the values of variables are interpreted inside the string    
# if the string is prefixed with `f`.
# That's what happens when we create `filename_template`:
filename_template = f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz"    
filename_template    

Output:

'fastq_files/SRR2920475_{num}.fastq.gz'

Two substitutions happened during the formatting of the string:

  1. The value of config['samples'][wildcards.sample] was used to make the first part of the string. (Single quotes were used around sample because this python expression was inside a string built with double quotes.)

  2. The double brackets around num were reduced to single ones as part of the formatting operation. That's why we can then use this again in further formatting operations involving num.

# Equivalently, without using 3.6 syntax:    
filename_template = "{filename_prefix}_{{num}}.fastq.gz".format(
    filename_prefix = config["samples"][wildcards.sample])
filename_template

Output:

'fastq_files/SRR2920475_{num}.fastq.gz'
# We could achieve the same by first extracting the value
# from the `config` dictionary    
filename_prefix = config["samples"][wildcards.sample]
filename_template = f"{filename_prefix}_{{num}}.fastq.gz"
filename_template

Output:

'fastq_files/SRR2920475_{num}.fastq.gz'
# Or, equivalently:
filename_prefix = config["samples"][wildcards.sample]
filename_template = "{filename_prefix}_{{num}}.fastq.gz".format(
    filename_prefix=filename_prefix)
filename_template

Output:

'fastq_files/SRR2920475_{num}.fastq.gz'
# We can actually perform string formatting on several variables
# at the same time:
filename_prefix = config["samples"][wildcards.sample]
num = 1
"{filename_prefix}_{num}.fastq.gz".format(
    filename_prefix=filename_prefix, num=num)

Output:

'fastq_files/SRR2920475_1.fastq.gz'
# Or, using 3.6 formatted strings
filename_prefix = config["samples"][wildcards.sample]
num = 1
f"{filename_prefix}_{num}.fastq.gz"

Output:

'fastq_files/SRR2920475_1.fastq.gz'
# We could therefore build the result of the expand in a single step:
[f"{config['samples'][wildcards.sample]}_{num}.fastq.gz" for num in [1, 2]]

Output:

['fastq_files/SRR2920475_1.fastq.gz', 'fastq_files/SRR2920475_2.fastq.gz']

Comments about question C

The following is a bit complex, in terms of how Python will build the string:

input:
    lambda wildcards: expand(f"fastq_files/{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])

But it should work, as we can see in the following simulation:

from collections import namedtuple
from snakemake.io

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...