I have actually 4 different dataframe corresponding to informations from gene predicted with augustus for 2 different species and within these species, I trained the database with the training parameters of the sp1 for the sp2 and the training parameters of the sp2 for the sp1.
Here is the exemple of the syntax name to better understand.
0035: Lepidoptera
0042: WASP
g1.t1_0035_0035 :
this gene has been predicted with the database of the specie 0035 and its own training parameters.
g1.t1_0035_0042 :
this gene has been predicted with the database of the specie 0035 and with the training parameters of the specie 0042.
g1.t1_0042_0042 :
this gene has been predicted with the database of the specie 0042 and its own training parameters.
g1.t1_0042_0035 :
this gene has been predicted with the database of the specie 0042 and with the training parameters of the specie 0035.
And now I have 4 dataframe such :
gene_name scaf_name scaf_length cov_depth GC
g3.t1 scaffold 6 56786 79 0.39
g4.t1 scaffold 6 56786 79 0.39
g1.t1 scaffold 256 789765 86 0.42
g2.t1 scaffold 890 3456 85 0.40
g5.t1 scaffold 1234 590 90 0.41
as you can see, the gene names do not have the name with _number1_number2
but each file corresponds to a specific situation: here are the file's name:
ggf_0042_0042.csv for all the genex_0042_0042
ggf_0042_0035.csv for all the genex_0042_0035
ggf_0035_0035.csv for all the genex_0035_0035
ggf_0042_0035.csv for all the genex_0042_0035
and what I actually would like is simply to parse a fasta file for exemple:
>g13600.t1_0042_0042
MERVINTQLLRYLEDHQLISDRQYGFR...
>g34744.t1_0042_0035
MSVPAHVAQIFEAIRRSGQQIDED...
>g28436.t1_0035_0042
WKKAKAENALDSYHHNHLMSEE...
>g14327.t1_0042_0042
MTYGAETWSLTVGLVRKLRVTQR...
>g30148.t1_0035_0042
MLRPVLSSKLPTNTKLRVYKTYIRSRLTY...
>g24481.t1_0035_0035
PCAGSNIKLKGTECFEKSFEVCLRNY...
and say:
if in the gene name there is the number _0035_0035, then, go into the file ggf_0035_0035.csv
and grab the row corresponding to the same gene name and fill a new dataframe with this row.
Here is an hypothetical exemple of an output:
gene_name scaf_name scaf_length cov_depth GC
g345.t1_0035_0035 scaffold 567 56778 78 0.39
g23.t1_0042_0035 scaffold 43 434 79 0.43
g46.t1_0042_0042 scaffold 276 785660 87 0.41
g2.t1_0042_0035 scaffold 845 345656 87 0.40
and so on...
See Question&Answers more detail:
os