|
SIGffRid experimental software |
|
|
|
|
|
|
Many genomes are
sequenced and annoted, but most of the time - even for E. Coli - we
know very few of their regulatory motifs involved in transcription
initiation (first step of gene expression). Here, we are interested in
binding sites of sigma subunit of RNA polymerase, implicated in its
recrutment to DNA. Several characteristics make these motifs hard to
find by current generalist methods:
- they are composed of two words ('boxes') named '-35' and '-10'
according to their position compared with transcription start,
- the 2 boxes can present a variable spacer,
- they can be degenerated with only two letters conserved (fuzzy box generally associated to an 'extended' one),
- even if we know their location (they are upstream genes), the high
number of comparisons to do give a two low signal/noise ratio for
having effective detection.
The aim of SIGffRid is to determine usefull criteria to target comparisons.
Each step is and will be designed according to biological knowledge and/or statistics.
- we search for compound motifs present in the upstream area of A PAIR of ORTHOLOGS (1),
- we search for ONLY compound motifs WITH TWO BOXES OVER-REPRESENTED on
the WHOLE GENOME of bacteria we are interested in (2), allowing spacer
low variability
- we CROSS results obtained on pairs of upstream sequences ACCORDING to GAPPED SEEDS (3),
- while it is not sufficient (cf. next step), we group sequence sets by
using the lower probability to obtain a given letter neighbouring
shared seeds (3),
- we evaluate compound 'consensus motif' given by a sequence set by
comparing the 'motif density' in upstream sequences and in whole
genome. The more specific of upstream sequences it is, we best we
consider it.
|
|
|
|
|
|
1: ORTHOLOGUES: definition and interest
|
|
|
Orthologues are genes from phylogenetically related spaces stemming from a common ancestral gene.
Biological hypothesis:
|
|
- supposed:
- orthologues may have kept the sames regulatory mechanisms, so nearly related binding sites for regulatory protein
- currently admited:
- sequence conservation is related to biological function (if bacteria are not to closely related)
|
 |
|
|
|
|
|
2: Compound motifs search / Boxes over-representation
|
|
|
We
consider as putative -35 and -10 boxes only words which are
over-represented on the whole genome of the bacteria we are interested
in.
To know it, we run R'MES ( Schbath
1997) (Gaussian approximation version using -fam option so as to take
into account sens strand and its inverted complement). Treating a
sequence, R'MES gives every statistically over-represented (and
under-represented) words by taking into account probabilities of
subwords they are composed of.
|
|
So, we search pairs of interesting words such that:
- they are conserved into the two upstream sequences of orthologues
- spacer between them is nearly constant in the two sequences
|
 |
|
|
|
|
|
3: CROSS of results -obtained for each pairs of sequences- according to GAPPED SEEDS
|
|
|
|
|
|
|
|
|
|
|
|
|
We applied SIGffRid algorithm to two related bacteria, Streptomyces coelicolor A3(2) and Streptomyces avermitilis.
Pairs of intergenic upstream sequences related to orthologous relationships were firstly grouped according to function families from MBGD database (Uchiyama 2003, Uchiyama 2007) by using an option of SIGffRid.
We used seeds {###, ####, ##**#, #**##, #*#*#, #***#, #**#, #*#} and a spacer range between conserved possible -35 and -10 boxes from 14 to 20 (default range for sigma factors belonging to sigma70 family).
We obtained:
- 113 motifs for S. coelicolor,
- 65 motifs for S. avermitilis.
The following file summarizes the most interesting results obtained by SIGffRid when ran on S. coelicolor and S. avermitilis: main SIGffRid results.
To validate our algorithm, we searched for motifs looking like SigR binding site.
The results which are particulary interesting have a rose background.
Two motifs were very closed to known SigR binding sites (ggaatn18gtt):
- ggaatn(16,19)gtt,
- gggaan(18,20)cgtt,
The following table presents the files corresponding to all motifs given by SIGffRid overlapping SigR known binding sites.
The following table presents the files corresponding to all motifs given by SIGffRid overlapping HrdB known binding sites.
| Sigma factor binding sites for HrdB |
| tgacan(17,20)an3t |
- sequences, sites and positions |
- genes functions and putative promoters. Matches chiA, chiB (Saito et al. 2000) and rrnD (Baylis and Bibb 1988, Kang et al. 1997) known promoters. |
| ttgacn(19,20)ancnt |
- sequences, sites and positions |
- genes functions and putative promoters. Matches chiA promoter (Saito et al. 2000). |
| ttgan(18,19)cta |
- sequences, sites and positions |
- genes functions and putative promoters. Matches chiC and chiD known promoters (Delic et al. 1992, Saito et al. 2000). |
| cngn(18,21)taggct |
- sequences, sites and positions |
- genes functions and putative promoters. Matches known sigB promoter (Cho et al. 2001). |
The following table presents the files corresponding to the motif given by SIGffRid overlapping BldN known binding site.
The following table presents the files corresponding to the motif similar to known Bacillus subtilis SigE binding site (Roels et al. 1992).
The following table presents the files corresponding to the motif given by SIGffRid probably related to sporulation (probable sigma factor binding sites).
The following table presents the files corresponding to the motif given by SIGffRid close to recA promoter, and whose most of targeted genes are related to DNA repair/replication/damage (declinations of motif in Ahel et al. 2002).
| DNA-damage induced transcription factor binding sites |
| tgtcagtn(14,15)tng |
- sequences, sites and positions |
- genes functions and putative promoters. Some targeted genes are highly homologue to Escherichia coli DNA-damage inducible genes dinP, priA, radA, dinG, recQ, in addition to DNA glycosylases (e.g. ung), excinuclease (e.g. uvrB SC), polymerase I genes, genes related to DNA replication (e.g. dnaE, dnaN encoding respectively alpha and beta subunits of PolIII, and recF) |
| tgtcagtn14tng |
- sequences, sites and positions |
| tgtcagtgn(9,12)ang |
- sequences, sites and positions |
| tgtcagtn(12,14)tng |
- sequences, sites and positions |
BldD binding site prediction prove the versatility of SIGffRid which is able to infer transcription factor binding sites those are not sigma factor binding sites.
The following table presents the files corresponding to the motif given by SIGffRid overlapping BldD known binding site.
The following table presents the files corresponding to the other motifs given by SIGffRid, possibly interesting.
|
|
|
|
|
|