SIGffRid experimental software

     

Context and Aim

Many genomes are sequenced and annoted, but most of the time - even for E. Coli - we know very few of their regulatory motifs involved in transcription initiation (first step of gene expression). Here, we are interested in binding sites of sigma subunit of RNA polymerase, implicated in its recrutment to DNA. Several characteristics make these motifs hard to find by current generalist methods:
- they are composed of two words ('boxes') named '-35' and '-10' according to their position compared with transcription start,
- the 2 boxes can present a variable spacer,
- they can be degenerated with only two letters conserved (fuzzy box generally associated to an 'extended' one),
- even if we know their location (they are upstream genes), the high number of comparisons to do give a two low signal/noise ratio for having effective detection.

The aim of SIGffRid is to determine usefull criteria to target comparisons. Each step is and will be designed according to biological knowledge and/or statistics.
- we search for compound motifs present in the upstream area of A PAIR of ORTHOLOGS (1),
- we search for ONLY compound motifs WITH TWO BOXES OVER-REPRESENTED on the WHOLE GENOME of bacteria we are interested in (2), allowing spacer low variability
- we CROSS results obtained on pairs of upstream sequences ACCORDING to GAPPED SEEDS (3),
- while it is not sufficient (cf. next step), we group sequence sets by using the lower probability to obtain a given letter neighbouring shared seeds (3),
- we evaluate compound 'consensus motif' given by a sequence set by comparing the 'motif density' in upstream sequences and in whole genome. The more specific of upstream sequences it is, we best we consider it.

     

1: ORTHOLOGUES: definition and interest


Orthologues are genes from phylogenetically related spaces stemming from a common ancestral gene. Biological hypothesis:
  • supposed:
  •   - orthologues may have kept the sames regulatory mechanisms, so nearly related binding sites for regulatory protein
  • currently admited:
  •   - sequence conservation is related to biological function (if bacteria are not to closely related)
Orthologues

     

2: Compound motifs search / Boxes over-representation


We consider as putative -35 and -10 boxes only words which are over-represented on the whole genome of the bacteria we are interested in.
To know it, we run R'MES (Schbath 1997) (Gaussian approximation version using -fam option so as to take into account sens strand and its inverted complement). Treating a sequence, R'MES gives every statistically over-represented (and under-represented) words by taking into account probabilities of subwords they are composed of.
So, we search pairs of interesting words such that:
  • they are conserved into the two upstream sequences of orthologues
  • spacer between them is nearly constant in the two sequences
Comparisons

     

3: CROSS of results -obtained for each pairs of sequences- according to GAPPED SEEDS


Under construction

     

SIGffRid validation

We applied SIGffRid algorithm to two related bacteria, Streptomyces coelicolor A3(2) and Streptomyces avermitilis.
Pairs of intergenic upstream sequences related to orthologous relationships were firstly grouped according to function families from MBGD database (Uchiyama 2003, Uchiyama 2007) by using an option of SIGffRid.
We used seeds {###, ####, ##**#, #**##, #*#*#, #***#, #**#, #*#} and a spacer range between conserved possible -35 and -10 boxes from 14 to 20 (default range for sigma factors belonging to sigma70 family).

We obtained:
- 113 motifs for S. coelicolor,
- 65 motifs for S. avermitilis.

The following file summarizes the most interesting results obtained by SIGffRid when ran on S. coelicolor and S. avermitilis:

main SIGffRid results.

To validate our algorithm, we searched for motifs looking like SigR binding site.
The results which are particulary interesting have a rose background.
Two motifs were very closed to known SigR binding sites (ggaatn18gtt):
- ggaatn(16,19)gtt,
- gggaan(18,20)cgtt,


The following table presents the files corresponding to all motifs given by SIGffRid overlapping SigR known binding sites.


Sigma factor binding sites for SigR
ggaatn(16,19)gtt - sequences, sites and positions - genes functions and putative promoters. We present also cross-check with macroarray data obtained under oxidative conditions (diamid medium, personal communication: We are grateful to Dimitris Kallifidas and Mark Paget for the provision of these unpublished data). Motifs match binding site consensuses in Streptomyces coelicolor A3(2) (GGAATn(18)GTT) and Mycobacterium tuberculosis (GGGAAn(18,19)CGTT) (Paget et al. 2001)
gggaan(18,20)cgtt - sequences, sites and positions

The following table presents the files corresponding to all motifs given by SIGffRid overlapping HrdB known binding sites.


Sigma factor binding sites for HrdB
tgacan(17,20)an3t - sequences, sites and positions - genes functions and putative promoters. Matches chiA, chiB (Saito et al. 2000) and rrnD (Baylis and Bibb 1988, Kang et al. 1997) known promoters.
ttgacn(19,20)ancnt - sequences, sites and positions - genes functions and putative promoters. Matches chiA promoter (Saito et al. 2000).
ttgan(18,19)cta - sequences, sites and positions - genes functions and putative promoters. Matches chiC and chiD known promoters (Delic et al. 1992, Saito et al. 2000).
cngn(18,21)taggct - sequences, sites and positions - genes functions and putative promoters. Matches known sigB promoter (Cho et al. 2001).

The following table presents the files corresponding to the motif given by SIGffRid overlapping BldN known binding site.


Sigma factor binding sites for BldN
cgtaan(18,19)gtt - sequences, sites and positions - genes functions and putative promoters. Matches the sole known binding site for BldN in bldM promoter (Bibb et al. 2000, CGTAACn(16)CGTTGA).

The following table presents the files corresponding to the motif similar to known Bacillus subtilis SigE binding site (Roels et al. 1992).


catan(15,17)tac - sequences, sites and positions - genes functions and putative promoters

The following table presents the files corresponding to the motif given by SIGffRid probably related to sporulation (probable sigma factor binding sites).


cngn(14,16)agtaa - sequences, sites and positions - genes functions and putative promoters. Motif responsible for bldB17 and bldB28 mutants (Pope et al. 1998)
agtaan(13,15)cng - sequences, sites and positions - genes functions and putative promoters. This motif matches the proposed binding site (Ueda et al. 2005).

The following table presents the files corresponding to the motif given by SIGffRid close to recA promoter, and whose most of targeted genes are related to DNA repair/replication/damage (declinations of motif in Ahel et al. 2002).


DNA-damage induced transcription factor binding sites
tgtcagtn(14,15)tng - sequences, sites and positions - genes functions and putative promoters. Some targeted genes are highly homologue to Escherichia coli DNA-damage inducible genes dinP, priA, radA, dinG, recQ, in addition to DNA glycosylases (e.g. ung), excinuclease (e.g. uvrB SC), polymerase I genes, genes related to DNA replication (e.g. dnaE, dnaN encoding respectively alpha and beta subunits of PolIII, and recF)
tgtcagtn14tng - sequences, sites and positions
tgtcagtgn(9,12)ang - sequences, sites and positions
tgtcagtn(12,14)tng - sequences, sites and positions


BldD binding site prediction prove the versatility of SIGffRid which is able to infer transcription factor binding sites those are not sigma factor binding sites.

The following table presents the files corresponding to the motif given by SIGffRid overlapping BldD known binding site.


Transcription factor binding sites for BldD
[ta]gtgan(18,20)tn2c - sequences, sites and positions - genes functions and putative promoters. This motif is similar to the proposed consensus for BldD transcription factor binding site (Elliot et al. 2001, AGTgAn(m)TCACc)
Proposed transcription factor binding sites for BldB (or BldD binding site declination?)
[ta]gtgan(16,18)cnt - sequences, sites and positions - genes functions and putative promoters. BldB is thought to be transcriptionaly auto-regulated (Pope et al. 1998)

tgtgan(18,20)tna - sequences, sites and positions - genes functions and putative promoters
tnan(16,18)tgtga - sequences, sites and positions - genes functions and putative promoters
tgtgan(17,18)tnt - sequences, sites and positions - genes functions and putative promoters
tntn(16,19)ctgtga - sequences, sites and positions - genes functions and putative promoters

The following table presents the files corresponding to the other motifs given by SIGffRid, possibly interesting.


taan(17,21)gtta - sequences, sites and positions - genes functions and putative promoters

cgggn(13,15)tta - sequences, sites and positions - genes functions and putative promoters
cccgn(14,15)gtaa - sequences, sites and positions - genes functions and putative promoters