Supplementary material for "Fastcompare: a non-alignment approach for genome-scale discovery of DNA and mRNA regulatory elements using network-level conservation"

Authors are Olivier Elemento and Saeed Tavazoie. They can be contacted at elemento AT princeton.edu and tavazoie AT genomics.princeton.edu


Distributed programs and scripts :

- download the full distribution (binaries and source) for both Linux and Windows (contains all the files described below).

- download individual platform-specific programs (written in the C language)


Linux Windows Source
fastcompare fastcompare_LINUX.bin fastcompare_WIN.exe fastcompare.c
do_fastcompare_conserved_set do_fastcompare_conserved_set_LINUX.bin do_fastcompare_conserved_set_WIN.exe do_fastcompare_conserved_set.c
do_fastcompare_coconservation do_fastcompare_coconservation_LINUX.bin do_fastcompare_coconservation_WIN.exe do_fastcompare_coconservation.c

- k-mer lists

5-mers    DNA    RNA
6-mers    DNA    RNA
7-mers    DNA    RNA
8-mers    DNA    RNA
9-mers    DNA    RNA
10-mers  DNA    RNA

- download pcre.dll for the above Windows binaries (not used in fastcompare.exe)

- download individual plateform-independent Perl scripts

do_fastcompare_alignment

do_fastcompare_randomization

- download the following Perl modules (required for the above Perl scripts): 

Sets.pm 
Table.pm
ClustalW.pm
Fasta.pm
Sequence.pm


- download the following external packages: 

ClustalW:     EBI
Perl:             ActivePerl for Windows


Additional scripts (may require a few minor modifications to run on your plateform)


This script requires standalone BLAST and GeneWise to be locally installed, the above Perl modules and the following ones : MyBlast.pm, GeneWise.pm . Usage is:
perl annotate_genome_using_orthologous_proteins.pl proteins.fa genome.fa

proteins.fa contains one protein per gene (usually the longest splice form). proteins.fa and genome.fa need to have been formated using:
formatdb -i proteins.fa -p T -o T
formatdb -i genome.fa -p F -o T

This script requires standalone BLAST to be installed, the above Perl modules and MyBlast.pm. Usage is:
perl best_reciprocal_blast_hits_protein_protein.pl  proteins1.fa proteins2.fa

proteins.fa and genome.fa need to have been formated using:
formatdb -i proteins1.fa -p T -o T
formatdb -i proteins2.fa -p T -o T
Usage : perl extract_upstream_sequences_from_genome.pl annotation.txt genome.fa lengthU lengthD minlen

annotation.txt
contains gene coordinates, with one line per gene (usually the longest splice form), in the following format:
GENE [tab] CHROMOSOME/SCAFFOLD [tab] START_PROTEIN [tab] END_PROTEIN [tab] STRAND [tab] START_TRANSCRIPT [tab] END_TRANSCRIPT   
Ex:
CG10018    3R    1380857    1383262    1     1380812    1383514
CG10019    2L    3745692    3752190    1     3745692    3752190
CG10021    2L    3779726    3783284    1     3779693    3784129
CG10023    2R    14945969   14951459   -1    14945513   14952105
CG10026    2L    19452110   19453522   1     19451607   19453655
CG10029    3R    3141555    3142891    -1    3141555    3142891
CG10031    2L    3698586    3699040    -1    3698555    3699091
CG10032    3R    3097314    3098231     1    3097314    3098231


lengthU is the length of the upstream region upstrean of the TSS. lengthD is the length downstream of the TSS (usually 0). genome.fa needs to be formated using: formatdb -i genome.fa -p F -o T Usage :  perl extract_3utr_sequences_from_genome.pl annotation.txt genome.fa minlen

annotation.txt contains gene coordinates as described above, with one line per gene (usually the longest splice form). minlen is the minimal length of a 3'UTR. genome.fa needs to be formated using: formatdb -i genome.fa -p F -o T
Usage : perl match_micrornas_to_kmers.pl micrornas.fa kmers.txt