fastcompare

Supplementary material for "Fastcompare: a non-alignment approach for genome-scale discovery of DNA and mRNA regulatory elements using network-level conservation"

Authors are Olivier Elemento and Saeed Tavazoie. They can be contacted at elemento AT princeton.edu and tavazoie AT genomics.princeton.edu

Distributed programs and scripts :

- download the full distribution (binaries and source) for both Linux and Windows (contains all the files described below).

- download individual platform-specific programs (written in the C language)

	Linux	Windows	Source
fastcompare	fastcompare_LINUX.bin	fastcompare_WIN.exe	fastcompare.c
do_fastcompare_conserved_set	do_fastcompare_conserved_set_LINUX.bin	do_fastcompare_conserved_set_WIN.exe	do_fastcompare_conserved_set.c
do_fastcompare_coconservation	do_fastcompare_coconservation_LINUX.bin	do_fastcompare_coconservation_WIN.exe	do_fastcompare_coconservation.c

- k-mer lists

5-mers    DNA    RNA
6-mers    DNA    RNA
7-mers    DNA    RNA
8-mers    DNA    RNA
9-mers    DNA    RNA
10-mers  DNA    RNA

- download pcre.dll for the above Windows binaries (not used in fastcompare.exe)

- download individual plateform-independent Perl scripts

do_fastcompare_alignment

do_fastcompare_randomization

- download the following Perl modules (required for the above Perl scripts):

Sets.pm
Table.pm
ClustalW.pm
Fasta.pm
Sequence.pm

- download the following external packages:

ClustalW: EBI
Perl: ActivePerl for Windows

Additional scripts (may require a few minor modifications to run on your plateform)

annotate_genome_using_orthologous_proteins.pl : basic genome annotation using orthologous protein sequences (this script is not meant to replace a full genome annotation pipeline). Starting from a protein, the script uses BLAST to find clusters of HSP (separated by < 100kb) in the related genome, then extract the corresponding genomic region, and uses GeneWise to find accurate gene boundaries. Then, the candidate gene is BLASTed back to the set of proteins, and if the best hit is the protein the analysis was started with, the candidate gene is reported.

This script requires standalone BLAST and GeneWise to be locally installed, the above Perl modules and the following ones : MyBlast.pm, GeneWise.pm . Usage is:
perl annotate_genome_using_orthologous_proteins.pl proteins.fa genome.fa

proteins.fa contains one protein per gene (usually the longest splice form). proteins.fa and genome.fa need to have been formated using:
formatdb -i proteins.fa -p T -o T
formatdb -i genome.fa -p F -o T

best_reciprocal_blast_hits_protein_protein.pl : orthology determination using best reciprocal BLAST hits

This script requires standalone BLAST to be installed, the above Perl modules and MyBlast.pm. Usage is:
perl best_reciprocal_blast_hits_protein_protein.pl proteins1.fa proteins2.fa

proteins.fa and genome.fa need to have been formated using:
formatdb -i proteins1.fa -p T -o T
formatdb -i proteins2.fa -p T -o T

extract_upstream_sequences_from_genome.pl : upstream region extraction from genome and gene coordinate files

Usage : perl extract_upstream_sequences_from_genome.pl annotation.txt genome.fa lengthU lengthD minlen

annotation.txt contains gene coordinates, with one line per gene (usually the longest splice form), in the following format:
GENE [tab] CHROMOSOME/SCAFFOLD [tab] START_PROTEIN [tab] END_PROTEIN [tab] STRAND [tab] START_TRANSCRIPT [tab] END_TRANSCRIPT
Ex:
CG10018    3R    1380857    1383262    1    1380812    1383514
CG10019    2L    3745692    3752190    1    3745692    3752190
CG10021    2L    3779726    3783284    1    3779693    3784129
CG10023    2R    14945969   14951459   -1    14945513   14952105
CG10026    2L    19452110   19453522   1    19451607   19453655
CG10029    3R    3141555    3142891    -1    3141555    3142891
CG10031    2L    3698586    3699040    -1    3698555    3699091
CG10032    3R    3097314    3098231    1    3097314    3098231

lengthU is the length of the upstream region upstrean of the TSS. lengthD is the length downstream of the TSS (usually 0). genome.fa needs to be formated using: formatdb -i genome.fa -p F -o T

extract_3utr_sequences_from_genome.pl : 3'UTR extraction from genome and gene coordinate files

Usage : perl extract_3utr_sequences_from_genome.pl annotation.txt genome.fa minlen

annotation.txt contains gene coordinates as described above, with one line per gene (usually the longest splice form). minlen is the minimal length of a 3'UTR. genome.fa needs to be formated using: formatdb -i genome.fa -p F -o T

match_micrornas_to_kmers.pl : determing which k-mers match the 5' extremity of a list of miRNAs

Usage : perl match_micrornas_to_kmers.pl micrornas.fa kmers.txt