Supplementary material for "Fastcompare: a non-alignment approach for genome-scale
discovery of DNA and mRNA regulatory elements using network-level conservation"
Authors are Olivier
Elemento and Saeed Tavazoie. They can be contacted at elemento AT princeton.edu and tavazoie AT genomics.princeton.edu
Distributed programs and scripts :
- download the full distribution (binaries and source) for both Linux and Windows (contains all the files described below).
- download individual platform-specific programs (written in the C language)
- k-mer lists
5-mers DNA RNA
6-mers DNA RNA
7-mers DNA RNA
8-mers DNA RNA
9-mers DNA RNA
10-mers DNA RNA
- download pcre.dll for the above Windows binaries (not used in fastcompare.exe)
- download individual plateform-independent Perl scripts
do_fastcompare_alignment
do_fastcompare_randomization
- download the following Perl modules (required for the above Perl scripts):
Sets.pm
Table.pm
ClustalW.pm
Fasta.pm
Sequence.pm
- download the following external packages:
ClustalW: EBI
Perl: ActivePerl for Windows
Additional scripts (may require a few minor modifications to run on your plateform)
-
annotate_genome_using_orthologous_proteins.pl
: basic genome annotation using orthologous protein sequences (this
script is not meant to replace a full genome annotation pipeline).
Starting from a protein, the script uses BLAST to find clusters
of HSP (separated by < 100kb) in the related genome, then extract
the corresponding genomic region, and uses GeneWise to find accurate
gene boundaries. Then, the candidate gene is BLASTed back to the set of
proteins, and if the best hit is the protein the analysis was started
with, the candidate gene is reported.
This script requires standalone BLAST and GeneWise to be locally
installed, the above Perl modules and the following ones : MyBlast.pm, GeneWise.pm . Usage is:
perl annotate_genome_using_orthologous_proteins.pl proteins.fa genome.fa
proteins.fa contains one protein per gene (usually the longest splice form). proteins.fa and genome.fa need to have been formated using:
formatdb -i proteins.fa -p T -o T
formatdb -i genome.fa -p F -o T
This script requires standalone BLAST to be installed, the above Perl modules and MyBlast.pm. Usage is:
perl
best_reciprocal_blast_hits_protein_protein.pl proteins1.fa proteins2.fa
proteins.fa and genome.fa need to have been formated using:
formatdb -i proteins1.fa -p T -o T
formatdb -i proteins2.fa -p T -o T
Usage : perl extract_upstream_sequences_from_genome.pl annotation.txt genome.fa lengthU lengthD minlen
annotation.txt contains gene coordinates, with one line per gene (usually the longest splice form), in the following format:
GENE [tab] CHROMOSOME/SCAFFOLD [tab] START_PROTEIN
[tab] END_PROTEIN [tab] STRAND [tab] START_TRANSCRIPT [tab]
END_TRANSCRIPT
Ex:
CG10018 3R
1380857 1383262 1
1380812 1383514
CG10019 2L
3745692 3752190 1
3745692 3752190
CG10021 2L
3779726 3783284 1
3779693 3784129
CG10023 2R
14945969 14951459 -1
14945513 14952105
CG10026 2L
19452110 19453522 1
19451607 19453655
CG10029 3R
3141555 3142891
-1 3141555 3142891
CG10031 2L
3698586 3699040
-1 3698555 3699091
CG10032 3R
3097314 3098231
1 3097314 3098231
lengthU is the length of the upstream region upstrean of the
TSS. lengthD is the length downstream of the TSS (usually 0). genome.fa
needs to be formated using: formatdb -i genome.fa -p F -o T
Usage : perl extract_3utr_sequences_from_genome.pl annotation.txt genome.fa minlen
annotation.txt contains gene coordinates as described above, with one line per gene (usually the longest splice form). minlen is the minimal length of a 3'UTR. genome.fa needs to be formated using: formatdb -i genome.fa -p F -o T
Usage : perl match_micrornas_to_kmers.pl micrornas.fa kmers.txt