Analysis tools for GLINT

Introduction

GLINT (Global Linkage based Investigation of Naturally occurring Traits) is a workflow for determining what specific genetic loci contribute to traits that differ between moderately related strains of bacteria. The experimental requirements for the procedure are detailed in a recent publication; this page describes the usage of an accompanying set of software tools for postprocessing GLINT data. These tools may also be used in the analysis of ADAM data, in which case the same strain should be given as both the donor and recipient.

System Requirements

The GLINT software component consists of a series of python scripts, and thus requires a system installation of python and a number of python modules:

Biopython
Numpy
Scipy
Matplotlib
rpy2 (and a functional R installation including the zoo package)

These should be installed through a package manager whenever possible. GLINT is written for pre-3.0 versions of python, and was developed and tested with python 2.7.3.

Getting GLINT

A snapshot of the most recent stable version of GLINT may be downloaded here. Updates will also be pushed to a git repository between changes to the main release.

Technical Summary

A complete description of the steps in GLINT postprocessing is given in the GLINT paper. In brief, the experimental component of GLINT provides data on the relative fitness of different portions of the donor strain genome when moved into the recipient strain genome, in the form of relative abundances of transferred genomic regions in libraries selected under either baseline conditions or the condition of interest (the test condition). Genomic regions where donor-derived insertions are over-represented in the test condition indicate that the donor strain genetic content at that site is more favorable under the test condition; the opposite is obviously true for cases where donor-derived insertions are under-represented.

The GLINT postprocessing toolkit takes as input probe-level data on the log ratios of donor-derived DNA under the test condition to that in the reference condition; thus, initial analysis and normalization of raw transposon footprinting data is left to the user. Log ratios are used because they tend to yield approximately normal background distributions of selection scores. The raw data are then corrected for the presence of large indels between the donor and recipient strains by collapsing regions of the donor genome that are not present in the recipient strain, and by normalizing the selection scores based on the distance of each probe from the nearest large indel. GLINT requires as input an alignment of the donor and recipient genomes (preferrably produced using Mauve; only Mauve .xmfa output files can be used at present).

After normalization and remapping of low homology regions, GLINT identifies significant peaks by parameterizing a background distribution using an autoregressive model, calculating a selection score at each location based on a rolling median of nearby probe-level data, producing p-values for all selection scores based on simulations from the background distribution, and then flagging all scores that are significant at a user-specified false discovery rate. Adjacent or nearby significant probe calls are lumped into "runs", which must have a particular minimum length to contribute to peak calls. Peaks are then defined as contiguous regions of a specified minimum width in which one or more runs is present within a sufficiently small area. All parameters used during peak calling can be modified, as described below; the defaults (used in the GLINT paper) are an FDR of 0.01, requirement of a run of ten adjacent significant probes to contribute to peak calling, minimum peak width of 5 kb, and maximum distance between runs to be combined into a peak of 5 kb. Due to the large genomic regions (up to approximately 90 kb) transfered by P1-based transduction, true peaks in GLINT or ADAM fitness score profiles tend to involve a broad region (spanning up to a few tens of kb) near the locus actually making the fitness contribution, informing the design of the peak breadth-based filter used here.

The GLINT postprocessing tools comprise a series of python scripts carrying out specific tasks required for the analysis described above, along with the driver script glint.py, which runs these other scripts to carry out the appropriate analysis. All user interaction should occur through glint.py unless problems are encountered during the analysis.

While the set of tools used here has only been tested on microarray-based transposon footprinting data, and some of the terminology (e.g., "probes") is microarray-specific, we expect that it would work equally well using a next-gen sequencing readout instead. Investigators may wish to experiment with providing the data at each genomic position as a "probe", or lumping the raw bp-level abundance data into psuedo-probes tiling the genome at intervals of a few tens of bp. The authors would greatly appreciate hearing from anyone who attempts such an application.

Running GLINT

After obtaining normalized selection scores as described above, a configuration file (see below) should be written for each selection of interest, indicating what strains and selections contributed to a given experiment and providing values for optional parameters involved in peak calling. GLINT can then be run with

            
              python glint.py CONFIG_FILE

Configuration files

All required inputs for GLINT must be specified in a single configuration file, which is given on the command line for glint.py. The configuration file consists of a series sections denoted in [square brackets], each section containing one or more "key : value" pairs specifying the actual parameters. Lines beginning with '#' are treated as comments. An example section containing three parameters would read:

            

              [Environment]
              # set environment variables needed by GLINT
              glint_bin_dir : /Users/petefred/src/glint/scripts
              temp_prefix : tmp_crooks_vs_mg1655_kdg
              output_prefix : crooks_vs_mg1655_kdg

A complete example is given in the sample usage case below.

The following headers and parameters are recognized in GLINT configuration files. Sections and parameters given in bold are required; others are optional.

Environment -- Information on file naming and location

glint_bin_dir -- Path to the GLINT scripts
temp_prefix -- Prefix for temporary files produced by glint.py
output_prefix -- Prefix for final output files from GLINT analysis

Genome1 -- Information on the donor strain

name -- Name to be used in output to refer to the donor strain
seqfile -- Path to a fasta file containing the donor strain genome

Genome2 -- Information on the recipient strain

name -- Name to be used in output to refer to the recipient strain
seqfile -- Path to a fasta file containing the recipient strain genome

Alignment -- Information on alignment of the donor and recipient strain genomes

xmfa_file -- Path to an xmfa file (output by Mauve) containing an alignment of the donor and recipient strain genomes

Readout -- Tranpsposon footprinting data from selection experiments

ori_file -- Table of all probe positions and orientations (see files included with the example for formatting)
intensity_file -- Table containing the normalized log ratios of transposon abundances under test and reference conditions (see files included with the example for formatting)

tlpeakfinder -- Parameters for peak calling. Defaults are provided for all parameters in this section, and are given in parenthesis.

RAVG_WINDOW_HALFWIDTH_PROBES -- Distance (in probes) in each direction to extend the window used in calculating running medians. (10)
MIN_PEAKWIDTH_PROBES -- Minimum number of significant scores in a row to give rise to a "run"/prospective peak (10)
SIG_COMBINE_WINDOWSIZE_PROBES -- Maximum distance, in probes, between two significant values for them to be combined into a single run. (1, indicating no gaps allowed)
MIN_PEAKWIDTH_BP -- Minimum width, in bp, for a set of significant probes to be called as a peak once merging is complete(5000)
SIG_COMBINE_WINDOWSIZE -- Maximum distance, in bp, between runs of significant probes for them to be lumped into a single peak (5000)
EXTREME_TAIL_QUANTILE_RAW -- Percentile boundary beyond which data are not included in normalization; for a given value x, cutoffs are placed at x and 100-x (2)
HOMOLOGY_LOESS_ALPHA -- Width parameter for loess smoothing during homology normalization (0.3)
PEAK_FDR -- False discovery rate for calling of significant probes (0.01)
N_BG_SIM -- Number of simulations used to approximate the background distribution for significance calling (500000)
MINDIST_FROM_GAP -- Minimum distance from a donor-specific region for a probe to be considered (55)
ACF_WINDOWSIZE -- Order of autoregressive model used to approximate the background distribution (this many probes will be considered) (41)

Input files

Aside from the configuration file described above, several data files are required for GLINT analysis:

ori_file -- File giving the locations and orientations of all probes used in transposon footprinting (see files given with the example for formatting)
intensity_file -- File containing the normalized log ratio selection scores from the test and reference condition at each probe (see files given with the example for formatting)
xmfa_file -- Multiple alignment file produced by Mauve or an equivalent program. xmfa files output by Mauve may be used directly; anything else may require substantial reformatting

Output files

During normal operation, glint.py will print a few diagnostic messages but otherwise produces no termnal output. However, intermediate files are saved at each step (with a prefix set in the config file); they may be deleted by the user if analysis is successful, or analyzed for sources of error if it is not. The final output from the GLINT peak finder is comprised of the following files:

{PREFIX}_tlpeakfinder_peaks.txt -- locations and average intensities of all significant peaks in the selection score profiles
{PREFIX}_tlpeakfinder_peaks_unmerged.txt -- locations and average intensities of all significant runs of probes in the selection score profiles, prior to merging into peaks
{PREFIX}_acfplot.pdf -- Plot of the empirical autocorrelation function used to parameterize the autoregressive background model
{PREFIX}_tlpeakfinder_normed_scores.txt -- normalized selection scores at all probes after homology corrections
{PREFIX}_tlpeakfiner_normed_ma_scores.txt -- rolling median of the normalized, corrected selection scores; used for peak calling

Example usage case

The GLINT software distribution includes an examples directory containing all data needed to reproduce the KDG case from the original GLINT paper. The input files contained here may also be useful in setting up other GLINT applications. To run the sample case, copy the complete contents of examples/crooks_kdg to a fresh directory, edit glint_kdg.conf to give the correct value of glint_bin_dir, and then run

            
              glint.py glint_kdg.conf

Contact

Please direct questions and bug reports to Peter Freddolino.

License

The GLINT postprocessing tools are distributed under the University of Illinois/NCSA Open Source License (see license.txt for details).