Columbia University in the City of New York

Tavazoie Lab

Analysis tools for GLINT

Introduction

GLINT (Global Linkage based Investigation of Naturally occurring Traits) is a workflow for determining what specific genetic loci contribute to traits that differ between moderately related strains of bacteria. The experimental requirements for the procedure are detailed in a recent publication; this page describes the usage of an accompanying set of software tools for postprocessing GLINT data. These tools may also be used in the analysis of ADAM data, in which case the same strain should be given as both the donor and recipient.

System Requirements

The GLINT software component consists of a series of python scripts, and thus requires a system installation of python and a number of python modules:

These should be installed through a package manager whenever possible. GLINT is written for pre-3.0 versions of python, and was developed and tested with python 2.7.3.

Getting GLINT

A snapshot of the most recent stable version of GLINT may be downloaded here. Updates will also be pushed to a git repository between changes to the main release.

Technical Summary

A complete description of the steps in GLINT postprocessing is given in the GLINT paper. In brief, the experimental component of GLINT provides data on the relative fitness of different portions of the donor strain genome when moved into the recipient strain genome, in the form of relative abundances of transferred genomic regions in libraries selected under either baseline conditions or the condition of interest (the test condition). Genomic regions where donor-derived insertions are over-represented in the test condition indicate that the donor strain genetic content at that site is more favorable under the test condition; the opposite is obviously true for cases where donor-derived insertions are under-represented.

The GLINT postprocessing toolkit takes as input probe-level data on the log ratios of donor-derived DNA under the test condition to that in the reference condition; thus, initial analysis and normalization of raw transposon footprinting data is left to the user. Log ratios are used because they tend to yield approximately normal background distributions of selection scores. The raw data are then corrected for the presence of large indels between the donor and recipient strains by collapsing regions of the donor genome that are not present in the recipient strain, and by normalizing the selection scores based on the distance of each probe from the nearest large indel. GLINT requires as input an alignment of the donor and recipient genomes (preferrably produced using Mauve; only Mauve .xmfa output files can be used at present).

After normalization and remapping of low homology regions, GLINT identifies significant peaks by parameterizing a background distribution using an autoregressive model, calculating a selection score at each location based on a rolling median of nearby probe-level data, producing p-values for all selection scores based on simulations from the background distribution, and then flagging all scores that are significant at a user-specified false discovery rate. Adjacent or nearby significant probe calls are lumped into "runs", which must have a particular minimum length to contribute to peak calls. Peaks are then defined as contiguous regions of a specified minimum width in which one or more runs is present within a sufficiently small area. All parameters used during peak calling can be modified, as described below; the defaults (used in the GLINT paper) are an FDR of 0.01, requirement of a run of ten adjacent significant probes to contribute to peak calling, minimum peak width of 5 kb, and maximum distance between runs to be combined into a peak of 5 kb. Due to the large genomic regions (up to approximately 90 kb) transfered by P1-based transduction, true peaks in GLINT or ADAM fitness score profiles tend to involve a broad region (spanning up to a few tens of kb) near the locus actually making the fitness contribution, informing the design of the peak breadth-based filter used here.

The GLINT postprocessing tools comprise a series of python scripts carrying out specific tasks required for the analysis described above, along with the driver script glint.py, which runs these other scripts to carry out the appropriate analysis. All user interaction should occur through glint.py unless problems are encountered during the analysis.

While the set of tools used here has only been tested on microarray-based transposon footprinting data, and some of the terminology (e.g., "probes") is microarray-specific, we expect that it would work equally well using a next-gen sequencing readout instead. Investigators may wish to experiment with providing the data at each genomic position as a "probe", or lumping the raw bp-level abundance data into psuedo-probes tiling the genome at intervals of a few tens of bp. The authors would greatly appreciate hearing from anyone who attempts such an application.

Running GLINT

After obtaining normalized selection scores as described above, a configuration file (see below) should be written for each selection of interest, indicating what strains and selections contributed to a given experiment and providing values for optional parameters involved in peak calling. GLINT can then be run with

            
              python glint.py CONFIG_FILE
            
            

Configuration files

All required inputs for GLINT must be specified in a single configuration file, which is given on the command line for glint.py. The configuration file consists of a series sections denoted in [square brackets], each section containing one or more "key : value" pairs specifying the actual parameters. Lines beginning with '#' are treated as comments. An example section containing three parameters would read:

            

              [Environment]
              # set environment variables needed by GLINT
              glint_bin_dir : /Users/petefred/src/glint/scripts
              temp_prefix : tmp_crooks_vs_mg1655_kdg
              output_prefix : crooks_vs_mg1655_kdg

            
            

A complete example is given in the sample usage case below.

The following headers and parameters are recognized in GLINT configuration files. Sections and parameters given in bold are required; others are optional.

Environment -- Information on file naming and location

Genome1 -- Information on the donor strain

Genome2 -- Information on the recipient strain

Alignment -- Information on alignment of the donor and recipient strain genomes

Readout -- Tranpsposon footprinting data from selection experiments

tlpeakfinder -- Parameters for peak calling. Defaults are provided for all parameters in this section, and are given in parenthesis.

Input files

Aside from the configuration file described above, several data files are required for GLINT analysis:

Output files

During normal operation, glint.py will print a few diagnostic messages but otherwise produces no termnal output. However, intermediate files are saved at each step (with a prefix set in the config file); they may be deleted by the user if analysis is successful, or analyzed for sources of error if it is not. The final output from the GLINT peak finder is comprised of the following files:

Example usage case

The GLINT software distribution includes an examples directory containing all data needed to reproduce the KDG case from the original GLINT paper. The input files contained here may also be useful in setting up other GLINT applications. To run the sample case, copy the complete contents of examples/crooks_kdg to a fresh directory, edit glint_kdg.conf to give the correct value of glint_bin_dir, and then run

            
              glint.py glint_kdg.conf
            
            

Contact

Please direct questions and bug reports to Peter Freddolino.

License

The GLINT postprocessing tools are distributed under the University of Illinois/NCSA Open Source License (see license.txt for details).

© 2013 Columbia University