KBAS

Developed by:
Alberto Riva and Alireza Nazarian,
Department of Molecular Genetics and Microbiology
and UF Genetics Institute,
University of Florida

KBAS is a software package to perform genome-wide association studies in situations in which combinations of SNPs jointly contribute to the traits of interest, as is often the case in complex diseases with a genetic basis. It implements a hypothesis-based method to confirm or invalidate the association between a user-provided sets of markers (typically SNPs) and the trait under investigation, in a case-control setting. Hypothesis testing and refinement are performed through a non-deterministic search procedure using a Genetic Algorithm (GA).

Availability

KBAS is distributed as a GNU/Linux command-line tool designed to be included in automated pipelines for genome-wide association analysis.

Download KBAS

How does KBAS work?

The work flow of KBAS involves three steps: hypothesis generation, refinement and testing. The process begins with a set of hypotheses, each one containing a user-defined set of markers based on preexisting expert knowledge. The hypotheses are then evaluated, ranked, refined and selected according to how well they fit the available data using a variation of the CHC class of GA algorithms. The designed GA receives a user-defined number of randomly selected potential solutions to the problem under consideration (population size). Each solution is a unique combination of markers with randomly assigned weights. Solutions then are evaluated according to a user-specified fitness function, and are evolved over a user-defined simulated time period (number of generations). Over the generations, the GA engine refines the initial marker set by removing markers that show limited contribution to the trait, and produces the model that best fits the fitness function. The GA stops either at the last generation or if in a particular generation the fitness of the best fitting model exceeds a pre-determined threshold specified by the user.

Installation

After downloading the compressed KBAS package using the link below, copy it to an appropriate directory and execute the following command:

tar -xvf kbas-1.0.tar.gz

This will create a directory called kbas-1.0/, which contains the ‘kbas’ and ‘demo’ executable scripts.

Getting started

KBAS includes an interactive demonstration that explains the basic steps in running the program. To run the demonstration, start the ‘demo’ script in the kbas-1.0 directory and follow the instructions. We highly recommend running the demonstration before starting to use KBAS, since it will provide you with an overview of the KBAS process and with a description of the most important command-line options.

Using KBAS

First Step: Converting genotype files to KBAS binary format
KBAS inputs consist of two sets of genotypes, one for the control population and one for the case population, and a file containing a list of SNP identifiers. The case-control datasets can be either in tab-delimited format or in VCF format, and should be converted into the binary format usable by KBAS. This is done using the -convert command followed by the appropriate conv-… arguments as follows:


-conv-in
Input genotypes file to be converted.

-conv-dir
The pathname of directory where the converted files should be stored. If not specified, current directory is used as default.

-conv-prefix
Prefix to use to name converted files i.e. P.bin, P-names.txt and P-snps.txt files where P is the specified prefix.

-conv-format
Format of input file. Possible values are: vcf, csv (both alleles contained in a single column), csv2 (alleles contained in separate columns).

-conv-delim
Delimiter used in input file to separate the fields in each row when the input file is in csv or csv2 formats. Possible values are: tab, comma, space, colon, semicolon.

-conv-marker-col
Column containing marker identifiers in genotypes file.

-conv-subj-col
Column containing subject names in genotypes file.

-conv-class-col
Column containing subject classification if the genotype file contains subjects of different classes (e.g. 0=unaffected, 1=affected).

-conv-allele-col
Column containing allele(s).If using the cvs format (both alleles contained in a single column), specify a single number. When using the cvs2 format (alleles in separate columns), specify two column numbers separated by comma (e.g. 3,4)

-conv-allele-delim
Character between alleles if both alleles are in the same column e.g. / for alleles represented as A/C. No need to be specified if alleles are not separate e.g. AC.

-conv-names-from
File containing subject names.

-conv-names-col
Column containing subject names in conv-names file.


Example:

./kbas -convert \
-conv-in databases/case/case-genotypes.csv \
-conv-dir databases/case/ \
-conv-prefix case \
-conv-format csv \
-conv-delim tab \
-conv-marker-col 1 \
-conv-subj-col 2 \
-conv-allele-col 3

Second Step: Running GA on the prepared case-control files
Once the P.bin, P-names.txt and P-snps.txt files generated for each of case and control groups, the GA can be initiated and run using the -run command and its relevant arguments as follows:

Required arguments:


-case
The local path of the stored P.bin file containing genotypes of the case group.

-ctrl
The local path of the stored P.bin file containing genotypes of the control group.

-markers
The local path of the file containing the name of markers (e.g. SNPs) included in the study.

Optional arguments:


-ga-enc
Method is used for numerically encoding genotypes(e.g. AA, AB, BB) . It takes a three-digit number between 000 and 999 (e.g. 012) where the first number (here 0) specifies the numeric value of AA genotype, the second number (here 1) shows the numeric value of AB genotype, and the third one (here 2) refers to the numeric value of BB genotype. (default = 012).

-ga-class
The GA class used in the analysis. Each of the predefined GA classes provides a different way to determine the score-variable and to measure the fitness. Use the -list-ga command to get a list of all available GA classes.

-ga-gen
The maximum number of generations the GA should run for if desired level of fitness is not met in any generation (default = 1000).

-ga-pop
Population size or number of chromosomes in each generation of GA (default = 500).

-ga-bits
Number of bits used for giving weight to each marker in the chromosomes (default = 1).

-ga-num
Number of markers having non-zero weight when GA is initialized (default = 20).

-ga-mut
Overall mutation rate at which the weight of individual markers may be mutated in each generation (default = 0.05). Its valid range is between 0 and 1.

-ga-0-to-1
Zero to one mutation rate at which the zero weight of individual markers may be mutated to non-zero values in each generation (default = 1). Its valid range is between 0 and 1.

-ga-1-to-0
Non-zero to zero mutation rate at which the non-zero weights of individual markers may be mutated to zero values in each generation (default = 1). Its valid range is between 0 and 1.

-ga-thld
Desired fitness value which if is met before the last generation, GA stops (default = 10).

-ga-rnd
Number of rounds of randomization test is performed by GA on case and control datasets using final successful model (default = 0).

-ga-out
Filename to write final model to. If specified, the program will write the results out to this file in tab-delimited format.


Example:

./kbas -run \
-case databases/case/case.bin \
-ctrl databases/ctrl/ctrl.bin \
-markers markers/SNP-set-1.csv \
-ga-enc “012” \
-ga-class 1 \
-ga-gen 500 \
-ga-pop 200 \
-ga-bits 3 \
-ga-num 50 \
-ga-thld 20 \
-ga-rnd 100000 \
-ga-out output-1

Help

Please use the -help convert or -help run commands to get a detailed description of all the options available for the -convert and -run commands respectively.

KBAS output

The output consists of a summary of the GA inputs and parameters, followed by the contents of the successful model, represented as two columns containing markers (e.g. SNP names) their corresponding weights, respectively. If the -ga-out argument is provided, the output is also written to the specified file in machine-readable format. If randomization test was requested, its final p-value is recorded to the output as well.

Publications

  1. A knowledge-based method for association studies on complex diseases.
    Nazarian A, Sichtig H, Riva A.
    PLoS One. 2012;7(9):e44162.