Bioinformatics Lab @ VIPBG

 

Introduction

Design

System Requirements

Download

Instruction

Performance

Discussion

Contact

SNPKS: A Statistical Method to Estimate the Effective SNP Size in Vertebrate Genomes

Citation: Seo D, Jiang C, Zhao Z (2006) A novel statistical method to estimate the effective SNP size in vertebrate genomes and categorized genomic regions. BMC Genomics 7:329 Paper Online

 

Introduction

SNPKS is an integrated statistical method to estimate the effective SNP size: the minimum number of the SNPs that can essentially represent the bias patterns of the whole SNP data. SNPKS considers both the biological and statistical significance. It has two major steps:

    (1) to obtain an initial effective size by the Kolmogorov-Smirnov test
    (2) to find an intermediate effective size by interval evaluation.

SNPKS is implemented in C and Perl.

Design

    1. Data preprocess
      SNP data may be obtained from various sources, including your own dataset. Here we demonstrate by using SNP data from dbSNP database, the largest public SNP database. SNP data are retrieved from dbSNP and are preprocessed into the format of SNPKS, which includes alleles of SNPs and
      10 neighboring nucleotides at each 5' and 3' side. We select only SNPs that are biallelic, mapped in the non-repetitive sequences, and at least 20 nucleotides long at each side of flanking sequences.

    2. The effective SNP size (Ne) is estimated by taking 100 repeats of the following procedures:

      • Count the number for each nucleotide at each position of whole SNPs.

      • Calculate the corresponding cumulative relative probabilities and obtain the corresponding whole biases relative to the genome sequence average.

      • Generate random number with initial size (n0), count the number of each nucleotide at each position of the randomly chosen SNPs.

      • Calculate the corresponding cumulative relative probabilities and obtain the sample biases relative to the genome sequence average.

      • Compare the maximum difference of cumulative relative frequency of each nucleotide among 20 positions with the threshold value of biological significance (e.g., 0.2%) instead of test statistic given by the KS test. If the maximum difference is less than the given biological significance then SNPKS gives out an initial effective size (n). If not, SNPKS increases the sample size by 10 k.

      • If this initial effective size (n) satisfies an interval evaluation using 30 SNP subsets with size (n) randomly sampled from the whole SNP dataset, then we find an intermediate effective SNP size (Ne0). Otherwise, the initial effective size (n) is increased by 10 k and runs this step again.

    3. Estimate the effective SNP size

      • The mean of the 100 intermediate effective SNP sizes (Ne0) obtained above is estimated as the effective SNP size (Ne).

      • The 95% confidence interval of the effective SNP size is also calculated based on these 100 Neo values.

 

The software design is illustrated in this flowchart. The annotation of the flanking sites of a SNP is shown here.

 

System Requirements

Operating system: Windows or Linux. It is possible to work in Unix too.
Memory: 1 GB minimum (2 GB is recommended for large size data – e.g. Human).
Hard disk: 2 times the size of source data.
Running environment: C compiler and Perl 5.6 or higher (Perl 5.8 is recommended).
(Installation instructions for Perl 5.6/5.8 are available here)

 

Download

Program: Windows: snpks.zip; Linux or Unix: snpks.tgz

Processed test data (caution: file size is very large!):    Human dbSNP build 125;      Chimpanzee dbSNP build 125;      Dog dbSNP build 125;     Mouse dbSNP build 126;     Mouse dbSNP build 123;     Human HapMap Phase I;     Human HapMap Phase II;     Human intergenic SNPs;     Human genic SNPs;     Human intronic SNPs;     Human SNPs in CpG islands;

 

Instruction

  1. Download snpks.zip (for windows) or snpks.tgz (for Linux or Unix)

 

  1. Uncompress the downloaded file
    Windows: use winzip or other software to uncompress.
    Linux or Unix: type  "tar -xzvf snpks.tgz"

  1. Compile C program (Note: don't change the execution file name - "rand_ks" because it is automatically called in perl script)
    Windows: use C compiler to compile.
    Linux or Unix: type "gcc random_ks.c –o rand_ks -lm"

 

  1. Data preparation (in Linux/Unix)

 
     4.1.
 
Preparation of SNP files and a list file
              - Make a data directory (% mkdir SNP_Data)
              - Change to the data directory (% cd SNP_Data)
              - Download SNP files from dbSNP ftp site and save a specific directory
                (% wget -r -A "*.gz" -np -nd [url])

                For example, to retrieve dog SNPs, we type:
                wget -r -A "*.gz" -np -nd ftp://ftp.ncbi.nih.gov/snp/organisms/dog_9615/rs_fasta/
              - Unzip downloaded SNP data (% gzip -d *.gz)

              - Make list file that indicates the location of downloaded SNP files using
                vi editor or pico editor
                 /SNP_Data/rs_ch1.fas
                 /SNP_Data/rs_ch2.fas
                                  .
                                  .
                                  .
                 /SNP_Data/rs_chX.fas


          4.2.  Preparation of genome average file in the order of A->C->G->T, for example,
                 we may save the following nucleotide average frequencies in dog genome
                 in the file dog_genome_average.txt)
                     29.481
                     20.512
                     20.509
                     29.498


  1. Run SNPKS program
    % perl snpks.pl SNP_list genome_average_file output_file

    For example: % perl snpks.pl dog_list.txt dog_genome_average.txt dog.out

 

Performance

We tested the performance of SNPKS using the human and mouse dbSNP data on a Dell Workstation (CPU 2 X 3.0GHz, Memory 4GB, Redhad Linux WS, gcc3.2.3 and Perl5.80).

SNP data

Process

SNPNB

SNPKS

1 round

5 rounds

10 rounds

Human

Preprocessing data

2h 50m 25s

2h 24m 49s

Estimation of Ne

24h 56m 1s

82h 48m 1s

147h 39m 40s

5h 7m 35s

Total elapsed time

27h 46m 26s

85h 38m 26s

151h 8m 56s

7h 32m 24s

 

 

 

 

 

 

Mouse

Preprocessing data

0h 2m 51s

0h 2m 52s

Estimation of Ne

7h 27m 55s

37h 53m 53s

75h 18m 28s

2h 4m 50s

Total elapsed time

7h 30m 46s

37h 56m 44s

75h 21m 19s

2h 7m 42s

 

Discussion

  1. Random number generation
    To evaluate how many SNPs are sufficient to represent the bias patterns from the whole data (e.g. genome data), the random numbers will be generated in a C program. If we write a Perl program to generate random numbers, the process will not only be extremely slow, but also generate lots of redundant numbers when the sample size is large such as 20,000.

 

  1. Operating systems
    SNPKS was tested in Redhat Enterprise Linux and Microsoft Windows OS. It should be able to run in the Unix OS, given C and Perl compilers are installed.

 

Contact

Zhongming Zhao <zzhao at vcu.edu> Daekwan Seo <dseo at vcu.edu>

 

 

 

Copyright (c) 2004, Bioinformatics Lab @ VIPBG, VCU. All Rights Reserved.

VCUVIPBGCSBCBBSISOM