SNP DISCOVERY FOR GENERATION OF OPAS FOR SNP ANALYSIS USING THE ILLUMINA BEADARRAY AND BEAD EXPRESS PLATFORMS

Background

SNPs for development of OPAs for the Illumina GoldenGate assay were identified using two methods. Initially, SNPs between Lactuca sativa cv. Salinas & L. serriola acc. UC96US23 were identified by resequencing PCR amplified genes of interest using Sanger sequencing (McHale et al., 2009; Lavelle, 2009).
Subsequently, SNPs between closely related cultivars were mined from Illumina sequencing data aligned to a reference EST assembly (http://cgpdb.ucdavis.edu/cgpdb2/est_info_assembly.php). Several populations for genetic analysis are derived from crosses between closely related cultivars (Figure 1). In order to identify SNPs in these populations, cDNA libraries from parental lines were sequenced with a Illumina Genome Analyzer (IGA) II. The pipeline described below was implemented to identify SNPs for the parental pairs indicated in figure 1. Similar methods were also used to identify SNPs between cv. Salinas and cv. Valmaine in resistance and developmental candidate genes.

Figure 1. Neighbor-joining tree for 70 cultivars generated from 384 SNPs assayed using the Illumina GoldenGate Assay with OPA3. Parents of mapping populations used in this study are indicated. The number of SNPs listed for each parental combination is the total number identified from the SNP mining experiment.

Methods for SNP mining from IGA sequence

  1. For each genotype, ~17 million 60 bp IGA reads were aligned to reference EST sequences generated from Sanger sequencing (CLS_S3_ESTs_Sat.assembly available at http://cgpdb.ucdavis.edu/cgpdb2/est_info_assembly.php).

    To do this: 1) Reference sequences were converted from fasta to binary fasta format (fasta2bfa; http://maq.sourceforge.net/index.shtml). 2) Fastq sequence was extracted from SRF files (srf2fastq). 3) Fastq file was converted to a binary fastq format and split into multiple files with 2 million reads in each (fastq2bfq, maq). 4) Reads were aligned to the reference EST sequences, any sequence mapping equally well to more than one position in the reference sequences was removed in order to avoid confusion from paralogs using the maq program. 5) Alignments are merged for each genotype using maq.
  2. SNPs were identified between each parental pair.

    To do this: 1) A consensus file was generated for each genotype. The expected allelic frequency was set to low value in order to remove any interference from false alignment of lowly expressed genes that may not be present in our EST reference sequence (maq options -r 0.01 -m 3 -q 30 -M 0.01). 2) Results were uploaded to a MySQL relational database to identify robust SNPs. 3) Resultant SNPs were filtered to have a minimum consensus quality score of 10, minimum coverage of 10, mapping quality score of 60, and minimum consensus quality score of neighboring bases of 10.
  3. SNPs, were chosen for assays based on maximization of the following criteria:
    1) The SNP was previously assayed and show to be robust, 2) The SNP was polymorphic in more than one of the three parental pair combinations (Figure 2; Table 1). 3) The sequence surrounding the SNP was suitable for oligonucleotide design for the Illimina GoldenGate assay. 4) Avoidance of intron/exon splice sites (http://www.genome.uga.edu/tools/intron/). 5) Limit to one SNP per contig. 6) SNPs selected for candidate genes of interest. 7) SNPs selected with an even genome distribution based on previous mapping work and the ultra-dense lettuce chip map.

Figure 2.Venn diagram representing the number of SNPs identified between each parental pair.


Table 1.Total numbers of SNPs and the number of contigs represented by these SNPs.