Experimental Procedures

 

Computational identification and analysis of AS events from cDNA and EST databases

AS identification and filtering

To identify cassette and mutually exclusive AS events, we first mapped mouse cDNA/EST sequences (ftp://ftp.ncbi.nih.gov/repository/UniGene/) (Build #119) to the mouse genome (ftp://ftp.ncbi.nih.gov/genomes/M_musculus/MGSCv3_Release3/). The expressed sequences were repeat-masked using RepeatMasker (http://ftp.genome.washington.edu/RM/RepeatMasker.html). Exons were identified by aligning each cDNA/EST sequence with the genomic supercontig sequences using Sim4 (Florea et al., 1998). Alignments of expressed sequences were discarded if having a sequence identity less than 80% and if not encompassing at least two contiguous splice sites. The expressed sequences were mapped to the corresponding contigs of exons extracted by Sim4. Cassette and mutually exclusive alternative exons were identified if a middle exon is skipped in one expressed sequence and included in another. To exclude possible false-positive detections, ¡°AS¡± events were not counted if they occurred at a frequency of less than one in 100 expressed sequences for cassette alternative exons, since these events could be attributed to splicing errors (Kan et al., 2002; Modrek and Lee, 2003). This resulted in the identification of a total of 14,106 exon skipping events in 13,801 mouse genes with 2,101,724 mouse cDNA/EST sequences. Of these AS events, 7,024 were removed because the flanking exons were not always constitutively spliced, leaving 7,082 AS events.

 

We next selected AS events based on the maximal number of supporting cDNA/EST sequences. To identify the number of expressed sequences that represent each isoform, two ¡°probe¡± sequences were designed: (1) a 100-mer ¡°skipped exon¡± probe consisting of 50 nucleotides upstream of the alternative exon fused to 50 nucleotides downstream of the alternative exon; and (2), a longer sequence spanning the same region but with the middle alternative exon sequence included. These two probe sequences were used to Blast-search the mouse expressed sequences using an e-value of 1e-4. An ¡°excluded¡± isoform was identified if there was a match to the middle 40 or more nucleotides of the ¡°skipped exon¡± probe. An ¡°included¡± isoform was identified if the alignment spanned the two splice sites flanking the alternative exon. 2,190 cassette alternative exons that co-exist with alternative 5´/3´ splice sites in the flanking upstream/downstream exons were detected and excluded from the analysis. Of the remaining 4,892 AS events, 3,126 with the highest number of supporting cDNA/ESTs for both isoforms were selected for representation on a single Agilent microarray.

 

Identification of conserved and mouse-specific alternative exons

AS events were classified as conserved or mouse-specific as described in the main text. Of the top 1,600 ranking GenASAP AS events, 328 were identified as conserved AS, 668 were identified as mouse-specific AS of conserved exons, and 604 were identified as mouse-genome-specific AS. The ¡°high confidence¡± cases of mouse-specific AS of conserved exons were identified if an AS event had a score equal to or greater than 8 (S >= 8) using a stringent, experimentally-validated scoring scheme (Pan et al., 2005). Of 668 mouse-specific single cassette alternative exons, 246 (36.8%) have scores in the ¡°high confidence¡± range.

 

Distribution of AS genes in Gene Ontology Biological Process (GO-BP) categories

            All Gene Ontology Biological Process (GO-BP) annotations for 10,361 Mouse Genome Informatics (MGI) markers were downloaded on June 2, 2004 from ftp://ftp.informatics.jax.org/pub/reports/gene_association.mgi. We also downloaded the GO-BP category hierarchy on May 31, 2004 from http://www.geneontology.org/ontology/gene_ontology.obo and used it to ¡°up-propagate¡± the MGI annotations, such that each category was augmented by the annotations present in any of its descendants (eg. King et al., 2003). This resulted in 120,338 annotations in 2361 GO-BP categories.  We used this annotation set to assign GO-BP annotations to the genes associated with the measured alternative splicing (AS) events. The GO-BP annotations associated with AS genes monitored on the microarray can be found in Supp. Info. Table S4.

 

We mapped AS events to MGI markers using a BLAST search.  Specifically, we used ungapped blastn with an E-value cutoff of 1e-5 to find matches for each C1 and C2 probes among the sequence for MGI markers labeled ¡°Gene¡± in the MRK_Sequence.rpt file (downloaded from MGI on May 28, 2004).  This procedure generated multiple matches for some probes; we discarded any matches having bit scores less than 90% of the highest bit score for that probe.  Probes with remaining multiple matches were excluded, and MGI marker IDs were assigned to all AS events whose C1 and C2 probe map to the same marker.  Using this procedure we mapped 2773 (of 3126) AS events to 2296 unique MGI markers, 1257 of which had at least one GO-BP annotation.

 

To test for enrichment of GO-BP annotations within the set of 1257 markers, we performed a series of hypergeometric tests (i.e. Fisher¡¯s exact test).  After the Bonferroni correction, only 22 of the 1118 represented categories were either under- or over-represented (14 and 8 categories, respectively) at P < 0.05.

 

AS microarray data analysis

Algorithm development

In the GenASAP model, each of the two isoforms is described by an isoform profile. The profile for isoform 1 is {l1,1¡­l1,6} and the profile for isoform 2 is {l2,1¡­l2,6}, where the 6 elements for each isoform correspond to the probes, {C1, C2, A, C1-A, A-C2, C1-C2} (see Figure 1). These profiles can be set by hand (e.g., the profile for isoform 1 can be set to {1,1,0,0,0,1} corresponding to the exclusion event). However, we found the data contains unanticipated correlations between the probes, and therefore we designed the algorithm to automatically estimate the isoform profiles, as described below. The preprocessed expression level xjtk for probe k in gene j and tissue t is given by

 

xjtk = rjt(sjt1 l1k + sjt2 l2k + ( (1-ojtk) yk + ojtk f) ejtk),

 

where rjt is a scaling factor specific to gene j and tissue t; sjt1 and sjt2 are the amounts of isoforms 1 and 2 respectively (constrained to be positive); ojtk indicates whether probe k is an outlier (ojtk =1) or not (ojtk =0); ejtk is a normally distributed random variable; yk is the std dev of probe k assuming it fits the model; and f is the std dev of probe k under an outlier process. While yk accounts for small levels of noise introduced by transcription, splicing, hybridization and scanning, f accounts for large levels of noise introduced by faulty probes, missing data, particulate matter on the microarray, among other sources.

 

Given the training set (x¡¯s), GenASAP carries out a Bayesian analysis to infer the values of the other variables in the above equation, for all genes and all tissues. The l¡¯s, y¡¯s and f variables are shared across all genes and tissues and so GenASAP has available a large amount of data to determine each of these variables precisely. We refer to these as parameters. However, the number of r¡¯s, s¡¯s, o¡¯s and e¡¯s is very large compared to the size of the data set, so GenASAP computes a distribution over each of these variables, which takes into account the uncertainties in their values. Inference is performed using a variational form of the expectation maximization algorithm (Jordan et al., 1999; Neal and Hinton, 1998). The algorithm begins by initializing the parameters to random values and then alternates between holding the parameters constant while updating the distributions over the remaining variables, and then holding these distributions constant while updating the parameters. After convergence, the distributions over the variables are used to compute the expected values of the isoform levels, E[sjt1] and E[sjt2], which are then combined to estimate the %ASex: %ASex = 100 E[sjt1] / (E[sjt1]+E[sjt2]). A more detailed description of GenASAP, and evidence that it performs better then a variety of supervised methods for AS detection, is presented in a technical report (www.psi.utoronto.ca/~ofer/AS-supp.pdf) (Shai et al., In preparation).

 

Comparison of gene sets defining tissue-specific AS and transcription profiles (Table 1)

            The greedy approach used for the analysis in Table 1 involves iteratively removing an event from the pool of 820 genes, such that the removal increases the profile similarity for the selected tissue pair, while reducing the average profile similarity for all of the other tissue pairs. At each step, the algorithm removes the event such that y is maximized, where y is given by y = Dsi - 0.5DS, where Dsi is the change in tissue similarity between the pool of events, with and without the event, for the tissue-pair selected, and DS is the average change in tissue similarity between the pool of events with and without the event for all other tissue pairs. This procedure was carried out twice for each tissue pair: once for %ASex profile similarity, and once for transcription profile similarity, yielding two groups of 100 genes. The similarity measures used for these sets of genes were the same as those used in Figure 4A. To ensure that the greedy algorithm operated correctly, the tissue-pair profile similarities for the sets of 100 genes were plotted as shown in Figure 4A; in each case, points representing the selected tissue-pairs were located in the upper-right-most quadrant of the scatter plot, and were well separated from the points representing the other tissue-pair combinations (data not shown).

 

References

 

Florea, L., Hartzell, G., Zhang, Z., Rubin, G. M., and Miller, W. (1998). A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res 8, 967-974.

 

Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning 37, 183-233.

 

Kan, Z., States, D., and Gish, W. (2002). Selecting for functional alternative splices in ESTs. Genome Res 12, 1837-1845.

 

King, O. D., Foulger, R. E., Dwight, S. S., White, J. V., and Roth, F. P. (2003). Predicting gene function from patterns of annotation. Genome Res 13, 896-904.

 

Modrek, B., and Lee, C. J. (2003). Alternative splicing in the human, mouse and rat genomes is associated with an increased frequency of exon creation and/or loss. Nat Genet 34, 177-180.

 

Neal, R. M., and Hinton, G. E. (1998). A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in Graphical Models (Cambridge, MIT Press), pp. 355-368.

 

Pan, Q., Bakowski, M. A., Morris, Q., Zhang, W., Frey, B. J., Hughes, T. R., and Blencowe, B. J. (2005). Alternative splicing of conserved exons is frequently species-specific in human and mouse. Trends Genet. In Press.