# br The feature matrix F was obtained by multiplying a

The feature matrix F was obtained by multiplying a transposed
matrix AScore with matrix W , with size N × K , where each K columns represents the coefficient for each subject. To determine the association

level of each factor with each cancer type, elastic net regression models were fitted using the cancer type as the output variable and each of the columns in the F matrix as input variables. To differentially discover pathways for a cancer type, subjects with the cancer type of interest were treated as cases and the remaining subjects as controls. The reg-ulation parameter λ was selected using ten-fold cross-validation. Association effects for each feature were represented by the beta value and the vector was denoted as . We selected the factor corresponding

to the largest , and denote the factor as a cancer-relevant factor. After selecting the factor, we utilized the matrix W to rank genes. The weights in the matrix W are composed of Angiotensin II with different weights. We as-sume genes with the largest coefficients cumulatively and linearly in-teract with each other and are associated with cancer development. For each cancer, we repeated the experiment and selected the top genes in each factor for enrichment analysis and presented the top significant pathways (adjusted P-value < 0.05). To ensure stable results, we analyzed the top 100, 200, 300, 400, 500, and 600 genes for pathway analysis and determined 300 genes started yielding stable results as illustrated with BRCA (Table S2).

3. Experiment results

After collapsing mutations’ numbers and scores in each gene, a

matrix Ascore was formed with the 2431 subjects as rows. Entry Aij de-notes the jth gene’s collapsed score for the ith subject. For gene pre-

measurement. The experiment was repeated using the Number (col-lapsed number of mutations), SIFT (collapsed sift scores), PP2 (col-lapsed PP2 score), and CADD (collapsed CADD scores) matrices. Factor numbers ranged from 2 to 15 (Fig. 2). The Number matrix was found to result in better performance than the other matrices. Using the Number

Fig. 2. The accuracy of cancer type classification using different P-value cutoffs. (A) Sum of the count of mutations (B) Sum of the SIFT scores (C) Sum of the PP2 scores (D) Sum of the CADD scores.

We then compared these four matrices: Number, SIFT, PP2, and CADD. The number of factors K ranged from 2 to 15, a range within the constraint of the rule (N + M)K < NM. For each factor number K, nsNMF was applied to the matrices, and a corresponding classifier was trained. The performances derived from the Number matrix out-performed the other matrices significantly (p < 0.01 for all compar-isons) (Fig. 3). The precision, recall, and f-measures were derived and similar patterns and trends were observed. Based on performance, the matrix Number was used for subsequent analyses. Using Number ma-trix, the maximum accuracy was 80.0% (Standard Error of the Mean SEM = 0.1%) when the factor number equaled 12 (Fig. 3). The accu-racy was found to become stable when factor number was larger than

12. To prevent potential overfitting, we chose a factor of 12 for our analysis.

The performance of our proposed model (80.0%, SEM = 0.1%) significantly outperformed the other four baselines (Fig. 4). The P-value using the Student’s t-test was 0.0001 comparing our proposed model to the second-ranked model (73.9% (SEM = 0.8%), which applies pena-lized logistical regression with the aggregated Number matrix. In the baselines, aggregating the mutations in a gene has improved the per-formance significantly as well (p < 0.01 in both comparisons). A comparison of our proposed model with previously applied methods for

Fig. 3. The accuracy of cancer type predictions using different numbers of factors from the matrices of Number (blue), SIFT (red), PP2 (green), and CADD (purple) scores. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

cancer classification was conducted. We found that our method achieved significantly improved performance (Fig. 5) compared to methods that utilize variations of CADD scores (71.6%) [29], logistical regression on L1-regularised term (74.0%) [30], and SVM-RFE (55.9%) [30].

Using regularized logistic regression, we assessed each gene’s as-sociation effect with each cancer type. The association score was de-fined as the sum of feature weights multiplied by the coefficient of each

Fig. 4. Comparison of our proposed model (nsNMF + SVM) with baselines. LR is penalized logistical regression. SVM is support vector machine. Aggregated are the matrices to sum mutations together in the same gene. Mutations are the model that utilizes every single mutation as an input variable.