vignettes/QualityControlAndSampleSizeEstimation.Rmd
QualityControlAndSampleSizeEstimation.Rmd
This vignette demonstrates how the package can be used to create descriptive statistics or quality control plots for a dataset. First, we read a “proteinGroups.txt” file from the MaxQuant software, and next, we specify a prolfqua configuration. Afterward, the data is transformed and normalized, and plots describing the data are generated. Finally, we estimate how many samples are necessary to properly quantify a two-fold change for \(90%\) of all the proteins with a power of \(0.8\).
Generate Simulated data. To learn how to create prolfqua LFQData objects see the vignette: creating configurations.
simdata <- prolfqua::sim_lfq_data_peptide_config()
lfqdata <- prolfqua::LFQData$new(simdata$data,simdata$config)
lfqdata$remove_small_intensities()
You can convert the data into a data frame in a wide format, where the intensities of each sample occupy their columns.
lfqdata$to_wide()$data[1:3,1:8]
## # A tibble: 3 × 8
## protein_Id peptide_Id isotopeLabel A_V1 A_V2 A_V3 A_V4 B_V1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0EfVhX~0087 ahQLlQY7 light 26.9 25.2 26.1 25.2 26.1
## 2 0EfVhX~0087 dJkdz7so light NA 16.1 14.7 NA 25.7
## 3 0EfVhX~0087 ITLb4x1q light 19.4 NA 18.2 18.9 28.9
Next we show how the data is distributed before transformation and normalization.
lfqplotter <- lfqdata$get_Plotter()
lfqplotter$intensity_distribution_density()
Often it also makes sense to look into the part of the data that is
not quantified in all samples. Therefore several functions for the
missingess are available. In the NA_heatmap
function it is
possible to see if some missing proteins are specific for one particular
dilution or if they appear randomly in all samples.
nah <- lfqplotter$NA_heatmap()
Prints missing heatmap:
nah
Also in the next two figures we show how many missing values are found in the respective conditions and how many proteins do not have missing data and how complete the measurements are for each group.
lfqdata$get_Summariser()$plot_missingness_per_group()
lfqdata$get_Summariser()$plot_missingness_per_group_cumsum()
In the missigness_histogram
we can check if there is a
dependency between NAs (and the number of NAs for earch protein) with
respect to the protein intensity.
lfqplotter$missigness_histogram()
Other important statistics that can be easily calculated from the object are coefficient of variation, means and standard deviations.
stats <- lfqdata$get_Stats()
## [1] "group_"
## NULL
prolfqua::table_facade( stats$stats_quantiles()$wide, paste0("quantile of ",stats$stat ))
probs | A | All | B | Ctrl |
---|---|---|---|---|
0.10 | 1.215060 | 7.185163 | 1.236542 | 1.129780 |
0.25 | 2.442726 | 12.277918 | 1.999453 | 1.607515 |
0.50 | 3.418211 | 15.169313 | 3.026062 | 3.349758 |
0.75 | 4.599101 | 19.467593 | 4.155821 | 4.885600 |
0.90 | 6.349093 | 21.451325 | 5.322302 | 6.008022 |
stats$density_median()
In the next figure we show the dependency of the standard deviation with respect to the mean intensitiy of the proteins.
stdm_raw <- stats$stdv_vs_mean(size = 10000) +
ggplot2::scale_x_log10() +
ggplot2::scale_y_log10()
stdm_raw
Next, we want to normalize the data by first \(\log_2\) transforming it and then z-scaling. The \(log_2\) stabilizes the variance, while the z-scaling removes systematic differences from the samples.
lt <- lfqdata$get_Transformer()
transformed <- lt$log2()$robscale()$lfq
transformed$config$table$is_response_transformed
## [1] TRUE
Again we want to look into the distribution of the intensities in our samples after normalization.
pl <- transformed$get_Plotter()
pl$intensity_distribution_density()
plots matrix scatter plot in the upper right matrix and some statistics in the lower left matrix.
pl$pairs_smooth()
## NULL
plots protein heatmap:
p <- pl$heatmap()
p
We can also look at the correlation among the samples or look at a PCA plot.
hc <- pl$heatmap_cor()
hc
pl$pca()
prints the standard deviations:
stats <- transformed$get_Stats()
## [1] "group_"
## NULL
prolfqua::table_facade(stats$stats_quantiles()$wide, "Standard deviations")
probs | A | All | B | Ctrl |
---|---|---|---|---|
0.10 | 0.0355196 | 0.1152932 | 0.0325012 | 0.0294799 |
0.25 | 0.0587524 | 0.1677154 | 0.0582281 | 0.0422670 |
0.50 | 0.0717438 | 0.2345489 | 0.0853025 | 0.0566645 |
0.75 | 0.1240478 | 0.3004859 | 0.1209993 | 0.0720238 |
0.90 | 0.1687990 | 0.3253740 | 0.1388753 | 0.1087199 |
plots density of the standard deviation
stats$density_median()
Check for heteroskedasticity. After transformation, the standard deviation should be independent of the mean intensity.
stdm_trans <- stats$stdv_vs_mean(size = 10000) + ggplot2::scale_x_log10() + ggplot2::scale_y_log10()
stdm_trans
#gridExtra::grid.arrange(stdm_raw, stdm_trans, nrow=1)
For the sample size estimation we will now focus on only one condition and the different bio-reps from the a group In this condition all samples have the same concentration and we want to check variability within this group. Like this we can estimate how many samples are necessary per group to quantify properly a two-fold change (delta of 1) for 90% of all the proteins with a power of 0.8 and significance level of 0.05.
First, we need to filter our data for group_
A
only.
transformedA <- transformed$get_copy()
transformedA$data <- transformedA$data |> dplyr::filter(group_ == "A")
stats <- transformedA$get_Stats()
## [1] "group_"
## NULL
stats$density(ggstat = "ecdf")
The table indicates that for this kind of data - if a two-fold change for 90% of all the proteins should be accurately quantified with a power of 0.8 (at 0.05 significance level) one would need to have 10 samples in each group.
sampleSize <- stats$power_t_test_quantiles() |>
dplyr::filter(group_ != "All")
prolfqua::table_facade(sampleSize, "Sample sizes. delta - Effect size, N - samplesize")
group_ | probs | quantiles | sdtrimmed | N_exact | N | delta |
---|---|---|---|---|---|---|
A | 0.10 | 0.0355196 | 0.0355196 | 1.526948 | 2 | 0.59 |
A | 0.25 | 0.0587524 | 0.0587524 | 1.673109 | 2 | 0.59 |
A | 0.50 | 0.0717438 | 0.0717438 | 1.758418 | 2 | 0.59 |
A | 0.75 | 0.1240478 | 0.1240478 | 2.172950 | 3 | 0.59 |
A | 0.90 | 0.1687990 | 0.1687990 | 2.671229 | 3 | 0.59 |
A | 0.10 | 0.0355196 | 0.0530265 | 1.500024 | 2 | 1.00 |
A | 0.25 | 0.0587524 | 0.0587524 | 1.521541 | 2 | 1.00 |
A | 0.50 | 0.0717438 | 0.0717438 | 1.569649 | 2 | 1.00 |
A | 0.75 | 0.1240478 | 0.1240478 | 1.768204 | 2 | 1.00 |
A | 0.90 | 0.1687990 | 0.1687990 | 1.961551 | 2 | 1.00 |
A | 0.10 | 0.0355196 | 0.1060530 | 1.500024 | 2 | 2.00 |
A | 0.25 | 0.0587524 | 0.1060530 | 1.500024 | 2 | 2.00 |
A | 0.50 | 0.0717438 | 0.1060530 | 1.500024 | 2 | 2.00 |
A | 0.75 | 0.1240478 | 0.1240478 | 1.533718 | 2 | 2.00 |
A | 0.90 | 0.1687990 | 0.1687990 | 1.616344 | 2 | 2.00 |
The table summarises visually how many samples are needed for different expected differences (in log2-scale) and different portions of all proteins.
sampleSize |>
ggplot2::ggplot(ggplot2::aes(x = factor(probs) , y = N)) +
ggplot2::facet_wrap(~delta) +
ggplot2::geom_bar(stat = "identity")
It is also possible to get the statistics for each protein in each group. The information includes group size, number of observations, standard deviation, variance, mean.
stats$stats() |> head()
## # A tibble: 6 × 11
## protein_Id peptide_Id group_ nrReplicates nrMeasured nrNAs sd var
## <chr> <chr> <chr> <int> <int> <int> <dbl> <dbl>
## 1 0EfVhX~0087 ITLb4x1q A 4 3 1 0.154 0.0236
## 2 0EfVhX~0087 ahQLlQY7 A 4 4 0 0.0620 0.00385
## 3 0EfVhX~0087 dJkdz7so A 4 2 2 0.182 0.0330
## 4 7cbcrd~5725 D5dQ4nKk A 4 4 0 0.0509 0.00259
## 5 9VUkAq~4703 eIC06D7g A 4 4 0 0.0946 0.00895
## 6 BEJI92~5282 HBkZvdhT A 4 2 2 0.00305 0.00000929
## # ℹ 3 more variables: meanAbundance <dbl>, medianAbundance <dbl>,
## # interaction <chr>
You can also estimate the sample size needed to perform differential
expression analysis for each protein, given the power
,
delta
, and sig.level
. The next figure shows
the distribution of needed sample sizes for all proteins with a power of
0.8 at signifiance level 0.05 for a two fold change (delta = 1).
x <- stats$power_t_test(delta = 1,power = 0.8, sig.level = 0.05)
x <- x |> dplyr::filter(group_ == "A") |> dplyr::arrange(desc(N),protein_Id)
prolfqua::table_facade(x[1:7,], caption = "Sample size for each protein")
protein_Id | peptide_Id | group_ | nrReplicates | nrMeasured | nrNAs | sd | var | meanAbundance | medianAbundance | interaction | delta | N_exact | N |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0EfVhX~0087 | dJkdz7so | A | 4 | 2 | 2 | 0.1816806 | 0.0330078 | 3.971469 | 3.971469 | group_A | 1 | 2.023422 | 3 |
HvIpHG~9079 | opjydeWJ | A | 4 | 3 | 1 | 0.2392614 | 0.0572460 | 4.094835 | 3.997946 | group_A | 1 | 2.344670 | 3 |
JcKVfU~9653 | 9LMQBevj | A | 4 | 4 | 0 | 0.1871014 | 0.0350069 | 5.482917 | 5.406550 | group_A | 1 | 2.050429 | 3 |
0EfVhX~0087 | ITLb4x1q | A | 4 | 3 | 1 | 0.1535287 | 0.0235710 | 4.190351 | 4.178250 | group_A | 1 | 1.892113 | 2 |
0EfVhX~0087 | ahQLlQY7 | A | 4 | 4 | 0 | 0.0620097 | 0.0038452 | 4.742744 | 4.735435 | group_A | 1 | 1.533666 | 2 |
7cbcrd~5725 | D5dQ4nKk | A | 4 | 4 | 0 | 0.0508679 | 0.0025875 | 4.956022 | 4.964554 | group_A | 1 | 1.500024 | 2 |
9VUkAq~4703 | eIC06D7g | A | 4 | 4 | 0 | 0.0945821 | 0.0089458 | 4.159728 | 4.182873 | group_A | 1 | 1.654299 | 2 |
x |> ggplot2::ggplot(ggplot2::aes(x = N)) +
ggplot2::geom_histogram() +
ggplot2::facet_wrap(~delta) +
ggplot2::xlim(0,100)
The prolfqua
package is described in (Wolski et al. 2022).
## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Europe/Zurich
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] tidyr_1.3.1 plotly_4.10.4 sass_0.4.9 utf8_1.2.4
## [5] generics_0.1.3 KernSmooth_2.23-22 lattice_0.22-5 stringi_1.8.3
## [9] digest_0.6.35 magrittr_2.0.3 evaluate_0.23 grid_4.3.2
## [13] RColorBrewer_1.1-3 fastmap_1.1.1 Matrix_1.6-5 jsonlite_1.8.8
## [17] ggrepel_0.9.5 prolfqua_1.2.5 conflicted_1.2.0 gridExtra_2.3
## [21] mgcv_1.9-1 httr_1.4.7 purrr_1.0.2 fansi_1.0.6
## [25] viridisLite_0.4.2 scales_1.3.0 lazyeval_0.2.2 textshaping_0.3.7
## [29] jquerylib_0.1.4 cli_3.6.2 rlang_1.1.3 splines_4.3.2
## [33] munsell_0.5.1 withr_3.0.0 cachem_1.0.8 yaml_2.3.8
## [37] tools_4.3.2 memoise_2.0.1 dplyr_1.1.4 colorspace_2.1-0
## [41] ggplot2_3.5.1 forcats_1.0.0 vctrs_0.6.5 R6_2.5.1
## [45] lifecycle_1.0.4 stringr_1.5.1 fs_1.6.4 htmlwidgets_1.6.4
## [49] MASS_7.3-60.0.1 ragg_1.3.0 pkgconfig_2.0.3 desc_1.4.3
## [53] pkgdown_2.0.9 pillar_1.9.0 bslib_0.7.0 gtable_0.3.5
## [57] glue_1.7.0 data.table_1.15.4 Rcpp_1.0.12 systemfonts_1.0.6
## [61] highr_0.10 xfun_0.43 tibble_3.2.1 tidyselect_1.2.1
## [65] knitr_1.46 farver_2.1.1 nlme_3.1-164 htmltools_0.5.8.1
## [69] labeling_0.4.3 rmarkdown_2.26 pheatmap_1.0.12 compiler_4.3.2