vignettes/CreatingConfigurations.Rmd
CreatingConfigurations.Rmd
Frequently, the data are available in tables, where the rows represent proteins, and the columns do represent samples. The example shows such a table. The first column contains, the protein id, while the other columns store the intensities for sample A, B, C.
df <- data.frame(protein_Id = c("tr|A|HUMAN","tr|B|HUMAN","tr|C|HUMAN","tr|D|HUMAN"),
Intensity_A = c(100,10000,10,NA),
Intensity_B = c(NA, 9000, 20, 100),
Intensity_C = c(200,8000,NA,150),
Intensity_D = c(130,11000, 50, 50))
df
## protein_Id Intensity_A Intensity_B Intensity_C Intensity_D
## 1 tr|A|HUMAN 100 NA 200 130
## 2 tr|B|HUMAN 10000 9000 8000 11000
## 3 tr|C|HUMAN 10 20 NA 50
## 4 tr|D|HUMAN NA 100 150 50
This table can be converted into a table in the long format using:
table_long <- tidyr::pivot_longer(df, starts_with("Intensity_"),names_to = "Sample", values_to = "Intensity")
table_long
## # A tibble: 16 × 3
## protein_Id Sample Intensity
## <chr> <chr> <dbl>
## 1 tr|A|HUMAN Intensity_A 100
## 2 tr|A|HUMAN Intensity_B NA
## 3 tr|A|HUMAN Intensity_C 200
## 4 tr|A|HUMAN Intensity_D 130
## 5 tr|B|HUMAN Intensity_A 10000
## 6 tr|B|HUMAN Intensity_B 9000
## 7 tr|B|HUMAN Intensity_C 8000
## 8 tr|B|HUMAN Intensity_D 11000
## 9 tr|C|HUMAN Intensity_A 10
## 10 tr|C|HUMAN Intensity_B 20
## 11 tr|C|HUMAN Intensity_C NA
## 12 tr|C|HUMAN Intensity_D 50
## 13 tr|D|HUMAN Intensity_A NA
## 14 tr|D|HUMAN Intensity_B 100
## 15 tr|D|HUMAN Intensity_C 150
## 16 tr|D|HUMAN Intensity_D 50
In addition you will need a table with the Sample annotations. In this example with have two groups A, B.
annot <- data.frame(Sample = c("Intensity_A", "Intensity_B", "Intensity_C", "Intensity_D"), Group = c("A","A","B","C"))
Now you can annotate the samples in the table with the Intensities.
table_long <- dplyr::inner_join(annot, table_long)
We create a AnalysisTableAnnotation
and start annotating
the data frame, that is specifying which column contains wich type of
information.
atable <- prolfqua::AnalysisTableAnnotation$new()
atable$fileName = "Sample"
atable$workIntensity = "Intensity"
The columns identifying the measured features, which are proteins,
peptides or precursors, are described using the named list
hierarchy
. The values of the list are the column names,
while the names are arbitrary as long as they are valid R column names.
Here we use the same names as the column names.
The list factors
, is used to point to the columns
containing the factors of your analysis (Group).
atable$hierarchy[["protein_Id"]] <- "protein_Id"
atable$factors[["Group"]] <- "Group"
config <- prolfqua::AnalysisConfiguration$new(atable)
analysis_data <- prolfqua::setup_analysis(table_long, config)
lfqdata <- prolfqua::LFQData$new(analysis_data, config)
lfqdata$hierarchy_counts()
## # A tibble: 1 × 2
## isotopeLabel protein_Id
## <chr> <int>
## 1 light 4
smrz <- lfqdata$get_Summariser()
smrz$plot_hierarchy_counts_sample()
Given for example a Peptide Quantification Report generated by
Spectronaut (a table in long format), we demonstrate how to create a
configuration that is required to use it with prolfqua. To do this, an
AnalysisTableAnnotation
has to be configured and some
fields (fileName, hierarchy, factors, workingIntensity) need to defined.
The configuration object describes the columns in the long table so that
prolfqua functions know which columns to use.
dataLongFormat <- prolfqua::sim_lfq_data(Nprot = 20, PEPTIDE = TRUE)
head(dataLongFormat)
## # A tibble: 6 × 12
## proteinID idtype2 average_prot_abundance sd peptideID sample group mean
## <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 5pb90S 1460 20.5 1 mpxNAbn2 Ctrl_V1 Ctrl 0
## 2 5pb90S 1460 20.5 1 mpxNAbn2 Ctrl_V2 Ctrl 0
## 3 5pb90S 1460 20.5 1 mpxNAbn2 Ctrl_V3 Ctrl 0
## 4 5pb90S 1460 20.5 1 mpxNAbn2 Ctrl_V4 Ctrl 0
## 5 5pb90S 1460 22.5 1 mpxNAbn2 A_V1 A -2
## 6 5pb90S 1460 22.5 1 mpxNAbn2 A_V2 A -2
## # ℹ 4 more variables: N <dbl>, avg_peptide_abd <dbl>, Replicate <chr>,
## # abundance <dbl>
We create a Table annotation object and start annotating the data we
read. Since in this example we eventually want to do more filtering on
data quality we will also define the ident_qValue in this
AnalysisTableAnnotation
.
atable <- prolfqua::AnalysisTableAnnotation$new()
atable$fileName = "sample"
atable$workIntensity = "abundance"
The columns identifying the measured features, which are proteins,
peptides or precursors, are described using the named list
hierarchy
. The values of the list are the column names,
while the names are arbitrary as long as they are valid R column names.
Here we use the same names as the column names.
The list factors, is used to point to the columns containing the factors of your analysis (group). Here, we rename the column “R.Condition” to “Marker”. In figures and legends generated by prolfqua the name “Marker” will then be used and not “R.Condition”. The data.frame can also contain more than one factor.
atable$hierarchy[["proteinID"]] <- "proteinID"
atable$hierarchy[["peptideID"]] <- "peptideID"
atable$factors[["group"]] <- "group"
Lastly, we create an Analysis parameter object, and the Analysis
Configuration. The function setup_analysis
, creates from
data frame in long format a data.frame compatible with your
configuration. We can now run most of the function in the package using
the data and configuration.
config <- prolfqua::AnalysisConfiguration$new(atable)
analysis_data <- prolfqua::setup_analysis(dataLongFormat, config)
prolfqua::summarize_hierarchy(analysis_data, config)
## # A tibble: 20 × 3
## proteinID isotopeLabel_n peptideID_n
## <chr> <int> <int>
## 1 1DIdTQ 1 6
## 2 1SO1NY 1 2
## 3 2JOCP3 1 6
## 4 5pb90S 1 2
## 5 OyHgjC 1 2
## 6 PDPfVe 1 2
## 7 TvODf3 1 2
## 8 UHjlsA 1 2
## 9 UmCi4L 1 1
## 10 Vphs0t 1 6
## 11 Xp1Tno 1 1
## 12 de5llz 1 2
## 13 hwBDgU 1 2
## 14 pGZKaB 1 1
## 15 rNWbGH 1 1
## 16 wXXHRN 1 6
## 17 yOSJAW 1 6
## 18 yj3Whg 1 10
## 19 z9WqiE 1 2
## 20 zVMmad 1 2
Now the analysis_data object is ready to generate the
LFQData
class instance. This object is the start for
further analysis.
lfqdata <- prolfqua::LFQData$new(analysis_data, config)
With this, it is possible for example to use the
get_Summariser
function to visualize and summarise the data
efficiently.
smrz <- lfqdata$get_Summariser()
smrz$plot_hierarchy_counts_sample()
The prolfqua
package is described in (Wolski et al. 2022).
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS 15.0.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Europe/Zurich
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] plotly_4.10.4 sass_0.4.9 utf8_1.2.4 generics_0.1.3
## [5] tidyr_1.3.1 stringi_1.8.4 digest_0.6.37 magrittr_2.0.3
## [9] evaluate_1.0.1 grid_4.4.1 RColorBrewer_1.1-3 fastmap_1.2.0
## [13] jsonlite_1.8.9 ggrepel_0.9.6 prolfqua_1.3.6 conflicted_1.2.0
## [17] gridExtra_2.3 httr_1.4.7 purrr_1.0.2 viridisLite_0.4.2
## [21] scales_1.3.0 lazyeval_0.2.2 textshaping_0.4.1 jquerylib_0.1.4
## [25] cli_3.6.3 rlang_1.1.4 munsell_0.5.1 withr_3.0.2
## [29] cachem_1.1.0 yaml_2.3.10 tools_4.4.1 memoise_2.0.1
## [33] dplyr_1.1.4 colorspace_2.1-1 ggplot2_3.5.1 forcats_1.0.0
## [37] vctrs_0.6.5 R6_2.5.1 lifecycle_1.0.4 stringr_1.5.1
## [41] fs_1.6.5 htmlwidgets_1.6.4 MASS_7.3-64 ragg_1.3.3
## [45] pkgconfig_2.0.3 desc_1.4.3 pkgdown_2.1.1 pillar_1.10.1
## [49] bslib_0.8.0 gtable_0.3.6 glue_1.8.0 data.table_1.16.4
## [53] Rcpp_1.0.13-1 systemfonts_1.1.0 xfun_0.50 tibble_3.2.1
## [57] tidyselect_1.2.1 knitr_1.49 farver_2.1.2 htmltools_0.5.8.1
## [61] labeling_0.4.3 rmarkdown_2.29 pheatmap_1.0.12 compiler_4.4.1