Title: Marine data mining with CMAP using R
Authors: Aditya Mishra - Flatiron Institute, Simons Foundation (United States) [presenting]
Christian L. Mueller - Simons Foundation (United States)
Jacob Bien - University of Southern California (United States)
Sangwon Hyun - University of Southern California (United States)
Abstract: Recent advances in experimental techniques and scientific instruments have enabled the collection of biological, biogeochemical, and imaging data of the ocean on a global scale. The Simons CMAP, a currently developed large-scale open-access marine database, hosts a multitude of such marine datasets, including remote-sensing satellite observations, large-scale integrated in-situ biogeochemical cruise measurements, amplicon sequencing data, and complex synthetic ocean simulation data. To facilitate easy access to these rich data sets for statisticians and data scientists, we have developed cmap4r, an R package that enables downloading, analyzing, and visualizing datasets from the Simons CMAP in a fast and structured manner. Integrated analysis of marine data is challenging due to several factors, including the presence of outliers, missing entries, different spatial and temporal resolutions, spatiotemporal dependencies, high dimensionality, and for amplicon sequencing data, the absence of absolute species abundance measurements due to experimental limitations. This presents a unique opportunity for both the development and the application of novel statistical methods for marine data analysis. Using cmap4r as primary access point to the database, we highlight two novel statistical analysis examples where we have developed high-dimensional statistical techniques to relate microbial species abundances, marine environmental factors, and primary productivity in the ocean.