Nephelae : a platform for data intensive science - an application to ocean
During the last decade, the ocean community witnessed the launch of over 30 new ocean-related satellite missions by 13 different contributing space agencies representing around 36 countries (http://eohandbook.com). Plans for new satellites are already laid well into the foreseeable future, and today, we are already talking Petabytes of data to download, to analyze, to transform into accessible information...
Abstract
During the last decade, the ocean community witnessed the launch of over 30 new ocean-related satellite missions by 13 different contributing space agencies representing around 36 countries (http://eohandbook.com). Plans for new satellites are already laid well into the foreseeable future, and today, we are already talking Petabytes of data to download, to analyze, to transform into accessible information. Within the next decade, past, actual and future satellite Earth Observation (EO) missions, extended in situ networks and super-computer simulations shall continue to pave this new era to understand the Earth system as a whole, to serve both research and operational interests. Facing streams of data pouring from space and simulations dictates that tools and methods must be engaged to leverage such a wealth and to better link the past and/or near real-time complementary observing and modeling system elements. This can only be achieved with dedicated infrastructures and methodologies to dynamically process massive information and to perform retrospective-analyses which will be essential key instruments for breakthroughs, and to stimulate multidisciplinary Earth system research and applications in marine and climate sciences.
With a lack in advanced strategies, the gap between continuous acquisition of data and the capability to analyze them will grow, leading specialists and end-users to become completely deluged. This obviously leads to a dramatic under usage of the total data archive (acquired at a cost of several hundred millions euros). These issues have already guided recent developments to refine data intensive oriented architecture to design mass science data centers. As already impulsed at national level by governmental organizations, there is now a significant shift in the academic domain towards new emerging technologies to offer shared storage and processing capabilities. New arising concepts inherited from big data and cloud technology now make possible providing online archiving and remote processing capabilities that could boost the revisiting rate and enhancement of the existing historical archives and sustain more dynamic science.
In the last few years, Ifremer/Cersat has dedicated a large effort, with the support of ESA, to provide the experimental Nephelae platform, aiming from its inception to capitalize on recognized expertise, bringing together thematic, observation and validation EO data (satellite, model, in situ), hardware, and information technology, to first define and design a flexible and educational new platform. This platform aims to facilitate the systematic provision of reliable advanced modeling and data information, to stimulate, strengthen and assist multidisciplinary advanced research and applications but also sustain a continuous feedback loop and revisiting on historical archives (for reprocessing, quality assessment, sensor synergy) to a speed and level of flexibility that was not achievable before. This platform is based on new emerging big data and cloud computing technologies to :
- allow to access quickly to a large collection of ocean observations and modeling outputs spanning over decades for cross comparison, merging, combining or data mining
- be fully and transparently scalable in order to add new collections of data and extending to new fields of investigation
- offer a fast processing capability overcoming the bottleneck or latency issues affecting the current computing architectures when large I/O throughputs occur which is critical for large scale applications revisiting several years of data at a frequent rate
- ease the access and minimizing the overhead time for users to run an application or processing over large amounts of data
It is one objective to more carefully assess and combine the available technologies and to build upon this first demonstrator to design and build an original and optimized solution towards a data centric and data intensive oriented scientific cloud. This is a highly innovative endeavor since, while several private cloud computing environments for generic science applications have been started, very few of them have explicitly aimed at optimizing the I/O throughput and focusing on mass data processing by designing an intelligent distributed storage and processing synergy.
This platform also benefits to the user community by providing coexisting large multi-source data collections (previously archived at various locations) enhancing the development of imaginative, multidisciplinary and advanced applications merging or inter-relating these data. Like a library, the equipment will come as a handy set of tools and resources but also an ever increasing pool of knowledge, growing interest through a wider community. It has been demonstrated for several reprocessing and science projects allowing a level of flexibility and experimenting that was not achievable before.
Intervenant
From 1996 to 1999 Jean-François worked as a software engineer in the industry, contributing to the development of several processing and analysis tools for marine data. From 1999-onwards he has been working as a data manager at CERSAT/IFREMER (Brest). His main realizations include the development of various optimally analysed gridded fields of sea-surface parameters (wind, fluxes, gas exchange coefficient and sea surface temperature), the design of several advanced tools for the search, extraction, and visualization of ocean data (Nausicaa, Naiad,...), the implementation of a wide multi-sensor intercomparison/colocation capability and matchup databases, and the evolution of CERSAT facility toward an operational multi-mission real-time satellite archiving and production center.
He has been responsible for the data management and dissemination of satellite ocean products at CERSAT for several years. He is deeply involved in various national and international projects with EUMETSAT (O&SI SAF), ESA (Brest operation center for ocean EO data, Medspiration, GlobWave, GlobCurrent, OceanFlux, Felyx), CNES (SMOS, CFOSAT), EU (MerSea, MyOcean),...and developed strong skills in data processing, product and metadata specification, user oriented tools and the design of satellite ground segment elements. He is also a member of GHRSST (Group for High Resolution Sea Surface Temperature) science team and has played a leading role in the definition of the GHRSST Data Specification (GDS).
One of his focus is also now the design and demonstration of thematic EO data exploitation platforms making use of cloud computing & big data technologies.