National Aeronautics and Space Administration

Living With A Star

Targeted Research and Technology

Solar Dynamics Observatory Machine Learning Dataset (SDOML) Improvements

ROSES ID: NNH21ZDA001N-LWSTM      Selection Year: 2021      

Program Element: Data, Tools, & Methods

Principal Investigator: James Paul Mason

Affiliation(s): Johns Hopkins University

Project Member(s):
Jin, Meng Co-I SETI Institute
Cheung, Chun Ming Mark Collaborator Lockheed Martin Inc.

Summary:

Since the publication of A Machine-learning Data Set Prepared from the NASA Solar Dynamics Observatory Mission" (herein called SDOML) in 2019, there has been a new publication using this dataset every other month. The utility of having the data of all three instruments onboard SDO wrapped up into one uniform package with the data already cleaned is clear. However, as key members of the team that originated this dataset, we have identified improvements that should be made that would further increase the utility, ease of access, and therefore science return. We will focus primarily on 4 new improvements, resulting in the release of SDOMLv2. First, we will repacketize all of the data into the .zarr format. This is a new, well-supported format specifically designed for handling large, multi-dimensional arrays, especially for use in cloud computing. It is the preferred format for platforms such as Pangeo. This will enable faster manipulation of the data for users. Second, we will generate the synthetic SDO/EVE emission lines data product for the entire timespan of the dataset (2010-2021) at full cadence (6 minutes, the SDO/AIA cadence). This method accepts SDO/AIA data as input and has already been demonstrated in case studies. SDO/EVE's 60-360 … channel ceased functioning in 2014. This new synthetic data restoration will re-enable all of the scientific analyses that this channel previously afforded, for example irradiance coronal dimming studies that this proposal's PI is heavily invested in. Third, we will include the full, cleaned SDO/EVE spectra in the dataset. SDOMLv1 only includes the extracted emission lines product. This will enable more detailed scientific analyses that require the full spectrum, such as the study of Doppler shifts during eruptions. Finally, we will build open source tools and an example gallery to demonstrate the access, manipulation, and some use cases for SDOMLv2, with emphasis on cloud computing.

We will deliver the dataset itself to the Stanford Digital Repository where SDOMLv1 is stored. We will also deliver the dataset to NASA's Solar Data Analysis Center (SDAC), leveraging our team's existing relationship with the team at Goddard Space Flight Center that maintains that resource. Based on SDOMLv1 and the changes described above, we anticipate the dataset will be approximately 10 TB. Delivery preparation will begin in Project Year 2 and be completed by the conclusion of the period of performance (2023-08-31). The example gallery and open source tools will be hosted on GitHub and be publicly accessible through the duration of the proposed effort and afterwards.

SDO is an exemplar of big data and as such is a perfect fit for many existing AI tools. As a flagship, the high profile means that big scientific returns can be expected from modest effort to make the data more easily accessible to those AI tools. Moreover, SDO's status as a flagship means that it has a big community impact when it sets the example for establishing datasets and tools like SDOMLv2, which helps establish such practices as the norm for all missions in heliophysics.
Export to PDF