Metadata loader

Load dataset dictionaries and access all datasets and fields in a flexible manner.

source

MetaLoader

 MetaLoader (base_path:str='/home/ec2-user/studies/hpp/', cohort:str=None,
             flexible_field_search:bool=False, errors:str='raise',
             **kwargs)

Class to load multiple dictionaries and allows to easily access the relevant fields.

Args:

base_path (str, optional): The base path where the data is stored. Defaults to DATASETS_PATH.
cohort (str, optional): The name of the cohort within the dataset. Defaults to COHORT.
flexible_field_search (bool, optional): Whether to allow regex field search. Defaults to False.
errors (str, optional): Whether to raise an error or issue a warning if missing data is encountered.
    Possible values are 'raise', 'warn' and 'ignore'. Defaults to 'raise'.
**kwargs: Additional keyword arguments to pass to a DataLoader class.

Attributes:

dicts (pd.DataFrame): A dictionary of data dictionaries (dataframes) of all availbale datasets in the base_path.
fields (list): A list of all fields.
cohort (str): The name of the cohort being used.
base_path (str): The base path where the data is stored.
flexible_field_search (bool): Whether to allow regex field search.
errors (str): Whether to raise an error or issue a warning if missing data is encountered.
kwargs (dict): Additional keyword arguments to pass to a DataLoader class.

The MetaLoader can be used to query all availbale fields throughout all datasets. In the following example, 3 datasets are available.

ml = MetaLoader()
ml
MetaLoader for: examples/*
with 81 fields
4 datasets:
['cgm'
 'diet_logging'
 'fundus'
 'sleep']

The object contains only the data dictionaries (metadata) of these datasets, where the columns correspond to columns in the data tables of the dataset (e.g., fundus).

ml.dicts['fundus']
tabular_field_name fundus_image_left fundus_image_right collection_date
dataset fundus fundus fundus
field_string Fundus image (left) Fundus image (right) Collection date (YYYY-MM-DD)
description_string Fundus image (left) Fundus image (right) Collection date (YYYY-MM-DD)
parent_dataframe NaN NaN NaN
relative_location /fundus/fundus.parquet /fundus/fundus.parquet /fundus/fundus.parquet
value_type Text Text Date
units None None Time
sampling_rate NaN NaN NaN
item_type Bulk Bulk Data
array Single Single Single
cohorts 10K 10K 10K
data_type image image tabular
debut 2021-02-17 2021-02-17 2021-02-17
pandas_dtype string string datetime64[ns]

You can query fields from multiple datasets directly:

ml[['glucose', 'fundus_image_left']]
tabular_field_name cgm/glucose fundus/fundus_image_left
dataset cgm fundus
field_string Glucose Fundus image (left)
description_string cgm temporal glucose values Fundus image (left)
parent_dataframe NaN NaN
relative_location /cgm/cgm.parquet /fundus/fundus.parquet
value_type Series data, continous Text
units mg/dl None
sampling_rate 15min NaN
item_type Data Bulk
array Single Single
cohorts 10K 10K
data_type time series image
debut 2018-12-27 2021-02-17
pandas_dtype float string

You can then use the MetaLoader to load the actual data of fields from multiple datasets. Here we load glucose from the CGM dataset, and fundus_image_left from the fundus dataset.

ml.load(['glucose' ,'fundus_image_left']).head()
glucose fundus_image_left
participant_id collection_timestamp connection_id cohort research_stage array_index
0 2020-05-25 10:48:00+03:00 1000001 10k 00_00_visit 0 111.6 /path/to/file
2020-05-25 11:03:00+03:00 1000001 10k 00_00_visit 0 79.2 /path/to/file
2020-05-25 11:18:00+03:00 1000001 10k 00_00_visit 0 84.6 /path/to/file
2020-05-25 11:33:00+03:00 1000001 10k 00_00_visit 0 106.2 /path/to/file
2020-05-25 11:48:00+03:00 1000001 10k 00_00_visit 0 102.6 /path/to/file

You may use more flexible search queries using regex and various properties of the fields. Both the get() method and load() method support the same syntax.

  1. Example: get all bulk data fields.
ml.get('bulk', flexible=True, prop='item_type')
tabular_field_name cgm/cgm_filename fundus/fundus_image_left fundus/fundus_image_right
dataset cgm fundus fundus
field_string CGM timeseries Fundus image (left) Fundus image (right)
description_string Name of the file containing the participants' ... Fundus image (left) Fundus image (right)
parent_dataframe NaN NaN NaN
relative_location /cgm/cgm.parquet /fundus/fundus.parquet /fundus/fundus.parquet
value_type Text Text Text
units NaN None None
sampling_rate NaN NaN NaN
item_type Bulk Bulk Bulk
array Single Single Single
cohorts 10K 10K 10K
data_type text image image
debut 2018-12-27 2021-02-17 2021-02-17
pandas_dtype string string string
  1. Example: get all fields that include “mg” in their units
ml.get('mg', flexible=True, prop='units')
tabular_field_name cgm/1st qu_ cgm/3rd qu_ cgm/auc cgm/ea1c cgm/glucose cgm/gmi cgm/iqr cgm/mad cgm/mag cgm/mage ... cgm/modd cgm/range cgm/sd cgm/sdb cgm/sdbdm cgm/sddm cgm/sdhhmm cgm/sdw cgm/sdwsh diet_logging/sodium_mg
dataset cgm cgm cgm cgm cgm cgm cgm cgm cgm cgm ... cgm cgm cgm cgm cgm cgm cgm cgm cgm diet_logging
field_string 1st quantile 3rd quantile AUC eA1C Glucose GMI IQR MAD MAG MAGE ... MODD Range SD SDb SDbdm SDdm SDhhmm SDw SDwsh Sodium intake per food logged
description_string First quantile of all glucose values. Third quantile of all glucose values. Hourly average AUC. This measure integrates, t... A linear transformation of the mean glucose va... cgm temporal glucose values A linear transformation of the mean glucose va... Interquartile range (IQR), calculated as the d... Median Absolute Deviation (MAD). This is a mea... Mean Absolute Glucose (MAG). This is a measure... Mean Amplitude of Glycemic Excursions (MAGE), ... ... Mean difference between glucose values obtaine... Difference between the maximum and minimum glu... Standard deviation of all glucose values. SD between days, within time points. Mean valu... SD between days, within time points, corrected... Horizontal SD. SD of the mean glucose values, ... SD between time points. Standard deviation of ... Vertical SD within days. Average value of the ... SD within series. Taking hour-long intervals t... Sodium intake per food logged
parent_dataframe NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
relative_location /cgm/cgm.parquet /cgm/cgm.parquet /cgm/cgm.parquet /cgm/cgm.parquet /cgm/cgm.parquet /cgm/cgm.parquet /cgm/cgm.parquet /cgm/cgm.parquet /cgm/cgm.parquet /cgm/cgm.parquet ... /cgm/cgm.parquet /cgm/cgm.parquet /cgm/cgm.parquet /cgm/cgm.parquet /cgm/cgm.parquet /cgm/cgm.parquet /cgm/cgm.parquet /cgm/cgm.parquet /cgm/cgm.parquet diet_logging/diet_logging.parquet
value_type Continuous Continuous Continuous Continuous Series data, continous Continuous Continuous Continuous Continuous Continuous ... Continuous Continuous Continuous Continuous Continuous Continuous Continuous Continuous Continuous Continuous
units mg/dl mg/dl mg/dl*h mg/dl mg/dl mg/dl mg/dl mg/dl mg/dl mg/dl ... mg/dl mg/dl mg/dl mg/dl mg/dl mg/dl mg/dl mg/dl mg/dl mg
sampling_rate NaN NaN NaN NaN 15min NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
item_type Data Data Data Data Data Data Data Data Data Data ... Data Data Data Data Data Data Data Data Data Data
array Single Single Single Single Single Single Single Single Single Single ... Single Single Single Single Single Single Single Single Single Single
cohorts 10K 10K 10K 10K 10K 10K 10K 10K 10K 10K ... 10K 10K 10K 10K 10K 10K 10K 10K 10K 10K
data_type tabular tabular tabular tabular time series tabular tabular tabular tabular tabular ... tabular tabular tabular tabular tabular tabular tabular tabular tabular Time Series
debut 2018-12-27 2018-12-27 2018-12-27 2018-12-27 2018-12-27 2018-12-27 2018-12-27 2018-12-27 2018-12-27 2018-12-27 ... 2018-12-27 2018-12-27 2018-12-27 2018-12-27 2018-12-27 2018-12-27 2018-12-27 2018-12-27 2018-12-27 2019-09-01
pandas_dtype float float float float float float float float float float ... float float float float float float float float float float

14 rows × 24 columns