Metadata loader

Load dataset dictionaries and access all datasets and fields in a flexible manner.

MetaLoader

 MetaLoader (base_path:str='/home/ec2-user/studies/hpp/', cohort:str=None,
             flexible_field_search:bool=False, errors:str='raise',
             **kwargs)

Class to load multiple dictionaries and allows to easily access the relevant fields.

Args:

base_path (str, optional): The base path where the data is stored. Defaults to DATASETS_PATH.
cohort (str, optional): The name of the cohort within the dataset. Defaults to COHORT.
flexible_field_search (bool, optional): Whether to allow regex field search. Defaults to False.
errors (str, optional): Whether to raise an error or issue a warning if missing data is encountered.
    Possible values are 'raise', 'warn' and 'ignore'. Defaults to 'raise'.
**kwargs: Additional keyword arguments to pass to a DataLoader class.

Attributes:

dicts (pd.DataFrame): A dictionary of data dictionaries (dataframes) of all availbale datasets in the base_path.
fields (list): A list of all fields.
cohort (str): The name of the cohort being used.
base_path (str): The base path where the data is stored.
flexible_field_search (bool): Whether to allow regex field search.
errors (str): Whether to raise an error or issue a warning if missing data is encountered.
kwargs (dict): Additional keyword arguments to pass to a DataLoader class.

The MetaLoader can be used to query all availbale fields throughout all datasets. In the following example, 3 datasets are available.

ml = MetaLoader()
ml

MetaLoader for: examples/*
with 81 fields
4 datasets:
['cgm'
 'diet_logging'
 'fundus'
 'sleep']

The object contains only the data dictionaries (metadata) of these datasets, where the columns correspond to columns in the data tables of the dataset (e.g., fundus).

ml.dicts['fundus']

tabular_field_name	fundus_image_left	fundus_image_right	collection_date
dataset	fundus	fundus	fundus
field_string	Fundus image (left)	Fundus image (right)	Collection date (YYYY-MM-DD)
description_string	Fundus image (left)	Fundus image (right)	Collection date (YYYY-MM-DD)
parent_dataframe	NaN	NaN	NaN
relative_location	/fundus/fundus.parquet	/fundus/fundus.parquet	/fundus/fundus.parquet
value_type	Text	Text	Date
units	None	None	Time
sampling_rate	NaN	NaN	NaN
item_type	Bulk	Bulk	Data
array	Single	Single	Single
cohorts	10K	10K	10K
data_type	image	image	tabular
debut	2021-02-17	2021-02-17	2021-02-17
pandas_dtype	string	string	datetime64[ns]

You can query fields from multiple datasets directly:

ml[['glucose', 'fundus_image_left']]

tabular_field_name	cgm/glucose	fundus/fundus_image_left
dataset	cgm	fundus
field_string	Glucose	Fundus image (left)
description_string	cgm temporal glucose values	Fundus image (left)
parent_dataframe	NaN	NaN
relative_location	/cgm/cgm.parquet	/fundus/fundus.parquet
value_type	Series data, continous	Text
units	mg/dl	None
sampling_rate	15min	NaN
item_type	Data	Bulk
array	Single	Single
cohorts	10K	10K
data_type	time series	image
debut	2018-12-27	2021-02-17
pandas_dtype	float	string

You can then use the MetaLoader to load the actual data of fields from multiple datasets. Here we load glucose from the CGM dataset, and fundus_image_left from the fundus dataset.

ml.load(['glucose' ,'fundus_image_left']).head()

						glucose	fundus_image_left
participant_id	collection_timestamp	connection_id	cohort	research_stage	array_index
0	2020-05-25 10:48:00+03:00	1000001	10k	00_00_visit	0	111.6	/path/to/file
	2020-05-25 11:03:00+03:00	1000001	10k	00_00_visit	0	79.2	/path/to/file
	2020-05-25 11:18:00+03:00	1000001	10k	00_00_visit	0	84.6	/path/to/file
	2020-05-25 11:33:00+03:00	1000001	10k	00_00_visit	0	106.2	/path/to/file
	2020-05-25 11:48:00+03:00	1000001	10k	00_00_visit	0	102.6	/path/to/file

You may use more flexible search queries using regex and various properties of the fields. Both the get() method and load() method support the same syntax.

Example: get all bulk data fields.

ml.get('bulk', flexible=True, prop='item_type')

tabular_field_name	cgm/cgm_filename	fundus/fundus_image_left	fundus/fundus_image_right
dataset	cgm	fundus	fundus
field_string	CGM timeseries	Fundus image (left)	Fundus image (right)
description_string	Name of the file containing the participants' ...	Fundus image (left)	Fundus image (right)
parent_dataframe	NaN	NaN	NaN
relative_location	/cgm/cgm.parquet	/fundus/fundus.parquet	/fundus/fundus.parquet
value_type	Text	Text	Text
units	NaN	None	None
sampling_rate	NaN	NaN	NaN
item_type	Bulk	Bulk	Bulk
array	Single	Single	Single
cohorts	10K	10K	10K
data_type	text	image	image
debut	2018-12-27	2021-02-17	2021-02-17
pandas_dtype	string	string	string

Example: get all fields that include “mg” in their units

ml.get('mg', flexible=True, prop='units')

tabular_field_name	cgm/1st qu_	cgm/3rd qu_	cgm/auc	cgm/ea1c	cgm/glucose	cgm/gmi	cgm/iqr	cgm/mad	cgm/mag	cgm/mage	...	cgm/modd	cgm/range	cgm/sd	cgm/sdb	cgm/sdbdm	cgm/sddm	cgm/sdhhmm	cgm/sdw	cgm/sdwsh	diet_logging/sodium_mg
dataset	cgm	cgm	cgm	cgm	cgm	cgm	cgm	cgm	cgm	cgm	...	cgm	cgm	cgm	cgm	cgm	cgm	cgm	cgm	cgm	diet_logging
field_string	1st quantile	3rd quantile	AUC	eA1C	Glucose	GMI	IQR	MAD	MAG	MAGE	...	MODD	Range	SD	SDb	SDbdm	SDdm	SDhhmm	SDw	SDwsh	Sodium intake per food logged
description_string	First quantile of all glucose values.	Third quantile of all glucose values.	Hourly average AUC. This measure integrates, t...	A linear transformation of the mean glucose va...	cgm temporal glucose values	A linear transformation of the mean glucose va...	Interquartile range (IQR), calculated as the d...	Median Absolute Deviation (MAD). This is a mea...	Mean Absolute Glucose (MAG). This is a measure...	Mean Amplitude of Glycemic Excursions (MAGE), ...	...	Mean difference between glucose values obtaine...	Difference between the maximum and minimum glu...	Standard deviation of all glucose values.	SD between days, within time points. Mean valu...	SD between days, within time points, corrected...	Horizontal SD. SD of the mean glucose values, ...	SD between time points. Standard deviation of ...	Vertical SD within days. Average value of the ...	SD within series. Taking hour-long intervals t...	Sodium intake per food logged
parent_dataframe	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
relative_location	/cgm/cgm.parquet	/cgm/cgm.parquet	/cgm/cgm.parquet	/cgm/cgm.parquet	/cgm/cgm.parquet	/cgm/cgm.parquet	/cgm/cgm.parquet	/cgm/cgm.parquet	/cgm/cgm.parquet	/cgm/cgm.parquet	...	/cgm/cgm.parquet	/cgm/cgm.parquet	/cgm/cgm.parquet	/cgm/cgm.parquet	/cgm/cgm.parquet	/cgm/cgm.parquet	/cgm/cgm.parquet	/cgm/cgm.parquet	/cgm/cgm.parquet	diet_logging/diet_logging.parquet
value_type	Continuous	Continuous	Continuous	Continuous	Series data, continous	Continuous	Continuous	Continuous	Continuous	Continuous	...	Continuous	Continuous	Continuous	Continuous	Continuous	Continuous	Continuous	Continuous	Continuous	Continuous
units	mg/dl	mg/dl	mg/dl*h	mg/dl	mg/dl	mg/dl	mg/dl	mg/dl	mg/dl	mg/dl	...	mg/dl	mg/dl	mg/dl	mg/dl	mg/dl	mg/dl	mg/dl	mg/dl	mg/dl	mg
sampling_rate	NaN	NaN	NaN	NaN	15min	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
item_type	Data	Data	Data	Data	Data	Data	Data	Data	Data	Data	...	Data	Data	Data	Data	Data	Data	Data	Data	Data	Data
array	Single	Single	Single	Single	Single	Single	Single	Single	Single	Single	...	Single	Single	Single	Single	Single	Single	Single	Single	Single	Single
cohorts	10K	10K	10K	10K	10K	10K	10K	10K	10K	10K	...	10K	10K	10K	10K	10K	10K	10K	10K	10K	10K
data_type	tabular	tabular	tabular	tabular	time series	tabular	tabular	tabular	tabular	tabular	...	tabular	tabular	tabular	tabular	tabular	tabular	tabular	tabular	tabular	Time Series
debut	2018-12-27	2018-12-27	2018-12-27	2018-12-27	2018-12-27	2018-12-27	2018-12-27	2018-12-27	2018-12-27	2018-12-27	...	2018-12-27	2018-12-27	2018-12-27	2018-12-27	2018-12-27	2018-12-27	2018-12-27	2018-12-27	2018-12-27	2019-09-01
pandas_dtype	float	float	float	float	float	float	float	float	float	float	...	float	float	float	float	float	float	float	float	float	float

14 rows × 24 columns