= MetaLoader()
ml ml
MetaLoader for: examples/*
with 81 fields
4 datasets:
['cgm'
'diet_logging'
'fundus'
'sleep']
MetaLoader (base_path:str='/home/ec2-user/studies/hpp/', cohort:str=None, flexible_field_search:bool=False, errors:str='raise', **kwargs)
Class to load multiple dictionaries and allows to easily access the relevant fields.
Args:
base_path (str, optional): The base path where the data is stored. Defaults to DATASETS_PATH.
cohort (str, optional): The name of the cohort within the dataset. Defaults to COHORT.
flexible_field_search (bool, optional): Whether to allow regex field search. Defaults to False.
errors (str, optional): Whether to raise an error or issue a warning if missing data is encountered.
Possible values are 'raise', 'warn' and 'ignore'. Defaults to 'raise'.
**kwargs: Additional keyword arguments to pass to a DataLoader class.
Attributes:
dicts (pd.DataFrame): A dictionary of data dictionaries (dataframes) of all availbale datasets in the base_path.
fields (list): A list of all fields.
cohort (str): The name of the cohort being used.
base_path (str): The base path where the data is stored.
flexible_field_search (bool): Whether to allow regex field search.
errors (str): Whether to raise an error or issue a warning if missing data is encountered.
kwargs (dict): Additional keyword arguments to pass to a DataLoader class.
The MetaLoader
can be used to query all availbale fields throughout all datasets. In the following example, 3 datasets are available.
MetaLoader for: examples/*
with 81 fields
4 datasets:
['cgm'
'diet_logging'
'fundus'
'sleep']
The object contains only the data dictionaries (metadata) of these datasets, where the columns correspond to columns in the data tables of the dataset (e.g., fundus).
tabular_field_name | fundus_image_left | fundus_image_right | collection_date |
---|---|---|---|
dataset | fundus | fundus | fundus |
field_string | Fundus image (left) | Fundus image (right) | Collection date (YYYY-MM-DD) |
description_string | Fundus image (left) | Fundus image (right) | Collection date (YYYY-MM-DD) |
parent_dataframe | NaN | NaN | NaN |
relative_location | /fundus/fundus.parquet | /fundus/fundus.parquet | /fundus/fundus.parquet |
value_type | Text | Text | Date |
units | None | None | Time |
sampling_rate | NaN | NaN | NaN |
item_type | Bulk | Bulk | Data |
array | Single | Single | Single |
cohorts | 10K | 10K | 10K |
data_type | image | image | tabular |
debut | 2021-02-17 | 2021-02-17 | 2021-02-17 |
pandas_dtype | string | string | datetime64[ns] |
You can query fields from multiple datasets directly:
tabular_field_name | cgm/glucose | fundus/fundus_image_left |
---|---|---|
dataset | cgm | fundus |
field_string | Glucose | Fundus image (left) |
description_string | cgm temporal glucose values | Fundus image (left) |
parent_dataframe | NaN | NaN |
relative_location | /cgm/cgm.parquet | /fundus/fundus.parquet |
value_type | Series data, continous | Text |
units | mg/dl | None |
sampling_rate | 15min | NaN |
item_type | Data | Bulk |
array | Single | Single |
cohorts | 10K | 10K |
data_type | time series | image |
debut | 2018-12-27 | 2021-02-17 |
pandas_dtype | float | string |
You can then use the MetaLoader
to load the actual data of fields from multiple datasets. Here we load glucose
from the CGM dataset, and fundus_image_left
from the fundus dataset.
glucose | fundus_image_left | ||||||
---|---|---|---|---|---|---|---|
participant_id | collection_timestamp | connection_id | cohort | research_stage | array_index | ||
0 | 2020-05-25 10:48:00+03:00 | 1000001 | 10k | 00_00_visit | 0 | 111.6 | /path/to/file |
2020-05-25 11:03:00+03:00 | 1000001 | 10k | 00_00_visit | 0 | 79.2 | /path/to/file | |
2020-05-25 11:18:00+03:00 | 1000001 | 10k | 00_00_visit | 0 | 84.6 | /path/to/file | |
2020-05-25 11:33:00+03:00 | 1000001 | 10k | 00_00_visit | 0 | 106.2 | /path/to/file | |
2020-05-25 11:48:00+03:00 | 1000001 | 10k | 00_00_visit | 0 | 102.6 | /path/to/file |
You may use more flexible search queries using regex and various properties of the fields. Both the get()
method and load()
method support the same syntax.
tabular_field_name | cgm/cgm_filename | fundus/fundus_image_left | fundus/fundus_image_right |
---|---|---|---|
dataset | cgm | fundus | fundus |
field_string | CGM timeseries | Fundus image (left) | Fundus image (right) |
description_string | Name of the file containing the participants' ... | Fundus image (left) | Fundus image (right) |
parent_dataframe | NaN | NaN | NaN |
relative_location | /cgm/cgm.parquet | /fundus/fundus.parquet | /fundus/fundus.parquet |
value_type | Text | Text | Text |
units | NaN | None | None |
sampling_rate | NaN | NaN | NaN |
item_type | Bulk | Bulk | Bulk |
array | Single | Single | Single |
cohorts | 10K | 10K | 10K |
data_type | text | image | image |
debut | 2018-12-27 | 2021-02-17 | 2021-02-17 |
pandas_dtype | string | string | string |
tabular_field_name | cgm/1st qu_ | cgm/3rd qu_ | cgm/auc | cgm/ea1c | cgm/glucose | cgm/gmi | cgm/iqr | cgm/mad | cgm/mag | cgm/mage | ... | cgm/modd | cgm/range | cgm/sd | cgm/sdb | cgm/sdbdm | cgm/sddm | cgm/sdhhmm | cgm/sdw | cgm/sdwsh | diet_logging/sodium_mg |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
dataset | cgm | cgm | cgm | cgm | cgm | cgm | cgm | cgm | cgm | cgm | ... | cgm | cgm | cgm | cgm | cgm | cgm | cgm | cgm | cgm | diet_logging |
field_string | 1st quantile | 3rd quantile | AUC | eA1C | Glucose | GMI | IQR | MAD | MAG | MAGE | ... | MODD | Range | SD | SDb | SDbdm | SDdm | SDhhmm | SDw | SDwsh | Sodium intake per food logged |
description_string | First quantile of all glucose values. | Third quantile of all glucose values. | Hourly average AUC. This measure integrates, t... | A linear transformation of the mean glucose va... | cgm temporal glucose values | A linear transformation of the mean glucose va... | Interquartile range (IQR), calculated as the d... | Median Absolute Deviation (MAD). This is a mea... | Mean Absolute Glucose (MAG). This is a measure... | Mean Amplitude of Glycemic Excursions (MAGE), ... | ... | Mean difference between glucose values obtaine... | Difference between the maximum and minimum glu... | Standard deviation of all glucose values. | SD between days, within time points. Mean valu... | SD between days, within time points, corrected... | Horizontal SD. SD of the mean glucose values, ... | SD between time points. Standard deviation of ... | Vertical SD within days. Average value of the ... | SD within series. Taking hour-long intervals t... | Sodium intake per food logged |
parent_dataframe | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
relative_location | /cgm/cgm.parquet | /cgm/cgm.parquet | /cgm/cgm.parquet | /cgm/cgm.parquet | /cgm/cgm.parquet | /cgm/cgm.parquet | /cgm/cgm.parquet | /cgm/cgm.parquet | /cgm/cgm.parquet | /cgm/cgm.parquet | ... | /cgm/cgm.parquet | /cgm/cgm.parquet | /cgm/cgm.parquet | /cgm/cgm.parquet | /cgm/cgm.parquet | /cgm/cgm.parquet | /cgm/cgm.parquet | /cgm/cgm.parquet | /cgm/cgm.parquet | diet_logging/diet_logging.parquet |
value_type | Continuous | Continuous | Continuous | Continuous | Series data, continous | Continuous | Continuous | Continuous | Continuous | Continuous | ... | Continuous | Continuous | Continuous | Continuous | Continuous | Continuous | Continuous | Continuous | Continuous | Continuous |
units | mg/dl | mg/dl | mg/dl*h | mg/dl | mg/dl | mg/dl | mg/dl | mg/dl | mg/dl | mg/dl | ... | mg/dl | mg/dl | mg/dl | mg/dl | mg/dl | mg/dl | mg/dl | mg/dl | mg/dl | mg |
sampling_rate | NaN | NaN | NaN | NaN | 15min | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
item_type | Data | Data | Data | Data | Data | Data | Data | Data | Data | Data | ... | Data | Data | Data | Data | Data | Data | Data | Data | Data | Data |
array | Single | Single | Single | Single | Single | Single | Single | Single | Single | Single | ... | Single | Single | Single | Single | Single | Single | Single | Single | Single | Single |
cohorts | 10K | 10K | 10K | 10K | 10K | 10K | 10K | 10K | 10K | 10K | ... | 10K | 10K | 10K | 10K | 10K | 10K | 10K | 10K | 10K | 10K |
data_type | tabular | tabular | tabular | tabular | time series | tabular | tabular | tabular | tabular | tabular | ... | tabular | tabular | tabular | tabular | tabular | tabular | tabular | tabular | tabular | Time Series |
debut | 2018-12-27 | 2018-12-27 | 2018-12-27 | 2018-12-27 | 2018-12-27 | 2018-12-27 | 2018-12-27 | 2018-12-27 | 2018-12-27 | 2018-12-27 | ... | 2018-12-27 | 2018-12-27 | 2018-12-27 | 2018-12-27 | 2018-12-27 | 2018-12-27 | 2018-12-27 | 2018-12-27 | 2018-12-27 | 2019-09-01 |
pandas_dtype | float | float | float | float | float | float | float | float | float | float | ... | float | float | float | float | float | float | float | float | float | float |
14 rows × 24 columns