= PhenoLoader('fundus')
dl dl
DataLoader for fundus with
78 fields
2 tables: ['fundus', 'age_sex']
PhenoLoader (dataset:str, base_path:str='/home/ec2-user/studies/hpp/', cohort:str=None, age_sex_dataset:str='events', skip_dfs:List[str]=[], unique_index:bool=False, valid_dates:bool=False, valid_stage:bool=False, flexible_field_search:bool=False, errors:str='raise', read_parquet_kwargs:Dict[str,Any]={})
Class to load multiple tables from a dataset and allows to easily access their fields.
Args:
dataset (str): The name of the dataset to load.
base_path (str, optional): The base path where the data is stored. Defaults to DATASETS_PATH.
cohort (str, optional): The name of the cohort within the dataset. Defaults to COHORT.
age_sex_dataset (str, optional): The name of the dataset to use for computing age and sex. Defaults to EVENTS_DATASET.
skip_dfs (list, optional): A list of tables (or substrings that match to tables) to skip when loading the data. Defaults to [].
unique_index (bool, optional): Whether to ensure the index of the data is unique. Defaults to False.
valid_dates (bool, optional): Whether to ensure that all timestamps in the data are valid dates. Defaults to False.
valid_stage (bool, optional): Whether to ensure that all research stages in the data are valid. Defaults to False.
flexible_field_search (bool, optional): Whether to allow regex field search. Defaults to False.
errors (str, optional): Whether to raise an error or issue a warning if missing data is encountered.
Possible values are 'raise', 'warn' and 'ignore'. Defaults to 'raise'.
Attributes:
dict (pd.DataFrame): The data dictionary for the dataset, containing information about each field.
dfs (dict): A dictionary of dataframes, one for each table in the dataset.
fields (list): A list of all fields in the dataset.
dataset (str): The name of the dataset being used.
cohort (str): The name of the cohort being used.
base_path (str): The base path where the data is stored.
dataset_path (str): The full path to the dataset being used.
age_sex_dataset (str): The name of the dataset being used to compute age and sex.
skip_dfs (list): A list of tables to skip when loading the data.
unique_index (bool): Whether to ensure the index of the data is unique.
valid_dates (bool): Whether to ensure that all timestamps in the data are valid dates.
valid_stage (bool): Whether to ensure that all research stages in the data are valid.
flexible_field_search (bool): Whether to allow regex field search.
errors (str): Whether to raise an error or issue a warning if missing data is encountered.
Use the dataset name to load the dataset. It may contain multiple tables. Age / sex will be added to the data by default. The default base_path
is set to work on the research platform.
The DataLoader class contains several usefull attributes
The data dictionary of the dataset displays the description of each field.
field_string | description_string | parent_dataframe | relative_location | value_type | units | sampling_rate | item_type | array | cohorts | data_type | debut | pandas_dtype | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
tabular_field_name | |||||||||||||
fundus_image_left | Fundus image (left) | Fundus image (left) | NaN | /fundus/fundus.parquet | Text | None | NaN | Bulk | Single | 10K | image | 2021-02-17 | string |
fundus_image_right | Fundus image (right) | Fundus image (right) | NaN | /fundus/fundus.parquet | Text | None | NaN | Bulk | Single | 10K | image | 2021-02-17 | string |
collection_date | Collection date (YYYY-MM-DD) | Collection date (YYYY-MM-DD) | NaN | /fundus/fundus.parquet | Date | Time | NaN | Data | Single | 10K | tabular | 2021-02-17 | datetime64[ns] |
fundus_image_left | fundus_image_right | collection_date | artery_average_width_left | artery_average_width_right | artery_distance_tortuosity_left | artery_distance_tortuosity_right | artery_fractal_dimension_left | artery_fractal_dimension_right | artery_squared_curvature_tortuosity_left | ... | vein_fractal_dimension_left | vein_fractal_dimension_right | vein_squared_curvature_tortuosity_left | vein_squared_curvature_tortuosity_right | vein_tortuosity_density_left | vein_tortuosity_density_right | vein_vessel_density_left | vein_vessel_density_right | vessel_density_left | vessel_density_right | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
participant_id | cohort | research_stage | array_index | |||||||||||||||||||||
0 | 10k | 00_00_visit | 0 | /path/to/file | /path/to/file | 2022-11-16 | 18430.284751 | 19038.547771 | 3.668175 | 3.271147 | 1.355673 | 1.343602 | 40.648267 | ... | 1.410553 | 1.403108 | 14.208195 | 6.098432 | 0.700187 | 0.698546 | 0.046645 | 0.045864 | 0.080377 | 0.078671 |
1 | 10k | 00_00_visit | 0 | /path/to/file | /path/to/file | 2022-06-30 | 17315.398780 | 19099.489575 | 2.095461 | 1.634782 | 1.368933 | 1.363413 | 24.253169 | ... | 1.387527 | 1.332864 | 8.999069 | 8.702682 | 0.740806 | 0.708911 | 0.037896 | 0.046853 | 0.074197 | 0.064578 |
2 | 10k | 00_00_visit | 0 | /path/to/file | /path/to/file | 2021-10-05 | 15375.866993 | 19855.576862 | 2.776472 | 2.747015 | 1.360404 | 1.362699 | 9.742353 | ... | 1.411881 | 1.408791 | 13.119227 | 9.936669 | 0.627281 | 0.675100 | 0.053022 | 0.048063 | 0.079515 | 0.082102 |
3 rows × 76 columns
All availbale fields (columns) in all tables can be listed.
['artery_average_width_left',
'artery_average_width_right',
'artery_distance_tortuosity_left',
'artery_distance_tortuosity_right',
'artery_fractal_dimension_left']
Access any of the fields (e.g., vein_average_width_right
, age
) or indices (e.g., research_stage
) from any of the tables via the data loader API.
research_stage | vein_average_width_right | age | sex | ||||
---|---|---|---|---|---|---|---|
participant_id | cohort | research_stage | array_index | ||||
0 | 10k | 00_00_visit | 0 | 00_00_visit | 18436.428634 | 43.5 | 0 |
1 | 10k | 00_00_visit | 0 | 00_00_visit | 18888.160314 | 53.7 | 1 |
2 | 10k | 00_00_visit | 0 | 00_00_visit | 19013.865043 | 26.2 | 0 |
3 | 10k | 00_00_visit | 0 | 00_00_visit | 18809.012493 | 44.6 | 1 |
4 | 10k | 00_00_visit | 0 | 00_00_visit | 19428.986690 | 50.3 | 0 |
Access time series or bulk data that is stored separately for each sample via the data loader API. In the following example, the data loader retrieves the relative path of each sample’s bulk file from the main table (where it is stored in the field fundus_image_left
), converts it to an absolute path, and loads the file. This is repeated for 2 samples and returned as a list. In the case of parquet DataFrames, there is no need to define the load_func
and multiple DFs are concatenated by deafult.
You can perform flexible field search (with regex support), when initializing the DataLoader as follows:
For example, the following command will search for any field starting with “fractal”.
fractal_dimension_left | fractal_dimension_right | ||||
---|---|---|---|---|---|
participant_id | cohort | research_stage | array_index | ||
0 | 10k | 00_00_visit | 0 | 1.564989 | 1.520885 |
1 | 10k | 00_00_visit | 0 | 1.542311 | 1.534158 |
2 | 10k | 00_00_visit | 0 | 1.482051 | 1.545097 |
3 | 10k | 00_00_visit | 0 | 1.548773 | 1.539352 |
4 | 10k | 00_00_visit | 0 | 1.554922 | 1.557029 |
You can summarize a field or set of fields by the following command
fundus_image_right | collection_date | |
---|---|---|
field_string | Fundus image (right) | Collection date (YYYY-MM-DD) |
description_string | Fundus image (right) | Collection date (YYYY-MM-DD) |
parent_dataframe | NaN | NaN |
relative_location | /fundus/fundus.parquet | /fundus/fundus.parquet |
value_type | Text | Date |
units | None | Time |
sampling_rate | NaN | NaN |
item_type | Bulk | Data |
array | Single | Single |
cohorts | 10K | 10K |
data_type | image | tabular |
debut | 2021-02-17 | 2021-02-17 |
pandas_dtype | string | datetime64[ns] |
count | 5 | 5 |
unique | 1 | 5 |
most_frequent | /path/to/file | 2021-10-05 |
min | NaN | NaN |
max | NaN | NaN |
mean | NaN | NaN |
median | NaN | NaN |
std | NaN | NaN |