Basic analysis

Basic analysis tools

source

custom_describe

 custom_describe (df:pandas.core.frame.DataFrame)

Generates a custom summary statistics dataframe for mixed data types.

Args: df: The input pandas DataFrame

Returns: A pandas DataFrame containing the summary statistics

data = generate_synthetic_data(n=100)

custom_describe(data[["date_of_research_stage", "sex", "val2"]])
date_of_research_stage sex val2
count 100 100.0 100.0
unique 99 2.0 100.0
most_frequent NaN 0.0 -3.137486
min 2020-01-04 00:00:00 0.0 -3.137486
max 2023-06-25 00:00:00 1.0 41.432775
mean NaN 0.46 19.091836
median NaN 0.0 19.601733
std NaN 0.500908 10.401328

source

assign_nearest_research_stage

 assign_nearest_research_stage (dataset:pandas.core.frame.DataFrame,
                                population:pandas.core.frame.DataFrame,
                                max_days:int=60,
                                stages:List[str]=['visit'],
                                agg:Optional[str]='first')

Assign the nearest research stage to each record in a dataset.

Args: dataset (pd.DataFrame): The dataset containing records to be assigned research stages. population (pd.DataFrame): The population data with participant_id, cohort, research_stage, and research_stage_date. max_days (int, optional): The maximum number of days allowed between the collection date and research stage date. Defaults to 60. stages (List[str], optional): The list of types of research stages to consider. Defaults to [‘visit’]. agg (Union[str, None], optional): The aggregation function to be used when (optionally) aggregating multiple rows from the same research stage. The rows are already sorted by distance from the date of the research stage. Can be ‘first’ (closest), ‘last’ (farthest), ‘mean’, ‘min’, ‘max’, or None. Defaults to ‘first’.

Returns: pd.DataFrame: The dataset with the nearest research stage assigned to each record.