Basic analysis

Basic analysis tools

custom_describe

 custom_describe (df:pandas.core.frame.DataFrame)

Generates a custom summary statistics dataframe for mixed data types.

Args: df: The input pandas DataFrame

Returns: A pandas DataFrame containing the summary statistics

data = generate_synthetic_data(n=100)

custom_describe(data[["date_of_research_stage", "sex", "val2"]])

	date_of_research_stage	sex	val2
count	100	100.0	100.0
unique	99	2.0	100.0
most_frequent	NaN	0.0	-3.137486
min	2020-01-04 00:00:00	0.0	-3.137486
max	2023-06-25 00:00:00	1.0	41.432775
mean	NaN	0.46	19.091836
median	NaN	0.0	19.601733
std	NaN	0.500908	10.401328

source

assign_nearest_research_stage

 assign_nearest_research_stage (dataset:pandas.core.frame.DataFrame,
                                population:pandas.core.frame.DataFrame,
                                max_days:int=60,
                                stages:List[str]=['visit'],
                                agg:Optional[str]='first')

Assign the nearest research stage to each record in a dataset.

Args: dataset (pd.DataFrame): The dataset containing records to be assigned research stages. population (pd.DataFrame): The population data with participant_id, cohort, research_stage, and research_stage_date. max_days (int, optional): The maximum number of days allowed between the collection date and research stage date. Defaults to 60. stages (List[str], optional): The list of types of research stages to consider. Defaults to [‘visit’]. agg (Union[str, None], optional): The aggregation function to be used when (optionally) aggregating multiple rows from the same research stage. The rows are already sorted by distance from the date of the research stage. Can be ‘first’ (closest), ‘last’ (farthest), ‘mean’, ‘min’, ‘max’, or None. Defaults to ‘first’.

Returns: pd.DataFrame: The dataset with the nearest research stage assigned to each record.