Load and Prepare Data for Prognostic Models — load_and_prepare_data

Loads a CSV file containing patient data, extracts features, outcome, and time columns, and prepares them into a format suitable for survival analysis models. Handles basic data cleaning like NA removal and column type conversion.

Usage

load_and_prepare_data_pro(
  data_path,
  outcome_col_name,
  time_col_name,
  time_unit = c("day", "month", "year")
)

Arguments

data_path: A character string, the file path to the input CSV data. The first column is assumed to be a sample ID.
outcome_col_name: A character string, the name of the column containing event status (0 for censored, 1 for event).
time_col_name: A character string, the name of the column containing event or censoring time.
time_unit: A character string, the unit of time in time_col_name. Can be "day", "month", or "year". Times will be converted to days internally.

Value

A list containing:

X: A data frame of features (all columns except ID, outcome, and time).
Y_surv: A survival::Surv object created from time and outcome.
sample_ids: A vector of sample IDs (the first column of the input data).
outcome_numeric: A numeric vector of outcome status.
time_numeric: A numeric vector of time, converted to days.

Examples

temp_csv_path <- tempfile(fileext = ".csv")
dummy_data <- data.frame(
  ID = paste0("Patient", 1:50),
  FeatureA = rnorm(50),
  FeatureB = runif(50, 0, 100),
  CategoricalFeature = sample(c("A", "B", "C"), 50, replace = TRUE),
  Outcome_Status = sample(c(0, 1), 50, replace = TRUE),
  Followup_Time_Months = runif(50, 10, 60)
)
write.csv(dummy_data, temp_csv_path, row.names = FALSE)

# Load and prepare data
prepared_data <- load_and_prepare_data_pro(
  data_path = temp_csv_path,
  outcome_col_name = "Outcome_Status",
  time_col_name = "Followup_Time_Months",
  time_unit = "month"
)

# Check prepared data structure
str(prepared_data$X)
#> 'data.frame':	50 obs. of  3 variables:
#>  $ FeatureA          : num  1.802 -0.946 -0.456 0.126 -2.691 ...
#>  $ FeatureB          : num  43.7 51.7 46.5 33.4 12.3 ...
#>  $ CategoricalFeature: Factor w/ 3 levels "A","B","C": 2 3 3 1 1 1 1 2 1 3 ...
print(prepared_data$Y_surv[1:5])
#> [1] 1808.3198+ 1124.3409+  710.2230  1600.5451   352.0851+

# Clean up dummy file
unlink(temp_csv_path)