Loads a CSV file containing patient data, extracts features, outcome, and time columns, and prepares them into a format suitable for survival analysis models. Handles basic data cleaning like NA removal and column type conversion.
Usage
load_and_prepare_data_pro(
data_path,
outcome_col_name,
time_col_name,
time_unit = c("day", "month", "year")
)
Arguments
- data_path
A character string, the file path to the input CSV data. The first column is assumed to be a sample ID.
- outcome_col_name
A character string, the name of the column containing event status (0 for censored, 1 for event).
- time_col_name
A character string, the name of the column containing event or censoring time.
- time_unit
A character string, the unit of time in
time_col_name
. Can be "day", "month", or "year". Times will be converted to days internally.
Value
A list containing:
X
: A data frame of features (all columns except ID, outcome, and time).Y_surv
: Asurvival::Surv
object created from time and outcome.sample_ids
: A vector of sample IDs (the first column of the input data).outcome_numeric
: A numeric vector of outcome status.time_numeric
: A numeric vector of time, converted to days.
Examples
temp_csv_path <- tempfile(fileext = ".csv")
dummy_data <- data.frame(
ID = paste0("Patient", 1:50),
FeatureA = rnorm(50),
FeatureB = runif(50, 0, 100),
CategoricalFeature = sample(c("A", "B", "C"), 50, replace = TRUE),
Outcome_Status = sample(c(0, 1), 50, replace = TRUE),
Followup_Time_Months = runif(50, 10, 60)
)
write.csv(dummy_data, temp_csv_path, row.names = FALSE)
# Load and prepare data
prepared_data <- load_and_prepare_data_pro(
data_path = temp_csv_path,
outcome_col_name = "Outcome_Status",
time_col_name = "Followup_Time_Months",
time_unit = "month"
)
# Check prepared data structure
str(prepared_data$X)
#> 'data.frame': 50 obs. of 3 variables:
#> $ FeatureA : num 1.802 -0.946 -0.456 0.126 -2.691 ...
#> $ FeatureB : num 43.7 51.7 46.5 33.4 12.3 ...
#> $ CategoricalFeature: Factor w/ 3 levels "A","B","C": 2 3 3 1 1 1 1 2 1 3 ...
print(prepared_data$Y_surv[1:5])
#> [1] 1808.3198+ 1124.3409+ 710.2230 1600.5451 352.0851+
# Clean up dummy file
unlink(temp_csv_path)