Train a classification model using (CpGs) as features for the given target data.
Usage
CimpleG(
train_data,
train_targets = NULL,
target_columns = NULL,
test_data = NULL,
test_targets = NULL,
method = c("CimpleG", "CimpleG_parab", "brute_force", "logistic_reg", "decision_tree",
"boost_tree", "mlp", "rand_forest"),
pred_type = c("both", "hypo", "hyper"),
engine = c("glmnet", "xgboost", "nnet", "ranger"),
rank_method = c("ac_rank", "a_rank", "c_rank"),
k_folds = 10,
grid_n = 10,
param_p = 2,
n_sigs = 1,
quantile_threshold = 0.005,
train_only = FALSE,
split_data = FALSE,
run_parallel = FALSE,
deconvolution_reference = TRUE,
save_dir = NULL,
save_format = c("zstd", "lz4", "gzip", "bzip2", "xz", "nocomp"),
verbose = 1,
targets = NULL
)
cimpleg(
train_data,
train_targets = NULL,
target_columns = NULL,
test_data = NULL,
test_targets = NULL,
method = c("CimpleG", "CimpleG_parab", "brute_force", "logistic_reg", "decision_tree",
"boost_tree", "mlp", "rand_forest"),
pred_type = c("both", "hypo", "hyper"),
engine = c("glmnet", "xgboost", "nnet", "ranger"),
rank_method = c("ac_rank", "a_rank", "c_rank"),
k_folds = 10,
grid_n = 10,
param_p = 2,
n_sigs = 1,
quantile_threshold = 0.005,
train_only = FALSE,
split_data = FALSE,
run_parallel = FALSE,
deconvolution_reference = TRUE,
save_dir = NULL,
save_format = c("zstd", "lz4", "gzip", "bzip2", "xz", "nocomp"),
verbose = 1,
targets = NULL
)
cpg(
train_data,
train_targets = NULL,
target_columns = NULL,
test_data = NULL,
test_targets = NULL,
method = c("CimpleG", "CimpleG_parab", "brute_force", "logistic_reg", "decision_tree",
"boost_tree", "mlp", "rand_forest"),
pred_type = c("both", "hypo", "hyper"),
engine = c("glmnet", "xgboost", "nnet", "ranger"),
rank_method = c("ac_rank", "a_rank", "c_rank"),
k_folds = 10,
grid_n = 10,
param_p = 2,
n_sigs = 1,
quantile_threshold = 0.005,
train_only = FALSE,
split_data = FALSE,
run_parallel = FALSE,
deconvolution_reference = TRUE,
save_dir = NULL,
save_format = c("zstd", "lz4", "gzip", "bzip2", "xz", "nocomp"),
verbose = 1,
targets = NULL
)
Arguments
- train_data
Training dataset. A matrix (s x f) with methylation data (Beta values) that will be used to train/find the predictors. Samples (s) must be in rows while features/CpGs (f) must be in columns.
- train_targets
A data frame with the training target samples one-hot encoded. A data frame with at least 1 column, with as many rows and in the same order as `train_data`. Target columns need to be one-hot encoded, meaning that, for that column the target samples should be encoded as `1` while every other sample should be encoded as `0`.
- target_columns
A string specifying the name of the column in `train_targets` to be used for training. Can be a character vector if there are several columns in `train_targets` to be used for training. If this argument is a character vector, CimpleG will search for the best predictors for each target sequentially or in parallel depending on the value of `run_parallel`
- test_data
Testing dataset. A matrix (s x f) with methylation data (Beta values) that will be used to test the performance of the found predictors. Samples (s) must be in rows while features/CpGs (f) must be in columns. If `test_data` *OR* `test_targets` are NULL, CimpleG will generate a stratified test dataset based on `train_targets` by removing 25 samples from `train_data` and `train_targets`.
- test_targets
A data frame with the testing target samples one-hot encoded. A data frame with at least 1 column, with as many rows and in the same order as `test_data`. Target columns need to be one-hot encoded, meaning that, for that column the target samples should be encoded as `1` while every other sample should be encoded as `0`. If `test_data` *OR* `test_targets` are NULL, CimpleG will generate a stratified test dataset based on `train_targets` by removing 25 samples from `train_data` and `train_targets`.
- method
A string specifying the method or type of machine learning model/algorithm to be used for training. These are divided in two main groups. * The simple models (classifiers that use a single feature), `CimpleG` (default), `brute_force`, `CimpleG_unscaled` or `oner`; * the complex models (classifiers that use several features), `logistic_reg`, `decision_tree`, `boost_tree`, `mlp` or `rand_forest`.
- pred_type
A string specifying the type of predictor/CpG to be searched for during training. Only used for simple models. One of `both` (default), `hypo` or `hyper`. If `hypo`, only hypomethylated predictors will be considered. If `hyper`, only hypermethylated predictors will be considered.
- engine
A string specifying the machine learning engine behind `method`. Only used for complex models. Currently not in use.
- rank_method
A string specifying the ranking strategy to rank the features during training.
- k_folds
An integer specifying the number of folds (K) to be used in training for the stratified K-fold cross-validation procedure.
- grid_n
An integer specifying the number of hyperparameter combinations to train for.
- param_p
An even number in `sigma / (delta^param_p)`. Tunes how much weight will be given to delta when doing feature selection. Default is
2
.- n_sigs
Number of signatures to be saved for classification and used in deconvolution. Default is
1
.- quantile_threshold
A number between 0 and 1. Determines how many features will be kept. Default is
0.005
.- train_only
A boolean, if TRUE, CimpleG will only train (find predictors) but not test them against a test dataset.
- split_data
A boolean, if `TRUE`, it will subset the train data provided, creating a smaller test set that will be used to test the models after training. This parameter is experimental. Default is `FALSE`.
- run_parallel
A boolean, if `FALSE`, the default, it will search for predictors for multiple targets sequentially. If `TRUE` it will search for predictors for multiple targets at the same time (parallel processing) in order to save in computational time. You need to set up `future::plan()` before running this function.
- deconvolution_reference
A boolean, if `TRUE`, it will create a deconvolution reference matrix based on the training data. This can later be used to perform deconvolution. Default is `FALSE`.
- save_dir
If defined it will save the resulting model to the given directory. Default is
NULL
.- save_format
Only used if
save_dir
is notNULL
. One of "zstd", "lz4", "gzip", "bzip2","xz", "nocomp".zstd
is the best option, fast compression and loading times, low space usage.- verbose
How verbose you want CimpleG to be while it is running. At 0, no message is displayed, at 3 every message is displayed. Default is
1
.- targets
DEPRECATED use `target_columns`.
Examples
library("CimpleG")
# read data
data(train_data)
data(train_targets)
data(test_data)
data(test_targets)
# run CimpleG
cimpleg_result <- CimpleG(
train_data = train_data,
train_targets = train_targets,
test_data = test_data,
test_targets = test_targets,
method = "CimpleG",
target_columns = c("glia","neurons")
)
#> Training for target 'glia' with 'CimpleG' has finished.: 2.464 sec elapsed
#> Training for target 'neurons' with 'CimpleG' has finished.: 0.705 sec elapsed
# check signatures
cimpleg_result$signatures
#> glia neurons
#> "cg14501977" "cg24548498"