Skip to contents

Train a classification model using (CpGs) as features for the given target data.

Usage

CimpleG(
  train_data,
  train_targets = NULL,
  target_columns = NULL,
  test_data = NULL,
  test_targets = NULL,
  method = c("CimpleG", "CimpleG_parab", "brute_force", "logistic_reg", "decision_tree",
    "boost_tree", "mlp", "rand_forest"),
  pred_type = c("both", "hypo", "hyper"),
  engine = c("glmnet", "xgboost", "nnet", "ranger"),
  rank_method = c("ac_rank", "a_rank", "c_rank"),
  k_folds = 10,
  grid_n = 10,
  param_p = 2,
  n_sigs = 1,
  quantile_threshold = 0.005,
  train_only = FALSE,
  split_data = FALSE,
  run_parallel = FALSE,
  deconvolution_reference = TRUE,
  save_dir = NULL,
  save_format = c("zstd", "lz4", "gzip", "bzip2", "xz", "nocomp"),
  verbose = 1,
  targets = NULL
)

cimpleg(
  train_data,
  train_targets = NULL,
  target_columns = NULL,
  test_data = NULL,
  test_targets = NULL,
  method = c("CimpleG", "CimpleG_parab", "brute_force", "logistic_reg", "decision_tree",
    "boost_tree", "mlp", "rand_forest"),
  pred_type = c("both", "hypo", "hyper"),
  engine = c("glmnet", "xgboost", "nnet", "ranger"),
  rank_method = c("ac_rank", "a_rank", "c_rank"),
  k_folds = 10,
  grid_n = 10,
  param_p = 2,
  n_sigs = 1,
  quantile_threshold = 0.005,
  train_only = FALSE,
  split_data = FALSE,
  run_parallel = FALSE,
  deconvolution_reference = TRUE,
  save_dir = NULL,
  save_format = c("zstd", "lz4", "gzip", "bzip2", "xz", "nocomp"),
  verbose = 1,
  targets = NULL
)

cpg(
  train_data,
  train_targets = NULL,
  target_columns = NULL,
  test_data = NULL,
  test_targets = NULL,
  method = c("CimpleG", "CimpleG_parab", "brute_force", "logistic_reg", "decision_tree",
    "boost_tree", "mlp", "rand_forest"),
  pred_type = c("both", "hypo", "hyper"),
  engine = c("glmnet", "xgboost", "nnet", "ranger"),
  rank_method = c("ac_rank", "a_rank", "c_rank"),
  k_folds = 10,
  grid_n = 10,
  param_p = 2,
  n_sigs = 1,
  quantile_threshold = 0.005,
  train_only = FALSE,
  split_data = FALSE,
  run_parallel = FALSE,
  deconvolution_reference = TRUE,
  save_dir = NULL,
  save_format = c("zstd", "lz4", "gzip", "bzip2", "xz", "nocomp"),
  verbose = 1,
  targets = NULL
)

Arguments

train_data

Training dataset. A matrix (s x f) with methylation data (Beta values) that will be used to train/find the predictors. Samples (s) must be in rows while features/CpGs (f) must be in columns.

train_targets

A data frame with the training target samples one-hot encoded. A data frame with at least 1 column, with as many rows and in the same order as `train_data`. Target columns need to be one-hot encoded, meaning that, for that column the target samples should be encoded as `1` while every other sample should be encoded as `0`.

target_columns

A string specifying the name of the column in `train_targets` to be used for training. Can be a character vector if there are several columns in `train_targets` to be used for training. If this argument is a character vector, CimpleG will search for the best predictors for each target sequentially or in parallel depending on the value of `run_parallel`

test_data

Testing dataset. A matrix (s x f) with methylation data (Beta values) that will be used to test the performance of the found predictors. Samples (s) must be in rows while features/CpGs (f) must be in columns. If `test_data` *OR* `test_targets` are NULL, CimpleG will generate a stratified test dataset based on `train_targets` by removing 25 samples from `train_data` and `train_targets`.

test_targets

A data frame with the testing target samples one-hot encoded. A data frame with at least 1 column, with as many rows and in the same order as `test_data`. Target columns need to be one-hot encoded, meaning that, for that column the target samples should be encoded as `1` while every other sample should be encoded as `0`. If `test_data` *OR* `test_targets` are NULL, CimpleG will generate a stratified test dataset based on `train_targets` by removing 25 samples from `train_data` and `train_targets`.

method

A string specifying the method or type of machine learning model/algorithm to be used for training. These are divided in two main groups. * The simple models (classifiers that use a single feature), `CimpleG` (default), `brute_force`, `CimpleG_unscaled` or `oner`; * the complex models (classifiers that use several features), `logistic_reg`, `decision_tree`, `boost_tree`, `mlp` or `rand_forest`.

pred_type

A string specifying the type of predictor/CpG to be searched for during training. Only used for simple models. One of `both` (default), `hypo` or `hyper`. If `hypo`, only hypomethylated predictors will be considered. If `hyper`, only hypermethylated predictors will be considered.

engine

A string specifying the machine learning engine behind `method`. Only used for complex models. Currently not in use.

rank_method

A string specifying the ranking strategy to rank the features during training.

k_folds

An integer specifying the number of folds (K) to be used in training for the stratified K-fold cross-validation procedure.

grid_n

An integer specifying the number of hyperparameter combinations to train for.

param_p

An even number in `sigma / (delta^param_p)`. Tunes how much weight will be given to delta when doing feature selection. Default is 2.

n_sigs

Number of signatures to be saved for classification and used in deconvolution. Default is 1.

quantile_threshold

A number between 0 and 1. Determines how many features will be kept. Default is 0.005.

train_only

A boolean, if TRUE, CimpleG will only train (find predictors) but not test them against a test dataset.

split_data

A boolean, if `TRUE`, it will subset the train data provided, creating a smaller test set that will be used to test the models after training. This parameter is experimental. Default is `FALSE`.

run_parallel

A boolean, if `FALSE`, the default, it will search for predictors for multiple targets sequentially. If `TRUE` it will search for predictors for multiple targets at the same time (parallel processing) in order to save in computational time. You need to set up `future::plan()` before running this function.

deconvolution_reference

A boolean, if `TRUE`, it will create a deconvolution reference matrix based on the training data. This can later be used to perform deconvolution. Default is `FALSE`.

save_dir

If defined it will save the resulting model to the given directory. Default is NULL.

save_format

Only used if save_dir is not NULL. One of "zstd", "lz4", "gzip", "bzip2","xz", "nocomp". zstd is the best option, fast compression and loading times, low space usage.

verbose

How verbose you want CimpleG to be while it is running. At 0, no message is displayed, at 3 every message is displayed. Default is 1.

targets

DEPRECATED use `target_columns`.

Value

A CimpleG object with the results per target class.

Examples

library("CimpleG")

# read data
data(train_data)
data(train_targets)
data(test_data)
data(test_targets)

# run CimpleG
cimpleg_result <- CimpleG(
  train_data = train_data,
  train_targets = train_targets,
  test_data = test_data,
  test_targets = test_targets,
  method = "CimpleG",
  target_columns = c("glia","neurons")
)
#> Training for target 'glia' with 'CimpleG' has finished.: 2.464 sec elapsed
#> Training for target 'neurons' with 'CimpleG' has finished.: 0.705 sec elapsed

# check signatures
cimpleg_result$signatures
#>         glia      neurons 
#> "cg14501977" "cg24548498"