Suppose you have a for loop that accumulates results in a vector result
result <- numeric(10)
for (i in 1:10) {
## do a lot of work, represented by...
Sys.sleep(5)
## after the calculation add a value to the result
result[i] <- sqrt(i)
}
Convert the body of the loop into a function that returns a value
my_fun <- function(i) {
## do a lot of work, represented by...
Sys.sleep(5)
## return the result
sqrt(i)
}
and instead of using a for
loop, use lapply()
result <- lapply(1:10, my_fun)
So far so good, but it still takes 5 seconds x 10 tasks = 50 seconds
system.time(result <- lapply(1:10, my_fun))
Now use BiocParallel to do the computation in parallel; if you have 10 cores, then this will take just 5 seconds
library(BiocParallel)
system.time(bpresult <- bplapply(1:20, my_fun))
identical(result, bpresult)
Currently the easiest way to get a performance improvement is to simply request a machine with more ‘cores’. Also, the amount of memory needs to be enough to support all cores working simultaneously, so if one iteration of my_fun()
takes 4 Gb, and you request a machine with 16 cores, you would need 16 x 4 = 64 Gb. Performing parallel computations like this on a single machine is much easier than mastering spark or workflows, although in the long run these might be ‘better’ solutions.
One ‘lesson learned’ is that R code, like any computer code, can be written in such a way that is very inefficient, e.g., try this
x <- integer(); for (i in 1:1000) x <- c(x, i)
x <- integer(); for (i in 1:100000) x <- c(x, i)
This is just making a sequence 1, 2, …; the second one takes phenomenally long! But check out
x <- integer(); for (i in 1:100000) x[i] <- i
This executes almost instantly! So if you have code that takes a long time, and sort-of intuitively it seems like a modern computer should be doing much better than that, then perhaps it would pay to speak with an R ‘expert’ to see if there are obvious inefficiencies.