AnVIL Office Hours 16JUN2022 @ 11 AM ET

Suppose you have a for loop that accumulates results in a vector result

result <- numeric(10)
for (i in 1:10) {
    ## do a lot of work, represented by...
    Sys.sleep(5)
    ## after the calculation add a value to the result
    result[i] <- sqrt(i)
}

Convert the body of the loop into a function that returns a value

my_fun <- function(i) {
    ## do a lot of work, represented by...
    Sys.sleep(5)
    ## return the result
    sqrt(i)
}

and instead of using a for loop, use lapply()

result <- lapply(1:10, my_fun)

So far so good, but it still takes 5 seconds x 10 tasks = 50 seconds

system.time(result <- lapply(1:10, my_fun))

Now use BiocParallel to do the computation in parallel; if you have 10 cores, then this will take just 5 seconds

library(BiocParallel)
system.time(bpresult <- bplapply(1:20, my_fun))
identical(result, bpresult)

Currently the easiest way to get a performance improvement is to simply request a machine with more ‘cores’. Also, the amount of memory needs to be enough to support all cores working simultaneously, so if one iteration of my_fun() takes 4 Gb, and you request a machine with 16 cores, you would need 16 x 4 = 64 Gb. Performing parallel computations like this on a single machine is much easier than mastering spark or workflows, although in the long run these might be ‘better’ solutions.

One ‘lesson learned’ is that R code, like any computer code, can be written in such a way that is very inefficient, e.g., try this

x <- integer(); for (i in 1:1000) x <- c(x, i)
x <- integer(); for (i in 1:100000) x <- c(x, i)

This is just making a sequence 1, 2, …; the second one takes phenomenally long! But check out

x <- integer(); for (i in 1:100000) x[i] <- i

This executes almost instantly! So if you have code that takes a long time, and sort-of intuitively it seems like a modern computer should be doing much better than that, then perhaps it would pay to speak with an R ‘expert’ to see if there are obvious inefficiencies.

1 Like