Why rayon-based parallel processing takes more time than serial processing?

Learning Rayon, I wanted to compare the performace of parallel calculation and serial calculation of Fibonacci series. Here's my code:

``````use rayon;
use std::time::Instant;

fn main() {
let nth = 30;
let now = Instant::now();
let fib = fibonacci_serial(nth);
println!(
"[s] The {}th number in the fibonacci sequence is {}, elapsed: {}",
nth,
fib,
now.elapsed().as_micros()
);

let now = Instant::now();
let fib = fibonacci_parallel(nth);
println!(
"[p] The {}th number in the fibonacci sequence is {}, elapsed: {}",
nth,
fib,
now.elapsed().as_micros()
);
}

fn fibonacci_parallel(n: u64) -> u64 {
if n <= 1 {
return n;
}

let (a, b) = rayon::join(|| fibonacci_parallel(n - 2), || fibonacci_parallel(n - 1));
a + b
}

fn fibonacci_serial(n: u64) -> u64 {
if n <= 1 {
return n;
}

fibonacci_serial(n - 2) + fibonacci_serial(n - 1)
}
``````

Run in Rust Playground

I expected the elapsed time of parallel calculation would be smaller than the elapsed time of serial caculation, but the result was opposite:

``````# `s` stands for serial calculation and `p` for parallel
[s] The 30th number in the fibonacci sequence is 832040, elapsed: 12127
[p] The 30th number in the fibonacci sequence is 832040, elapsed: 990379
``````

My implementation for serial/parallel calculation would have flaws. But if not, why am I seeing these results?

• If you are doing speed comparision, always use `--release` or the button Release in the upper left corner of the playground (under Debug)! – hellow Jun 7 at 9:17

I think the real reason is, that you create `n²` threads which is not good. In every call of `fibonacci_parallel` you create another pair of threads for rayon and because you call `fibonacci_parallel` again in the closure you create yet another pair of threads.
This is utterly terrible for the OS/rayon.

An approach to solve this problem could be this:

``````fn fibonacci_parallel(n: u64) -> u64 {
fn inner(n: u64) -> u64 {
if n <= 1 {
return n;
}

inner(n - 2) + inner(n - 1)
}

if n <= 1 {
return n;
}

let (a, b) = rayon::join(|| inner(n - 2), || inner(n - 1));
a + b
}
``````

You create two threads which both execute the inner function. With this addition I get

``````op@VBOX /t/t/foo> cargo run --release 40
Finished release [optimized] target(s) in 0.03s
Running `target/release/foo 40`
[s] The 40th number in the fibonacci sequence is 102334155, elapsed: 1373741
[p] The 40th number in the fibonacci sequence is 102334155, elapsed: 847343
``````

But as said, for low numbers parallel execution is not worth it:

``````op@VBOX /t/t/foo> cargo run --release 20
Finished release [optimized] target(s) in 0.02s
Running `target/release/foo 20`
[s] The 10th number in the fibonacci sequence is 6765, elapsed: 82
[p] The 10th number in the fibonacci sequence is 6765, elapsed: 241
``````
• those are internal tasks, not real threads, but pretty much the same result here. – chpio Jun 7 at 10:48
• The parallelization overhead is higher than the gains. – chpio Jun 7 at 10:50
• @chpio thanks for the particularization. Please feel free to edit my answer, I will happily accept it. – hellow Jun 7 at 10:50
• @chpio: Indeed, it's an overhead issue. Each task that is created needs to be scheduled, sent to a thread, etc... In general, parallelization is only worth it for coarse-grained parallelism. – Matthieu M. Jun 7 at 15:30