In [None]:
using Distributed
nprocs()

In [None]:
addprocs(3)
nprocs()

In [None]:
myid()

The code can be executed on different processors in parallel

In [None]:
@everywhere function partial_pi(r)
    series = 0.0
    for i in r
        series += (isodd(i) ? -1 : 1) / (i*2+1)
    end
    return 4*series
end

In [None]:
a = partial_pi(0:999)
a, a-pi

In [None]:
b = partial_pi(1000:9999)
(a + b), (a+b) - pi

In [None]:
r = 0:10_000_000_000

In [None]:
# Parallel code
@time p = @distributed (+) for offset in 0:10000:r[end]-1
    q = (0:9999) .+ offset
    partial_pi(q)
end
p - pi

In [None]:
# Serial code
p = 0.0
@time for offset in 0:10000:r[end]-1
    q = (0:9999) .+ offset
    p += partial_pi(q)
end
p - pi

## Data movement

Remember: Moving data is _expensive_!

| System Event                   | Actual Latency | Scaled Latency |
| ------------------------------ | -------------- | -------------- |
| One CPU cycle                  |     0.4 ns     |     1 s        |
| Level 1 cache access           |     0.9 ns     |     2 s        |
| Level 2 cache access           |     2.8 ns     |     7 s        |
| Level 3 cache access           |      28 ns     |     1 min      |
| Main memory access (DDR DIMM)  |    ~100 ns     |     4 min      |
| Intel Optane memory access     |     <10 μs     |     7 hrs      |
| NVMe SSD I/O                   |     ~25 μs     |    17 hrs      |
| SSD I/O                        |  50–150 μs     | 1.5–4 days     |
| Rotational disk I/O            |    1–10 ms     |   1–9 months   |
| Internet call: SF to NYC       |      65 ms     |     5 years    |
| Internet call: SF to Hong Kong |     141 ms     |    11 years    |

You really don't want to be taking a trip to the moon very frequently.
Communication between processes can indeed be as expensive as hitting a disk —
sometimes they're even implemented that way.

So that's why Julia has special support for reductions built in to the
`@distributed` macro: each worker can do its own (intermediate) reduction
before returning just one value to our master node.

But sometimes you need to see those intermediate values. If you have a
very expensive computation relative to the communication overhead, there are
several ways to do this. The easiest is `pmap`:

In [None]:
ranges = [(0:99999) .+ offset for offset in 0:100000:r[end]-1]

In [None]:
# Parallel code
@time res_p = pmap(partial_pi, ranges)

In [None]:
sum(res_p) - pi

In [None]:
# Serial code
@time res_s = map(partial_pi, ranges);

In [None]:
sum(res_s) - pi

## Code movement

Each node is _completely_ independent; it's like starting brand new, separate
Julia processes yourself. By default, `addprocs()` just launches the
appropriate number of workers for the current workstation that you're on, but
you can easily connect them to remote machines via SSH or even through cluster
managers.

Those `@everywhere`s above are very important! They run the given expression
on all workers to make sure the state between them is consistent. Without it,
you'll see errors like this:

Note that this applies to packages, too!

# Multi-process parallelism is the heavy-duty workhorse in Julia

It can tackle very large problems and distribute across a very large number
of workers. Key things to remember

* Each worker is a completely independent Julia process
    * Data must move to them
    * Code must move to them