Jeff Fessler, University of Michigan
2019-01-31
2020-01-18 Julia 1.3.1, LinearMapsAA
The first-order finite difference operation is important for regularization.
It is especially important for it to be memory and compute efficient
for 2D and higher dimensional problems,
but for simplicity this notebook first explores it for 1D problems.
If x
is a 1D array of length $N$ and y
is its first-order finite difference
then
y[n] = x[n+1] - x[n]
for $n=1,\ldots,N-1$.
There are surprisingly many ways to compute y
from x
in a high-level language like Julia.
After loading the key packages,
the next cell gives a table of some possible methods.
using BenchmarkTools
# using LinearAlgebra: diagm - far too slow to even illustrate
using Printf
# using DSP # for conv() - also too slow to bother
# using FastConv - currently not loading properly so excluded. Will be slower anyway.
# using DirectConvolution - ditto.
do_time = ENV["HOST"] == "jf28.local" # set to true to run timing tests
do_time = false;
# here it is written as a loop as some Fortran or C programmers might do
function myloop(x::AbstractVector{<:Number})
N = length(x)
y = similar(x, N-1)
for n=1:(N-1)
@inbounds y[n] = x[n+1] - x[n]
end
return y
end
# the remaining methods all use built-in Julia functions
tryfun = [
x -> myloop(x) "myloop()";
x -> diff(x) "diff()";
x -> [x[n+1]-x[n] for n=1:(length(x)-1)] "comprehension";
x -> (@inbounds [x[n+1]-x[n] for n=1:(length(x)-1)]) "compreh. inbounds";
x -> (circshift(x, -1) - x)[1:(length(x)-1)] "circshift";
x -> x[2:end] - x[1:(end-1)] "index";
# x -> (@inbounds x[2:end] - x[1:(end-1)]) "index inbound";
# x -> (x[2:N] - x[1:(N-1)]) "index N";
# x -> (@inbounds x[2:N] - x[1:(N-1)]) "index inbound N";
x -> (@views x[2:end] - x[1:(end-1)]) "index views";
# x -> conv(x, eltype(x).([1, -1]))[2:end-1] "conv"; # slow!
]
nfun = size(tryfun,1)
tryfun
The next cell creates some test data
and then verifies that all of the nfun
methods above
produce the same output
to within numerical precision.
We use Float32
rather than the default 64-bit float
because single precision suffices
for most imaging applications
and runs faster because of less memory.
N = 2^15 # adjust this based on your computer speed
x = randn(Float32, N) # faster than Float64
y = diff(x)
for it = 1:size(tryfun,1) # warm-up
fun = tryfun[it,1]
@assert isapprox(y, fun(x))
end
label = tryfun[:,2];
Use @btime
macro in BenchmarkTools
package to time these nfun
methods.
if do_time
for it = 1:size(tryfun,1)
@printf("%20s ", tryfun[it,2])
@btime y = $tryfun[$it,1]($x)
end
end
The results of course depend on the computer and Julia version.
Here are the results for a 2017 iMac with 4.2GHz Intel Core i7 (Kaby Lake I7-7700K) with Julia 1.3
myloop() 7.637 μs (2 allocations: 128.08 KiB)
diff() 7.500 μs (4 allocations: 128.17 KiB)
comprehension 22.183 μs (4 allocations: 128.13 KiB)
compreh. inbounds 29.439 μs (4 allocations: 128.13 KiB)
circshift 21.929 μs (6 allocations: 384.23 KiB)
index 20.490 μs (6 allocations: 384.23 KiB)
index views 7.771 μs (4 allocations: 128.17 KiB)
A. myloop(x)
B. x[2:end] - x[1:(end-1)]
(index)
C. circshift(x ...
D. (@views x[2:end] - x[1:(end-1)])
(index views)
E. [x[n+1]-x[n] for n=1:(length(x)-1)]
(comprehension)
These function calls are all linear, but they do not "look" like matrix operations.
In contrast, if we create a sparse matrix C
using spdiagm
then we can make the code look a lot like the math
by typing y = C * x
using SparseArrays: spdiagm
# note: the extra [1:(N-1),:] indexing at the end eliminates a useless (?) row of zeros
Spn = (N,T) -> spdiagm(0 => -ones(T, N-1), 1 => ones(T, N-1))[1:(N-1),:] # sparse
C = Spn(length(x),eltype(x))
@assert y == C*x
But using sparse matrix multiplication is much slower than using diff
:
if do_time
@btime $C * $x
end
On same computer:
63.513 μs (2 allocations: 128.08 KiB)
The LinearMapsAA
package lets us do the same thing with any (linear) function!
using LinearMapsAA
Cdiff = LinearMapAA(x -> diff(x), (N-1, N))
# it works as you would hope:
@assert y == Cdiff*x
# we can examine the elements for small sizes
Ctiny = LinearMapAA(x -> diff(x), (4, 5))
Int.(Matrix(Ctiny))
# How does Matrix() work here? Think about this:
Ctiny * [0, 1, 0, 0, 0]
# It acts like a matrix for multiplication:
Ctiny * [1, 3, 4, 7, 0]
Ctiny * (1:5)
# make LinearMap from each of the functions
DD = similar(Array{Any}, nfun+1)
for it = 1:size(tryfun,1) # warm-up
fun = tryfun[it,1]
DD[it] = LinearMapAA(fun, (N-1, N))
@assert isapprox(y, DD[it] * x)
end
# include sparse array in the collection
if true # spdiagm
label = [tryfun[:,2]..., "spdiagm"]
DD[length(label)] = C
@assert isapprox(y, DD[length(label)] * x)
end
if do_time
for it = 1:length(label)
@printf("%20s ", label[it])
@btime y = $DD[$it] * $x
end
end
Results on the same computer:
myloop() 7.757 μs (2 allocations: 128.08 KiB)
diff() 7.647 μs (4 allocations: 128.17 KiB)
comprehension 22.295 μs (4 allocations: 128.13 KiB)
compreh. inbounds 29.454 μs (4 allocations: 128.13 KiB)
circshift 21.520 μs (6 allocations: 384.23 KiB)
index 20.608 μs (6 allocations: 384.23 KiB)
index views 7.730 μs (4 allocations: 128.17 KiB)
spdiagm 64.011 μs (2 allocations: 128.08 KiB)
Using spdiagm
to perform C*x
is $\approx 10 \times$ slower
than using the LinearMapAA
with the fastest function!
Also, LinearMapAA
requires very little overhead.
But we are not done yet.
Our regularizer gradient needs multiply operations
with both C
and C'
and we have C'
for the sparse array,
but not yet for the LinearMapAA
version.
First look at the transpose to see what it should be:
Int.(Matrix(Ctiny))'
Note that the variable type above is Adjoint
that is a generalization of transpose:
https://en.wikipedia.org/wiki/Adjoint
We can define many Julia functions to perform C' * y
.
Here is a C/Fortran style loop:
function myadj(y::AbstractVector{<:Number})
N = length(y)
x = similar(y, N+1)
x[1] = -y[1]
@views x[2:end-1] .= y[1:(end-1)] - y[2:end]
x[N+1] = y[N]
return x
end
# the remaining methods all use built-in Julia functions
adjfun = [
y -> myadj(y) "myadj()";
y -> [-y[1]; -diff(y); y[end]] "diff()";
y -> [-y[1]; [y[n]-y[n+1] for n=1:(length(y)-1)]; y[end]] "comprehension";
y -> [-y[1]; (@views y[1:end-1] - y[2:end]); y[end]] "index views";
# y -> conv(y, eltype(y).([1, -1]))[???] "conv"; # slow, not done!
]
nadj = size(adjfun,1)
z = C' * y
for it = 1:size(adjfun,1) # warm-up and check
fun = adjfun[it,1]
@assert isapprox(z, fun(y))
end
adjlabel = tryfun[:,2];
# time these methods
if do_time
for it = 1:size(adjfun,1)
@printf("%20s ", adjfun[it,2])
@btime z = $adjfun[$it,1]($y)
end
end
On the same computer the adjoint timing results were:
myadj() 15.089 μs (7 allocations: 256.30 KiB)
diff() 25.894 μs (39 allocations: 385.31 KiB)
comprehension 31.864 μs (37 allocations: 257.19 KiB)
index views 17.372 μs (37 allocations: 257.23 KiB)
Here, index views and myadj()
are the two fastest.
Now make a new LinearMapAA
that has both forward and adjoint multiplication.
Use the @views
functions for both for simplicity.
ClmN = N -> LinearMapAA(
x -> (@views x[2:end] - x[1:(end-1)]), # forward function
y -> [-y[1]; (@views y[1:end-1] - y[2:end]); y[end]], (N-1, N)) # adjoint function
Clm5 = ClmN(5)
Clm5'
# check for adjoint consistency
display(Int.(Matrix(Clm5)))
display(Int.(Matrix(Clm5')))
@assert Matrix(Clm5)' == Matrix(Clm5')
LinearMapAA
versus SparseArray
¶Clm = ClmN(N)
@assert Clm * x == C * x
@assert Clm' * y == C' * y
if do_time
@printf("%22s ", "linearmap C")
@btime $Clm * $x;
@printf("%22s ", "linearmap C'")
@btime $Clm' * $y
@printf("%22s ", "sparse C")
@btime $C * $x
@printf("%22s ", "sparse C'")
@btime $C' * $y
end
Timing results on same computer:
linearmap C 7.848 μs (4 allocations: 128.17 KiB)
linearmap C' 18.617 μs (42 allocations: 257.39 KiB)
sparse C 63.818 μs (2 allocations: 128.08 KiB)
sparse C' 66.526 μs (3 allocations: 128.09 KiB)
Using a LinearMapAA
is much faster than using a SparseArray
in both directions.
u = randn(size(C,1))
v = randn(size(C,2))
@assert u'*(C*v) ≈ (C' * u)' * v
a = ones(5); b = 2*ones(5)
[a b]
arr = [a, b, "hi", x -> x^2]
[2 "hi"]
[[a] "hi"]
# code from xijia quan in w19 - code for prof. fessler to study later.
# N-N first and second order
#=
using Distributed
a = randn(100,100)
@time begin
tmp_C1 = diff(a,dims=1)
tmp_C0 = a[1,:] - a[end,:]
C1 = vcat(tmp_C1, tmp_C0')
end # 1st order
@time begin
tmp_C2 = -diff(C1,dims=1)
tmp_C20 = C1[end,:] - C1[1,:]
C2 = vcat(tmp_C2, tmp_C20')
end # 2rd order
# slow
C3 = Matrix{Float64}(undef,size(a))
@time begin
@sync @distributed for i=1:size(a,2)
tmp_C3 = diff(a[:,i])
tmp_C30 = a[1,i] - a[end,i]
C3[:,i] = vcat(tmp_C3, tmp_C30)
end # 1st order
end
C4 = Matrix{Float64}(undef,size(a))
@time begin
@sync @distributed for i=1:size(a,2)
tmp_C4 = -diff(C3[:,i])
tmp_C40 = C3[end,i] - C3[1,i]
C4[:,i] = vcat(tmp_C4, tmp_C40)
end # 2rd order
end
=#