1D finite differences

Jeff Fessler, University of Michigan 2019-01-31
2020-01-18 Julia 1.3.1, LinearMapsAA

The first-order finite difference operation is important for regularization.
It is especially important for it to be memory and compute efficient for 2D and higher dimensional problems, but for simplicity this notebook first explores it for 1D problems.

If x is a 1D array of length $N$ and y is its first-order finite difference then
y[n] = x[n+1] - x[n] for $n=1,\ldots,N-1$.

There are surprisingly many ways to compute y from x in a high-level language like Julia. After loading the key packages, the next cell gives a table of some possible methods.

In [1]:
using BenchmarkTools
# using LinearAlgebra: diagm - far too slow to even illustrate
using Printf
# using DSP # for conv() - also too slow to bother
# using FastConv - currently not loading properly so excluded.  Will be slower anyway.
# using DirectConvolution - ditto.

do_time = ENV["HOST"] == "jf28.local" # set to true to run timing tests
do_time = false;
In [2]:
# here it is written as a loop as some Fortran or C programmers might do
function myloop(x::AbstractVector{<:Number})
    N = length(x)
    y = similar(x, N-1)
    for n=1:(N-1)
        @inbounds y[n] = x[n+1] - x[n]
    end
    return y
end

# the remaining methods all use built-in Julia functions
tryfun = [
    x -> myloop(x) "myloop()";
    x -> diff(x) "diff()";
    x -> [x[n+1]-x[n] for n=1:(length(x)-1)]  "comprehension";
    x -> (@inbounds [x[n+1]-x[n] for n=1:(length(x)-1)]) "compreh. inbounds";
    x -> (circshift(x, -1) - x)[1:(length(x)-1)] "circshift";
    x -> x[2:end] - x[1:(end-1)] "index";
#   x -> (@inbounds x[2:end] - x[1:(end-1)]) "index inbound";
#   x -> (x[2:N] - x[1:(N-1)]) "index N";
#   x -> (@inbounds x[2:N] - x[1:(N-1)]) "index inbound N";
    x -> (@views x[2:end] - x[1:(end-1)]) "index views";
#   x -> conv(x, eltype(x).([1, -1]))[2:end-1]  "conv"; # slow!
]
nfun = size(tryfun,1)
tryfun
Out[2]:
7×2 Array{Any,2}:
 #3   "myloop()"         
 #4   "diff()"           
 #5   "comprehension"    
 #7   "compreh. inbounds"
 #9   "circshift"        
 #10  "index"            
 #11  "index views"      

The next cell creates some test data and then verifies that all of the nfun methods above produce the same output to within numerical precision. We use Float32 rather than the default 64-bit float because single precision suffices for most imaging applications and runs faster because of less memory.

In [3]:
N = 2^15 # adjust this based on your computer speed
x = randn(Float32, N) # faster than Float64

y = diff(x)
for it = 1:size(tryfun,1) # warm-up
    fun = tryfun[it,1]
    @assert isapprox(y, fun(x))
end
label = tryfun[:,2];

Use @btime macro in BenchmarkTools package to time these nfun methods.

In [4]:
if do_time
    for it = 1:size(tryfun,1)
        @printf("%20s ", tryfun[it,2])
        @btime y = $tryfun[$it,1]($x)
    end
end

The results of course depend on the computer and Julia version.

Here are the results for a 2017 iMac with 4.2GHz Intel Core i7 (Kaby Lake I7-7700K) with Julia 1.3

            myloop()   7.637 μs (2 allocations: 128.08 KiB)
              diff()   7.500 μs (4 allocations: 128.17 KiB)
       comprehension   22.183 μs (4 allocations: 128.13 KiB)
   compreh. inbounds   29.439 μs (4 allocations: 128.13 KiB)
           circshift   21.929 μs (6 allocations: 384.23 KiB)
               index   20.490 μs (6 allocations: 384.23 KiB)
         index views   7.771 μs (4 allocations: 128.17 KiB)

Which approach is the most like well-written single-thread C-code?

A. myloop(x)
B. x[2:end] - x[1:(end-1)] (index)
C. circshift(x ...
D. (@views x[2:end] - x[1:(end-1)]) (index views)
E. [x[n+1]-x[n] for n=1:(length(x)-1)] (comprehension)

Sparse matrices

These function calls are all linear, but they do not "look" like matrix operations. In contrast, if we create a sparse matrix C using spdiagm then we can make the code look a lot like the math by typing y = C * x

In [5]:
using SparseArrays: spdiagm
# note: the extra [1:(N-1),:] indexing at the end eliminates a useless (?) row of zeros
Spn = (N,T) -> spdiagm(0 => -ones(T, N-1), 1 => ones(T, N-1))[1:(N-1),:] # sparse
C = Spn(length(x),eltype(x))
@assert y == C*x

But using sparse matrix multiplication is much slower than using diff:

In [6]:
if do_time
    @btime $C * $x
end

On same computer:

63.513 μs (2 allocations: 128.08 KiB)

LinearMaps

The LinearMapsAA package lets us do the same thing with any (linear) function!

In [7]:
using LinearMapsAA
Cdiff = LinearMapAA(x -> diff(x), (N-1, N))
Out[7]:
LinearMapAA: 32767×32768
(none = nothing,)
LinearMaps.FunctionMap{Float32}(#23, 32767, 32768; ismutating=false, issymmetric=false, ishermitian=false, isposdef=false)
In [8]:
# it works as you would hope:
@assert y == Cdiff*x
In [9]:
# we can examine the elements for small sizes
Ctiny = LinearMapAA(x -> diff(x), (4, 5))
Out[9]:
LinearMapAA: 4×5
(none = nothing,)
LinearMaps.FunctionMap{Float32}(#25, 4, 5; ismutating=false, issymmetric=false, ishermitian=false, isposdef=false)
In [10]:
Int.(Matrix(Ctiny))
Out[10]:
4×5 Array{Int64,2}:
 -1   1   0   0  0
  0  -1   1   0  0
  0   0  -1   1  0
  0   0   0  -1  1
In [11]:
# How does Matrix() work here?  Think about this:
Ctiny * [0, 1, 0, 0, 0]
Out[11]:
4-element Array{Int64,1}:
  1
 -1
  0
  0
In [12]:
# It acts like a matrix for multiplication:
Ctiny * [1, 3, 4, 7, 0]
Out[12]:
4-element Array{Int64,1}:
  2
  1
  3
 -7
In [13]:
Ctiny * (1:5)
Out[13]:
4-element Array{Int64,1}:
 1
 1
 1
 1

Timing the LinearMap (and sparse) versions

In [14]:
# make LinearMap from each of the functions

DD = similar(Array{Any}, nfun+1)

for it = 1:size(tryfun,1) # warm-up
    fun = tryfun[it,1]
    DD[it] = LinearMapAA(fun, (N-1, N))
    @assert isapprox(y, DD[it] * x)
end

# include sparse array in the collection
if true # spdiagm
    label = [tryfun[:,2]..., "spdiagm"]
    DD[length(label)] = C
    @assert isapprox(y, DD[length(label)] * x)
end

if do_time
    for it = 1:length(label)
        @printf("%20s ", label[it])
        @btime y = $DD[$it] * $x
    end
end

Results on the same computer:

            myloop()   7.757 μs (2 allocations: 128.08 KiB)
              diff()   7.647 μs (4 allocations: 128.17 KiB)
       comprehension   22.295 μs (4 allocations: 128.13 KiB)
   compreh. inbounds   29.454 μs (4 allocations: 128.13 KiB)
           circshift   21.520 μs (6 allocations: 384.23 KiB)
               index   20.608 μs (6 allocations: 384.23 KiB)
         index views   7.730 μs (4 allocations: 128.17 KiB)
             spdiagm   64.011 μs (2 allocations: 128.08 KiB)

Take away:

Using spdiagm to perform C*x is $\approx 10 \times$ slower than using the LinearMapAA with the fastest function!
Also, LinearMapAA requires very little overhead.

Transpose

But we are not done yet. Our regularizer gradient needs multiply operations with both C and C' and we have C' for the sparse array, but not yet for the LinearMapAA version. First look at the transpose to see what it should be:

In [15]:
Int.(Matrix(Ctiny))'
Out[15]:
5×4 LinearAlgebra.Adjoint{Int64,Array{Int64,2}}:
 -1   0   0   0
  1  -1   0   0
  0   1  -1   0
  0   0   1  -1
  0   0   0   1

Note that the variable type above is Adjoint that is a generalization of transpose:
https://en.wikipedia.org/wiki/Adjoint

We can define many Julia functions to perform C' * y. Here is a C/Fortran style loop:

In [16]:
function myadj(y::AbstractVector{<:Number})
    N = length(y)
    x = similar(y, N+1)
    x[1] = -y[1]
    @views x[2:end-1] .= y[1:(end-1)] - y[2:end]
    x[N+1] = y[N]
    return x
end
Out[16]:
myadj (generic function with 1 method)
In [17]:
# the remaining methods all use built-in Julia functions
adjfun = [
    y -> myadj(y) "myadj()";
    y -> [-y[1]; -diff(y); y[end]] "diff()";
    y -> [-y[1]; [y[n]-y[n+1] for n=1:(length(y)-1)]; y[end]]  "comprehension";
    y -> [-y[1]; (@views y[1:end-1] - y[2:end]); y[end]] "index views";
#   y -> conv(y, eltype(y).([1, -1]))[???]  "conv"; # slow, not done!
]
nadj = size(adjfun,1)

z = C' * y
for it = 1:size(adjfun,1) # warm-up and check
    fun = adjfun[it,1]
    @assert isapprox(z, fun(y))
end
adjlabel = tryfun[:,2];
In [18]:
# time these methods
if do_time
    for it = 1:size(adjfun,1)
        @printf("%20s ", adjfun[it,2])
        @btime z = $adjfun[$it,1]($y)
    end
end

On the same computer the adjoint timing results were:

             myadj()   15.089 μs (7 allocations: 256.30 KiB)
              diff()   25.894 μs (39 allocations: 385.31 KiB)
       comprehension   31.864 μs (37 allocations: 257.19 KiB)
         index views   17.372 μs (37 allocations: 257.23 KiB)

Here, index views and myadj() are the two fastest.

Now make a new LinearMapAA that has both forward and adjoint multiplication.
Use the @views functions for both for simplicity.

In [19]:
ClmN = N -> LinearMapAA(
    x -> (@views x[2:end] - x[1:(end-1)]), # forward function
    y -> [-y[1]; (@views y[1:end-1] - y[2:end]); y[end]], (N-1, N)) # adjoint function
Clm5 = ClmN(5)
Out[19]:
LinearMapAA: 4×5
(none = nothing,)
LinearMaps.FunctionMap{Float32}(#38, #39, 4, 5; ismutating=false, issymmetric=false, ishermitian=false, isposdef=false)
In [20]:
Clm5'
Out[20]:
LinearMapAA: 5×4
(none = nothing,)
LinearMaps.TransposeMap{Float32,LinearMaps.FunctionMap{Float32,var"#38#41",var"#39#42"}}(LinearMaps.FunctionMap{Float32}(#38, #39, 4, 5; ismutating=false, issymmetric=false, ishermitian=false, isposdef=false))
In [21]:
# check for adjoint consistency
display(Int.(Matrix(Clm5)))
display(Int.(Matrix(Clm5')))
@assert Matrix(Clm5)' == Matrix(Clm5')
4×5 Array{Int64,2}:
 -1   1   0   0  0
  0  -1   1   0  0
  0   0  -1   1  0
  0   0   0  -1  1
5×4 Array{Int64,2}:
 -1   0   0   0
  1  -1   0   0
  0   1  -1   0
  0   0   1  -1
  0   0   0   1

Timing test to compare LinearMapAA versus SparseArray

In [22]:
Clm = ClmN(N)
@assert Clm * x == C * x
@assert Clm' * y == C' * y
if do_time
    @printf("%22s ", "linearmap C")
    @btime $Clm * $x;
    @printf("%22s ", "linearmap C'")
    @btime $Clm' * $y
    @printf("%22s ", "sparse C")
    @btime $C * $x
    @printf("%22s ", "sparse C'")
    @btime $C' * $y
end

Timing results on same computer:

          linearmap C    7.848 μs (4 allocations: 128.17 KiB)
          linearmap C'  18.617 μs (42 allocations: 257.39 KiB)
             sparse C   63.818 μs (2 allocations: 128.08 KiB)
             sparse C'  66.526 μs (3 allocations: 128.09 KiB)

Using a LinearMapAA is much faster than using a SparseArray in both directions.

Check adjoint consistency for large LinearMaps

In [23]:
u = randn(size(C,1))
v = randn(size(C,2))
@assert u'*(C*v)  (C' * u)' * v

Illustrate Array{Any}

In [24]:
a = ones(5); b = 2*ones(5)
Out[24]:
5-element Array{Float64,1}:
 2.0
 2.0
 2.0
 2.0
 2.0
In [25]:
[a b]
Out[25]:
5×2 Array{Float64,2}:
 1.0  2.0
 1.0  2.0
 1.0  2.0
 1.0  2.0
 1.0  2.0
In [26]:
arr = [a, b, "hi", x -> x^2]
Out[26]:
4-element Array{Any,1}:
 [1.0, 1.0, 1.0, 1.0, 1.0]
 [2.0, 2.0, 2.0, 2.0, 2.0]
 "hi"                     
 var"#43#44"()            
In [27]:
[2 "hi"]
Out[27]:
1×2 Array{Any,2}:
 2  "hi"
In [28]:
[[a] "hi"]
Out[28]:
1×2 Array{Any,2}:
 [1.0, 1.0, 1.0, 1.0, 1.0]  "hi"
In [29]:
# code from xijia quan in w19 - code for prof. fessler to study later.
# N-N first and second order
#=
using Distributed
a = randn(100,100)
@time begin
    tmp_C1 = diff(a,dims=1)
    tmp_C0 = a[1,:] - a[end,:]
    C1 = vcat(tmp_C1, tmp_C0')
end # 1st order

@time begin
    tmp_C2 = -diff(C1,dims=1)
    tmp_C20 = C1[end,:] - C1[1,:]
    C2 = vcat(tmp_C2, tmp_C20')
end # 2rd order

# slow
C3 = Matrix{Float64}(undef,size(a))
@time begin
    @sync @distributed for i=1:size(a,2)
        tmp_C3 = diff(a[:,i])
        tmp_C30 = a[1,i] - a[end,i]
        C3[:,i] = vcat(tmp_C3, tmp_C30)
    end # 1st order
end

C4 = Matrix{Float64}(undef,size(a))
@time begin
    @sync @distributed for i=1:size(a,2)
        tmp_C4 = -diff(C3[:,i])
        tmp_C40 = C3[end,i] - C3[1,i]
        C4[:,i] = vcat(tmp_C4, tmp_C40)
    end # 2rd order
end
=#