Note on convolutions

For simplicity, we consider a 1d convolution on 1d input array as an example. Most important parameters of a convolution are kernel size, stride and padding, denoted by $K, S, P$, respectively. Let $I$ be the size of 1d input array, the index of padded array can be viewed of value from $-P$ to $I-1+P$ (inclusive).

The convolution can be interpreted in a sliding window picture: At the start (0 step), the first element in the window is aligned at the index $-P$ of the input array; In each step, the window slides $S$ elements along the input array. That is, in the $i$-th step, the first element in the window is aligned at the index $-P+i*S$ and the last element is thus aligned at the index $-P+i*S + K-1$, leading to a constraint $-P+i*S + K-1\leq I-1+P$. As a result, we have $0 \leq i \leq (I + 2P-K) / S$ and the size of output array is \begin{equation}\left\lfloor\frac{I + 2P - K}{S}\right\rfloor + 1\,.\end{equation} The above process can be summarized in the python code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
def conv1d(in_list, weights_list, bias, stride, padding):
    in_size = len(in_list)
    kernel_size = len(weights_list)
    out_size = int((in_size + 2 * padding - kernel_size) / stride) + 1
    out_list = [0] * out_size
    for i in range(out_size):
        start = -padding + i * stride
        for j in range(kernel_size):
            elem = in_list[start+j] if 0 <= start+j < in_size else 0
            out_list[i] += weights_list[j] * elem
        out_list[i] += bias
    return out_list

In cuDNN and deep learning packages, the underlying implementation of convolution is by general matrix multiplication (GEMM) instead of the naive sliding windows. Both kernels and input array are transformed to 2d matrices as in the figure:
Figure: A example of two 2x2x3 kernels on a 3x3 RGB image in GEMM. Taken from cuDNN paper.

im2col can be implemented efficiently using strides in memory. A detailed description can be found in this zhihu post. Here are some highlights:

  1. Assume each element in a 2d array has 8 bytes. The 2d array is stored contiguously in memory with strides (8m, 8n). Viewed in a matrix form, an element's right element in the matrix is the n-th element afterwards in the memory and its down element in the matrix is the m-th element afterwards in the memory.
  2. By changing strides and specifying the number of rows and cols, one obtains a different matrix view of the same underlying data. 
  3. One can call 'as_strided' in numpy to transform an array to the 2d matrix required for convolution by specifying a shape and stride. The input stride (8r, 8c, 8m, 8n) can be understood in the convolution's sliding window picture: 
    • Inside one sliding window: m is the distance to the down element and n is the distance to the right element;
    • c is the distance between the first elements of two consecutive windows in the same row and r is the distance between the first elements of two consecutive windows in the same column.


Comments

Popular posts from this blog

529 Plan

How to offset W2 tax

Revocable Living Trust