c - Multiplying large matrices is much slower with contiguous memory allocation -
while implementing neural network, noticed if allocate memory single contiguous block data set arrays, execution time increases several times.
compare these 2 methods of memory allocation:
float** alloc_2d_float(int rows, int cols, int contiguous) { int i; float** array = malloc(rows * sizeof(float*)); if(contiguous) { float* data = malloc(rows*cols*sizeof(float)); assert(data && "can't allocate contiguous memory"); for(i=0; i<rows; i++) array[i] = &(data[cols * i]); } else for(i=0; i<rows; i++) { array[i] = malloc(cols * sizeof(float)); assert(array[i] && "can't allocate memory"); } return array; }
here results when compiling -march=native -ofast
(tried gcc , clang):
michael@pascal:~/nn$ time ./test 300 1 0 multiplying (100000, 1000) , (300, 1000) arrays 1 times, noncontiguous memory allocation. allocating memory: 0.2 seconds initializing arrays: 0.8 seconds dot product: 3.3 seconds real 0m4.296s user 0m4.108s sys 0m0.188s michael@pascal:~/nn$ time ./test 300 1 1 multiplying (100000, 1000) , (300, 1000) arrays 1 times, contiguous memory allocation. allocating memory: 0.0 seconds initializing arrays: 40.3 seconds dot product: 13.5 seconds real 0m53.817s user 0m4.204s sys 0m49.664s
here's code: https://github.com/michaelklachko/nn/blob/master/test.c
note both initializing , dot product slower contiguous memory.
i expected opposite - contiguous block of memory should more cache friendly large number of separate small blocks. or @ least should similar in performance (this machine has 64gb of ram, , 90% of unused).
edit: here's compressed self-contained code (i still recommend using github version instead, has measuring , formatting statements):
#include <stdio.h> #include <stdlib.h> #include <time.h> float** alloc_2d_float(int rows, int cols, int contiguous){ int i; float** array = malloc(rows * sizeof(float*)); if(contiguous){ float* data = malloc(rows*cols*sizeof(float)); for(i=0; i<rows; i++) array[i] = &(data[cols * i]); } else for(i=0; i<rows; i++) array[i] = malloc(cols * sizeof(float)); return array; } void initialize(float** array, int dim1, int dim2){ srand(time(null)); int i, j; for(i=0; i<dim1; i++) for(j=0; j<dim2; j++) array[i][j] = rand()/rand_max; } int main(){ int i,j,k, dim1=100000, dim2=1000, dim3=300; int contiguous=0; float temp; float** array1 = alloc_2d_float(dim1, dim2, contiguous); float** array2 = alloc_2d_float(dim3, dim2, contiguous); float** result = alloc_2d_float(dim1, dim3, contiguous); initialize(array1, dim1, dim2); initialize(array2, dim3, dim2); for(i=0; i<dim1; i++) for(k=0; k<dim3; k++){ temp = 0; for(j=0; j<dim2; j++) temp += array1[i][j] * array2[k][j]; result[i][k] = temp; } }
looks you've run ability or disability of compiler run vectorisation of code. i've tried repeat experiment no succeed -
mick@mick-laptop:~/Загрузки$ ./a.out 100 1 0
multiplying (100000, 1000) , (100, 1000) arrays 1 times, noncontiguous memory allocation.
initializing arrays...
multiplying arrays...
execution time: allocating memory: 0.1 seconds initializing arrays: 0.9 seconds dot product: 44.8 seconds
mick@mick-laptop:~/Загрузки$ ./a.out 100 1 1
multiplying (100000, 1000) , (100, 1000) arrays 1 times, contiguous memory allocation.
initializing arrays...
multiplying arrays...
execution time: allocating memory: 0.0 seconds initializing arrays: 1.0 seconds dot product: 46.3 seconds
Comments
Post a Comment