c - Multiplying large matrices is much slower with contiguous memory allocation -


while implementing neural network, noticed if allocate memory single contiguous block data set arrays, execution time increases several times.

compare these 2 methods of memory allocation:

float** alloc_2d_float(int rows, int cols, int contiguous) {     int i;     float** array = malloc(rows * sizeof(float*));      if(contiguous)     {         float* data = malloc(rows*cols*sizeof(float));         assert(data && "can't allocate contiguous memory");          for(i=0; i<rows; i++)             array[i] = &(data[cols * i]);     }     else         for(i=0; i<rows; i++)         {             array[i] = malloc(cols * sizeof(float));             assert(array[i] && "can't allocate memory");         }      return array; } 

here results when compiling -march=native -ofast (tried gcc , clang):

michael@pascal:~/nn$ time ./test 300 1 0  multiplying (100000, 1000) , (300, 1000) arrays 1 times, noncontiguous memory allocation.  allocating memory:    0.2 seconds initializing arrays: 0.8 seconds dot product:         3.3 seconds  real    0m4.296s user    0m4.108s sys     0m0.188s  michael@pascal:~/nn$ time ./test 300 1 1  multiplying (100000, 1000) , (300, 1000) arrays 1 times, contiguous memory allocation.  allocating memory:    0.0 seconds initializing arrays: 40.3 seconds dot product:         13.5 seconds      real    0m53.817s user    0m4.204s sys     0m49.664s 

here's code: https://github.com/michaelklachko/nn/blob/master/test.c

note both initializing , dot product slower contiguous memory.

i expected opposite - contiguous block of memory should more cache friendly large number of separate small blocks. or @ least should similar in performance (this machine has 64gb of ram, , 90% of unused).

edit: here's compressed self-contained code (i still recommend using github version instead, has measuring , formatting statements):

#include <stdio.h> #include <stdlib.h> #include <time.h>  float** alloc_2d_float(int rows, int cols, int contiguous){     int i;     float** array = malloc(rows * sizeof(float*));     if(contiguous){         float* data = malloc(rows*cols*sizeof(float));         for(i=0; i<rows; i++)             array[i] = &(data[cols * i]);     }     else     for(i=0; i<rows; i++)         array[i] = malloc(cols * sizeof(float));     return array; }  void initialize(float** array, int dim1, int dim2){     srand(time(null));     int i, j;     for(i=0; i<dim1; i++)         for(j=0; j<dim2; j++)             array[i][j] = rand()/rand_max; }  int main(){     int i,j,k, dim1=100000, dim2=1000, dim3=300;     int contiguous=0;     float temp;      float** array1 = alloc_2d_float(dim1, dim2, contiguous);     float** array2 = alloc_2d_float(dim3, dim2, contiguous);     float** result = alloc_2d_float(dim1, dim3, contiguous);      initialize(array1, dim1, dim2);     initialize(array2, dim3, dim2);      for(i=0; i<dim1; i++)         for(k=0; k<dim3; k++){             temp = 0;             for(j=0; j<dim2; j++)                 temp += array1[i][j] * array2[k][j];             result[i][k] = temp;     } } 

looks you've run ability or disability of compiler run vectorisation of code. i've tried repeat experiment no succeed -

mick@mick-laptop:~/Загрузки$ ./a.out 100 1 0

multiplying (100000, 1000) , (100, 1000) arrays 1 times, noncontiguous memory allocation.

initializing arrays...

multiplying arrays...

execution time: allocating memory: 0.1 seconds initializing arrays: 0.9 seconds dot product: 44.8 seconds

mick@mick-laptop:~/Загрузки$ ./a.out 100 1 1

multiplying (100000, 1000) , (100, 1000) arrays 1 times, contiguous memory allocation.

initializing arrays...

multiplying arrays...

execution time: allocating memory: 0.0 seconds initializing arrays: 1.0 seconds dot product: 46.3 seconds


Comments

Popular posts from this blog

python - How to insert QWidgets in the middle of a Layout? -

python - serve multiple gunicorn django instances under nginx ubuntu -

module - Prestashop displayPaymentReturn hook url -