c++ - Halide optimal scheduling -


i'm trying work out best schedule benchmark halide code , might missing because timing results don't make sense me.

i'm using aot compilation, , here's algorithm part of code:

imageparam input1(type_of<float>(), 3); imageparam input2(type_of<float>(), 3);  func in1 = boundaryconditions::constant_exterior(input1, 0.0f); func in2 = boundaryconditions::constant_exterior(input2, 0.0f);    f1(x, y, z) = (in1(x + 1, y, z) + in1(x, y, z) + in1(x - 1, y,z)); f2(x, y, z) = (in2(x + 2, y, z) + in2(x + 1, y, z) + in2(x, y, z) +in2(x - 1, y, z) + in2(x - 2, y, z)); res(x, y, z) = f1(x, y, z) + f1(x - 1, y, z) + f2(x - 1, y, z) + f2(x, y, z); 

for schedule have:

f1.store_at(res, y).compute_at(res, yi).vectorize(x, 8); f2.store_at(res, y).compute_at(res, yi).vectorize(x, 8); res.split(y, y, yi, 8).vectorize(x, 8).parallel(y); res.print_loop_nest(); 

i use current_time function time execution of code. when use mentioned schedule both f1 , f2 execution time more when use schedule on 1 of these funcs. considering structure of stencils i'd have thought scheduling both of them improve performance. missing here? when print loops see generated code:

  k:     parallel j.j:       store f1:         store f2:           j.in_y in [0, 7]:             produce f1:               k:                 j:                   i.i:                     vectorized i.v122 in [0, 7]:                       f1(...) = ...             consume f1:               produce f2:                 k:                   j:                     i.i:                       vectorized i.v126 in [0, 7]:                         f2(...) = ...               consume f2:                 i.i:                   vectorized i.v133 in [0, 7]:                     result(...) = ... consume result: 

is indentation or produce f2 nested within produce f1? suggestions better schedule?

i think memory-bandwidth limited. few adds implied inlining f1 or f2 res don't matter. indeed, best performance following schedule:

in1.compute_at(res, yi).vectorize(in1.args()[0], 8); in2.compute_at(res, yi).vectorize(in2.args()[0], 8); res.split(y, y, yi, 8).vectorize(x, 8).parallel(y); 

i.e. pulling in padded scanline of each inputs , doing math inlined.

but it's barely faster yours. difference might noise. full experiment:

https://gist.github.com/abadams/c2e6f67d79e1768af6db5afcabb1caab

the produce of f2 nested inside consume of f1. that's normal - doesn't use f1, it's used use f1, that's reasonable place end up.


Comments

Popular posts from this blog

python - How to insert QWidgets in the middle of a Layout? -

python - serve multiple gunicorn django instances under nginx ubuntu -

module - Prestashop displayPaymentReturn hook url -