c++ - Halide optimal scheduling -
i'm trying work out best schedule benchmark halide code , might missing because timing results don't make sense me.
i'm using aot compilation, , here's algorithm part of code:
imageparam input1(type_of<float>(), 3); imageparam input2(type_of<float>(), 3); func in1 = boundaryconditions::constant_exterior(input1, 0.0f); func in2 = boundaryconditions::constant_exterior(input2, 0.0f); f1(x, y, z) = (in1(x + 1, y, z) + in1(x, y, z) + in1(x - 1, y,z)); f2(x, y, z) = (in2(x + 2, y, z) + in2(x + 1, y, z) + in2(x, y, z) +in2(x - 1, y, z) + in2(x - 2, y, z)); res(x, y, z) = f1(x, y, z) + f1(x - 1, y, z) + f2(x - 1, y, z) + f2(x, y, z);
for schedule have:
f1.store_at(res, y).compute_at(res, yi).vectorize(x, 8); f2.store_at(res, y).compute_at(res, yi).vectorize(x, 8); res.split(y, y, yi, 8).vectorize(x, 8).parallel(y); res.print_loop_nest();
i use current_time function time execution of code. when use mentioned schedule both f1 , f2 execution time more when use schedule on 1 of these funcs. considering structure of stencils i'd have thought scheduling both of them improve performance. missing here? when print loops see generated code:
k: parallel j.j: store f1: store f2: j.in_y in [0, 7]: produce f1: k: j: i.i: vectorized i.v122 in [0, 7]: f1(...) = ... consume f1: produce f2: k: j: i.i: vectorized i.v126 in [0, 7]: f2(...) = ... consume f2: i.i: vectorized i.v133 in [0, 7]: result(...) = ... consume result:
is indentation or produce f2 nested within produce f1? suggestions better schedule?
i think memory-bandwidth limited. few adds implied inlining f1 or f2 res don't matter. indeed, best performance following schedule:
in1.compute_at(res, yi).vectorize(in1.args()[0], 8); in2.compute_at(res, yi).vectorize(in2.args()[0], 8); res.split(y, y, yi, 8).vectorize(x, 8).parallel(y);
i.e. pulling in padded scanline of each inputs , doing math inlined.
but it's barely faster yours. difference might noise. full experiment:
https://gist.github.com/abadams/c2e6f67d79e1768af6db5afcabb1caab
the produce of f2 nested inside consume of f1. that's normal - doesn't use f1, it's used use f1, that's reasonable place end up.
Comments
Post a Comment