excel - Match Forecasted Values to Actuals -
i receive daily file forecasts values given categories. filedate = fcstdate, value fcstval real, actual value. right i'm using excel power query (xl'16: & transform) pull dozens of files table resembles 1 below (400k+ rows, 18 levels in reality).
i need able say, on 1-1, 1-2 category aa|ac|null forecast 60, 44 respectively on 1-3, actual value 43. ditto every other row. most, not all, unique row combinations common between files. i'll have worry dealing renamed levels...
the table.partition, table.fillup, table.frompartitions power query functions express logic perfectly, power query far slow because seems read each large .xlsx file multiple times (+1x per row?!), made worse because i'd need index table distinct category levels & forecast dates partition on.
right i'm reduced using formula in excel table: =sumifs([actualval], [lvl1],[@[lvl1]], [lvl2],[@[lvl2]], [lvl3],[@[lvl3]], [filedt]],[@[fcstdt]], [@[eq]]="y")
however, requires setting blanks 'null', changing values start "=" or ">", etc. , takes hours calculate.
i've been trying learn powerpivot/dax because understand it's able efficiently filter & calculate large data sets. i'm hoping solution set 'context' of dax calculation same row reference via old-fashioned excel formula & move value 'wanted' column - haven't figured out.
i'd prefer powerpivot solution if possible, if not, can make sense of python/pandas. however, we're stuck excel input 3rd party provider.
lvl1 | lvl2 | lvl3 | filedt | fcstdt | eq | fcstval | actualval | wanted! 1-1: ________________________________________________________________________ aa ab ad 1-1 1-1 y 100 100 100 aa ac ae 1-1 1-1 y 50 50 50 aa ab (null) 1-1 1-2 110 105 aa ac (null) 1-1 1-2 (null) 45 aa ab (null) 1-1 1-3 120 105 aa ac (null) 1-1 1-3 70 43 1-2 file: ___________________________________________________________________ aa ab (null) 1-2 1-2 y 105 105 105 aa ac (null) 1-2 1-2 y 45 45 45 aa ab (null) 1-2 1-3 113 (null) aa ac (null) 1-2 1-3 44 43 1-3 file: ___________________________________________________________________ (missing row aa|ab!) 1-3 1-3 y (null) (null) (null) aa ac (null) 1-3 1-3 y 43 43 43 aa ab (null) 1-3 1-4 108 (null) aa ac (null) 1-3 1-4 42 (null)
edit:
i'll share code because parts might useful others, , problem might in other parts.
my strategy load set of workbooks per table in open excel called . apply simple function extract table want workbook contents, , apply function processing possible on tables while still separate , thinking multithreading may better leveraged because they're still independent (is right?).
this ends first query: . prefer stop here , use powerpivot if can rest (with final table.combine if necessary).
in power query have combine tables - twice. first has fields, while second distinct set of grouping fields tables (without value or as-of date fields). single (i.e. first) table cannot used because grouping combinations may exist in later tables aren't in first, & vice versa. distinct table gets index.
i join second first via table.nestedjoin & extract index joined column. allows me divide data partitions have same forecast date & groups. here can filldown because tables pre-sorted in descending order of date in prep_data_table function, actual value, if flows down others of same group, , no further.
after this, recombine tables.
code:
fieldmetadata holds data type & ordering information fields. sources holds pathnames & whether or not load specified file.
importparameters:
[thiswb = excel.currentworkbook() sources = thiswb{[name="sources"]}[content], fieldmetadata = thiswb{[name="fieldmetadata"]}, fieldtypes = table.torows(getcfg({"type"})), categoryfields = list.buffer(list.transform(list.select(list.transform(fieldtypes, each {list.first(_), typefromstring(list.last(_))}), each list.last(_) = type text), each list.first(_))), categoryfieldtypes = list.buffer(list.transform(fieldtypes, (_) => {list.first(_), typefromstring(list.last(_))}))
getcfg:
let cfg = (columns list) table => let filterlist = list.transform(columns, each "[" & _ & "]" <> null"), expressiontext = text.combine(filterlist, " , "), source = excel.currentworkbook(){name="fieldmetadata"]}[content], #"changed type" = table.transformcolumntypes(source, {{"field", type text}, {"type", type text"}, {"grouping", int32.type}, {"presentation"}, int32.type}}), custom1 = table.selectcolumns(#"changed type", list.combine({{"field"}, columns})), #"filtered rows" = table.selectrows(custom1, each expression.evaluate(expressiontext, [_=_])) /* above line bit of mind bender. lets me apply filteres without hard-coding column names. useful. credit http://www.thebiccountant.com/2016/03/08/select-rows-that-have-no-empty-fields-using-expression-evaluate-in-power-bi-and-power-query/ */ in #"filtered rows" in cfg
fieldsortorder
let sorton = (sorton text) list => let source = importparameters[fieldmetadata], #"changed type" = table.transformcolumntypes(source, {{"field", type text}, {"grouping", type number}}), selectedsort = table.selectxolumns(source, {"field", sorton}), renamedsortcolumn = table.renamecolumns(selectedsort, {{sorton, "sort"}}), nonulls = table.selectrows(renamedsortcolumn, each ([sort] <> null)), sortedfields = table.sort(nonulls, {{"sort", order.ascending}})[field] in sortedfields in sorton
typefromstring
let type = (typename text) type => let typenamefix = if typename = "table" "_table" else typename, // because table reserved word typr = [any=any.type, binary=binary.type, // whole list of types find. ... _table=table.type, ... webmethod=webmethod.type], thetype = try record.field(typr, typenamefix) otherwise error [reason="typename not found", message="parameter not found among list of types defined within typefromstring function.", in thetype in type
extract_data_table:
let source = (table table) table => let #"filtered rows" = table.selectrows(table, each ([kind] = "table" , ([item] = "report data" or [item] = "report_data"))), #"select columns" = table.selectcolumns(#"filtered rows", "data"), datatable = #"select columns"[data]{0} in datatable in source
prep_data_table:
let prepparams = (horizonend date, categoryfieldtypes list) function => let horizonend = horizonend, categoryfieldtypes = list.buffer(categoryfieldtypes), source = (inputtable table, filedate date) table => let endfields = {"as-of date", "period", "actual", "forecast"} list, periodsasdates = table.transformcolumntypes(inputtable, {{"period", type date}}), #"remove errors" = table.removerowswitherrors(periodsasdates, {"period"}), withinhorizon = table.selectrows(#"remove errors", each ([period] <= horizonend)), renamedval = table.renamecolumns(withinhorizon, {"val", "forecast"}), // forecast named val movedactual = table.addcolumn(renamedval, "actual", each if [period]=filedate (if [forecast] null 0 else [forecast]) else null), includesofdate = table.addcolumn(movedactual, "as-of date", each filedate, date.type), appliedcategoryfieldtypes = table.transformcolumntypes(includeasofdate, categoryfieldtypes), transformedcolumns = table.transformcolumns(appliedcategoryfieldtypes, {{"{values}", text.trim, type text}, {"actual", number.abs, currency.type}, {"forecast", number.abs, currency.type}}), sorted = table.sort(transformedcolumns, {{"actual", order.descending}}), // descending order important because table.filldown more efficient table.fillup outputtable = table.selectcolumns(sorted, list.distinct(list.combine({list.transform(categoryfieldtypes, each list.first(_)), endfields}))), output = outputtable in output in source in prepparams
workbooks:
let // import data source = importparameters[sources], #"changed type" = table.transformcolumntypes(source, {{"as-of date", type date}, {"folder path", type text}, {"tab", type text}, {"load", type logical}}), #"filtered rows"= table.selectrows(#"changed type", each ([load] = true)), workbookpaths = table.addcolumn(#"filtered rows", "file path", each [folder path] & [file], type text), loadworkbooks = table.addcolumn(workbookpaths, "data", each excel.workbook(file.contents([file path])) meta [#"as-of date" = [#"as-of date"]]), loaddatatables = table.transformcolumns(loadworkbooks, {"data", each extract_data_table(_) meta [#"as-of date" = value.metadata(_)[#"as-of date"]]}), prepfunc = prep_data_table(list.max(loaddatatables[#"as-of date"]), importparameters[categoryfieldtypes]), // transformcolumns step references column's list, not table, as-of date field of column out of scope. use metadata bring as-of date value same scope prepdatatables = table.transformcolumns(loaddatatables, {"data", each table.buffer(prepfunc(_, value.metadata(_)[#"as-of date"]))}), output = table.selectcolumns(prepdatatables, {"data", "as-of date"}) in output
makecomparison:
let categoryfields = importparameters[categoryfields] datatablelist = workbooks[data], categoryindex = table.addindexcolumn(table.distinct(table.combine(list.transform(datatablelist, each table.selectcolumns(_, categoryfields)))), "index"), listofdatatableswithnestedindextable = list.transform(datatablelist, each table.nestedjoin(_, categoryfields, categoryindex, categoryfields, "index", joinkind.inner)), listofindexeddatatables = list.transform(listofdatatableswithnestedindextable, each table.transformcolumns(_, {"index", each list.single(table.column(_, "index")) number, type number})), appended = table.combine(listofindexeddatatables), merged = table.join(categoryindex, "index", table.selectcolumns(appended, {"as-of date", "actual", "forecast", "index"}), "index"), partitioned = table.partition(merged, "index", table.rowcount(categoryindex), each _), copiedactuals = list.transform(partitioned, each table.filldown(_, {"actual"})), tounpartition = list.transform(copiedactuals, each {list.first(_[index]), table.removecolumns(_, {"index"})}), unpartitioned = table.frompartitions("index", tounpartition, type number), output = unpartitioned in output
question: qualify closure?
question: matter whether use table.frompartitions or table.combine recombine tables? what's difference?
question: fast data load do? when / not make difference?
question: there performance benefit specifying type of (x table, y list, z number, etc.)?
question: read in documentation let..in syntactic sugar around records. i've begun prefer records because intermediate values available. performance implications?
question: difference between number types? int32.type vs int64.type?
how large large xlsx files? agree idea it's we're opening file once per row. given xlsx archive format , each sheet big file, seeking within file going slow.
especially if total less half ram , if you're running 64-bit office, improve power query performance dramatically calling table.buffer
on tables coming xlsx.
alternatively, if somehow convert xlsx data csv source, don't pay price uncrack xlsx files each time. or if load data source sql server column indexes, should speed query. (we "query fold" operations sql server, has more powerful performance heuristics in query engine we've created in power query.) it's possible use power pivot engine instead, i'm not familiar that.
Comments
Post a Comment