python - Fast reading of specified columns in df using pandas.to_hdf -
i have dataframe of 2gb write once, read many df. use df in pandas, therefore using df.read_hdf
, df.to_hdf
in fixed format works pretty fine in reading , writing.
however, df growing more columns being added, use table format instead, can select columns need when reading data. thought give me speed advantage, testing doesn't seem case.
this example:
import numpy np import pandas pd df = pd.dataframe(np.random.randn(10000000,9),columns=list('abcdefghi')) %time df.to_hdf("temp.h5", "temp", format ="fixed", mode="w") %time df.to_hdf("temp2.h5", "temp2", format="table", mode="w")
shows fixed format faster (6.8s vs 5.9 seconds on machine).
then reading data (after small break make sure file has been saved):
%time x = pd.read_hdf("temp.h5", "temp") %time y = pd.read_hdf("temp2.h5", "temp2") %time z = pd.read_hdf("temp2.h5", "temp2", columns=list("abc"))
yields:
wall time: 420 ms (fixed) wall time: 557 ms (format) wall time: 671 ms (format, specified columns)
i understand fixed format faster in reading data, why df specified columns slower reading full dataframe? benefit of using table formatting (with or without specified columns) on fixed formatting?
is there maybe memory advantage when df growing bigger?
imo main advantage of using format='table'
in conjunction data_columns=[list_of_indexed_columns]
ability conditionally (see where="where clause"
parameter) read huge hdf5 files. can filter data while reading , process data in chunks avoid memoryerror.
you can try save single columns or column groups (those of time read together) in different hdf files or in same file different keys.
i'd consider using "cutting-edge" technology - feather-format
tests , timing:
import feather
writing disk in 3 formats: (hdf5 fixed, hdf% table, feather)
df = pd.dataframe(np.random.randn(10000000,9),columns=list('abcdefghi')) df.to_hdf('c:/temp/fixed.h5', 'temp', format='f', mode='w') df.to_hdf('c:/temp/tab.h5', 'temp', format='t', mode='w') feather.write_dataframe(df, 'c:/temp/df.feather')
reading disk:
in [122]: %timeit pd.read_hdf(r'c:\temp\fixed.h5', "temp") 1 loop, best of 3: 409 ms per loop in [123]: %timeit pd.read_hdf(r'c:\temp\tab.h5', "temp") 1 loop, best of 3: 558 ms per loop in [124]: %timeit pd.read_hdf(r'c:\temp\tab.h5', "temp", columns=list('bdf')) slowest run took 4.60 times longer fastest. mean intermediate result being cached. 1 loop, best of 3: 689 ms per loop in [125]: %timeit feather.read_dataframe('c:/temp/df.feather') slowest run took 6.92 times longer fastest. mean intermediate result being cached. 1 loop, best of 3: 644 ms per loop in [126]: %timeit feather.read_dataframe('c:/temp/df.feather', columns=list('bdf')) 1 loop, best of 3: 218 ms per loop # winner !!!
ps if encounter following error when using feather.write_dataframe(...)
:
feathererror: invalid: no support strided data yet
here workaround:
df = df.copy()
after feather.write_dataframe(df, path)
should work properly...
Comments
Post a Comment