python - Fast reading of specified columns in df using pandas.to

python - Fast reading of specified columns in df using pandas.to_hdf -

i have dataframe of 2gb write once, read many df. use df in pandas, therefore using df.read_hdf , df.to_hdf in fixed format works pretty fine in reading , writing.

however, df growing more columns being added, use table format instead, can select columns need when reading data. thought give me speed advantage, testing doesn't seem case.

this example:

import numpy np import pandas pd df = pd.dataframe(np.random.randn(10000000,9),columns=list('abcdefghi')) %time df.to_hdf("temp.h5", "temp", format ="fixed", mode="w") %time df.to_hdf("temp2.h5", "temp2", format="table", mode="w")

shows fixed format faster (6.8s vs 5.9 seconds on machine).

then reading data (after small break make sure file has been saved):

%time x = pd.read_hdf("temp.h5", "temp") %time y = pd.read_hdf("temp2.h5", "temp2") %time z = pd.read_hdf("temp2.h5", "temp2", columns=list("abc"))

yields:

wall time: 420 ms (fixed)    wall time: 557 ms (format)    wall time: 671 ms (format, specified columns)

i understand fixed format faster in reading data, why df specified columns slower reading full dataframe? benefit of using table formatting (with or without specified columns) on fixed formatting?

is there maybe memory advantage when df growing bigger?

imo main advantage of using format='table' in conjunction data_columns=[list_of_indexed_columns] ability conditionally (see where="where clause" parameter) read huge hdf5 files. can filter data while reading , process data in chunks avoid memoryerror.

you can try save single columns or column groups (those of time read together) in different hdf files or in same file different keys.

i'd consider using "cutting-edge" technology - feather-format

tests , timing:

import feather

writing disk in 3 formats: (hdf5 fixed, hdf% table, feather)

df = pd.dataframe(np.random.randn(10000000,9),columns=list('abcdefghi')) df.to_hdf('c:/temp/fixed.h5', 'temp', format='f', mode='w') df.to_hdf('c:/temp/tab.h5', 'temp', format='t', mode='w') feather.write_dataframe(df, 'c:/temp/df.feather')

reading disk:

in [122]: %timeit pd.read_hdf(r'c:\temp\fixed.h5', "temp") 1 loop, best of 3: 409 ms per loop  in [123]: %timeit pd.read_hdf(r'c:\temp\tab.h5', "temp") 1 loop, best of 3: 558 ms per loop  in [124]: %timeit pd.read_hdf(r'c:\temp\tab.h5', "temp", columns=list('bdf')) slowest run took 4.60 times longer fastest. mean intermediate result being cached. 1 loop, best of 3: 689 ms per loop  in [125]: %timeit feather.read_dataframe('c:/temp/df.feather') slowest run took 6.92 times longer fastest. mean intermediate result being cached. 1 loop, best of 3: 644 ms per loop  in [126]: %timeit feather.read_dataframe('c:/temp/df.feather', columns=list('bdf')) 1 loop, best of 3: 218 ms per loop  # winner !!!

ps if encounter following error when using feather.write_dataframe(...):

feathererror: invalid: no support strided data yet

here workaround:

df = df.copy()

after feather.write_dataframe(df, path) should work properly...

Search This Blog

WIKI

python - Fast reading of specified columns in df using pandas.to_hdf -

Comments

Post a Comment

Popular posts from this blog

qt - QML MouseArea onWheel event not working properly when inside QML Scrollview -

java - is not an enclosing class / new Intent Cannot Resolve Constructor -

python - Error importing VideoFileClip from moviepy : AttributeError: 'PermissionError' object has no attribute 'message' -