How to use the Statistics DataFrame =================================== Introduction ------------ The statistical data of your post-processed load cases are saved in the HDF format. You can use Pandas to retrieve and organize that data. Pandas organizes the data in a DataFrame, and the library is powerful, comprehensive and requires some learning. There are extensive resources out in the wild that can will help you getting started: * [list](http://pandas.pydata.org/pandas-docs/stable/tutorials.html) of good tutorials can be found in the Pandas [documentation](http://pandas.pydata.org/pandas-docs/version/0.16.2/). * short and simple [tutorial](https://github.com/DTUWindEnergy/Python4WindEnergy/blob/master/lesson%204/pandas.ipynb) as used for the Python 4 Wind Energy course The data is organized in simple 2-dimensional table. However, since the statistics of each channel is included for multiple simulations, the data set is actually 3-dimensional. As an example, this is how a table could like: ``` [case_id] [channel name] [mean] [std] [windspeed] sim_1 pitch 0 1 8 sim_1 rpm 1 7 8 sim_2 pitch 2 9 9 sim_2 rpm 3 2 9 sim_3 pitch 0 1 7 ``` Each row is a channel of a certain simulation, and the columns represent the following: * a tag from the master file and the corresponding value for the given simulation * the channel name, description, units and unique identifier * the statistical parameters of the given channel Load the statistics as a pandas DataFrame ----------------------------------------- Pandas has some very powerful functions that will help analysing large and complex DataFrames. The documentation is extensive and is supplemented with various tutorials. You can use [10 Minutes to pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html) as a first introduction. Loading the pandas DataFrame table works as follows: ```python import pandas as pd df = pd.read_hdf('path/to/file/name.h5', 'table') ``` Some tips for inspecting the data: ```python import numpy as np # Check the available data columns: for colname in sorted(df.keys()): print colname # list all available channels: print np.sort(df['channel'].unique()) # list the different load cases df['[Case folder]'].unique() ``` Reduce memory footprint using categoricals ------------------------------------------ When the DataFrame is consuming too much memory, you can try to reduce its size by using categoricals. A more extensive introduction to categoricals can be found [here](http://pandas.pydata.org/pandas-docs/stable/faq.html#dataframe-memory-usage) and [here](http://matthewrocklin.com/blog/work/2015/06/18/Categoricals/). The basic idea is to replace all string values with an integer, and have an index that maps the string value to the index. This trick only works when you have long strings that occur multiple times throughout your data set. The following example shows how you can use categoricals to reduce the memory usage of a pandas DataFrame: ```python # load a certain DataFrame df = pd.read_hdf(fname, 'table') # Return the total estimated memory usage print '%10.02f MB' % (df.memory_usage(index=True).sum()/1024.0/1024.0) # the data type of column that contains strings is called object # convert objects to categories to reduce memory consumption for column_name, column_dtype in df.dtypes.iteritems(): # applying categoricals mostly makes sense for objects, we ignore all others if column_dtype.name == 'object': df[column_name] = df[column_name].astype('category') print '%10.02f MB' % (df.memory_usage(index=True).sum()/1024.0/1024.0) ``` Python has a garbage collector working in the background that deletes un-referenced objects. In some cases it might help to actively trigger the garbage collector as follows, in an attempt to free up memory during a run of a script that is almost flooding the memory: ```python import gc gc.collect() ``` Load a DataFrame that is too big for memory in chunks ----------------------------------------------------- When a DataFrame is too big to load into memory at once, and you already compressed your data using categoricals (as explained above), you can read the DataFrame one chunk at the time. A chunk is a selection of rows. For example, you can read 1000 rows at the time by setting ```chunksize=1000``` when calling ```pd.read_hdf()```. For example: ```python # load a large DataFrame in chunks of 1000 rows for df_chunk in pd.read_hdf(fname, 'table', chunksize=1000): print 'DataFrame chunk contains %i rows' % (len(df_chunk)) ``` We will read a large DataFrame as chunks into memory, and select only those rows who belong to dlc12: ```python # only select one DLC, and place them in one DataFrame. If the data # containing one DLC is still to big for memory, this approach will fail # create an empty DataFrame, here we collect the results we want df_dlc12 = pd.DataFrame() for df_chunk in pd.read_hdf(fname, 'table', chunksize=1000): # organize the chunk: all rows for which [Case folder] is the same # in a single group. Each group is now a DataFrame for which # [Case folder] has the same value. for group_name, group_df in df_chunk.groupby(df_chunk['[Case folder]']): # if we have the group with dlc12, save them for later if group_name == 'dlc12_iec61400-1ed3': df_dlc12 = pd.concat([df_dlc12, group_df])#, ignore_index=True) ``` Plot wind speed vs rotor speed ------------------------------ ```python # select the channels of interest for group_name, group_df in df_dlc12.groupby(df_dlc12['channel']): # iterate over all channel groups, but only do something with the channels # we are interested in if group_name == 'Omega': # we save the case_id tag, the mean value of channel Omega df_rpm = group_df[['[case_id]', 'mean']].copy() # note we made a copy because we will change the DataFrame in the next line # rename the column mean to something more useful df_rpm.rename(columns={'mean': 'Omega-mean'}, inplace=True) elif group_name == 'windspeed-global-Vy-0.00-0.00--127.00': # we save the case_id tag, the mean value of channel wind, and the # value of the Windspeed tag df_wind = group_df[['[case_id]', 'mean', '[Windspeed]']].copy() # note we made a copy because we will change the DataFrame in the next line # rename the mean of the wind channel to something more useful df_wind.rename(columns={'mean': 'wind-mean'}, inplace=True) # join both results on the case_id value so the mean RPM and mean wind speed # are referring to the same simulation/case_id. df_res = pd.merge(df_wind, df_rpm, on='[case_id]', how='inner') # now we can plot RPM vs wind speed from matplotlib import pyplot as plt plt.plot(df_res['wind-mean'].values, df_res['Omega-mean'].values, '*') ```