diff --git a/docs/howto-make-dlcs.md b/docs/howto-make-dlcs.md new file mode 100644 index 0000000000000000000000000000000000000000..220633453ede9e65e5b594dc7a20dd3ec6dd47f7 --- /dev/null +++ b/docs/howto-make-dlcs.md @@ -0,0 +1,542 @@ +Auto-generation of Design Load Cases +==================================== + + +<!--- +TODO, improvements: +putty reference and instructions (fill in username in the address username@gorm +how to mount gorm home on windows +do as on Arch Linux wiki: top line is the file name where you need to add stuff +point to the gorm/jess wiki's +explain the difference in the paths seen from a windows computer and the cluster +--> + + +Introduction +------------ + +For the auto generation of load cases and the corresponding execution on the +cluster, the following events will take place: +* Create an htc master file, and define the various tags in the exchange files +(spreadsheets). +* Generate the htc files for all the corresponding load cases based on the +master file and the tags defined in the exchange files. Besides the HAWC2 htc +input file, a corresponding pbs script is created that includes the instructions +to execute the relevant HAWC2 simulation on a cluster node. This includes copying +the model to the node scratch disc, executing HAWC2, copying the results from +the node scratch disc back to the network drive. +* Submit all the load cases (or the pbs launch scripts) to the cluster queueing +system. This is also referred to as launching the jobs. + +Important note regarding file names. On Linux, file names and paths are case +sensitive, but on Windows they are not. Additionally, HAWC2 will always generate +result and log files with lower case file names, regardless of the user input. +Hence, in order to avoid possible ambiguities at all times, make sure that there +are no upper case symbols defined in the value of the following tags (as defined +in the Excel spreadsheets): ```[Case folder]```, ```[Case id.]```, and +```[Turb base name]```. + +The system will always force the values of the tags to be lower case anyway, and +when working on Windows, this might cause some confusing and unexpected behaviour. +The tags themselves can have lower and upper case characters as can be seen +in the example above. + +Notice that throughout the document ```$USER``` refers the your user name. You can +either let the system fill that in for you (by using the variable ```$USER```), +or explicitly user your user name instead. This user name is the same as your +DTU account name (or student account/number). + +This document refers to commands to be entered in the terminal on Gorm when the +line starts with ```g-000 $```. The command that needs to be entered starts +after the ```$```. + + +Connecting to the cluster +------------------------- + +You connect to the cluster via an SSH terminal. SSH is supported out of the box +for Linux and Mac OSX terminals (such as bash), but requires a separate +terminal client under Windows. Windows users are advised to use PuTTY and can +be downloaded at: +[http://www.chiark.greenend.org.uk/~sgtatham/putty/](http://www.chiark.greenend.org.uk/~sgtatham/putty/). +Here's a random +[tutorial](http://www.ghacks.net/2008/02/09/about-putty-and-tutorials-including-a-putty-tutorial/), +you can use your favourite search engine if you need more or different instructions. +More answers regarding PuTTY can also be found in the online +[documentation](http://the.earth.li/~sgtatham/putty/latest/htmldoc/). + +The cluster that is setup for using the pre- and post-processing tools for HAWC2 +has the following address: ```gorm.risoe.dk```. + +On Linux/Mac connecting to the cluster is as simple as running the following +command in the terminal: + +``` +g-000 $ ssh $USER@gorm.risoe.dk +``` + +Use your DTU password when asked. This will give you terminal access to the +cluster called Gorm. + +The cluster can only be reached when on the DTU network (wired, or only from a +DTU computer when using a wireless connection), when connected to the DTU VPN, +or from one of the DTU [databars](http://www.databar.dtu.dk/). + +More information about the cluster can be found on the +[Gorm-wiki](http://gorm.risoe.dk/gormwiki) + + +Mounting the cluster discs +-------------------------- + +You need to be connected to the DTU network in order for this to work. You can +also connect to the DTU network over VPN. + +When doing the HAWC2 simulations, you will interact regularly with the cluster +file system and discs. It is convenient to map these discs as network +drives (in Windows terms). Map the following network drives (replace ```$USER``` +with your user name): + +``` +\\mimer\hawc2sim +\\gorm\$USER # this is your Gorm home directory +``` + +Alternatively, on Windows you can use [WinSCP](http://winscp.net) to interact +with the cluster discs. + +Note that by default Windows Explorer will hide some of the files you will need edit. +In order to show all files on your Gorm home drive, you need to un-hide system files: +Explorer > Organize > Folder and search options > select tab "view" > select the option to show hidden files and folders. + +From Linux/Mac, you should be able to mount using either of the following +addresses: +``` +//mimer.risoe.dk/hawc2sim +//mimer.risoe.dk/well/hawc2sim +//gorm.risoe.dk/$USER +``` +You can use either ```sshfs``` or ```mount -t cifs``` to mount the discs. + + +Preparation +----------- + +Add the cluster-tools script to your system's PATH of you Gorm environment, +by editing the file ```.bash_profile``` file in your Gorm’s home directory +(```/home/$USER/.bash_profile```), and add the following lines (add at the end, +or create a new file with this file name in case it doesn't exist): + +``` +PATH=$PATH:/home/MET/STABCON/repositories/cluster-tools/ +export PATH +``` + +After modifying ```.bash_profile```, save and close it. Then, in the terminal, run the command: +``` +g-000 $ source ~/.bash_profile +``` +In order for any changes made in ```.bash_profile``` to take effect, you need to either ```source``` it (as shown above), or log out and in again. + +You will also need to configure wine and place the HAWC2 executables in a +directory that wine knows about. First, activate the correct wine environment by +typing in a shell in the Gorm's home directory (it can be activated with +ssh (Linux, Mac) or putty (MS Windows)): + +``` +g-000 $ WINEARCH=win32 WINEPREFIX=~/.wine32 wine test.exe +``` + +Optionally, you can also make an alias (a short format for a longer, more complex +command). In the ```.bashrc``` file in your home directory +(```/home/$USER/.bash_profile```), add at the bottom of the file: + +``` +alias wine32='WINEARCH=win32 WINEPREFIX=~/.wine32 wine' +``` + +And now copy all the HAWC2 executables, DLL's (including the license manager) +to your wine directory. You can copy all the required executables, dll's and +the license manager are located at ```/home/MET/hawc2exe```. The following +command will do this copying: + +``` +g-000 $ cp /home/MET/hawc2exe/* /home/$USER/.wine32/drive_c/windows/system32 +``` + +Notice that the HAWC2 executable names are ```hawc2-latest.exe```, +```hawc2-118.exe```, etc. By default the latest version will be used and the user +does not need to specify this. However, when you need to compare different version +you can easily do so by specifying which case should be run with which +executable. The file ```hawc2-latest.exe``` will always be the latest HAWC2 +version at ```/home/MET/hawc2exe/```. When a new HAWC2 is released you can +simply copy all the files from there again to update. + +Log out and in again from the cluster (close and restart PuTTY). + +At this stage you can run HAWC2 as follows: + +``` +g-000 $ wine32 hawc2-latest htc/some-intput-file.htc +``` + + +Method A: Generating htc input files on the cluster +--------------------------------------------------- + +Use ssh (Linux, Mac) or putty (MS Windows) to connect to the cluster. + +With qsub-wrap.py the user can wrap a PBS launch script around any executable or +Python/Matlab/... script. In doing so, the executable/Python script will be +immediately submitted to the cluster for execution. By default, the Anaconda +Python environment in ```/home/MET/STABCON/miniconda``` will be activated. The +Anaconda Python environment is not relevant, and can be safely ignored, if the +executable does not have anything to do with Python. + +In order to see the different options of this qsub-wrap utility, do: + +``` +g-000 $ qsub-wrap.py --help +``` + +For example, in order to generate the default IEC DLCs: + +``` +g-000 $ cd path/to/HAWC2/model # folder where the hawc2 model is located +g-000 $ qsub-wrap.py -f /home/MET/STABCON/repositories/prepost/dlctemplate.py -c python --prep +``` + +Note that the following folder structure for the HAWC2 model is assumed: + +``` +|-- control +| |-- ... +|-- data +| |-- ... +|-- htc +| |-- DLCs +| | |-- dlc12_iec61400-1ed3.xlsx +| | |-- dlc13_iec61400-1ed3.xlsx +| | |-- ... +| |-- _master +| | `-- dtu10mw_master_C0013.htc +``` + +The load case definitions should be placed in Excel spreadsheets with a +```*.xlsx``` extension. The above example shows one possible scenario whereby +all the load case definitions are placed in ```htc/DLCs``` (all folder names +are case sensitive). Alternatively, one can also place the spreadsheets in +separate sub folders, for example: + +``` +|-- control +| |-- ... +|-- data +| |-- ... +|-- htc +| |-- dlc12_iec61400-1ed3 +| | |-- dlc12_iec61400-1ed3.xlsx +| |-- dlc13_iec61400-1ed3 +| | |-- dlc13_iec61400-1ed3.xlsx +``` + +In order to use this auto-configuration mode, there can only be one master file +in ```_master``` that contains ```_master_``` in its file name. + +For the NREL5MW and the DTU10MW HAWC2 models, you can find their respective +master files and DLC definition spreadsheet files on Mimer. When connected +to Gorm over SSH/PuTTY, you will find these files at: +``` +/mnt/mimer/hawc2sim # (when on Gorm) +``` + + +Method B: Generating htc input files interactively on the cluster +----------------------------------------------------------------- + +Use ssh (Linux, Mac) or putty (MS Windows) to connect to the cluster. + +This approach gives you more flexibility, but requires more commands, and is hence +considered more difficult compared to method A. + +First activate the Anaconda Python environment by typing: + +```bash +# add the Anaconda Python environment paths to the system PATH +g-000 $ export PATH=/home/MET/STABCON/miniconda/bin:$PATH +# activate the custom python environment: +g-000 $ source activate anaconda +# add the Pythone libraries to the PYTHONPATH +g-000 $ export PYTHONPATH=/home/MET/STABCON/repositories/prepost:$PYTHONPATH +g-000 $ export PYTHONPATH=/home/MET/STABCON/repositories/pythontoolbox/fatigue_tools:$PYTHONPATH +g-000 $ export PYTHONPATH=/home/MET/STABCON/repositories/pythontoolbox:$PYTHONPATH +g-000 $ export PYTHONPATH=/home/MET/STABCON/repositories/MMPE:$PYTHONPATH +``` +For example, launch the auto-generation of DLCs input files: + +``` +g-000 $ cd path/to/HAWC2/model # folder where the hawc2 model is located +g-000 $ python /home/MET/STABCON/repositories/prepost/dlctemplate.py --prep +``` + +Or start an interactive IPython shell: + +``` +g-000 $ ipython +``` + +Users should be aware that running computational heavy loads on the login node +is strictly discouraged. By overloading the login node other users will +experience slow login procedures, and the whole cluster could potentially be +jammed. + + +Method C: Generating htc input files locally +-------------------------------------------- + +This approach gives you total freedom, but is also more difficult since you +will have to have fully configured Python environment installed locally. +Additionally, you need access to the cluster discs from your local workstation. +Method C is not documented yet. + + +Optional configuration +---------------------- + +Optional tags that can be set in the Excel spreadsheet, and their corresponding +default values are given below. Beside a replacement value in the master htc +file, there are also special actions connected to these values. Consequently, +these tags have to be present. When removed, the system will stop working properly. + +Relevant for the generation of the PBS launch scripts (```*.p``` files): +* ```[walltime] = '04:00:00' (format: HH:MM:SS)``` +* ```[hawc2_exe] = 'hawc2-latest'``` + +Following directories have to be defined, and their default values are used when +they are not set explicitly in the spreadsheets. +* ```[animation_dir] = 'animation/'``` +* ```[control_dir] = 'control/'```, all files and sub-folders copied to node +* ```[data_dir] = 'data/'```, all files and sub-folders copied to node +* ```[eigenfreq_dir] = False``` +* ```[htc_dir] = 'htc/'``` +* ```[log_dir] = 'logfiles/'``` +* ```[res_dir] = 'res/'``` +* ```[turb_dir] = 'turb/'``` +* ```[turb_db_dir] = '../turb/'``` +* ```[turb_base_name] = 'turb_'``` + +Required, and used for the PBS output and post-processing +* ```[pbs_out_dir] = 'pbs_out/'``` +* ```[iter_dir] = 'iter/'``` + +Optional +* ```[turb_db_dir] = '../turb/'``` +* ```[wake_dir] = False``` +* ```[wake_db_dir] = False``` +* ```[wake_base_name] = 'turb_'``` +* ```[meander_dir] = False``` +* ```[meand_db_dir] = False``` +* ```[meand_base_name] = 'turb_'``` +* ```[mooring_dir] = False```, all files and sub-folders copied to node +* ```[hydro_dir] = False```, all files and sub-folders copied to node + +A zip file will be created which contains all files in the model root directory, +and all the contents (files and folders) of the following directories: +```[control_dir], [mooring_dir], [hydro_dir], 'externalforce/', [data_dir]```. +This zip file will be extracted into the execution directory (```[run_dir]```). +After the model has ran on the node, only the files that have been created +during simulation time in the ```[log_dir]```, ```[res_dir]```, +```[animation_dir]```, and ```[eigenfreq_dir]``` will be copied back. +Optionally, on can also copy back the turbulence files, and other explicitly +defined files [TODO: expand manual here]. + + +Launching the jobs on the cluster +--------------------------------- + +Use ssh (Linux, Mac) or putty (MS Windows) to connect to the cluster. + +The ```launch.py``` is a generic tool that helps with launching an arbitrary +number of pbs launch script on a PBS Torque cluster. Launch scripts here +are defined as files with a ```.p``` extension. The script will look for any +```.p``` files in a specified folder (```pbs_in/``` by default, which the user +can change using the ```-p``` or ```--path_pbs``` flag) and save them in a +file list called ```pbs_in_file_cache.txt```. When using the option ```-c``` or +```--cache```, the script will not look for pbs files, but instead read them +directly from the ```pbs_in_file_cache.txt``` file. + +The launch script has a simple build in scheduler that has been successfully +used to launch 50.000 jobs. This scheduler is configured by two parameters: +number of cpu's requested (using ```-c``` or ```--nr_cpus```) and minimum +of required free cpu's on the cluster (using ```--cpu_free```, 48 by default). +Jobs will be launched after a predefined sleep time (as set by the +```--tsleep``` option, and set to 5 seconds by default). After the initial sleep +time a new job will be launched every 0.1 second. If the launch condition is not +met (```nr_cpus > cpu's used by user AND cpu's free on cluster > cpu_free```), +the program will wait 5 seconds before trying to launch a new job again. + +Depending on the amount of jobs and the required computation time, it could +take a while before all jobs are launched. When running the launch script from +the login node, this might be a problem when you have to close your ssh/putty +session before all jobs are launched. In that case the user should use a +dedicated compute node for launching jobs. To run the launch script on a +compute instead of the login node, use the ```--node``` option. You can inspect +the progress in the ```launch_scheduler_log.txt``` file. + +The ```launch.py``` script has some different options, and you can read about +them by using the help function (the output is included for your convenience): + +```bash +g-000 $ launch.py --help + +usage: launch.py -n nr_cpus + +options: + -h, --help show this help message and exit + --depend Switch on for launch depend method + -n NR_CPUS, --nr_cpus=NR_CPUS + number of cpus to be used + -p PATH_PBS_FILES, --path_pbs_files=PATH_PBS_FILES + optionally specify location of pbs files + --re=SEARCH_CRIT_RE regular expression search criterium applied on the + full pbs file path. Escape backslashes! By default it + will select all *.p files in pbs_in/. + --dry dry run: do not alter pbs files, do not launch + --tsleep=TSLEEP Sleep time [s] after qsub command. Default=5 seconds + --logfile=LOGFILE Save output to file. + -c, --cache If on, files are read from cache + --cpu_free=CPU_FREE No more jobs will be launched when the cluster does + not have the specified amount of cpus free. This will + make sure there is room for others on the cluster, but + might mean less cpus available for you. Default=48. + --qsub_cmd=QSUB_CMD Is set automatically by --node flag + --node If executed on dedicated node. +``` + +Then launch the actual jobs (each job is a ```*.p``` file in ```pbs_in```) using +100 cpu's, and using a compute node instead of the login node (see you can exit +the ssh/putty session without interrupting the launching process): + +```bash +g-000 $ cd path/to/HAWC2/model +g-000 $ launch.py -n 100 --node +``` + + +Inspecting running jobs +----------------------- + +There are a few tools you can use from the command line to see what is going on +the cluster. How many nodes are free, how many nodes do I use as a user, etc. + +* ```cluster-status.py``` overview dashboard of the cluster: nodes free, running, +length of the queue, etc +* ```qstat -u $USER``` list all the running and queued jobs of the user +* ```nnsqdel $USER all``` delete all the jobs that from the user +* ```qdel_range JOBID_FROM JOBID_TIL``` delete a range of job id's + +Notice that the pbs output files in ```pbs_out``` are only created when the job +has ended (or failed). When you want to inspect a running job, you can ssh from +the Gorm login node to node that runs the job. First, find the job id by listing +all your current jobs (```qstat -u $USER```). The job id can be found in the +first column, and you only need to consider the number, not the domain name +attached to it. Now find the on which node it runs with (replace 123546 with the +relevant job id): +``` +g-000 $ qstat -f 123456 | grep exec_host +``` + +From here you login into the node as follows (replace g-078 with the relevant +node): +``` +g-000 $ ssh g-078 +``` + +And browse to the scratch directory which lands you in the root directory of +your running HAWC2 model (replace 123456 with the relevant job id): +``` +g-000 $ cd /scratch/$USER/123456.g-000.risoe.dk +``` + + +Re-launching failed jobs +------------------------ + +In case you want to re-launch only a subset of a previously generated set of +load cases, there are several methods: + +1. Copy the PBS launch scripts (they have the ```*.p``` extension and can be +found in the ```pbs_in``` folder) of the failed cases to a new folder (for +example ```pbs_in_failed```). Now run ```launch.py``` again, but instead point +to the folder that contains the ```*.p``` files of the failed cases, for example: +``` +g-000 $ launch.py -n 100 --node -p pbs_in_failed +``` + +2. Use the ```--cache``` option, and edit the PBS file list in the file +```pbs_in_file_cache.txt``` so that only the simulations remain that have to be +run again. Note that the ```pbs_in_file_cache.txt``` file is created every time +you run a ```launch.py```. Note that you can use the option ```--dry``` to make +a practice launch run, and that will create a ```pbs_in_file_cache.txt``` file, +but not a single job will be launched. + +3. Each pbs file can be launched manually as follows: +``` +g-000 $ qsub path/to/pbs_file.p +``` + +Alternatively, one can use the following options in ```launch.py```: + +* ```-p some/other/folder```: specify from which folder the pbs files should be taken +* ```--re=SEARCH_CRIT_RE```: advanced filtering based on the pbs file names. It +requires some notion of regular expressions (some random tutorials: +[1](http://www.codeproject.com/Articles/9099/The-Minute-Regex-Tutorial), +[2](http://regexone.com/)) + * ```launch.py -n 10 --re=.SOMENAME.``` will launch all pbs file that + contains ```SOMENAME```. Notice the leading and trailing colon, which is + in bash environments is equivalent to the wild card (*). + + +Post-processing +--------------- + +The post-processing happens through the same script as used for generating the +htc files, but now we set different flags. For example, for checking the log +files, calculating the statistics, the AEP and the life time equivalent loads: + +``` +g-000 $ qsub-wrap.py -f /home/MET/STABCON/repositories/prepost/dlctemplate.py -c python --years=25 --neq=1e7 --stats --check_logs --fatigue +``` + +Other options for the ```dlctemplate.py``` script: + +``` +usage: dlctemplate.py [-h] [--prep] [--check_logs] [--stats] [--fatigue] + [--csv] [--years YEARS] [--no_bins NO_BINS] [--neq NEQ] + +pre- or post-processes DLC's + +optional arguments: + -h, --help show this help message and exit + --prep create htc, pbs, files (default=False) + --check_logs check the log files (default=False) + --stats calculate statistics (default=False) + --fatigue calculate Leq for a full DLC (default=False) + --csv Save data also as csv file (default=False) + --years YEARS Total life time in years (default=20) + --no_bins NO_BINS Number of bins for fatigue loads (default=46) + --neq NEQ Equivalent cycles neq (default=1e6) +``` + +Debugging +--------- + +Any output (everything that involves print statements) generated during the +post-processing of the simulations using ```dlctemplate.py``` is captured in +the ```pbs_out/qsub-wrap_dlctemplate.py.out``` file, while exceptions and errors +are redirected to the ```pbs_out/qsub-wrap_dlctemplate.py.err``` text file. + +The output and errors of HAWC2 simulations can also be found in the ```pbs_out``` +directory. The ```.err``` and ```.out``` files will be named exactly the same +as the ```.htc``` input files, and the ```.sel```/```.dat``` output files. + diff --git a/docs/using-statistics-df.md b/docs/using-statistics-df.md new file mode 100644 index 0000000000000000000000000000000000000000..aa78af08fc46afc3a1dce6da912be074f32a54c0 --- /dev/null +++ b/docs/using-statistics-df.md @@ -0,0 +1,179 @@ +How to use the Statistics DataFrame +=================================== + + +Introduction +------------ + +The statistical data of your post-processed load cases are saved in the HDF +format. You can use Pandas to retrieve and organize that data. Pandas organizes +the data in a DataFrame, and the library is powerful, comprehensive and requires +some learning. There are extensive resources out in the wild that can will help +you getting started: + +* [list](http://pandas.pydata.org/pandas-docs/stable/tutorials.html) +of good tutorials can be found in the Pandas +[documentation](http://pandas.pydata.org/pandas-docs/version/0.16.2/). +* short and simple +[tutorial](https://github.com/DTUWindEnergy/Python4WindEnergy/blob/master/lesson%204/pandas.ipynb) +as used for the Python 4 Wind Energy course + + +The data is organized in simple 2-dimensional table. However, since the statistics +of each channel is included for multiple simulations, the data set is actually +3-dimensional. As an example, this is how a table could like: + +``` + [case_id] [channel name] [mean] [std] [windspeed] + sim_1 pitch 0 1 8 + sim_1 rpm 1 7 8 + sim_2 pitch 2 9 9 + sim_2 rpm 3 2 9 + sim_3 pitch 0 1 7 +``` + +Each row is a channel of a certain simulation, and the columns represent the +following: + +* a tag from the master file and the corresponding value for the given simulation +* the channel name, description, units and unique identifier +* the statistical parameters of the given channel + + +Load the statistics as a pandas DataFrame +----------------------------------------- + +Pandas has some very powerful functions that will help analysing large and +complex DataFrames. The documentation is extensive and is supplemented with +various tutorials. You can use +[10 Minutes to pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html) +as a first introduction. + +Loading the pandas DataFrame table works as follows: + +```python +import pandas as pd +df = pd.read_hdf('path/to/file/name.h5', 'table') +``` + +Some tips for inspecting the data: + +```python +import numpy as np + +# Check the available data columns: +for colname in sorted(df.keys()): + print colname + +# list all available channels: +print np.sort(df['channel'].unique()) + +# list the different load cases +df['[Case folder]'].unique() +``` + + +Reduce memory footprint using categoricals +------------------------------------------ + +When the DataFrame is consuming too much memory, you can try to reduce its +size by using categoricals. A more extensive introduction to categoricals can be +found +[here](http://pandas.pydata.org/pandas-docs/stable/faq.html#dataframe-memory-usage) +and [here](http://matthewrocklin.com/blog/work/2015/06/18/Categoricals/). +The basic idea is to replace all string values with an integer, +and have an index that maps the string value to the index. This trick only works +when you have long strings that occur multiple times throughout your data set. + +The following example shows how you can use categoricals to reduce the memory +usage of a pandas DataFrame: + +```python +# load a certain DataFrame +df = pd.read_hdf(fname, 'table') +# Return the total estimated memory usage +print '%10.02f MB' % (df.memory_usage(index=True).sum()/1024.0/1024.0) +# the data type of column that contains strings is called object +# convert objects to categories to reduce memory consumption +for column_name, column_dtype in df.dtypes.iteritems(): + # applying categoricals mostly makes sense for objects, we ignore all others + if column_dtype.name == 'object': + df[column_name] = df[column_name].astype('category') +print '%10.02f MB' % (df.memory_usage(index=True).sum()/1024.0/1024.0) +``` + +Python has a garbage collector working in the background that deletes +un-referenced objects. In some cases it might help to actively trigger the +garbage collector as follows, in an attempt to free up memory during a run of +a script that is almost flooding the memory: + +```python +import gc +gc.collect() +``` + +Load a DataFrame that is too big for memory in chunks +----------------------------------------------------- + +When a DataFrame is too big to load into memory at once, and you already +compressed your data using categoricals (as explained above), you can read +the DataFrame one chunk at the time. A chunk is a selection of rows. For +example, you can read 1000 rows at the time by setting ```chunksize=1000``` +when calling ```pd.read_hdf()```. For example: + +```python +# load a large DataFrame in chunks of 1000 rows +for df_chunk in pd.read_hdf(fname, 'table', chunksize=1000): + print 'DataFrame chunk contains %i rows' % (len(df_chunk)) +``` + +We will read a large DataFrame as chunks into memory, and select only those +rows who belong to dlc12: + +```python +# only select one DLC, and place them in one DataFrame. If the data +# containing one DLC is still to big for memory, this approach will fail + +# create an empty DataFrame, here we collect the results we want +df_dlc12 = pd.DataFrame() +for df_chunk in pd.read_hdf(fname, 'table', chunksize=1000): + # organize the chunk: all rows for which [Case folder] is the same + # in a single group. Each group is now a DataFrame for which + # [Case folder] has the same value. + for group_name, group_df in df_chunk.groupby(df_chunk['[Case folder]']): + # if we have the group with dlc12, save them for later + if group_name == 'dlc12_iec61400-1ed3': + df_dlc12 = pd.concat([df_dlc12, group_df])#, ignore_index=True) +``` + +Plot wind speed vs rotor speed +------------------------------ + +```python +# select the channels of interest +for group_name, group_df in df_dlc12.groupby(df_dlc12['channel']): + # iterate over all channel groups, but only do something with the channels + # we are interested in + if group_name == 'Omega': + # we save the case_id tag, the mean value of channel Omega + df_rpm = group_df[['[case_id]', 'mean']].copy() + # note we made a copy because we will change the DataFrame in the next line + # rename the column mean to something more useful + df_rpm.rename(columns={'mean': 'Omega-mean'}, inplace=True) + elif group_name == 'windspeed-global-Vy-0.00-0.00--127.00': + # we save the case_id tag, the mean value of channel wind, and the + # value of the Windspeed tag + df_wind = group_df[['[case_id]', 'mean', '[Windspeed]']].copy() + # note we made a copy because we will change the DataFrame in the next line + # rename the mean of the wind channel to something more useful + df_wind.rename(columns={'mean': 'wind-mean'}, inplace=True) + +# join both results on the case_id value so the mean RPM and mean wind speed +# are referring to the same simulation/case_id. +df_res = pd.merge(df_wind, df_rpm, on='[case_id]', how='inner') + +# now we can plot RPM vs wind speed +from matplotlib import pyplot as plt +plt.plot(df_res['wind-mean'].values, df_res['Omega-mean'].values, '*') +``` +