From 4da858d0f2e187d52575cd15fb42995ef64dd3a6 Mon Sep 17 00:00:00 2001
From: dave <dave@dtu.dk>
Date: Fri, 15 Jan 2016 10:03:32 +0100
Subject: [PATCH] add documentation on how to use the prepost DLC generator and
 cluster

---
 docs/howto-make-dlcs.md     | 542 ++++++++++++++++++++++++++++++++++++
 docs/using-statistics-df.md | 179 ++++++++++++
 2 files changed, 721 insertions(+)
 create mode 100644 docs/howto-make-dlcs.md
 create mode 100644 docs/using-statistics-df.md

diff --git a/docs/howto-make-dlcs.md b/docs/howto-make-dlcs.md
new file mode 100644
index 00000000..22063345
--- /dev/null
+++ b/docs/howto-make-dlcs.md
@@ -0,0 +1,542 @@
+Auto-generation of Design Load Cases
+====================================
+
+
+<!---
+TODO, improvements:
+putty reference and instructions (fill in username in the address username@gorm
+how to mount gorm home on windows
+do as on Arch Linux wiki: top line is the file name where you need to add stuff
+point to the gorm/jess wiki's
+explain the difference in the paths seen from a windows computer and the cluster
+-->
+
+
+Introduction
+------------
+
+For the auto generation of load cases and the corresponding execution on the
+cluster, the following events will take place:
+* Create an htc master file, and define the various tags in the exchange files
+(spreadsheets).
+* Generate the htc files for all the corresponding load cases based on the
+master file and the tags defined in the exchange files. Besides the HAWC2 htc
+input file, a corresponding pbs script is created that includes the instructions
+to execute the relevant HAWC2 simulation on a cluster node. This includes copying
+the model to the node scratch disc, executing HAWC2, copying the results from
+the node scratch disc back to the network drive.
+* Submit all the load cases (or the pbs launch scripts) to the cluster queueing
+system. This is also referred to as launching the jobs.
+
+Important note regarding file names. On Linux, file names and paths are case
+sensitive, but on Windows they are not. Additionally, HAWC2 will always generate
+result and log files with lower case file names, regardless of the user input.
+Hence, in order to avoid possible ambiguities at all times, make sure that there
+are no upper case symbols defined in the value of the following tags (as defined
+in the Excel spreadsheets): ```[Case folder]```,  ```[Case id.]```, and
+```[Turb base name]```.
+
+The system will always force the values of the tags to be lower case anyway, and
+when working on Windows, this might cause some confusing and unexpected behaviour.
+The tags themselves can have lower and upper case characters as can be seen
+in the example above.
+
+Notice that throughout the document ```$USER``` refers the your user name. You can
+either let the system fill that in for you (by using the variable ```$USER```),
+or explicitly user your user name instead. This user name is the same as your
+DTU account name (or student account/number).
+
+This document refers to commands to be entered in the terminal on Gorm when the
+line starts with ```g-000 $```. The command that needs to be entered starts
+after the ```$```.
+
+
+Connecting to the cluster
+-------------------------
+
+You connect to the cluster via an SSH terminal. SSH is supported out of the box
+for Linux and Mac OSX terminals (such as bash), but requires a separate
+terminal client under Windows. Windows users are advised to use PuTTY and can
+be downloaded at:
+[http://www.chiark.greenend.org.uk/~sgtatham/putty/](http://www.chiark.greenend.org.uk/~sgtatham/putty/).
+Here's a random
+[tutorial](http://www.ghacks.net/2008/02/09/about-putty-and-tutorials-including-a-putty-tutorial/),
+you can use your favourite search engine if you need more or different instructions.
+More answers regarding PuTTY can also be found in the online
+[documentation](http://the.earth.li/~sgtatham/putty/latest/htmldoc/).
+
+The cluster that is setup for using the pre- and post-processing tools for HAWC2
+has the following address: ```gorm.risoe.dk```.
+
+On Linux/Mac connecting to the cluster is as simple as running the following
+command in the terminal:
+
+```
+g-000 $ ssh $USER@gorm.risoe.dk
+```
+
+Use your DTU password when asked. This will give you terminal access to the
+cluster called Gorm.
+
+The cluster can only be reached when on the DTU network (wired, or only from a
+DTU computer when using a wireless connection), when connected to the DTU VPN,
+or from one of the DTU [databars](http://www.databar.dtu.dk/).
+
+More information about the cluster can be found on the
+[Gorm-wiki](http://gorm.risoe.dk/gormwiki)
+
+
+Mounting the cluster discs
+--------------------------
+
+You need to be connected to the DTU network in order for this to work. You can
+also connect to the DTU network over VPN.
+
+When doing the HAWC2 simulations, you will interact regularly with the cluster
+file system and discs. It is convenient to map these discs as network
+drives (in Windows terms). Map the following network drives (replace ```$USER```
+with your user name):
+
+```
+\\mimer\hawc2sim
+\\gorm\$USER # this is your Gorm home directory
+```
+
+Alternatively, on Windows you can use [WinSCP](http://winscp.net) to interact
+with the cluster discs.
+
+Note that by default Windows Explorer will hide some of the files you will need edit.
+In order to show all files on your Gorm home drive, you need to un-hide system files:
+Explorer > Organize > Folder and search options > select tab "view" > select the option to show hidden files and folders.
+
+From Linux/Mac, you should be able to mount using either of the following
+addresses:
+```
+//mimer.risoe.dk/hawc2sim
+//mimer.risoe.dk/well/hawc2sim
+//gorm.risoe.dk/$USER
+```
+You can use either ```sshfs``` or ```mount -t cifs``` to mount the discs.
+
+
+Preparation
+-----------
+
+Add the cluster-tools script to your system's PATH of you Gorm environment,
+by editing the file ```.bash_profile``` file in your Gorm’s home directory
+(```/home/$USER/.bash_profile```), and add the following lines (add at the end,
+or create a new file with this file name in case it doesn't exist):
+
+```
+PATH=$PATH:/home/MET/STABCON/repositories/cluster-tools/
+export PATH
+```
+
+After modifying ```.bash_profile```, save and close it. Then, in the terminal, run the command:
+```
+g-000 $ source ~/.bash_profile
+```
+In order for any changes made in ```.bash_profile``` to take effect, you need to either ```source``` it (as shown above), or log out and in again.  
+
+You will also need to configure wine and place the HAWC2 executables in a
+directory that wine knows about. First, activate the correct wine environment by
+typing in a shell in the Gorm's home directory (it can be activated with
+ssh (Linux, Mac) or putty (MS Windows)):
+
+```
+g-000 $ WINEARCH=win32 WINEPREFIX=~/.wine32 wine test.exe
+```
+
+Optionally, you can also make an alias (a short format for a longer, more complex
+command). In the ```.bashrc``` file in your home directory
+(```/home/$USER/.bash_profile```), add at the bottom of the file:
+
+```
+alias wine32='WINEARCH=win32 WINEPREFIX=~/.wine32 wine'
+```
+
+And now copy all the HAWC2 executables, DLL's (including the license manager)
+to your wine directory. You can copy all the required executables, dll's and
+the license manager are located at ```/home/MET/hawc2exe```. The following
+command will do this copying:
+
+```
+g-000 $ cp /home/MET/hawc2exe/* /home/$USER/.wine32/drive_c/windows/system32
+```
+
+Notice that the HAWC2 executable names are ```hawc2-latest.exe```,
+```hawc2-118.exe```, etc. By default the latest version will be used and the user
+does not need to specify this. However, when you need to compare different version
+you can easily do so by specifying which case should be run with which
+executable. The file ```hawc2-latest.exe``` will always be the latest HAWC2
+version at ```/home/MET/hawc2exe/```. When a new HAWC2 is released you can
+simply copy all the files from there again to update.
+
+Log out and in again from the cluster (close and restart PuTTY).
+
+At this stage you can run HAWC2 as follows:
+
+```
+g-000 $ wine32 hawc2-latest htc/some-intput-file.htc
+```
+
+
+Method A: Generating htc input files on the cluster
+---------------------------------------------------
+
+Use ssh (Linux, Mac) or putty (MS Windows) to connect to the cluster.
+
+With qsub-wrap.py the user can wrap a PBS launch script around any executable or
+Python/Matlab/... script. In doing so, the executable/Python script will be
+immediately submitted to the cluster for execution. By default, the Anaconda
+Python environment in ```/home/MET/STABCON/miniconda``` will be activated. The
+Anaconda Python environment is not relevant, and can be safely ignored, if the
+executable does not have anything to do with Python.
+
+In order to see the different options of this qsub-wrap utility, do:
+
+```
+g-000 $ qsub-wrap.py --help
+```
+
+For example, in order to generate the default IEC DLCs:
+
+```
+g-000 $ cd path/to/HAWC2/model # folder where the hawc2 model is located
+g-000 $ qsub-wrap.py -f /home/MET/STABCON/repositories/prepost/dlctemplate.py -c python --prep
+```
+
+Note that the following folder structure for the HAWC2 model is assumed:
+
+```
+|-- control
+|   |-- ...
+|-- data
+|   |-- ...
+|-- htc
+|   |-- DLCs
+|   |   |-- dlc12_iec61400-1ed3.xlsx
+|   |   |-- dlc13_iec61400-1ed3.xlsx
+|   |   |-- ...
+|   |-- _master
+|   |   `-- dtu10mw_master_C0013.htc
+```
+
+The load case definitions should be placed in Excel spreadsheets with a
+```*.xlsx``` extension. The above example shows one possible scenario whereby
+all the load case definitions are placed in ```htc/DLCs``` (all folder names
+are case sensitive). Alternatively, one can also place the spreadsheets in
+separate sub folders, for example:
+
+```
+|-- control
+|   |-- ...
+|-- data
+|   |-- ...
+|-- htc
+|   |-- dlc12_iec61400-1ed3
+|   |   |-- dlc12_iec61400-1ed3.xlsx
+|   |-- dlc13_iec61400-1ed3
+|   |   |-- dlc13_iec61400-1ed3.xlsx
+```
+
+In order to use this auto-configuration mode, there can only be one master file
+in ```_master``` that contains ```_master_``` in its file name.
+
+For the NREL5MW and the DTU10MW HAWC2 models, you can find their respective
+master files and DLC definition spreadsheet files on Mimer. When connected
+to Gorm over SSH/PuTTY, you will find these files at:
+```
+/mnt/mimer/hawc2sim # (when on Gorm)
+```
+
+
+Method B: Generating htc input files interactively on the cluster
+-----------------------------------------------------------------
+
+Use ssh (Linux, Mac) or putty (MS Windows) to connect to the cluster.
+
+This approach gives you more flexibility, but requires more commands, and is hence
+considered more difficult compared to method A.
+
+First activate the Anaconda Python environment by typing:
+
+```bash
+# add the Anaconda Python environment paths to the system PATH
+g-000 $ export PATH=/home/MET/STABCON/miniconda/bin:$PATH
+# activate the custom python environment:
+g-000 $ source activate anaconda
+# add the Pythone libraries to the PYTHONPATH
+g-000 $ export PYTHONPATH=/home/MET/STABCON/repositories/prepost:$PYTHONPATH
+g-000 $ export PYTHONPATH=/home/MET/STABCON/repositories/pythontoolbox/fatigue_tools:$PYTHONPATH
+g-000 $ export PYTHONPATH=/home/MET/STABCON/repositories/pythontoolbox:$PYTHONPATH
+g-000 $ export PYTHONPATH=/home/MET/STABCON/repositories/MMPE:$PYTHONPATH
+```
+For example, launch the auto-generation of DLCs input files:
+
+```
+g-000 $ cd path/to/HAWC2/model # folder where the hawc2 model is located
+g-000 $ python /home/MET/STABCON/repositories/prepost/dlctemplate.py --prep
+```
+
+Or start an interactive IPython shell:
+
+```
+g-000 $ ipython
+```
+
+Users should be aware that running computational heavy loads on the login node
+is strictly discouraged. By overloading the login node other users will
+experience slow login procedures, and the whole cluster could potentially be
+jammed.
+
+
+Method C: Generating htc input files locally
+--------------------------------------------
+
+This approach gives you total freedom, but is also more difficult since you
+will have to have fully configured Python environment installed locally.
+Additionally, you need access to the cluster discs from your local workstation.
+Method C is not documented yet.
+
+
+Optional configuration
+----------------------
+
+Optional tags that can be set in the Excel spreadsheet, and their corresponding
+default values are given below. Beside a replacement value in the master htc
+file, there are also special actions connected to these values. Consequently,
+these tags have to be present. When removed, the system will stop working properly.
+
+Relevant for the generation of the PBS launch scripts (```*.p``` files):
+* ```[walltime] = '04:00:00' (format: HH:MM:SS)```
+* ```[hawc2_exe] = 'hawc2-latest'```
+
+Following directories have to be defined, and their default values are used when
+they are not set explicitly in the spreadsheets.
+* ```[animation_dir] = 'animation/'```
+* ```[control_dir] = 'control/'```, all files and sub-folders copied to node
+* ```[data_dir] = 'data/'```, all files and sub-folders copied to node
+* ```[eigenfreq_dir] = False```
+* ```[htc_dir] = 'htc/'```
+* ```[log_dir] = 'logfiles/'```
+* ```[res_dir] = 'res/'```
+* ```[turb_dir] = 'turb/'```
+* ```[turb_db_dir] = '../turb/'```
+* ```[turb_base_name] = 'turb_'```
+
+Required, and used for the PBS output and post-processing
+* ```[pbs_out_dir] = 'pbs_out/'```
+* ```[iter_dir] = 'iter/'```
+
+Optional
+* ```[turb_db_dir] = '../turb/'```
+* ```[wake_dir] = False```
+* ```[wake_db_dir] = False```
+* ```[wake_base_name] = 'turb_'```
+* ```[meander_dir] = False```
+* ```[meand_db_dir] = False```
+* ```[meand_base_name] = 'turb_'```
+* ```[mooring_dir] = False```, all files and sub-folders copied to node
+* ```[hydro_dir] = False```, all files and sub-folders copied to node
+
+A zip file will be created which contains all files in the model root directory,
+and all the contents (files and folders) of the following directories:
+```[control_dir], [mooring_dir], [hydro_dir], 'externalforce/', [data_dir]```.
+This zip file will be extracted into the execution directory (```[run_dir]```).
+After the model has ran on the node, only the files that have been created
+during simulation time in the ```[log_dir]```, ```[res_dir]```,
+```[animation_dir]```, and ```[eigenfreq_dir]``` will be copied back.
+Optionally, on can also copy back the turbulence files, and other explicitly
+defined files [TODO: expand manual here].
+
+
+Launching the jobs on the cluster
+---------------------------------
+
+Use ssh (Linux, Mac) or putty (MS Windows) to connect to the cluster.
+
+The ```launch.py``` is a generic tool that helps with launching an arbitrary
+number of pbs launch script on a PBS Torque cluster. Launch scripts here
+are defined as files with a ```.p``` extension. The script will look for any
+```.p``` files in a specified folder (```pbs_in/``` by default, which the user
+can change using the  ```-p``` or ```--path_pbs``` flag) and save them in a
+file list called ```pbs_in_file_cache.txt```. When using the option ```-c``` or
+```--cache```, the script will not look for pbs files, but instead read them
+directly from the ```pbs_in_file_cache.txt``` file.
+
+The launch script has a simple build in scheduler that has been successfully
+used to launch 50.000 jobs. This scheduler is configured by two parameters:
+number of cpu's requested (using ```-c``` or ```--nr_cpus```) and minimum
+of required free cpu's on the cluster (using ```--cpu_free```, 48 by default).
+Jobs will be launched after a predefined sleep time (as set by the
+```--tsleep``` option, and set to 5 seconds by default). After the initial sleep
+time a new job will be launched every 0.1 second. If the launch condition is not
+met (```nr_cpus > cpu's used by user AND cpu's free on cluster > cpu_free```),
+the program will wait 5 seconds before trying to launch a new job again.
+
+Depending on the amount of jobs and the required computation time, it could
+take a while before all jobs are launched. When running the launch script from
+the login node, this might be a problem when you have to close your ssh/putty
+session before all jobs are launched. In that case the user should use a
+dedicated compute node for launching jobs. To run the launch script on a
+compute instead of the login node, use the ```--node``` option. You can inspect
+the progress in the ```launch_scheduler_log.txt``` file.
+
+The ```launch.py``` script has some different options, and you can read about
+them by using the help function (the output is included for your convenience):
+
+```bash
+g-000 $ launch.py --help
+
+usage: launch.py -n nr_cpus
+
+options:
+  -h, --help            show this help message and exit
+  --depend              Switch on for launch depend method
+  -n NR_CPUS, --nr_cpus=NR_CPUS
+                        number of cpus to be used
+  -p PATH_PBS_FILES, --path_pbs_files=PATH_PBS_FILES
+                        optionally specify location of pbs files
+  --re=SEARCH_CRIT_RE   regular expression search criterium applied on the
+                        full pbs file path. Escape backslashes! By default it
+                        will select all *.p files in pbs_in/.
+  --dry                 dry run: do not alter pbs files, do not launch
+  --tsleep=TSLEEP       Sleep time [s] after qsub command. Default=5 seconds
+  --logfile=LOGFILE     Save output to file.
+  -c, --cache           If on, files are read from cache
+  --cpu_free=CPU_FREE   No more jobs will be launched when the cluster does
+                        not have the specified amount of cpus free. This will
+                        make sure there is room for others on the cluster, but
+                        might mean less cpus available for you. Default=48.
+  --qsub_cmd=QSUB_CMD   Is set automatically by --node flag
+  --node                If executed on dedicated node.
+```
+
+Then launch the actual jobs (each job is a ```*.p``` file in ```pbs_in```) using
+100 cpu's, and using a compute node instead of the login node (see you can exit
+the ssh/putty session without interrupting the launching process):
+
+```bash
+g-000 $ cd path/to/HAWC2/model
+g-000 $ launch.py -n 100 --node
+```
+
+
+Inspecting running jobs
+-----------------------
+
+There are a few tools you can use from the command line to see what is going on
+the cluster. How many nodes are free, how many nodes do I use as a user, etc.
+
+* ```cluster-status.py``` overview dashboard of the cluster: nodes free, running,
+length of the queue, etc
+* ```qstat -u $USER``` list all the running and queued jobs of the user
+* ```nnsqdel $USER all``` delete all the jobs that from the user
+* ```qdel_range JOBID_FROM JOBID_TIL``` delete a range of job id's
+
+Notice that the pbs output files in ```pbs_out``` are only created when the job
+has ended (or failed). When you want to inspect a running job, you can ssh from
+the Gorm login node to node that runs the job. First, find the job id by listing
+all your current jobs (```qstat -u $USER```). The job id can be found in the
+first column, and you only need to consider the number, not the domain name
+attached to it. Now find the on which node it runs with (replace 123546 with the
+relevant job id):
+```
+g-000 $ qstat -f 123456 | grep exec_host
+```
+
+From here you login into the node as follows (replace g-078 with the relevant
+node):
+```
+g-000 $ ssh g-078
+```
+
+And browse to the scratch directory which lands you in the root directory of
+your running HAWC2 model (replace 123456 with the relevant job id):
+```
+g-000 $ cd /scratch/$USER/123456.g-000.risoe.dk
+```
+
+
+Re-launching failed jobs
+------------------------
+
+In case you want to re-launch only a subset of a previously generated set of
+load cases, there are several methods:
+
+1. Copy the PBS launch scripts (they have the ```*.p``` extension and can be
+found in the ```pbs_in``` folder) of the failed cases to a new folder (for
+example ```pbs_in_failed```). Now run ```launch.py``` again, but instead point
+to the folder that contains the ```*.p``` files of the failed cases, for example:
+```
+g-000 $ launch.py -n 100 --node -p pbs_in_failed
+```
+
+2. Use the ```--cache``` option, and edit the PBS file list in the file
+```pbs_in_file_cache.txt``` so that only the simulations remain that have to be
+run again. Note that the ```pbs_in_file_cache.txt``` file is created every time
+you run a ```launch.py```. Note that you can use the option ```--dry``` to make
+a practice launch run, and that will create a ```pbs_in_file_cache.txt``` file,
+but not a single job will be launched.
+
+3. Each pbs file can be launched manually as follows:
+```
+g-000 $ qsub path/to/pbs_file.p
+```
+
+Alternatively, one can use the following options in ```launch.py```:
+
+* ```-p some/other/folder```: specify from which folder the pbs files should be taken
+* ```--re=SEARCH_CRIT_RE```: advanced filtering based on the pbs file names. It
+requires some notion of regular expressions (some random tutorials:
+[1](http://www.codeproject.com/Articles/9099/The-Minute-Regex-Tutorial),
+[2](http://regexone.com/))
+    * ```launch.py -n 10 --re=.SOMENAME.``` will launch all pbs file that
+    contains ```SOMENAME```. Notice the leading and trailing colon, which is
+    in bash environments is equivalent to the wild card (*).
+
+
+Post-processing
+---------------
+
+The post-processing happens through the same script as used for generating the
+htc files, but now we set different flags. For example, for checking the log
+files, calculating the statistics, the AEP and the life time equivalent loads:
+
+```
+g-000 $ qsub-wrap.py -f /home/MET/STABCON/repositories/prepost/dlctemplate.py -c python --years=25 --neq=1e7 --stats --check_logs --fatigue
+```
+
+Other options for the ```dlctemplate.py``` script:
+
+```
+usage: dlctemplate.py [-h] [--prep] [--check_logs] [--stats] [--fatigue]
+                      [--csv] [--years YEARS] [--no_bins NO_BINS] [--neq NEQ]
+
+pre- or post-processes DLC's
+
+optional arguments:
+  -h, --help         show this help message and exit
+  --prep             create htc, pbs, files (default=False)
+  --check_logs       check the log files (default=False)
+  --stats            calculate statistics (default=False)
+  --fatigue          calculate Leq for a full DLC (default=False)
+  --csv              Save data also as csv file (default=False)
+  --years YEARS      Total life time in years (default=20)
+  --no_bins NO_BINS  Number of bins for fatigue loads (default=46)
+  --neq NEQ          Equivalent cycles neq (default=1e6)
+```
+
+Debugging
+---------
+
+Any output (everything that involves print statements) generated during the
+post-processing of the simulations using ```dlctemplate.py``` is captured in
+the ```pbs_out/qsub-wrap_dlctemplate.py.out``` file, while exceptions and errors
+are redirected to the ```pbs_out/qsub-wrap_dlctemplate.py.err``` text file.
+
+The output and errors of HAWC2 simulations can also be found in the ```pbs_out```
+directory. The ```.err``` and ```.out``` files will be named exactly the same
+as the ```.htc``` input files, and the ```.sel```/```.dat``` output files.
+
diff --git a/docs/using-statistics-df.md b/docs/using-statistics-df.md
new file mode 100644
index 00000000..aa78af08
--- /dev/null
+++ b/docs/using-statistics-df.md
@@ -0,0 +1,179 @@
+How to use the Statistics DataFrame
+===================================
+
+
+Introduction
+------------
+
+The statistical data of your post-processed load cases are saved in the HDF
+format. You can use Pandas to retrieve and organize that data. Pandas organizes
+the data in a DataFrame, and the library is powerful, comprehensive and requires
+some learning. There are extensive resources out in the wild that can will help
+you getting started:
+
+* [list](http://pandas.pydata.org/pandas-docs/stable/tutorials.html)
+of good tutorials can be found in the Pandas
+[documentation](http://pandas.pydata.org/pandas-docs/version/0.16.2/).
+* short and simple
+[tutorial](https://github.com/DTUWindEnergy/Python4WindEnergy/blob/master/lesson%204/pandas.ipynb)
+as used for the Python 4 Wind Energy course
+
+
+The data is organized in simple 2-dimensional table. However, since the statistics
+of each channel is included for multiple simulations, the data set is actually
+3-dimensional. As an example, this is how a table could like:
+
+```
+   [case_id]  [channel name]  [mean]  [std]    [windspeed]
+       sim_1          pitch        0      1              8
+       sim_1            rpm        1      7              8
+       sim_2          pitch        2      9              9
+       sim_2            rpm        3      2              9
+       sim_3          pitch        0      1              7
+```
+
+Each row is a channel of a certain simulation, and the columns represent the
+following:
+
+* a tag from the master file and the corresponding value for the given simulation
+* the channel name, description, units and unique identifier
+* the statistical parameters of the given channel
+
+
+Load the statistics as a pandas DataFrame
+-----------------------------------------
+
+Pandas has some very powerful functions that will help analysing large and
+complex DataFrames. The documentation is extensive and is supplemented with
+various tutorials. You can use
+[10 Minutes to pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html)
+as a first introduction.
+
+Loading the pandas DataFrame table works as follows:
+
+```python
+import pandas as pd
+df = pd.read_hdf('path/to/file/name.h5', 'table')
+```
+
+Some tips for inspecting the data:
+
+```python
+import numpy as np
+
+# Check the available data columns:
+for colname in sorted(df.keys()):
+    print colname
+
+# list all available channels:
+print np.sort(df['channel'].unique())
+
+# list the different load cases
+df['[Case folder]'].unique()
+```
+
+
+Reduce memory footprint using categoricals
+------------------------------------------
+
+When the DataFrame is consuming too much memory, you can try to reduce its
+size by using categoricals. A more extensive introduction to categoricals can be
+found
+[here](http://pandas.pydata.org/pandas-docs/stable/faq.html#dataframe-memory-usage)
+and [here](http://matthewrocklin.com/blog/work/2015/06/18/Categoricals/).
+The basic idea is to replace all string values with an integer,
+and have an index that maps the string value to the index. This trick only works
+when you have long strings that occur multiple times throughout your data set.
+
+The following example shows how you can use categoricals to reduce the memory
+usage of a pandas DataFrame:
+
+```python
+# load a certain DataFrame
+df = pd.read_hdf(fname, 'table')
+# Return the total estimated memory usage
+print '%10.02f MB' % (df.memory_usage(index=True).sum()/1024.0/1024.0)
+# the data type of column that contains strings is called object
+# convert objects to categories to reduce memory consumption
+for column_name, column_dtype in df.dtypes.iteritems():
+    # applying categoricals mostly makes sense for objects, we ignore all others
+    if column_dtype.name == 'object':
+        df[column_name] = df[column_name].astype('category')
+print '%10.02f MB' % (df.memory_usage(index=True).sum()/1024.0/1024.0)
+```
+
+Python has a garbage collector working in the background that deletes
+un-referenced objects. In some cases it might help to actively trigger the
+garbage collector as follows, in an attempt to free up memory during a run of
+a script that is almost flooding the memory:
+
+```python
+import gc
+gc.collect()
+```
+
+Load a DataFrame that is too big for memory in chunks
+-----------------------------------------------------
+
+When a DataFrame is too big to load into memory at once, and you already
+compressed your data using categoricals (as explained above), you can read
+the DataFrame one chunk at the time. A chunk is a selection of rows. For
+example, you can read 1000 rows at the time by setting ```chunksize=1000```
+when calling ```pd.read_hdf()```. For example:
+
+```python
+# load a large DataFrame in chunks of 1000 rows
+for df_chunk in pd.read_hdf(fname, 'table', chunksize=1000):
+    print 'DataFrame chunk contains %i rows' % (len(df_chunk))
+```
+
+We will read a large DataFrame as chunks into memory, and select only those
+rows who belong to dlc12:
+
+```python
+# only select one DLC, and place them in one DataFrame. If the data
+# containing one DLC is still to big for memory, this approach will fail
+
+# create an empty DataFrame, here we collect the results we want
+df_dlc12 = pd.DataFrame()
+for df_chunk in pd.read_hdf(fname, 'table', chunksize=1000):
+    # organize the chunk: all rows for which [Case folder] is the same
+    # in a single group. Each group is now a DataFrame for which
+    # [Case folder] has the same value.
+    for group_name, group_df in df_chunk.groupby(df_chunk['[Case folder]']):
+        # if we have the group with dlc12, save them for later
+        if group_name == 'dlc12_iec61400-1ed3':
+            df_dlc12 = pd.concat([df_dlc12, group_df])#, ignore_index=True)
+```
+
+Plot wind speed vs rotor speed
+------------------------------
+
+```python
+# select the channels of interest
+for group_name, group_df in df_dlc12.groupby(df_dlc12['channel']):
+    # iterate over all channel groups, but only do something with the channels
+    # we are interested in
+    if group_name == 'Omega':
+        # we save the case_id tag, the mean value of channel Omega
+        df_rpm = group_df[['[case_id]', 'mean']].copy()
+        # note we made a copy because we will change the DataFrame in the next line
+        # rename the column mean to something more useful
+        df_rpm.rename(columns={'mean': 'Omega-mean'}, inplace=True)
+    elif group_name == 'windspeed-global-Vy-0.00-0.00--127.00':
+        # we save the case_id tag, the mean value of channel wind, and the 
+        # value of the Windspeed tag
+        df_wind = group_df[['[case_id]', 'mean', '[Windspeed]']].copy()
+        # note we made a copy because we will change the DataFrame in the next line
+        # rename the mean of the wind channel to something more useful
+        df_wind.rename(columns={'mean': 'wind-mean'}, inplace=True)
+
+# join both results on the case_id value so the mean RPM and mean wind speed
+# are referring to the same simulation/case_id.
+df_res = pd.merge(df_wind, df_rpm, on='[case_id]', how='inner')
+
+# now we can plot RPM vs wind speed
+from matplotlib import pyplot as plt
+plt.plot(df_res['wind-mean'].values, df_res['Omega-mean'].values, '*')
+```
+
-- 
GitLab