From 501f1263e3dc0e0e7105aad25dfe96fcaaf4825c Mon Sep 17 00:00:00 2001 From: David Robert Verelst <dave@dtu.dk> Date: Thu, 1 Mar 2018 15:40:20 +0100 Subject: [PATCH] add docs for launch.py as taken from wetb --- docs/launch.md | 139 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 139 insertions(+) create mode 100644 docs/launch.md diff --git a/docs/launch.md b/docs/launch.md new file mode 100644 index 0000000..9869f8b --- /dev/null +++ b/docs/launch.md @@ -0,0 +1,139 @@ +Launching the jobs on the cluster +--------------------------------- + +Use ssh (Linux, Mac) or putty (MS Windows) to connect to the cluster. + +The ```launch.py``` is a generic tool that helps with launching an arbitrary +number of pbs launch script on a PBS Torque cluster. Launch scripts here +are defined as files with a ```.p``` extension. The script will look for any +```.p``` files in a specified folder (```pbs_in/``` by default, which the user +can change using the ```-p``` or ```--path_pbs``` flag) and save them in a +file list called ```pbs_in_file_cache.txt```. When using the option ```-c``` or +```--cache```, the script will not look for pbs files, but instead read them +directly from the ```pbs_in_file_cache.txt``` file. + +The launch script has a simple build in scheduler that has been successfully +used to launch 50.000 jobs. This scheduler is configured by two parameters: +number of cpu's requested (using ```-c``` or ```--nr_cpus```) and minimum +of required free cpu's on the cluster (using ```--cpu_free```, 48 by default). +Jobs will be launched after a predefined sleep time (as set by the +```--tsleep``` option, and set to 5 seconds by default). After the initial sleep +time a new job will be launched every 0.5 second. If the launch condition is not +met: + +``` +nr_cpus > cpu's used by user +AND cpu's free on cluster > cpu_free +AND jobs queued by user < cpu_user_queue +``` + +the program will sleep 5 seconds before trying to launch a new job again. + +Depending on the amount of jobs and the required computation time, it could +take a while before all jobs are launched. When running the launch script from +the login node, this might be a problem when you have to close your ssh/putty +session before all jobs are launched. In that case the user can use the +```--crontab``` argument: it will trigger the ```launch.py``` script every 5 +minutes to check if more jobs can be launched until all jobs have been +executed. The user does not need to have an active ssh/putty session for this to +work. You can follow the progress and configuration of ```launch.py``` in +crontab mode in the following files: + +* ```launch_scheduler_log.txt``` +* ```launch_scheduler_config.txt```: you can change your launch settings on the fly +* ```launch_scheduler_state.txt``` +* ```launch_pbs_filelist.txt```: remaining jobs, when a job is launched it is +removed from this list + +You can check if ```launch.py``` is actually active as a crontab job with: + +``` +crontab -l +``` + +```launch.py``` will clean-up the crontab after all jobs are launched, but if +you need to prevent it from launching new jobs before that, you can clean up your +crontab with: + +``` +crontab -r +``` + +The ```launch.py``` script has various different options, and you can read about +them by using the help function (the output is included for your convenience): + +```bash +g-000 $ launch.py --help +Usage: + +launch.py -n nr_cpus + +launch.py --crontab when running a single iteration of launch.py as a crontab job every 5 minutes. +File list is read from "launch_pbs_filelist.txt", and the configuration can be changed on the fly +by editing the file "launch_scheduler_config.txt". + +Options: + -h, --help show this help message and exit + --depend Switch on for launch depend method + -n NR_CPUS, --nr_cpus=NR_CPUS + number of cpus to be used + -p PATH_PBS_FILES, --path_pbs_files=PATH_PBS_FILES + optionally specify location of pbs files + --re=SEARCH_CRIT_RE regular expression search criterium applied on the + full pbs file path. Escape backslashes! By default it + will select all *.p files in pbs_in/. + --dry dry run: do not alter pbs files, do not launch + --tsleep=TSLEEP Sleep time [s] when cluster is too bussy to launch new + jobs. Default=5 seconds + --tsleep_short=TSLEEP_SHORT + Sleep time [s] between between successive job + launches. Default=0.5 seconds. + --logfile=LOGFILE Save output to file. + -c, --cache If on, files are read from cache + --cpu_free=CPU_FREE No more jobs will be launched when the cluster does + not have the specified amount of cpus free. This will + make sure there is room for others on the cluster, but + might mean less cpus available for you. Default=48 + --cpu_user_queue=CPU_USER_QUEUE + No more jobs will be launched after having + cpu_user_queue number of jobs in the queue. This + prevents users from filling the queue, while still + allowing to aim for a high cpu_free target. Default=5 + --qsub_cmd=QSUB_CMD Is set automatically by --node flag + --node If executed on dedicated node. Although this works, + consider using --crontab instead. Default=False + --sort Sort pbs file list. Default=False + --crontab Crontab mode: %prog will check every 5 (default) + minutes if more jobs can be launched. Not compatible + with --node. When all jobs are done, crontab -r will + remove all existing crontab jobs of the current user. + Use crontab -l to inspect current crontab jobs, and + edit them with crontab -e. Default=False + --every_min=EVERY_MIN + Crontab update interval in minutes. Default=5 + --debug Debug print statements. Default=False + +``` + +Then launch the actual jobs (each job is a ```*.p``` file in ```pbs_in```) using +100 cpu's: + +```bash +g-000 $ cd /mnt/mimer/hawc2sim/demo/A0001 +g-000 $ launch.py -n 100 -p pbs_in/ +``` + +If the launching process requires hours, and you have to close you SHH/PuTTY +session before it reaches the end, you can either use the ```--node``` or the +```--crontab``` argument. When using ```--node```, ```launch.py``` will run on +a dedicated cluster node, submitted as a PBS job. When using ```--crontab```, +```launch.py``` will be run once every 5 minutes as a ```crontab``` job on the +login node. This is preferred since you are not occupying a node with a very +simple and light job. ```launch.py``` will remove all the users crontab jobs +at the end with ```crontab -r```. + +```bash +g-000 $ cd /mnt/mimer/hawc2sim/demo/A0001 +g-000 $ launch.py -n 100 -p pbs_in/ --crontab +``` + -- GitLab