TRAINING AND DOCUMENTATION MATERIALS:
• User Meeting Slides
• GAEA Wiki page
• FAQs
• System Details
• GAEA Transfers (view in ppt or word)
• Cheatsheet - 10/7/11
• Hands on Training - 1/25/11
• Information Session - 9/7/11
• Quickstart Guide
• Cray Application Developer's Environment User's Guide
|
NEWS AND UPDATES:
| 08/30/12 |
NCEP user directory moves. See details below. |
| 08/8/12 |
View the Gaea user meeting slides from 8/6/12 here |
| 07/24/12 |
(from MH IBM notes) The old Gaea c1ms machine was shut down on 07/09 so it can be upgraded (h/w & s/w); the upgraded machine will be called c1 once it is available for users again. Stability testing for c1 scheduled to begin 07/25, ‘advance’ users allowed on by 08/06, all users allowed on by 08/20. For 30 days after that, users are required to run dual jobs on c1. C2 issue: jobs were running slower than usual so system was rebooted on 07/03; no further problems reported but no root cause has been found (updated on issue are on Gaea known issues page).
Cron will be allowed on Gaea once testing on Zeus is complete & a certificate is issued. The wiki will be updated once cron is in place.
Gaea access to HPSS on Zeus is being worked - hsi & htar commands are available on remote data transfer nodes (rdtn). Issues found during testing are being worked between ORNL & CSC & the wiki will be updated once things are resolved. |
| 02/22/12 |
View the GAEA wiki page here. |
| 11/29/11 |
(from MH IBM notes) Data transfer to/from Gaea. Only George & Kate are currently authorized to do data transfer from Gaea to Vapor; DTN (Nwave) has set up for all users but wants one user to test the transfers before allowing the rest to start. Restricted data on Gaea: working on agreement with ORNL, but ORNL security says we can’t call it ‘restricted’ data. NOAA needs to make a restricted data agreement for Zeus too. |
| 11/22/11 |
(from MH IBM notes) Data transfer to/from Gaea. George & Kate H are the only users currently authorized to do data transfer from Gaea to Vapor; DTN (Nwave) will have to authorize rest of users. Vapor admins installed GridFTP on Vapor (11/16) so users on Vapor can transfer to/from Gaea. Now Gaea RTDN has to fix a certificate issue so Vapor users can have their certification accepted & finally transfer files to/from Gaea. Restricted data on Gaea: NOAA is working on an agreement with ORNL, but we'll need to call it something else because 'restricted' has a special meaning to ORNL security & we don't want them to upgrade Gaea to a higher security level with more user restrictions. |
| 11/15/11 |
Gaea admins changed their filewall to allow Vapor to access Gaea's RDTN nodes, so George is getting good transfer rates now but he's the only user currently authorized to do data transfers. NCEP will request that other users (starting with Kate) be authorized to transfer data. Under the current set-up users have to be on Gaea to transfer data to/from Vapor; Vapor admins need to install GridFTP on Vapor so users on Vapor can transfer to/from Gaea (scheduled for 11/16).
|
| 11/8/11 |
(from MH IBM notes) Gaea firewall issues interfering with Vapor data transfers still not resolved. Vapor will also need to replace a daemon to get transfers working properly, which will require a brief outage (length & date TBD).
|
| 11/2/11 |
(from MH IBM notes) Gaea data transfer. George is very close to being able to transfer data to/from Gaea; the Gaea admins have to change their filewall to allow Vapor to access Gaea's RDTN nodes and Vapor admins have to modify their firewall to allow Gaea users to write as well as read files on Vapor. Gaea firewall change date is TBD but Vapor firewall change will happen ASAP.
|
| 10/27/11 |
(from MH IBM notes) Data transfer to/from Gaea. CSC modified the Vapor firewall so data can move directly to/from Gaea but there is some sort of issue on the Gaea side that keeps it from working. In the meantime, George V created a special directory on Vapor where users can put non-restricted data files that will be sent to Gaea via the Jet computer at Boulder. No restricted data can be put on Gaea until NOAA has an agreement with ORNL. There will be no direct link from the CCS to Gaea, all data transfers will have to be via Vapor & Zeus.
Also, Kate Howard is the EMC POC for Gaea user support.
|
| 10/11/11 |
(from MH IBM notes) Data transfer to/from Gaea. CSC is working on an RFC to modify the Vapor firewall so data can move directly between Vapor & Gaea; there will be no direct link between Gaea and CCS. In the meantime, George V created a special directory on Vapor where users can put non-restricted data files that will be sent to Gaea via the Jet computer at Boulder. No restricted data can be put on Gaea until NOAA has an agreement with ORNL.
|
|
CURRENT GFDL ACTION ITEMS:
| ITEM |
EMC WORKED |
ELEVATED TO GFDL |
STATUS |
RESOLUTION |
| NCEP access to savaged fs content |
Yes |
11/11/11 |
Closed |
Closed 11/28 - All requested access to savaged fs content has been given |
| Emacs issue |
Yes |
11/10/11 |
Closed |
Resolved in EMC |
| Set $NETCDF Variable |
Yes |
11/3/11 |
Closed |
EMC working |
| Thread stack overflow |
Yes |
10/31/11 |
Closed |
From GFDL - In order to increase the OpenMP stacksize for the threads, use KMP_STACKSIZE with a suggested value of 512MB. |
| Large mpi_gather fails |
Yes |
10/28/11 |
Closed |
Closed 11/20 - Work around found |
MPMD - Ksh executable issue on cluster nodes Click for more info |
Yes |
11/28/11 |
Closed |
Being worked by NCO and EMC |
Message: The ksh executable will not run on cluster nodes because it tries to load non-existant libraries. Using the AT&T official x86-64 ksh binary is a workaround for this, but that led to discovering several more problems.
1. /bin/sort does not exist on cluster nodes,
2. gunzip wants to write to /tmp, but cannot
3. aprun is unable to run more programs than the number of machines. The aprun program should be able to do this:
aprun -n 1 prog1 : -n 1 prog2 : -n 1 prog3 : -n 1 prog4 : ... : -n 1 prog144
when there are 144 CPUs available. Unfortunately, it is unable to do that -- it can only run up to one program per node.
Effects of the problem: HWRF system will not be able to run in coupled mode on GAEA. Also, HWRF's prep_hybrid program will be unusable.
Path to associated scripts: /lustre/fs/scratch/Samuel.Trahan/mpmd.testcase
Path to Std Out: stdout: /lustre/fs/scratch/Samuel.Trahan/mpmd.testcase/test.out
stderr: /lustre/fs/scratch/Samuel.Trahan/mpmd.testcase/test.err
Occurrence: 100%
Reproducible?: Yes. Use this job script:
/lustre/fs/scratch/Samuel.Trahan/mpmd.testcase/testjob.ksh
Software/tools: /lustre/fs/scratch/Samuel.Trahan/mpmd.testcase> module list
Currently Loaded Modulefiles:
1) modules/3.2.6.6 7) pmi/1.0-1.0000.7901.22.1.ss 13) moab/5.4.1
2) xt-sysroot/3.1.29.securitypatch.20100707 8) xt-sysroot/3.1.29 14) torque/2.4.9-snap.201006181312
3) xtpe-network-seastar 9) portals/2.2.0-1.0301.22039.18.1.ss 15) xtpe-mc12
4) pgi/10.6.0 10) xt-asyncpe/4.4 16) TimeZoneEDT
5) xt-libsci/10.4.6 11) PrgEnv-pgi/3.1.29 17) CmrsEnv
6) xt-mpt/5.0.1 12) eswrap/1.0.9
|
| Upgrade svn client to v1.6.X |
Yes |
11/30/11 |
Open |
Requested GFDL make the upgrade |
| New short queue |
Yes |
12/15/11 |
Open |
|
|
CURRENT EMC ACTION ITEMS:
| ITEM |
STATUS |
RESOLUTION |
How do we run numerous cpu intensive scripts? Click here to view more information |
Open |
|
Effects of the problem: NCEP pre and post processing jobs include scripts which run many commands which are not individually expensive but in aggregate are cpu intensive and not suitable for batch nodes. Because many of the commands are themselves scripts, they also cannot be run with aprun. Attempts to run them on batch nodes with multiple concurrent executions are believed to be very dangerous. Even single serial executions if done by many concurrent independent jobs are unsafe on batch nodes. But there is nowhere else for a batch job to run them.
Path to associated scripts: A testcase is in /lustre/fs/scratch/George.Vandenberghe/mpmd.testcase. It consists of 146 separate scripts which are individually fairly expensive. We would like to run 10 or so of these per node
in the background in a batch job. How do we do this. A sample script is:
cat cmd.c10n8.gz | gunzip | sort -n -k8 | tail >top10.cmd.c10n8.gz
This testcase was designed to illustrate the problem. Real cases have more complex logic which
a reader would first have to understand before analyzing the problem which is why we have submitted this synthetic testcase. This testcase could potentially be repackaged into an MPI job. A more complex script would be much more difficult or impossible to transform this way.
Script function: Generally preprocessing or postprocessing multiple files with a multistep script algorithm.
Reproducible?: yes. Run the scripts from 1-99 serially and then consider what happens when they are run in many
concurrent batch jobs. |
USER TRAINING - 11/1/11:
Quick start guide, FAQs, System Details
GFDL GAEA Presentation (ppt)
Cray XE6 Architectureand Performance Top 10
User Workshop Outcome
IMPORTANT INFORMATION:
NCEP User Directory Move:
At 9:00AM EST on Wednesday, August 29, NCEP users had their FS and LTFS user directories moved into an NCEP specific folder within each of the file systems. A soft-link (symlink) was left in the place of each of the users' folders. These soft-links point to the new LTFS and FS locations. Our plan remains to remove these symlinks after a 30 day grace period, tentatively scheduled for September 26th. Users are highly encouraged to use this grace period to update their scripts to reflect the new location of their FS and LTFS scratch spaces.
Old directory for NCEP users:
/lustre/ltfs/scratch/$USER
/lustre/fs/scratch/$USER
New directories for NCEP users:
/lustre/ltfs/scratch/ncep/$USER
/lustre/fs/scratch/ncep/$USER
NCEP non-scratch directories on fs(unchanged):
/lustre/fs/dev/ncep
/lustre/fs/pdata/ncep
Gaea Update - 7/23/12
There has been a lot going on recently and I wanted to provide an update on Gaea and a status on the transition to c1. General updates are normally provided at the User Meeting - the next one is scheduled on August 6 at 10am. For the next User Meeting, we are planning on starting with Gaea to accommodate off-site users who are not interested in the other GFDL-specific updates.
C1MS to C1 transition:
- Gaea has two compute partitions (c1ms and c2)
- The two partitions had different high speed networks and processors
- The performance was different and the runs on one partition would not match the answers from runs on the other partition
- Currently, the c1ms is being upgraded to match the c2 architecture
- If needed, users can re-run legacy experiments from the c1ms on the t1ms (a small test machine matching the c1ms architecture)
- The t1ms will be made available for only a few months
- When the t1ms is available, details will be sent out on how to access it
- More information about Gaea’s compute partitions can be found at:
- The c1ms to c1 hardware upgrade are completed
- Hardware updates included c1 and gaea1-4
- Important dates:
- July 25th - Stability testing scheduled to begin
- Aug 6th - Initial “Advance” users introduced to c1
- Aug 20th - Scheduled general access and end of stability testing
- Users are reminded that for the first 30 days of production, they are required to dual run jobs on c1
Scheduling on Gaea:
- Until dual running completes, c1 and c2 will remain separate systems
- Instructions on how to access c1 will be sent out before the general access date of the system
- The following changes to the scheduler policy are expected to be put in place during the first Preventative Maintenance (PM) window following the end of dual running:
- You will be able to submit to either c1 or c2 from any login node
- In addition to the current scheduling policies, the following changes will be made to enhance job performance and job packing efficiency:
- Jobs 896 processors or less can run on c1
- Jobs 896 processors or greater can run on c2
- Small jobs can backfill c2
- To take advantage of this, users will only need to add both partitions to their msub statements: -l partition=c1,c2
- Additional detail of the current queue policy, and detail of these updates will be available on the gaeadoc's scheduling policy page:
Variability on c2:
- There has been no reports of longer than expected run times following the reboot on July 3rd
- A root cause has not yet been identified
- If further reports are seen, will will gather system and job data for analysis and follow our current procedure of rebooting the system 24 hours after analysis stats have been collected and analyzed
- Updates about the variability issue is posted on the Gaea Known issues page:
HPC Reporting (compute allocation reporting)
- A new version of hpcrpt is available on Gaea
- The new version has two new additional reports: user and summary
- The user report reports all usage of a user across all projects on Gaea
- The summary report provides a summary of all project usage on Gaea
- The project report now provides windfall usage by user
- CSC is working with Adaptive Computing to generate the final versions of these scripts. We expect that the current versions will be very similar
- To access the new version, run the following command on a gaea login node(gaea1-8):
- module load hpcrpt
- hpcrpt.new
- Documentation of the old and new hpcrpt formats is available on the gaeadocs wiki at:
Upcoming Changes:
- A “cron” scheduling service will be made available for Gaea
- The current solution is being tested by NCEP
- ORNL is evaluating the solution that is on Zeus
- Will require a valid certificate
- Users should not rely on cron as a scheduler. It is being provided as a convenience. If performance or reliability issues are encountered, the service will be suspended.
- The wiki will be updated once a final solution is in place.
- NESCC Archive access on Zeus
- hsi and htar commands are available on the remote data transfer nodes (rdtn)
- During testing a problem with data transfers has been encountered. CSC will be working with ORNL to resolve the issue. Due to competing priorities by CSC, it is not anticipated that this will be worked on before next week.
- The wiki will be updated once a final solution is in place.
General Updates:
- All of the most current information regarding gaea can be found at gaeadocs:
- Help for all gaea issues can be reached at oar.gfdl.help@noaa.gov
- Detailed instructions for getting help are available on gaeadocs (also known as the Gaea wiki):
- User Meetings provide a general update an an opportunity to have a dialog about current issues.
- The next upcoming User Meeting is on August 6 at 10am. An announcement will be sent out with the dial-in information.
- We are changing the format and starting with Gaea to accommodate off-site users. If users are not interested in updates about GFDL-specific issues, they can drop off after the Gaea updates.
- Recent User Meeting slides are posted:
- Software requests and configuration changes requested through the help desk must be validated and go through the Configuration Management (CM) process. The CM board meets once a week to determine when and how these changes will take place. Generally a timeline for installation/change is provided in the help desk ticket after the change has been approved.
How to request unattended transfer setup: CCS -> Zeus & Gaea -> Zeus
Click here for document detailing requests for unattended data transfers to Zeus
Gridftp (+/-)
The command provided to us to copy data to/from vapor is globus_url_copy
The following example copies a file to vapor. Its name should be unique to your ID.
This is all one line and MUST run from an initial Gaea login session, the one you used your token for.
globus-url-copy -vb -p 1 -bs 6M gsiftp://rdtn01.ncrc.gov/lustre/ltfs/scratch/$YOUR.NAME/$FILENAME gsiftp://dtn-001.gaithersburg.rdhpcs.noaa.gov/gpfs/v/scratch/stmp/$FILENAME
Once you have verified it works you can configure your environment for it to work more generally.
The key thing you need is an X509_USER_PROXY file and this is a volatile thing on /tmp
by default. To capture it
FROM YOUR INITIAL GAEA LOGIN SESSION
mkdir $HOME/.globus
cp $X509_USER_PROXY $HOME/.globus/cert
Then in subsequent jobs
setenv X509_USER_PROXY $HOME/.globus/cert
globus_url_copy should then work in eslogin and rdtn batch jobs.
A sample batch job to do a transfer is
#msub
#Options
#PBS -l partition=es,size=1,walltime=04:11:09
#PBS -q rdtn
#PBS -S /bin/csh
setenv X509_USER_PROXY $HOME/.globus/cert
globus-url-copy -vb -p 1 -bs 6M gsiftp://rdtn01.ncrc.gov/lustre/ltfs/stage/$YOUR.NAME/$FILENAME
gsiftp://dtn-001.gaithersburg.rdhpcs.noaa.gov/gpfs/v/scratch/stmp/$FILENAME
However this job will not work until you have an X509 certificate in a permanent location such as the one suggested above.
The name of the location is not important but it must be consistent. The user $HOME/.ssh directory is a good
place to put such things because it is user read only.
|
Cronjob-like Capability (+/-)
Moab has an option that allows a user to submit a job to run at a specific time. The job will be submitted to a hold state until that time is reached. Below is an example of a .csh code snippet that is used in a script that runs every day at 20:00.
For more information you can do man msub.
You will need to add something like what has been added to the scripts you want to be resubmitted on a schedule. The issues which you will need to plan for are moab errors and job failures. Unlike Cron, the jobs expected may need to be babysat by another script or use error handling code. We have some examples of moab error handling if you need guidance.
set now = `date +%s`
@ oneday = 24 * 3600
@ next = $now + $oneday
#date -d "1970-01-01 00:00 UTC $next seconds" +%Y%m%d
@ go = `date -d "1970-01-01 00:00 UTC $next seconds" +%Y%m%d `
rm -rf /home/tlm/pptest/$delete
rm -rf /archive/Tara.McQueen/pptest/$delete
##############################################################
# Submit script for tomorrow
echo Tomorrows date is $next
msub -a ${go}2000 /home/tlm/testing/freppTest.csh
|
Vapor to GAEA to Vapor: (+/-)
from GAEA Transfers document
1) Find values for $NUM1 and $NUM2, see document on how to find it
2) First GAEA login: ssh -L$NUM1:localhost:$NUM1 -R$NUM2:localhost:22 $yourname@140.208.145.9
3) While keeping your first GAEA login active, you can access GAEA from Vapor: ssh -p $NUM1 $yourname@localhost
4) From GAEA back to Vapor: ssh -p $NUM2 $your.vapor.id@localhost
|
Batch Job Test (+/-)
Run the following job:
#PBS -S /bin/sh
#PBS -o output
#PBS -N JWtest
#PBS -q batch
#PBS -l partition=c1ms
#PBS -l size=24
#PBS -l walltime=0:01:00
#PBS -j oe
cd /lustre/fs/scratch/$USER
echo hello world >hello.out
cp /lustre/fs/scratch/George.Vandenberghe/SIG flatfile
rm flatfile.gz
aprun -n 1 gzip flatfile
# check out mpi trivial job
aprun -n 4 /lustre/fs/scratch/George.Vandenberghe/bin/mpi.test.x
# use 1.9gb of memory per task
aprun -n 24 /lustre/fs/scratch/George.Vandenberghe/bin/mpi.1.9gbyte.x
to verify minimal node functionality. It runs a serial aprun command, then
a four task trivial MPI job and a 24 task job that grabs 1.9 gbytes of memory
per task.
If submitted from your home directory or something in /lustre/fs/scratch/$yourid it should run. The submit command is: msub $job
msub /lustre/fs/scratch/George.Vandenberghe/jfirst will also work.
The job produces three outputs.
1. an "output" file in the CWD you submitted from.
2. A "hello.out" file in /lustre/fs/scratch/$USER/hello.out
3. a flatfile.gz file in /lustre/fs/scratch/$USER/hello.out
The "output" file should include a line with:
size and rank 24 23
indicating task24 of the second MPI executable ran.
|
|