Skip Navigation Links www.nws.noaa.gov
NOAA logo - Click to go to the NOAA homepage National Weather Service NWS logo - Click to go to the NWS homepage
EMC Logo
Navigation Bar Left Cap Home News Organization
Navigation Bar End Cap

         MISSION / VISION    |    About EMC

EMC > MTT >
Model Transition Team
TRAINING AND DOCUMENTATION MATERIALS:

User Meeting Slides
GAEA Wiki page
FAQs
System Details
GAEA Transfers (view in ppt or word)
Cheatsheet - 10/7/11
Hands on Training - 1/25/11
Information Session - 9/7/11
Quickstart Guide
Cray Application Developer's Environment User's Guide

NEWS AND UPDATES:

08/30/12 NCEP user directory moves. See details below.
08/8/12 View the Gaea user meeting slides from 8/6/12 here
07/24/12 (from MH IBM notes) The old Gaea c1ms machine was shut down on 07/09 so it can be upgraded (h/w & s/w); the upgraded machine will be called c1 once it is available for users again. Stability testing for c1 scheduled to begin 07/25, ‘advance’ users allowed on by 08/06, all users allowed on by 08/20. For 30 days after that, users are required to run dual jobs on c1. C2 issue: jobs were running slower than usual so system was rebooted on 07/03; no further problems reported but no root cause has been found (updated on issue are on Gaea known issues page).

Cron will be allowed on Gaea once testing on Zeus is complete & a certificate is issued. The wiki will be updated once cron is in place.

Gaea access to HPSS on Zeus is being worked - hsi & htar commands are available on remote data transfer nodes (rdtn). Issues found during testing are being worked between ORNL & CSC & the wiki will be updated once things are resolved.
02/22/12 View the GAEA wiki page here.
11/29/11 (from MH IBM notes) Data transfer to/from Gaea. Only George & Kate are currently authorized to do data transfer from Gaea to Vapor; DTN (Nwave) has set up for all users but wants one user to test the transfers before allowing the rest to start. Restricted data on Gaea: working on agreement with ORNL, but ORNL security says we can’t call it ‘restricted’ data. NOAA needs to make a restricted data agreement for Zeus too.
11/22/11 (from MH IBM notes) Data transfer to/from Gaea. George & Kate H are the only users currently authorized to do data transfer from Gaea to Vapor; DTN (Nwave) will have to authorize rest of users. Vapor admins installed GridFTP on Vapor (11/16) so users on Vapor can transfer to/from Gaea. Now Gaea RTDN has to fix a certificate issue so Vapor users can have their certification accepted & finally transfer files to/from Gaea. Restricted data on Gaea: NOAA is working on an agreement with ORNL, but we'll need to call it something else because 'restricted' has a special meaning to ORNL security & we don't want them to upgrade Gaea to a higher security level with more user restrictions.
11/15/11 Gaea admins changed their filewall to allow Vapor to access Gaea's RDTN nodes, so George is getting good transfer rates now but he's the only user currently authorized to do data transfers. NCEP will request that other users (starting with Kate) be authorized to transfer data. Under the current set-up users have to be on Gaea to transfer data to/from Vapor; Vapor admins need to install GridFTP on Vapor so users on Vapor can transfer to/from Gaea (scheduled for 11/16).
11/8/11 (from MH IBM notes) Gaea firewall issues interfering with Vapor data transfers still not resolved. Vapor will also need to replace a daemon to get transfers working properly, which will require a brief outage (length & date TBD).
11/2/11 (from MH IBM notes) Gaea data transfer. George is very close to being able to transfer data to/from Gaea; the Gaea admins have to change their filewall to allow Vapor to access Gaea's RDTN nodes and Vapor admins have to modify their firewall to allow Gaea users to write as well as read files on Vapor. Gaea firewall change date is TBD but Vapor firewall change will happen ASAP.
10/27/11 (from MH IBM notes) Data transfer to/from Gaea. CSC modified the Vapor firewall so data can move directly to/from Gaea but there is some sort of issue on the Gaea side that keeps it from working. In the meantime, George V created a special directory on Vapor where users can put non-restricted data files that will be sent to Gaea via the Jet computer at Boulder. No restricted data can be put on Gaea until NOAA has an agreement with ORNL. There will be no direct link from the CCS to Gaea, all data transfers will have to be via Vapor & Zeus.

Also, Kate Howard is the EMC POC for Gaea user support.

10/11/11 (from MH IBM notes) Data transfer to/from Gaea. CSC is working on an RFC to modify the Vapor firewall so data can move directly between Vapor & Gaea; there will be no direct link between Gaea and CCS. In the meantime, George V created a special directory on Vapor where users can put non-restricted data files that will be sent to Gaea via the Jet computer at Boulder. No restricted data can be put on Gaea until NOAA has an agreement with ORNL.
CURRENT GFDL ACTION ITEMS:

ITEM EMC WORKED ELEVATED TO GFDL STATUS RESOLUTION
NCEP access to savaged fs content Yes 11/11/11 Closed Closed 11/28 - All requested access to savaged fs content has been given
Emacs issue Yes 11/10/11 Closed Resolved in EMC
Set $NETCDF Variable Yes 11/3/11 Closed EMC working
Thread stack overflow Yes 10/31/11 Closed From GFDL - In order to increase the OpenMP stacksize for the threads, use KMP_STACKSIZE with a suggested value of 512MB.
Large mpi_gather fails Yes 10/28/11 Closed Closed 11/20 - Work around found
MPMD - Ksh executable issue on cluster nodes
Click for more info
Yes 11/28/11 Closed Being worked by NCO and EMC
Upgrade svn client to v1.6.X Yes 11/30/11 Open Requested GFDL make the upgrade
New short queue Yes 12/15/11 Open
CURRENT EMC ACTION ITEMS:

ITEM STATUS RESOLUTION
How do we run numerous cpu intensive scripts?
Click here to view more information
Open

USER TRAINING - 11/1/11:

Quick start guide, FAQs, System Details
GFDL GAEA Presentation (ppt)
Cray XE6 Architecture and Performance Top 10
User Workshop Outcome

IMPORTANT INFORMATION:

NCEP User Directory Move:

At 9:00AM EST on Wednesday, August 29, NCEP users had their FS and LTFS user directories moved into an NCEP specific folder within each of the file systems. A soft-link (symlink) was left in the place of each of the users' folders. These soft-links point to the new LTFS and FS locations. Our plan remains to remove these symlinks after a 30 day grace period, tentatively scheduled for September 26th. Users are highly encouraged to use this grace period to update their scripts to reflect the new location of their FS and LTFS scratch spaces.

Old directory for NCEP users:
    /lustre/ltfs/scratch/$USER
    /lustre/fs/scratch/$USER

New directories for NCEP users:
    /lustre/ltfs/scratch/ncep/$USER
    /lustre/fs/scratch/ncep/$USER

NCEP non-scratch directories on fs(unchanged):
    /lustre/fs/dev/ncep
    /lustre/fs/pdata/ncep

Gaea Update - 7/23/12

There has been a lot going on recently and I wanted to provide an update on Gaea and a status on the transition to c1.  General updates are normally provided at the User Meeting - the next one is scheduled on August 6 at 10am.  For the next User Meeting, we are planning on starting with Gaea to accommodate off-site users who are not interested in the other GFDL-specific updates.

C1MS to C1 transition:
  • Gaea has two compute partitions (c1ms and c2)
    • The two partitions had different high speed networks and processors
      • The performance was different and the runs on one partition would not match the answers from runs on the other partition
    • Currently, the c1ms is being upgraded to match the c2 architecture
      • Will then be named c1
    • If needed, users can re-run legacy experiments from the c1ms on the t1ms (a small test machine matching the c1ms architecture)
      • The t1ms will be made available for only a few months
      • When the t1ms is available, details will be sent out on how to access it
    • More information about Gaea’s compute partitions can be found at:
  • The c1ms to c1 hardware upgrade are completed
    • Hardware updates included c1 and gaea1-4
    • Important dates:
      • July 25th - Stability testing scheduled to begin
      • Aug 6th - Initial “Advance” users introduced to c1
      • Aug 20th - Scheduled general access and end of stability testing
  • Users are reminded that for the first 30 days of production, they are required to dual run jobs on c1
Scheduling on Gaea:
  • Until dual running completes, c1 and c2 will remain separate systems
    • Instructions on how to access c1 will be sent out before the general access date of the system
  • The following changes to the scheduler policy are expected to be put in place during the first Preventative Maintenance (PM) window following the end of dual running:
    • You will be able to submit to either c1 or c2 from any login node
    • In addition to the current scheduling policies, the following changes will be made to enhance job performance and job packing efficiency:
      • Jobs 896 processors or less can run on c1
      • Jobs 896 processors or greater can run on c2
      • Small jobs can backfill c2
      • To take advantage of this, users will only need to add both partitions to their msub statements:  -l partition=c1,c2
  • Additional detail of the current queue policy, and detail of these updates will be available on the gaeadoc's scheduling policy page:       
Variability on c2:
  • There has been no reports of longer than expected run times following the reboot on July 3rd
  • A root cause has not yet been identified
  • If further reports are seen, will will gather system and job data for analysis and follow our current procedure of rebooting the system 24 hours after analysis stats have been collected and analyzed
  • Updates about the variability issue is posted on the Gaea Known issues page:
HPC Reporting (compute allocation reporting)
  • A new version of hpcrpt is available on Gaea
    • The new version has two new additional reports: user and summary
    • The user report reports all usage of a user across all projects on Gaea
    • The summary report provides a summary of all project usage on Gaea
    • The project report now provides windfall usage by user
    • CSC is working with Adaptive Computing to generate the final versions of these scripts.  We expect that the current versions will be very similar
  • To access the new version, run the following command on a gaea login node(gaea1-8):
    • module load hpcrpt
    • hpcrpt.new
  • Documentation of the old and new hpcrpt formats is available on the gaeadocs wiki at:
Upcoming Changes:
  • A “cron” scheduling service will be made available for Gaea
    • The current solution is being tested by NCEP
    • ORNL is evaluating the solution that is on Zeus
    • Will require a valid certificate
    • Users should not rely on cron as a scheduler.  It is being provided as a convenience.  If performance or reliability issues are encountered, the service will be suspended.
    • The wiki will be updated once a final solution is in place.
  • NESCC Archive access on Zeus
    • hsi and htar commands are available on the remote data transfer nodes (rdtn)
    • During testing a problem with data transfers has been encountered.  CSC will be working with ORNL to resolve the issue.  Due to competing priorities by CSC, it is not anticipated that this will be worked on before next week.
    • The wiki will be updated once a final solution is in place.
General Updates:
  • All of the most current information regarding gaea can be found at gaeadocs:
  • Help for all gaea issues can be reached at oar.gfdl.help@noaa.gov
  • User Meetings provide a general update an an opportunity to have a dialog about current issues.
    • The next upcoming User Meeting is on August 6 at 10am.  An announcement will be sent out with the dial-in information.
    • We are changing the format and starting with Gaea to accommodate off-site users.  If users are not interested in updates about GFDL-specific issues, they can drop off after the Gaea updates.
    • Recent User Meeting slides are posted:
  • Software requests and configuration changes requested through the help desk must be validated and go through the Configuration Management (CM) process.  The CM board meets once a week to determine when and how these changes will take place.  Generally a timeline for installation/change is provided in the help desk ticket after the change has been approved.

How to request unattended transfer setup: CCS -> Zeus & Gaea -> Zeus

    Click here for document detailing requests for unattended data transfers to Zeus

Gridftp (+/-)

Cronjob-like Capability (+/-)

Vapor to GAEA to Vapor: (+/-)

Batch Job Test (+/-)