ARC: A Root Cluster for Research into Scalable Computer Systems

Official Annoucement of the ARC Cluster (local copy)
NCSU write-up on the ARC Cluster (local copy)
TechNewsDailyStory (local copy)


main system funded in part by NSF through CRI grant #0958311

Cooling door equipment and installation funded by NCSU CSC; GPUs funded in part by a grant from NCSU ETF funds, and by NVIDIA and HP donations


Old ARC Cluster

Looking for the documentation of the old ARC Cluster?

READ THIS BEFORE YOU RUN ANYTHING: Running Jobs (via Slurm)

  • Once logged in the ARC, immediately obtain access to a compute node (interactively) or schedule batch jobs as shown below. Do not execute any other commands on the entry node!


  • Hardware

    1728 cores on 108 compute nodes integrated by Advanced HPC. All machines are 2-way SMPs with either AMD or Intel processors (see below).
    Nodes:







    Networking, Power and Cooling:

    Pictures


    System Status


    Software

    All software is 64 bit unless marked otherwise.

    Obtaining an Account


    Accessing the Cluster


    Using OpenMP (via Gcc)


    Running CUDA Programs (Version 8.0)


    Running MPI Programs with MVAPICH2 and Gcc (Default)


    Running MPI Programs with Open MPI and Gcc (Alternative)


    Using the PGI compilers V16.7

    (includes OpenMP and CUDA support via pragmas, even for Fortran)


    Dynamic Voltage and Frequency Scaling (DVFS)


    Power monitoring

    Sets of three compute nodes share a power meter; in such a set, the lowest numbered node has the meter attached (either on the serial port or via USB). In addition, two individual compute nodes have power meters (with different GPUs). See this power wiring diagram to identify which nodes belong to a set. The diagram also indicates if a meter uses serial or USB for a given node. We recommend to explicitly request a reservation for all nodes in a monitored set (see srun commands with host name option). Monitoring at 1Hz is accomplished with the following software tools (on the respective nodes where meters are attached):

    Virtualization with LXD (optionally with X11, VirtualBox, Docker inside)

    Container virtualization support is realized via LXD. Please try to use CentOS images as they will take much less space than any other ones since only the differences to the host image need to be stored in the container.

    Notice: Images are installed locally on the node you are running on. If you need identical images on multiple nodes, then write a script to create an image from scratch. You cannot simply copy images as they are in a protected directory.

    X11 inside LXD:

    VirtualBox inside LXD (requires X11, see above):

    Docker inside of LXD:


    Virtualization with KVM (pending)

    Virtualization support is realized via KVM.

    Follow instructions for VM creation and see the MAC guidelines for network connectivity.

    Lustre


    PVFS2


    PAPI


    likwid (pending)


    Hadoop Map-Reduce and Spark

    Simple setup of multi-node
    Hadoop map-reduce with HDFS, see also free AWS setup as an alternative and the original single node and cluster setup. But follow the instructions below for ARC. Other components, e.g., YARN, can be added to the setup below as well (not covered). We'll set up a Hadoop instance with nodes cXXX and cYYY (optionally more), so you should have gotten at least 2 nodes with srun. To get rid of ssh errors, you need to add a secondary node server and other optional services. This is not required, it's an option.

    You can also run Spark on top of Hadoop as follows, which will also default to the HDFS file system:

    export SPARK_DIST_CLASSPATH=$(hadoop classpath)
    export SPARK_HOME=/usr/local/spark
    export PATH="$PATH:$SPARK_HOME/bin"
    run-example SparkPi 10
    

    Tensorflow (1.0.1)


    Other Packges

    A number of packages have been installed, please check out their location (via: rpm -ql pkg-name) and documentation (see URLs) in this PDF if you need them. (Notice, only the mvapich2/openmpi/gnu variants are installed.) Typically, you can get access to them via:
      module avail # show which modules are available
      module load X
      export |grep X #shows what has been defined
      gcc/mpicc -I${X_INC} -L{X_LIB} -lx #for a library
      ./X #for a tool/program, may be some variant of 'X' depending on toolkit
      module switch X Y #for mutually exclusive modules if X is already loaded
      module unload X
      module info #learn how to use modules
    
    Current list of available modules:
      -------------------------- /opt/ohpc/pub/moduledeps/gnu-mvapich2 --------------------------
       adios/1.10.0    mpiP/3.4.1              petsc/3.7.0        scorep/3.0
       boost/1.61.0    mumps/5.0.2             phdf5/1.8.17       sionlib/1.7.0
       fftw/3.3.4      netcdf/4.4.1            scalapack/2.0.2    superlu_dist/4.2
       hypre/2.10.1    netcdf-cxx/4.2.1        scalasca/2.3.1     tau/2.26
       imb/4.1         netcdf-fortran/4.4.4    scipy/0.18.0       trilinos/12.6.4
    
    ------------------------------ /opt/ohpc/pub/moduledeps/gnu -------------------------------
       R_base/3.3.1    metis/5.1.0         ocr/1.0.1          pdtoolkit/3.22
       gsl/2.2.1       mvapich2/2.2        openblas/0.2.19    superlu/5.2.1
       hdf5/1.8.17     numpy/1.11.1        openmpi/1.10.4
    
    -------------------------------- /opt/ohpc/pub/modulefiles --------------------------------
       EasyBuild/2.9.0        java                       pgi/2016
       PrgEnv-pgi/16.7        ohpc                       pgi64/16.7           
       autotools              openmpi/1.10.2/2016        pgi64/2016
       cuda                   papi/5.4.3                 prun/1.1           
       gnu/5.4.0              pgi/16.7                   valgrind/3.11.0
    

    Advanced topics (pending)

    For all other topics, access is restricted. Request a root password. Also, read this documentation, which is only accessible from selected NCSU labs.

    This applies to:


    Known Problems

    Consult the FAQ. If this does not help, then please report your problem.

    References:

    Additional references: