Cloud Tier 3

Introduction

We are developing a cluster for local ATLAS computing using the TACC Rodeo system to boot virtual machines. If you just want to use the system, see the next section and ignore the rest (which describes the virtual machine setup and is a bit out of date as of Sep 2015).

Transferring data from external sources

The Tier-3 nodes do not directly connect to any storage space. We can access files via the xrootd protocol from the /data disk that is mounted by all the workstations and utatlas.its.utexas.edu (see below). So files must first be transferred to the tau workstations or to utatlas.its.utexas.edu. Methods include:

Rucio download for Grid datasets
xrootd copy for files on CERN EOS/ATLAS Connect FaxBox/ATLAS FAX (Federated XrootD)
Globus Connect for files on ATLAS Connect FaxBox, TACC, or CERN

Getting started with Bosco

The Tier-3 uses utatlas.its.utexas.edu as a submission host - this is where the Condor scheduler lives. However

Bosco is a job submission manager designed to manage job submissions across different resources. It is needed to submit jobs from our workstations to the Tier-3.

Make sure you have an account on our local machine utatlas.its.utexas.edu, and that you have passwordless ssh set up to it from the tau* machines.

To do this create an RSA key and copy your .ssh folder onto the tau machine using scp.

Then carry out the following instructions on any of the tau* workstations:

cd ~
curl -o bosco_quickstart.tar.gz ftp://ftp.cs.wisc.edu/condor/bosco/1.2/bosco_quickstart.tar.gz
tar xvzf ./bosco_quickstart.tar.gz
./bosco_quickstart

this will ask you if you would like to install. Select y and continue.

Bosco Quickstart
Detailed logging of this run is in ./bosco_quickstart.log

************** Starting Bosco: ***********
BOSCO Started
************** Connect one cluster (resource) to BOSCO: ***********
At any time hit [CTRL+C] to interrupt.

Type the submit host name for the BOSCO resource and press [ENTER]: 
No default, please type the name and press [ENTER]: utatlas.its.utexas.edu
Type your username on utatlas.its.utexas.edu (default USERNAME) and press [ENTER]: 
Type the queue manager for utatlas.its.utexas.edu (pbs, condor, lsf, sge, slurm) and press [ENTER]: condor
Connecting utatlas.its.utexas.edu, user: USERNAME, queue manager: condor

This may take some time to configure and test, but when it finishes, you should run.

source ~/bosco/bosco_setenv

Then you will be able to submit jobs as if you were running condor!

In order to get more than ten jobs to submit at once (through ATLAS Connect you have access to hundreds of job slots at other institutions), edit the ~/bosco/local.bosco/condor_config.local file, changing the last line to reflect the maximum number of simultaneous jobs to submit:

GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE=200

Lastly, here is a more detailed guide to Bosco .

Tier 3 Tips

Here are some useful tips for running with the Cloud Tier-3:

The worker nodes do not mount any of our network disks. This is partly for simplicity and robustness, and partly for security reasons. Because of this your job must transfer files either using the Condor file transfer mechanism (recommended for code and output data) or using the xrootd door on utatlas.its.utexas.edu (which gives read access to /data, through the URL root://utatlas.its.utexas.edu://data/...; recommended for input datasets). Although this may seem somewhat unfortunate, it's actually a benefit, because any submitted job that runs properly on the Cloud Tier-3 can therefore be flocked to other sites, which obviously don't mount our filesystem, without being rewritten (see Overflowing to ATLAS Connect below).
You must make your data world-readable to be visible through the xrootd door, because the server daemon runs as a very unprivileged user. The command is "chmod -R o+rX ." in the top-level directory above your data (this will fix subdirectories to be world-listable and the files to be world-readable).
You must submit jobs in the "grid" universe (again, to enable proper flocking). In other words,
```
grid_resource = batch condor ${USER}@utatlas.its.utexas.edu
```
in your Condor submission file (replace ${USER} with your username).
The worker nodes have full ATLAS CVMFS.
One common problem is having jobs go into the Held state with no logfiles or other explanation of what's going on. Running condor_q -long <jobid> will give "Job held remotely with no hold reason." By far (>99.9%) the most common cause of this is that a file is requested to be transferred back through the file transfer mechanism in the submission file, but is not produced in the job. That is usually caused by the job failing (unable to read input data, crash of code, etc.). Unfortunately you won't have the logfile, so the easiest way to debug this is to resubmit the job but without the output file transfer specified in the submission script. (This is a very unfortunate and nasty feature of Bosco.)
You can request multiple cores for your job, by specifying
```
+remote_SMPGranularity = 8
+remote_NodeNumber = 8
```
(for example, if you want 8 cores) in your submission script.

Overflowing to ATLAS Connect

The Cloud Tier 3 is enabled to overflow user jobs to the ATLAS Connect system when not enough slots are available locally. ATLAS Connect allows multiple sites (Chicago, Illinois, Indiana, Fresno State, and us) to "flock" jobs to each other when necessary, achieving better CPU utilization and job throughput.

Due to the "disconnected" nature of our worker nodes (no mounting of our filesystems), essentially all jobs can flock from our Tier-3 to other sites. In fact the default behavior is to flock the jobs if necessary, and this is usually completely transparent. If you need to forbid jobs from flocking (pinning them to our systems), you can do this by adding the following to your Condor submission script:

Requirements = ( IS_RCC =?= undefined )

A snapshot of ATLAS Connect status can be seen at this link. "UTexas" shows the number of outside jobs executing in our Tier-3, while "Tier3Connect UTexas" shows activity we induce on other sites.

VM configuration

Our virtual machines are CentOS 6 instances configured with CVMFS for importing the ATLAS software stack from CERN. They also boot individual instances of the Condor job scheduling system. They access the same instance of the Squid HTTP caching server which our local workstations use (on utatlas.its.utexas.edu), which help reduce network traffic required for CVMFS and for database access using the Frontier system.

Booting a VM on Nimbus with scratch disk

We need to use tools other than the standard cloud-client.sh provided by Nimbus. We use a slightly modified vm-helpers . Untar the file vm-helpers.tgz in your nimbus-cloud-client directory (it will dump four files into bin/). Now you can run e.g.

cd nimbus-cloud-client-21

grid-proxy-init -cert conf/usercert.pem -key conf/userkey.pem

bin/vm-run --cloud master1.futuregrid.tacc.utexas.edu:8443 --image "cumulus://master1.futuregrid.tacc.utexas.edu:8888/Repo/VMS/71a4ea3e-07e4-11e2-a3b7-02215ecdcdaf/centos-5.7-x64-clusterbase.kvm" --blank-space 500000

which will give you a 500 GB scratch space. Our images will mount this under /scratch.

Building Image Using Boxgrinder

Ensure that Boxgrinder is present on the VM which you are building from. If this is a temporary image, it is likely you will need to copy over or git conf files and the appliance definition (.appl). Boxgrinder can be run by:boxgrinder-build definition.appl -d local

Boxgrinder options include:

?-f 	 								#Remove previous build for this image
--os-config format:qcow2				#Build image with qcow2 disk
-p										#Specify location style (VMware, KVM, Player, etc..)
-d local								#Deliver to local directory
--debug									#Prints debug info while building
--trace									#Print trace info while building

Creating Openstack Nodes

cp ......./openstack
cp ......./openstack-share
source openstack/ec2rc.sh
cd openstack-share 
./boot-instances

Accessing Openstack Nodes

ssh username@alamo.futuregrid.org

Then visit the list of instances to see which nodes are running. Then simply

ssh root@10.XXX.X.XX

and you are now accessing a node!

Space shortcuts

Child pages