Abstract Submission is closed!

Tuesday - March 23, 2004

Compilers, Message Passing and Optimization for POWER4 and HPS (Federation) Systems

Roch Archambault, IBM
Charles Grassl, IBM
Pascal Vezolle, IBM

Preliminary Outline:
- Macro architecture and it resources.
- Compiler and its strategies.
- POWER4 specific optimization techniques
- Message passing usage and techniques

The pSeries models 630, 655, 670, and 690 ("Regatta") use the POWER4 and POWER4+ processors. Though the processor architecture and instruction set have few changes from POWER3 processors, the overall system design is much different from Winterhawk and Nighthawk based systems.
The memory hierarchy of the POWER4 systems is changed: there is now a third level of cache and memory itself is not "flat", or "uniform". Additionally, the POWER4 systems have two sizes of memory pages available. We now have the concepts of "memory affinity" and of large versus small memory pages. These features have ramifications which affect performance programming.
The POWER4 processors are architecturally compatible with POWER3 processors, but some features are different. The main features of concern are the depths of the functional units, the number of rename registers and the number of pre-fetch queues and the size of level 2 cache. These features are largely manipulated with the compilers, but it is useful for programmers to be aware of the compiler strategy and techniques.
PSeries systems now have available a new network switch call High Performance Switch (HPS). (This product was previously labled "Federation".) This switch is nominally four times higher bandwidth than previous switch. The MPI message passing latency for this switch is less than 10 microseconds.
In this tutorial, we will describe the latest C and Fortran compiler and their specific features relevance to the POWER4 processor. We will also describe the performance optimization facilities available in the C and Fortran compilers and the most effective tactics for leveraging them.
We will follow this with an overview of optimization techniques which will exploit the features of the POWER4 processors. We will also discuss the use, exploitation and ramifications of memory affinity and of large memory pages.
We will also discuss message passing using the new HPS switch. The new switch has slightly different tuning characteristics from previous pSeries switches and this tuning involves several new environment variables.

High Performance Computing in Computational Chemistry

- Introduction & Overview, Sigismondo Boschi (1 hour)

    - Introduction to the tutorial
    - Performance in computational chemistry codes
    - Computational Chemistry Research at CINECA on IBM Systems

- electronic structure applications (2 hours)

    - Gaussian, C.Sosa and Gamess, A.Rossi
    - Nwchem, S.Boschi

- Molecular Dynamics applications (1 hour)

    - CHARMM and GROMACS, A.Rossi
    - NAMD, A.Rossi, J.Hein
    - AMBER, C.Sosa

- Car-Parinello applications (2 hours)

    - CPMD, A.Curioni
    - PWscf, C.Cavazzoni

Research laboratories, and academic institutions routinely use Computational Chemistry applications and aid in solving important problems in all areas of Chemistry including examples such as the discovery of new and important drugs for current and future generations.

This tutorial provides an overview of some of the most popular Computational Chemistry applications from the point of view of running on IBM systems. Detailed discussions of electronic structure, molecular dynamics, Car-Parrinello applications as well as other Computational Chemistry applications used by researchers will be presented.

The discussion on the applications will touch upon many well know computational chemistry applications including Gaussian, NWChem, GAMESS, CHARMM, GROMACS, NAMD and AMBER, and Car-Parinello applications (PWscf, CPMD).

Wednesday - March 24, 2004

Product Overview and Hardware Update

Barry Bolding, Manager, WW Deep Computing Technical Support

This presentation will describe the current HPC products from IBM running AIX and Linux. It will include positioning of the products in the marketplace and a discussion of future HPC directions at IBM.

The PERCS Project at IBM

Sinharoy Balaram, IBM

The PERCS project is IBM's response to the U.S. government HPCS initiative. The intent of this effort is to stimulate research into computing systems having overall productivity as its primary goal. The IBM vision of this project involves architectural features which are able to dynamically adapt to the needs of applications, yielding better performance for a broader range of applications. This talk will discuss IBM's plans for such an architecutre and the impact it will have on the productivity problems facing the HPC community.

Wednesday - March 24, 2004

Analysis of AIX traces with Paraver

Judit Gimenez, Jesus Labarta, CEPBA-UPC
Terry Jones, LLNL

The AIX trace facility from IBM allows to collect very low level information on OS scheduling, system calls, resources allocation, system daemons activity...

Under a collaboration contract between LLNL and CEPBA-UPC we have developed a translator from the AIXtrace format to Paraver. So now we are able to use the flexibility and analysis power of Paraver to analyze the low level detail information captured by the AIX trace facility.

This talk will present the aix2prv translator, the kind of information and views that can be obtained and some examples of studies like analyzing the influence and impact of system interrupts in fine grain parallel applications or discovering some details about the MPI internals.

DPOMP: An Infrastructure for Performance Monitoring of OpenMP Applications

Bernd Mohr, Forschungszentrum Juelich, Germany
Luiz DeRose, IBM Research, ACTC

OpenMP provides a higher level specification for users to write threaded programs, and has emerged as the standard for shared memory parallel programming, allowing users to write applications that are portable across most shared memory multiprocessors. However, application developers still face a large number of application performance problems, which make it harder to achieve high performance on these systems. Moreover, these problems are difficult to detect without the help of performance tools.

Unlike MPI which includes a standard monitoring interface (PMPI), OpenMP does not provide yet a standardized performance monitoring interface. In order to simplify the design and implementation of portable OpenMP performance tools, Mohr et. al. [1] proposed POMP, a performance monitoring interface for OpenMP. This proposal extends experiences of previous implementations of monitoring interfaces for OpenMP [2][3][4].

In this talk we present DPOMP, a POMP instrumentation infrastructure based on dynamic probes. This implementation, which is built on top of DPCL, is the first implementation based on binary modification, instead of a compiler or pre-processor based one. The advantage of this approach lies in its ability to modify the binary with performance instrumentation, without requiring access to the source code or re-compilation, whenever a new set of instrumentation is required. This is in contrast to the most common instrumentation approach, which augments source code statically with calls to specific instrumentation libraries. In addition, since it relies only on the binary, this POMP implementation is programming-language independent. DPOMP takes as input an OpenMP application binary and a POMP compliant performance monitoring library. It reads the application binary, as well as the binary of the POMP library and instruments the application binary, so that, at locations which represents events in the POMP execution model the corresponding POMP monitoring routines are called. From the user's point of view, the amount of instrumentation can be controlled through environment variables which describe the level of instrumentation for each group of OpenMP events as proposed by the POMP specification. From the tools builder point of view, instrumentation can also be controlled by the set of POMP routines provided by the library, i.e., instrumentation is only applied to those events that have a corresponding POMP routine in the library. In addition, DPOMP supports instrumentation of user functions, as well as MPI functions.

In the presentation, we will first briefly describe the main DPCL features, as well as the IBM compiler and run-time library issues that make our dynamic instrumentation tool for POMP possible. Then, we will exemplify how users can build their own performance monitoring libraries and will present two POMP compliant libraries: POMPROF and the KOJAK POMP library, which provide respectively the functionality for profiling and tracing of OpenMP applications.

[1] B. Mohr, A. Mallony, H-C. Hoppe, F. Schlimbach, G. Haab, and S. Shah. A Performance Monitoring Interface for OpenMP. In Proceedings of the fourth European Workshop on OpenMP - EWOMP'02, September 2002.

[2] E. Ayguad=E9, M. Brorsson, H. Brunst, H.-C. Hoppe, S. Karlsson, X. Martorell, W. E. Nagel, F. Schlimbach, G. Utrera, and M. Winkler. OpenMP Performance Analysis Approach in the INTONE Project. In Proceedings of the Third European Workshop on OpenMP - EWOMP'01, September 2001.

[3] S. W. Kim, B. Kuhn, M. Voss, H.-Ch. Hoppe, and W. Nagel. VGV: Supporting Performance Analysis of Object-Oriented Mixed MPI/OpenMP Parallel Applications. In Proceedings of the International Parallel and Distributed Processing Symposium, April 2002.

[4] B. Mohr, A. Malony, S. Shende, and F. Wolf. Towards a Performance Tool Interface for OpenMP: An Approach Based on Directive Rewriting. In Proceedings of the Third European Workshop on OpenMP - EWOMP'01, September 2001.

Mixed mode programming on a clustered p690 system

Lorna Smith, Mark Bull, Jake Duthie, EPCC, The University of Edinburgh

Clustered SMP systems such as HPCx have become prominent in the HPC market, with manufacturers clustering SMP systems together to go beyond the limits of a single SMP system.

Message passing codes written in MPI are obviously portable and should transfer easily to these systems. However, while message passing is required to communicate between boxes, it is not immediately clear that this is the most efficient parallelisation technique within an SMP box. In theory a shared memory model such as OpenMP should offer a more efficient parallelisation strategy within an SMP box. Hence a combination of shared memory and message passing parallelisation paradigms within the same application (mixed mode programming) may provide a more efficient parallelisation strategy than pure MPI.

In this talk we investigate the potential benefits of mixed mode programming for our large p690 cluster. We consider a range of pure MPI, OpenMP and mixed MPI/OpenMP implementations of a simple Jacobi algorithm and a subset of the ASCI Purple benchmarks, and compare and contrast their performance.

Nested parallelism in the drift-diffusion model for semiconductor devices

Sergio Rovida, G. Gazzaniga, P. Pietra, G. Sacchi, IMATI - CNR PAVIA
P. Lanucara, CASPUR Roma

We present the parallelization of a numerical serial code, modeling semiconductor devices, as pn-diode and pnp-transistor.

The implemented drift-diffusion model, in the stationary 2D case, after a suitable scaling, consists in a coupled system of three partial differential equations in the electrostatic potential and the charge densities p and n.

For its numerical solution a modification of the Gummel method has been chosen. It can be regarded as an approximated Newton method and has the advantage to decouple at each step the original strictly-coupled system. In order to guarantee the stability, the procedure is coupled with a continuation in the Debye length parameter.

The sparse linear systems, coming from the discretization of differential equations are solved by means of the Restarted GMRES method.

The code is written in Fortran, using BLAS and NAG Libraries.

The Gummel iterative scheme exhibits an intrinsic 2-way parallelism, due to the independence of the equations for p and n.
So the natural approach, to improve the performance of the procedure, is to distribute the computation of these steps among different processors.

OpenMP parallel programming model could be used, without any expensive re-engineering of the software, adding the parallel sections directive to the original code. The key factors of OpenMP are both the portability and the good efficiency on most shared memory platforms and compilers.

We suggest another strategy of parallelization based on MPI message passing paradigm, to manage the interprocessor communication, in order to guarantee the portability on distributed memory architectures.

The test problem considered is the simulation of a 2D silicon pn-diode.

Some preliminary experiments have been carried out on a single SMP node of the IBM SP3, using NAG SMP Library release.

The performance obtained must be considered satisfactory, taking into account the simplicity of the parallelization, only based on the intrinsic parallelism of the numerical procedure.

On the other hand the profiling of the code shows that more than 90% of the execution time is spent by the NAG routines implementing the GMRES solver. So it can be reasonable to experiment the multi-threaded version of these routines.

The first experiments of this nested parallelism, based on the mix of MPI and the shared memory parallelism of the NAG SMP Library, gives quite satisfactory results and could be more efficient for larger problems.

Further experiments, using IBM SP4 installed at CINECA and an HP cluster of ES45 machines located at Caspur are in progress.

HPC-Europa and DEISA: Advanced Infrastructures for Computational Scieces in Europe

Giovanni Erbacci, CINECA

This talk will present the structure and the objectives of two Projects recentely funded by the European Commission, in the VI Frame Work Programme - support for Research Infrastructure Action.
HPC-Europa is based on a consortium of leading HPC Infrastructures in Europe aiming at the integrated provision of advanced computational services to the European research community working at the forefront of sciences.
The services will be delivered at a wide spectrum both in terms of access to HPC infrastructures and provision of a suitable computational environment to allow the European researchers to remain competitive with teams elsewhere in the world. Moreover, specific Research and Networking Activities will contribute to drive new advances in HPC activities.
DEISA is a consortium of leading national supercomputing centres in Europe aiming to jointly build and operate a distributed tera-scale supercomputing facility. This objective will be attained by a deep integration - using modern grid technologies - of the high end national HPC infrastructures that are currently being deployed.
Both HPC-Europa and DEISA will integrate and structure better the way HPC Infrastructures are operating in Europe and thus lead to the creation of a true European Research Area, without frontiers for scientists mobility, knowledge and technologies.

JUMP: Europe's new most powerful supercomputer

Bernd Mohr, Forschungszentrum Juelich, NIC/ZAM, Germany

The John von Neumann Institute for Computing (NIC) of Research Centre Juelich, Germany's largest national laboratory, is currently installing Europe's most powerful supercomputer. Mid March 2004, the system will consist of 41 IBM p690 nodes each with 32 Power4+ 1.7 GHz CPUs and 128 GB main memory. This means a peak performance of 8.9 Tflops and a total main memory of 5.2 TByte. The nodes are connected by the new IBM high performance switch ('federation'). NIC operated a 6 nodes system since June 2003 and a 30 node system since January 2004.

The talk will give an overview about the system and will report on experiences in installing such a large machine as well as running in a production mode with several hundreds users spread over all over Germany.

More details about the system can be found at http://jumpdoc.fz-juelich.de/

Thursday - March 25, 2004

HPC Software Update and Directions

Dr Rama Govindaraju, HPC Software Architect

This talk will focus on some of the futures in HPC software technologies from IBM. In particular, it will inc lude an overview of enhancements in GPFS for Grid computing, scaling and performance; enhancements in protocols with respect to RDMA on Federation; and enhancements to LoadLeveler to support heterogeneous AIX and Linux clusters. This talk will highlight insights into the future interconnect technologies, and the challenges that lie ahead.

The IBM High Performance Computing Toolkit

David Klepacki, IBM

The software tools developed by the ACTC in IBM Research (e.g., HPM, TurboMPI/SHMEM, etc.) have been re-architected to work together in a more uniform environment. For example, MPI or SHMEM programs can now instrument message-passing performance simultaneously with the HPM metrics, and visualize both sets of statisitics in a common GUI called PeekPerf. Additional tools for OpenMP as well as for memory simulation are also being integrated into this visualizer. The resulting software suite is now referred to as the IBM High Performance Computing Tookit. This talk will discuss the new tools, with application examples illustrating their benefits for the performance programmer.

Emerging Researches Tools

Luiz DeRose, IBM

In this talk I will provide an update of the on-going ACTC work on tools and infrastructure for performance measurement of scientific applications. The presentation will cover the new developments, which includes support for analysis and understanding of the memory subsystem, with the Simulator Guided Memory Analyzer (SiGMA), and support for performance measurement and visualization of OpenMP applications, with the Dynamic Performance monitoring interface for OpenMP (dpomp) and the Practical OpenMP Profiler (pomprof), as well as the new features in the Hardware Performance Monitor (HPM) Toolkit: hpmstat and CATCH (the Call-path based Automatic Tool for Capture of Hardware events).


Recent Advances in Parallel Approaches to Large-Scale SVMs

Gaetano Zanghirati, University of Ferrara, Department of Mathematics
Serafini Thomas, Zanni Luca, University of Modena and Reggio-Emilia, Department of Mathematics

We present parallel decomposition algorithms for effectively solving large-scale machine learning problems via the 'Support Vector Machine' (SVM) approach [Vapnik, 1998]. For binary classification, this methodology involve the numerical solution of a large-scale optimization problem, which is a convex quadratic programming (QP) problem subject to box constraints and a single linear equality constraint. Since its Hessian matrix is dense and in many real-world applications its size is very large (often much greather than 10^4), ad hoc approaches are required. Among the recent ideas, the decomposition techniques are the most investigated: they are iterative schemes consisting in splitting the original large problem into a sequence of QP subproblems, much smaller than the original one, but of the same form.
Various packages implementing decomposition strategies are available. Here we deal with decomposition algorithms that use numerical QP solvers for the inner subproblems. In these cases, a crucial question is what kind of QP solver is most convenient.
Some recent gradient projection methods [Birgin et al., 2000; Ruggiero and Zanni, 2000; Serafini et al., 2003] based on the Barzilai-Borwein steplength selection rules have demonstrated their numerical merits, but most importantly they are well suited to parallel implementations. By using these iterative methods as inner QP solvers within a decomposition approach similar to the Joachims' SVMlight algorithm, we designed an easily parallelizable technique for training large-scale SVM. One of the key features of our approach is to efficiently manage subproblems large enough to produce few decomposition steps.
Since the most expensive tasks of each iteration can be efficiently faced in parallel, this feature allows an immediate parallelization of the decomposition scheme.

Our projection-based decomposition technique was further improved by also adding a completely parallel initialization step, consisting of independent training on subsets of the original training set. This feature provides the algorithm a good initial approximation and yields significant benefits in those cases where unsatisfactory convergence rate is observed.
In order to show the effectiveness of our strategy, an extensive computational study was carried out on different multiprocessor platforms at the CINECA Supercomputing Center.
Real-world large-scale SVM problems were considered and encouraging results were obtained in both scalar and parallel environments.

Parallel Eigensolver Performance on the HPCx System

Andrew Sunderland, Daresbury Laboratory
Elena Breitmoser, Edinburgh Parallel Computing Centre

The symmetric Eigenvalue problem is central to many scientific and engineering application codes, particularly in areas such as computational chemistry, materials science and atomic and molecular physics. The computational intensity associated with the Eigensolver stage often results in this calculation dominating parallel run-times. 'Grand challenge' and other very large computations based on Eigensolvers are often limited by both time and memory resources, even on the largest supercomputers currently available. There are several parallel algorithms available to solve the Eigenvalue problem where the effectiveness of the chosen approach often relates to the characteristics of the particular matrices in question. The standard method involves reduction to tri-diagonal form followed by bisection and inverse iteration or the QR algorithm, then finally back transformation to find the Eigenvectors of the full problem. Parallel library routine drivers that use these methods have been available in the public-domain ScaLAPACK library for several years. Recently, new approaches have been developed, such as the Divide & Conquer algorithm [ScaLAPACK 1.7], the Berkeley Algorithm [Peigs 3.0], Multiple Relatively Robust Representation (MR^3) [PLAPACK], and a new parallel implementation of the one-sided Block-Factored Jacobi Eigensolver (BFG). This presentation investigates the performance of these methods on the IBM p690 HPCx system based at Daresbury Laboratory (UK). Results from a range of applications will be presented, incorporating a range of problem sizes. Approaches that can be used to enable parallel scaling to very large processor counts for the Eigensolver-based application codes will also be discussed.

Using Lapi and MPI-2 in an N-body cosmological code on IBM SP

Marco Comparato, U. Becciani, V. Antonuccio, INAF-Osservatorio Astrofisico di Catania
Claudio Gheller, CINECA

In the last few years, cosmological simulations of structures and galaxy formation have assumed a fundamental role in the study of the origin, formation and evolution of the universe. These studies improved enormously with the use of supercomputers and parallel systems, allowing more accurate simulations, in comparison with traditional serial systems. In this paper, we present FLY, a numerical code based on the tree-N-body method to model the evolution of three-dimensional self-gravitating collisionless systems.
Latest FLY release (version 2.1) is open source distributed at http://www.ct.astro.it/fly. It is also available in the Computer Physics Communications Program Library.
FLY is a fully parallel code based on the tree Barnes-Hut algorithm. It is based on the one-side communication paradigm to share data among the processors, thus enabling access to remote private data avoiding any kind of synchronism. The code was originally developed on CRAY T3E system using the logically SHared MEMory access routines (SHMEM) but it runs also on IBM SP by using the Low-Level Application Programming Interface routines (LAPI).
However, both SHMEM and LAPI are propretary parallel programming libraries. In order to improve the portability of FLY, we have recently developed a new version of our code based on MPI-2, this version has been extensively optimized and tested on the IBM SP Power 4 supercomputing system. We describe the details of the new implementation and present tests and benchmarks comparing them with the corresponding results obtained with the LAPI (IBM propretary) version of the code.

The use of HPC in Astrophysics: an experience report

Paolo Miocchi, Dipartimento di Fisica, Universita' di Roma 'La Sapienza'

An application of High Performance Computing in Astrophysics is presented and discussed. A numerical tool for the simulation of the gravitational dynamics of stellar systems (the treeATD code) has been parallelized with MPI routines and implemented on the IBM SP4 platform.
The handled numerical problem represents by itself a difficult task as far as efficient domain-decomposition and load-balancing are concerned. An overview of a first parallelization approach and of specific methods of minimizing inter-node communications is presented.
The overall performances and some scientific results will also be shown.

3-D Hydrodynamics of SNR Shock Interaction with Interstellar Bubbles: The OAPa/UniPa Key Project

Salvatore Orlando, INAF - Osservatorio Astronomico di Palermo
G. Peres, F. Reale, DipSFA Università di Palermo
R. Rosner, FLASH center, The University of Chicago

We report on our key-project experience made on the IBM/Sp4 machine at CINECA. We describe the main objectives of our project, aimed at studying the interaction of a supernova shock wave with interstellar clouds. We describe the numerical code used, namely FLASH, a 3-D astrophysical hydrodynamics code for parallel computers developed at the FLASH center (in Chicago); our team collaborates with, and contributes to, the FLASH project. We discuss the resources required for the whole project, the I/O management, the performance and the scalability of the code on IBM/Sp4. The project has required a significant data storage and we discuss here the various strategies adopted. Finally, we present a selection of preliminary results.

Nature of DE from VPF evaluation

Silvio Bonometto, Un.MI-Bicocca/Fisica, INFN-MI
Paola Solevi, Andrea Macciò, Anatoly Klypin

We compare the behavior of the void probability function (VPF) in models where Dark Energy is due to false vacuum (LCDM models) and models with dynamical Dark Energy (dDE), due to a scalar field self-interacting through a RP or SUGRA potentials. Using high resolution ART simulations in boxes of 100h^-1 Mpc, we show that VPF is a good discriminator between different dDE's. A comparison between simulations and data requires a clear definition of galaxy number density, after resolving halos above a given size. In turn this requires using a HOD algorithm, whose sample variance unfortunately exeeds the difference between the VPF for different DE's. This is not a numerical uncertainty, but has a physical origin. We propose a method to overcome this empasse, based on a different treatment of observational samples.

Parallelization of a code for large-eddy simulation of environmental turbulent flows

Stefano Salon, Vincenzo Armenio, Univ. Trieste - Dip. Ing. Civile, Sez. Idraulica
Alessandro Crise, Ist. Naz. Oceanografia e Geofisica Sperimentale - OGS, Sgonico (TS)

The parallelization by means of the MPI paradigm of a Large-Eddy Simulation (LES) code resolving three-dimensional Navier-Stokes equations is here described. LESs directly solve the large, energy-carrying scales, whereas the small and dissipative scales of turbulence are parameterized by the use of a subgrid-scale (SGS) model.Recently, this technique has been successfully used in a variety of problems, for the investigation of environmental, geophysical and engineering problems.
The parallelization is achieved by one-dimensional domain decomposition along the transverse direction of the computational box.The more computationally expensive portions of the code are identified in the SGS model subroutine, in the multigrid subroutine solving the pressure field and in the resolution of a tridiagonal system solving the velocity and density equations. The parallelized subroutine resolving the pressure field represents the best result of the work, with a speed-up very close to the ideal one. The collective communications employed in the resolution of the tridiagonal system limit the performance of the code. The reduction of the mean time for single iteration on a 64x128x64 computational grid by one sixth with respect to the serial version (in a parallel configuration with 8 processors) represents a satisfactory result.
More information about troubles met in the parallel implementation together with evaluations regarding performance and scalability will be discussed at the workshop.
The parallel solver has been largely used in the resolution of a Stokes-boundary layer in the turbulent regime. Due to the burdensome computational effort needed to accurately simulate the fully developed turbulent regime, past studies of the oscillating boundary layers have investigated by means of direct numerical simulations (DNS) the so called disturbed-laminar and the intermittently turbulent regimes. Our investigation represents the first numerical study of the turbulent field in a purely oscillating flow at a Reynolds number (Re = 1.6e+06) such that most of the cycle of oscillation is characterized by the presence of fully developed turbulence. Our results are in good agreement with the experimental measures (Jensen et al., 1989, J.Fluid Mech. 206) and extend the findings of the previous studies. All the numerical computations were carried out on the IBM-SP4 facility of CINECA.

High performance computing for a family of smooth trajectories using parallel environments

Gianluca Argentini, RIELLO GROUP / New Technologies & Models

In this work I present a technique of construction and fast evaluation for a family of cubic polynomials for analytic smoothing and graphical rendering of particles trajectories for flows in a generic geometry. The principal aims of the work are:

1. the interpolation of 3D points by regular parametric curves; the improved technique permits to obtain smoothed geometric lines even in situations where there are few data-points or where the flow is turbulent;
2. a fast and efficient evaluation of these polynomials in a set of suitable values of the parameter for a good resolution of graphic rendering; the method is based on parallel computing on a multiprocessor environment;
3. the measure of speedup and efficiency for scientific and technical applications using cluster computing techniques.

The numerical approch is based on a cellular automaton evolving on a three-dimensional grid.This mechanism simulates in an adaptive manner the behavior of the flow to obtain the discrete set of data-points for every particle.
The smoothed curves are then computed by interpolation of the points using a combination of Bezier method and piecewise cubic splines, and imposing adequate conditions for slope and curvature. The functions so computed have the regular properties of Bezier curves, the simple algebraic expression of cubic polynomials and avoid the possible rising of spurious wiggles and other not realistic effects as Gibbs phenomenon.
For an appropriate visualization of the flow, we use a computational method based on an appropriate distribution of the polynomials among the available processors.
The efficiency of the used method is good, mainly reducing the number of floating-points computations by caching the numerical values of the polinomials parameter's powers, and reducing the necessity of communication among processes. The computation is performed using a customized parallel environment for the package Matlab on a multiprocessor IBM x440 server, with Intel Hyper Threading technology, and using MPI technology on Linux x330 nodes cluster at Cineca.

The work permits to deduce these conclusions:
a. it's possible to obtain smooth and realistic rendering of a flow even in situations where the geometry of the interested region is not easily schematizable by standard mathematical methods such as finite elements;
b. the parallel method used has a good level of computational efficiency (about 0.8-0.9 in our experiments).

This work has been developed for the Research & Development Department of our company for planning advanced customized models of industrial burners.

Performance portability of a Lattice Boltzmann code

Federico Massaioli, Giorgio Amati, CASPUR

We report on recent developments of a simple but real production code, implementing the well known Lattice Boltzmann Method (in the BGK approximation) to solve Navier-Stokes equations on a regular lattice.
The fluid is represented by 19 different particle species, each one moving on the 3D lattice with fixed, preassigned velocity (streaming). At every lattice site, particle species interact, adjusting their values according to the physics of the flow (collision). Every particle species is represented with a separate three-index array.
LBM 'natural' implementations, widely adopted in practice, separate streaming and collision phases. This approach stresses the memory subsystem of a computer, and more so for parallel implementations, severely affecting serial and parallel performances.
At ScicomP 6 we presented a previous version of this code, and discussed several techniques to relieve the burden on memory subsystems and CPU internal tables on Nighthawk II, Power3 based nodes. Preliminary results for Power4 systems were presented as well.
In this talk we'll present Power4 performance data for the latest version of the code, which is running in production simulating two-phase flows on 512^3 grids.
Most of the improvements were aimed to obtain a single implementation giving highest performances on a wide range of architectures and implementations (Alpha, IA-32, IA-64, NEC, PowerPC) without sacrificing the significant efficiency gains realized on Power4. Not only we happened to reach performance portability on this applications, but further speed improvements on Power4 systems were obtained.
In the process, we found that some issues differently affected performances on different architectures: CPU implementation, memory subsystem, compilers. Serial and parallel performance data will be presented in the talk to illustrate the point.


Performance of IFS on p690+ with Federation switch at ECMWF

Deborah Salmond, ECMWF
John Hague, ECMWF/IBM consultant

The IFS is the weather forecasting application run at the European Centre for Medium Range Weather Forecasts (ECMWF). This has been running operationally for 2 years on the 960 processor p690 clusters installed at ECMWF. IFS is now being benchmarked on a 256 processor p690+ cluster with federation switch. The talk covers performance comparisons between p690 and p690+ systems focussing on communications and CPU aspects. Effects of large pages and memory affinity are also discussed.

HPC parallelization of oceanographic models via high-level techniques

Piero Lanucara, Vittorio Ruggiero, CASPUR Roma
Vincenzo Artale, Casaccia Gianmaria Sannino, Casaccia Adriana Carillo, ENEA

The strong cooperation between CASPUR and GEM-CLIM group at ENEA Casaccia is motivated by the wide production of HPC codes to study the role of the oceans in climate change using numerical models.
Parallel approach is necessary for the higher spatial resolution used in the simulations and the enormous elapsed time requested for such a job.

Although very efficient on distributed memory architectures, low-level parallelization tools (like MPI) are not used in this study as, on the basis of our experience, too much effort (and time) must be spent to convert a serial code into an MPI parallel one and maintain it. As it was not our aim to rearrange the codes in such a way, we oriented our choice towards tools based on the high-level technique where all the low-level serial and parallel machinery (data partitioning, communications, I/O...) is realised by directives to the compiler in a user-hiding mode.

SMS (Scalable Modeling System) seems to be the correct answer to our needings and we use it to realize a prototype of a modified version of Princeton Ocean Modeling (POM) for studying the Mediterranean Sea circulation.
The parallel code is running onto 1 server 690 equipped by 32 CPU Power4 at 1.3GHz, 64GB RAM.
Results are showed for climatic runs (>>1 year of simulation).

Wind driven circulation in the Gulf of Trieste: a numerical study in stratified conditions

Stefano Querin, Alessandro Crise, Istituto Nazionale di Oceanografia e di Geofisica Sperimentale - O.G.S. (Trieste - Italy)

The aim of this work is the analysis of the dynamics of the Gulf of Trieste (GoT, NE Adriatic Sea): the comprehension of the general circulation of the basin is an essential step in order to be able to make reliable short-term (few-days) oceanographic forecasts. For this purpose, the numerical model MITgcm was used to set up a nested, high-resolution, coastal model.
This paper shows the results obtained from wind driven circulation case studies, with realistic thermohaline initial conditions and external forcings. Particular attention was paied to the influence on the hydrodynamics of the basin of the Isonzo river freshwater input and of a proper conditioning on the open boundary: to solve the latter problem, a nested approach with larger coarse-scale models was carried out to permit longer integrations.
The MITgcm is a non-hydrostatic general circulation model (GCM) created to study both oceanic and atmospheric problems on a wide range of scales. The model was discretized on a cartesian Arakawa C-grid on the f-plane, with a horizontal and vertical resolution of 250 m and 1 m, respectively. The domain was rotated to best fit the grid space.
MITgcm partial cells capability was employed to properly reproduce the GoT bathymetry. Eddy diffusion processes were parametrized with biharmonic operators and KPP turbulence parametrization for horizontal and vertical components, respectively.
The MITgcm code structure includes a series of packages containing additional routines that can be added to study specific hydrodynamic features. The infrastructure in which kernel and package operates is called WRAPPER (Wrappable Application Parallel Programming Environment Resource).
This allows the numerical code to be easily adjusted for different architectures.
The experiments were run in a parallel MPI environment using a 4 x 4 domain decomposition (16 processors) on the parallel SP4 supercomputer at CINECA: with this kind of computational scaling, high speed-up values (around 10) were reached.
Several tests were carried out and led to results in good agreement with experimental data, expecially regarding the upwelling phenomena that occur during severe Bora wind events.

Large Scale Photoemission Calculations: Algorithm, Performance and Applications

Piero Decleva, Department of Chemical Sciences, University of Trieste
G. Fronzoni, M.Stener

Accurate solutions of the Schroedinger equation in the electronic continuum are obtained employing a multicenter B-spline basis set expansion and a least squares algorithm. Full use of symmetry is implemented. Excellent parallelization is achieved for the most intensive computational parts of the program, namely numerical integrations for the matrix elements and linear algebra (employing Scalapack) for the relevant eigenvectors. MPI-2 is required for the block cyclic decomposition required by Scalapack. Timings will be reported for selected examples. Results for photoabsorption and photoionization in C60 and C70 will be presented and compared with recent experimental data.

An Overview of Running Applications on IBM pSeries Linux Systems

Angelo Rossi, IBM Research Division/Department of Computer Science
Carlos P. Sosa, IBM and University of Minnesota Supercomputing Institute
Francois Thomas, EMEA Deep Computing, IBM France
Balaji V. Atyam, IBM Austin
Raj Panda, IBM Life Sciences Solutions Technology

In this presentation we shall compare the performance of several leading life sciences scientific applications on AIX®5.1 and SuSE Linux® Enterprise V8 for PowerPC (PPC). In addition we shall provide a brief overview of our current efforts in porting and optimizing life sciences applications on PowerPC running SuSE ( pLinux ). Benchmarks were run on IBM® eServer and pSeries systems with the latest IBM VisualAge C++ V6.0 and XL Fortran V8.1 compilers for both AIX and Linux. Also, a performance comparison was made between the IBM Compilers versus GNU Compilers on some of these applications. In addition, present some preliminary results on parallel scalability for some of these applications running on pLinux.>

Many of the tables presented here can be obtained from: Balaji V. Atyam, Raj Panda, and Carlos P. Sosa, 'Performance Comparison of Scientific Applications on AIX and Linux for PowerPC' and Prabhakar Attaluri, Tomas Baublys, Xinghong He, Chin Yau Lee, and Francois Thomas, 'Deploying Linux on IBM e-server pSeries Clusters'.

Theoretical modeling of catalysts: a real challenge for performing industrial processes

Ignazio Fragalà, Alessandro Motta, Dipartimento di Scienze Chimiche, Università di Catania
Giuseppe Lanza, Dip. Chim. Università della Basilicata
Tobin J. Marks, Dep. Chem. Northwestern University Evanston (IL) USA

Cyclopentadienyl-amido based catalysts effect the homopolymerization of long-chain ?-olefins (1-butene and 1-pentene), ethylene, copolymerization with sterically encumbered comonomers, and, depending on the catalyst symmetry, moderately stereoselective enchainment, isotactic or syndiotatic, of propylene. The polymerization of propene and higher 1-olefins introduces the problems of stereoselectivity (enentioface selectivity or enantioselectivity) and regioselectivity. Constrained Geometry Metallocenes represent best performing catalysts for olefin polymerization and the associated stereospecific properties are of relevance for qualified products. In this prospective, both the structural and energetic factors, which drive the prochiral orientation of the olefin during insertion processes, no doubt, represent key issues.
Focus is on stereospecific olefin polymerization processes with (CH3)2Si(ind)(tBuN)TiR2 precursor catalysts which are used to produce partially isotactic polypropylene. The intrinsic ability of the catalyst in selecting the best suited insertion route has been modeled for the following system (eq. 1). H2Si(ind)(tBuN)Ti(CH3)+ + CH2=CHCH3 first insertion (1)

Effects on stereo and regioselective catalytic properties on the growing polymeric chain have been focused on the following reaction (eq. 2):

H2Si(ind)(tBuN)Ti(iBu)+ + CH2=CHCH3 second insertion (2)

Of course, the isobutyl-metal naked cation mimics the effects on active site environment due to the growing chain.

The use of electrophilic early transition metal and f-element complexes to effect synthetically useful organic transformations is rapidly becoming an important interfacial boundary between traditional organometallic and synthetic organic chemistry. Carbon-nitrogen bond-forming processes are of fundamental importance in organic chemistry, and hydroamination (catalytic N-H bond addition to an unsatured carbon-carbon multiple bond) represents both a challenging and highly desirable, atom-efficient transformation for the synthesis of nitrogen-containing molecules.
Olefin hydroamination, which in a formal sense consists of the addition of a N-H bond across a C=C bond, is a transformation of seemingly fundamental simplicity and atom economy, and would appear to offer an attractive route to numerous classes of organonitrogen molecules.
Organolanthanide complexes of the type Cp?2LnR (Cp? = Me5C5; R = H, CH(TMS)2; Ln = La, Nd, Sm, Y, Lu) have been shown to be highly reactive with respect to hydroamination/cyclization transformations involving aminoalkenes, aminoalkynes, aminoallenes, and aminodienes. Experimental kinetic and mechanistic data strongly argue that catalytic hydroamination/cyclization of amino-olefins by organolanthanide complexes involves i) the turnover-limiting insertion of olefinic functionalities into Ln-N bonds within the framework of a bis(cyclopentadienyl)lanthanide environment, coupled with ii) subsequent rapid Ln-C protonolysis to effect efficient catalytic N-C bond-forming processes.
The present study represents the first theoretical analysis of the salient mechanistic aspects associated with amino-olefin hydroamination/cyclization mediated by Cp?2LnR complexes (Cp? = Me5C5; R = H, CH(TMS)2; Ln = La, Nd, Sm, Y, Lu). Here (C5H5)2LaCH(TMS)2 has been adopted as a model catalyst while the substrate is represented by 1-aminopent-4-ene (CH2=CH(CH2)3NH2).

A Parallel Programming Tool for SAR Processors

Davide Guerri, Marco Lettere, Riccardo Fontanelli, Synapsis Srl

Earth observation techniques are based on the application of computationally heavy image processing algorithms on satellite sensor image data. Images are acquired at fine geometric resolutions and raw data is quite huge (26500*5600 double precision complex pixel values for a raw image.
This well known fact, along with requirements related to real-time industry production, led to a a quantitative and qualitative study of a set of image processing algorithms for SAR Processors in the context of ASI (Italian Space Agency) COSMO SkyMed project.
The considered algorithms showed some interesting patterns in terms of algorithm structure and parallelism exploitation. During the activity of prototyping and analysis, an abstraction (SPE Chain Model) of the algorithmic behaviour has been defined. This model simplifies performance modeling, design and implementation of parallel image processing algorithms.
According to the defined abstraction, a parallel programming tool (SPE - Sar Parallel Executor) has been developed. The SPE tool enables a programmer to implement efficient, structured and object oriented parallel image processing algorithms conforming to the SPE Chain Model and reusing possibly pre-existing sequential code.
SPE is written in C++ and is based on two class libraries. SPEAPI is a programmer interface for writing parallel image processing algorithms according to the SPE Chain Model. SPEENG is a set of classes that implements the runtime support for executing algorithms written with SPEAPI. The main purpose of SPEENG is wrapping communication, file system access and other implementation specific choices, isolating the algorithm definition from its execution. In particular, the current version of SPE uses MPI for communication and standard POSIX calls for IO.
Two image processing algorithms (CSA - Chirp Scaling and P-FLOOD) belonging to different classes of applications have been tested to validate both the SPE Chain Model and the SPE programming tool.
Performance tests of the two sample algorithm have been executed on the IBM Linux Cluster 1350 at Cineca Institute. The targets of the tests were adherence to analytical performance models and validation of the programming tool in terms of efficiency and overhead introduced by the high level programming constructs.
Both algorithms could exploit the high bandwidth communication infrastructure and the powerful SMP computation nodes yielding outstanding performances and obtaining a very good efficiency which closely approximated 80% on medium sized datasets.
The CSA test case validated also the fact that no significant overhead was introduced by the high level programming constructs since its completion times on a varying number of computation nodes closely matched the completion times of a specialized low level implementation.
Moreover CSA showed a nearly exact adherence to its analytical performance model based on the quantitative study.
These very positive results make SPE Chain Model and the SPE programming tool a feasible solution for industry environments because they can be used to predict the effects of a parallel solution on the application performance and to quickly design and implement it.

Friday - March 26, 2004

Compiler Roadmap

Roch Archambault, IBM

This presentation will describe the features of the latest versions of IBM C, C++ and Fortran compilers. Some emphasis will given to the latest Fortran release which was made generally available in June. I will also discuss compiler directions for 2004 and beyond, including Fortran 200x, ongoing developments in the C and C++ languages and new techniques for code optimization.

MPI Developements

Dick Treumann, IBM

This talk will address status and directions for MPI. It will include:
1) the current experience and future expectations for use of LAPI as the MPI transport on the High Performance Switch (HPS);
2) environment variables for tuning MPI performance in the LAPI based release;
3) the roadmap for performance improvements on the HPS; and remarks about using AIX Large Pages with MPI.


Pyroclastic density currents Simulations by Parallel Computing

Carlo Cavazzoni, Giovanni Erbacci, CINECA
Tomaso Esposti Ongaro, Augusto Neri, Silvia Baseggio, INGV

The parallelization and optimization of a computer code for 3D pyroclastic flows simulations will be presented, and it will be shown that supercomputers are required to run sufficiently accurate simulations with realistic flow conditions.
The development of the 3D parallel code started from a 2D Fortran77 serial code already used for 2D simulations in previous works. Since the time and memory taken by a 3D simulations are, approximately, two orders of magnitude larger than 2D simulations, the only possibility to performs 3D run then is to use parallel computing.
In the first part of the work we have parallelized the 2D code, and, at the same time, we have rewritten it using fortran90 and a modular programming approach. The modular approach and fortran90 have been used to better manage the increment in the code complexity, consequence of the parallelization.
Before starting to add the third dimension, we performed a series of parallel benchmarks with different data size and data distribution, to validate the new code, and evaluate its efficiency.
Thanks to the modular structure, the addition of the third dimensions to the code data structure has been quite smooth, and in many case we could use automatic programming techniques. Finally, the numerical kernel of the code have been optimized, obtaining a factor of two speed-up.

Seismic waves numerical modeling with a parallel algorithm

Peter Klinc, Geza Seriani, Enrico Priolo, Istituto Nazionale di Oceanografia e Geofisica Sperimentale - OGS

The accurate simulation of seismic waves propagation through large portions of heterogeneous media has several geophysical applications, ranging from natural resources exploration to earthquake and volcano hazard assessment, but it requires high performance computers and the exploitation of parallel computing.
We present the development of a parallelized algorithm for the computation of the seismic wave field in three dimensional detailed models of the upper Earth's crust. The algorithm is based on the solution of the velocity-stress formulation of the elasto-dynamic equation by using the Fourier pseudo-spectral method. The complexity of the basic algorithm is however increased by several additional features: the evaluation of spatial derivatives on staggered grids, the description of the medium intrinsic attenuation effects by means of the generalized standard linear solid model, the implementation of the free surface condition on the top of the model and the setup of perfectly matching layers as absorbing boundaries in order to prevent from the wrap-around effect.
The code has been built using Fortran90 modules and the parallelism has been implemented using MPI. Since the algorithm usually treats a huge amount of data, it has been parallelized following a data decomposition approach. A large amount of communication between processors resulted unavoidable, therefore tuning tests were required in order to optimize the efficiency of the scheme and exploit at the best the computing power of the multiprocessor. We tuned the developed software on the IBM-SP4 at CINECA, where we currently run it for the purposes of a number of research projects, like the investigation of the structure of the Campi Flegrei caldera (Italy).

Achieving High Performance on BlueGene/L: Preliminary Experience

Manish Gupta, IBM

The BlueGene/L (BG/L) supercomputer proposes to deliver new levels of computational performance with a combination of single-node performance and high scalability. To help achieve good single-node performance, the BG/L design includes a variety of architectural features, like two processors per node and dual floating-point units per processor. BG/L also relies on regular interconnection networks, like a torus and a tree, to achieve almost unlimited scalability. We demonstrate how applications can take advantage of these architectural features to get the most out of BG/L. We show that code can be structured in a way to better utilize BG/L's dual floating-point units and its 128-bit wide memory bus. We also show how to take advantage of the two processors in every node, either by offloading computations from the main processor to an auxiliary processor or by operating each processor independently. Finally, we show that further gains are possible by properly mapping applications to the machine topology. We can get significant performance boosts sometimes simply by changing the assignment of tasks to nodes. We present experimental results obtained on the 512 node prototype BG/L system to demonstrate our approach.