go back What do we have? next last

Compilers - Libraries - Developement Tools - Visualization Tools - Applications - Historical Approach


Hardware

CESCA has at its disposal seven high performance computers:

Hewlett-Packard N4000: 8 processors PA8500 (440 MHz), 4 GB of main memory, 227 GB of hard disk, with a peak performance (Rpeak) of 14.08 Gflop/s and a maximum performance (Rmax ) of 10.22 Gflop/s.

Compaq AlphaServer HPC320: 8 nodes ES40 (4 EV68, 833 MHz, 64 KB/8 MB), 28 GB of main memory, 892 GB of hard disk, with Rpeak of 53.31 Gflop/s and Rmax of 40.84 Gflop/s, interconnected by a Memory Channel II of 100 MB/s.

beowulf de Compaq: 8 nodes DS10 (1 EV67, 600 MHz, 64 KB/2 MB), 4 GB of main memory, 291 GB of hard disk, with Rpeak of 9.60 Gflop/s and an estimated Rmax of 7.68 Gflop/s , interconnected by a Myrinet of 2 Gbps.

HP AlphaServer GS1280: 16 processors 21364 EV7 (1,150 MHz, 64 KB/1.75 MB), 32 GB of main memory, 655 GB of hard disk, with Rpeak of 36.80 Gflop/s and Rmax of 31.28 Gflop/s.

HP rx2600: 2 processors Itanium2 (1,000 MHz, 32 KB/256 KB/3 MB), 2 GB of main memory, 146 GB of hard disk, with Rpeak of 8.00 Gflop/s and an estimated Rmax of 7.20 Gflop/s.

SGI Altix 3700 Bx2: 128 processors Itanium2 (1.6 MHz, 16 KB/256 KB/6 MB), 384 GB of main memory, 6.13 TB of hard disk, with Rpeak of 819.20 Gflop/s and an estimated Rmax of 720.60 Gflop/s.

HP CP4000: 16 nodes DL145 G2 (2 AMD64 Opteron 275 dual core, 2.2 GHz, 64 KB/1 MB each core), 256 GB of main memory, 4.56 TB of hard disk, with Rpeak of 281.60 Gflop/s and an estimated Rmax of 177.41 Gflop/s, interconnected by 3 GigabitEthernet (one external and two internal, one of the latest for computing and the other one for management)

All the computers have superscalar processors but differ in memory access: Compaq and HP CP4000 provide distributed memory while the rest use shared memory.

Processor-to-memory interconnection in N4000 is acomplished via two buses able to reach a total joint speed of 3.8 GB/s, with a memory access latency of 130 ns.

Processor-to-memory interconnection in ES40 is also done by means of two buses with a total joint speed of 2.67 GB/s.

Regarding GS1280, each processor has at its disposal two memory controllers able of jointly performing at 12.8 GB/s; these processors are interconnected with a toroidal network and access latencies are asymmetrical, from 75 ns for local memory to 270 ns for the most remote memory.

In Altix 3700, considering that two processors form a node, processor-to-memory connection within the node is achieved through a SHub 1.2 ASIC. Each SHub bears a top memory bandwidth of 10.2 GB/s, in the case of local memory. If memory access to other nodes were needed, connection is fulfilled via two 6.4 GB/s Numalink 4 chanels. These connections provide latencies from 129 ns (local memory) to 559 ns in the worst case (remote memory).

In HP CP4000, AMD64 Opteron O275 chips have embeded their own memory controller. Such controller is connected within the chip by means of an internal crossbar, in order to build the system structure. The memory access bandwidth of each controller is 6.4 GB/s. This way, a DL145 G2 node, being composed of two chips, is able to reach a total joint bandwidth of 12.8 GB/s. Latency of a core with its own memory is 60 ns, and with other's chip memory is 90 ns.


Technical features and performance of processors


HP
N4000
PA8500
CPQ
beowulf
EV67
CPQ
HPC320
EV68
HP
GS1280
EV7
HP
rx2600
Itanium2
HP
Altix
Itanium2
HP
CP4000
Opteron 275
Frequency (MHz)
440
600
833
1,150
1,000
1,600
2,200
Data bus width
64
64
64
128
128
128
128
Cache (L1 KB/L2 MB/L3 MB)
1,024/-/- 64/2/-
64/8/-
64/1.75/-
32/0.25/3
16/0.25/6
128/2/-
Rtop (Mflop/s)
1,760
1,200
1,666
2,300
4,000
6,400
8,800
LINPACK TPP
1,290
877.5
1,277
1,900
3,528
5,937
7,153
LINPACK 100x100
375
470.8
639
950
1,102
1,765
1,589
SPECint2000
n/d
355
565
900
n/d
1,441
1,515
SPECfp2000
n/d
400
777
1,450
1,427
2,647
1,830
Performance of EV67 evaluated with 616 MHz.

We also dispose of the following:
  • An automated tape library StorageTek TimberWolf 9740 containing 302 tapes of 9840 type with native capacity of 20 GB, and 2 transfer devices 9840, each of them with a transfer speed of 10 MB/s and a cartridge exchange rate of 350 per hour.
  • StorageWorks MA6000 modular and multibuild disc subsystem, with 985 GB available and 2 RAID FiberChannel HSG60 controllers performing at 1 Gbps.
  • Enterprise Virtual Array V.2 (EVA) disk subsystem, model 2C6D-B, with 10.15 TB (gross capacity) available and 2 FiberChannel HSV110 controllers performing at 2 Gbps.
  • AlphaServer DS25 file server with 2 processors EV68 21264C at 1,000 MHz, 4 GB of main memory, 72.8 GB of disk, 2 GigabitEthernet controllers, 1 ATM controller at 155 Mbps, 1 Fast Ethernet controller at 100 Mbps, 2 Ultra SCSI Wide adapters used to connect with the StorageTek TimberWolf 9740 robot and, finally, 2 PCI FiberChannel adapters at 2 Gbps.
  • HP rp5430 database server with 2 processors PA8700 at 750 MHz, 2.25 MB of L1 cache, 8 GB of main memory, 146 GB of hard disk at 15K, 1 GigabitEthernet adapter and 1 PCI FiberChannel adapter at 2 Gbps.
  • HP Workstation xw8000 Pharmacophor Search Server with 2 processors Intel Xeon at 3.06 GHz, 8 KB of L1 cache and 512 KB of L2 cache, 4 GB of main memory, 73 GB of hard disk at 10K and 1 GigabitEthernet adapter.
  • A Linux cluster composed of ten 2-way SMP nodes, devoted to information resources, model Proliant DL360 G4p, with 100 GB of main memory and 360 GB in Ultra320 hard disks. Every node is composed of 2 Intel Xeon processors at 3.0 GHz, 2 MB of L2 cache and an internal disk with 36 GB at 15.000 rpm, assigned to operating system, swap and temporary files. All that nodes are connected to the SAN (EVA 2C6D-B) by means of a FiberChannel adapter at 2 Gbps. Moreover, every node has three GigabitEthernet ports, two of them are RJ45 PCI type and the other one is with SC connector.
Glossary

Superscalar processors can simultaneously start the execution in parallel of several scalar instructions, in such a way that different vector elements can be handled during a single clock cycle. In our case, PA8500 and Compaq processors are able to star up four processes.

When shared memory access is present among processors, that is, only a single addressing space exists, programming becomes easier since data can be stored at any segment of memory, which all processors can access uniformly.

When distributed memory access is present among processors, that is, each processor has access to its own local memory, then programming becomes more complex because when data needed by a processor are placed in another processor's addresing space, they must be requested and transferred via messaging protocol. In orderto minimize inter-processor comunication, thus achieving a better performance, we must increase the proportion of references to local memory. An advantage provided by this architecture is its scalability; the system can exceed the number of processors used in a shared memory system and, therefore, it is more suitable to parallel computers.

A third type of memory hierarchy exists, known as distributed shared memory (DSM), which combines the advantages of the two hierarchies mentioned above. Memory is physically distributed and therefore we don't loose sense of scalability, on the other hand there is only a single memory address space, which can be easily programmed.

In order to optimize a supercomputer's performance, one of the factors that should be considered is the cache memory size available to the processor:
  • N4000 cache allocation for the PA8500 processor, 1 MB.
  • AlphaServer's cache allocation for the ES40 processor, 64 KB of L1 and 8 MB of L2.
  • Beowulf's cache allocation for the DS10 processor, 64 KB of L1 and 2 MB of L2.
  • GS1280's cache allocation for the EV7 processor, 64 KB of L1 and 1.75 MB L2.
  • rx2600's cache allocation for the Itanium2 processor, 32 KB of L1, 256 KB of L2 and 3 MB of L3.
  • Altix 3700's cache allocation for the Itanium2 processor, 16 KB of L1, 256 KB of L2 and 6 MB of L3.
  • CP4000 cache allocation for the Opteron 275 processor, 64 KB of L1 for data and 1 MB of L2, each core.
A supercomputer's performance is measured in Gflop/s: 1 Gflop/s indicates that the processor performs 109 arithmetic operations (addition or multiplication) per second of real numbers with a 64 bit floating point format.

Rmax is the best result obtained for the parallel benchmark "Linpack" (that solves a weighty system of linear equations) of different sizes. To achieve the size of Rmax we use Nmax.

Last update: AG, 15/09/06 go back next