Hardware
The three high performance computers available at CESCA
are:
SGI Altix 3700 Bx2: 128 processors Itanium2 (1.6
GHz, 16 KB/256 KB/6 MB), 384 GB of main memory, 5.99 TB of
hard disk, with Rpeak of 819.20 Gflop/s and an
estimated Rmax of 720.60 Gflop/s.
HP CP4000: 33 nodes DL145 G2 (2 AMD64 Opteron 275 dual core, 2.2 GHz, 64 KB/1 MB each core), 528 GB of main memory, 9.41 TB of hard disk, with Rpeak of 580.80 Gflop/s and an estimated Rmax of 365.91 Gflop/s, interconnected by 3 GigabitEthernet (one external and two internal, one of the latest for computing and the other one for management).
Bull NovaScale: 28 nodes R422E1 (2 Xeon E5472 quad core, 30GHz, 64 KB/3 MB each core), 896 GB of main memory, 31,72 TB of hard disk, with Rpeak of 2,68 Tflop/s and an estimated Rmax of 2,24 Tflop/s, interconnected by 3 networks; 2Gigabit Ethernet (one for management and one for services) and one InfiniBand for computing.
All the computers have superscalar processors but
differ in memory access: Bull NovaScale and HP CP4000 provide
distributed memory while SGI Altix 3700 use shared
memory.HP CP4000: 33 nodes DL145 G2 (2 AMD64 Opteron 275 dual core, 2.2 GHz, 64 KB/1 MB each core), 528 GB of main memory, 9.41 TB of hard disk, with Rpeak of 580.80 Gflop/s and an estimated Rmax of 365.91 Gflop/s, interconnected by 3 GigabitEthernet (one external and two internal, one of the latest for computing and the other one for management).
Bull NovaScale: 28 nodes R422E1 (2 Xeon E5472 quad core, 30GHz, 64 KB/3 MB each core), 896 GB of main memory, 31,72 TB of hard disk, with Rpeak of 2,68 Tflop/s and an estimated Rmax of 2,24 Tflop/s, interconnected by 3 networks; 2Gigabit Ethernet (one for management and one for services) and one InfiniBand for computing.
In Altix 3700, considering that two processors form a node, processor-to-memory connection within the node is achieved through a SHub 1.2 ASIC. Each SHub bears a top memory bandwidth of 10.2 GB/s, in the case of local memory. If memory access to other nodes were needed, connection is fulfilled via two 6.4 GB/s Numalink 4 chanels. These connections provide latencies from 129 ns (local memory) to 559 ns in the worst case (remote memory).
In HP CP4000, AMD64 Opteron O275 chips have embeded their own memory controller. Such controller is connected within the chip by means of an internal crossbar, in order to build the system structure. The memory access bandwidth of each controller is 6.4 GB/s. This way, a DL145 G2 node, being composed of two chips, is able to reach a total joint bandwidth of 12.8 GB/s. Latency of a core with its own memory is 60 ns, and with other's chip memory is 90 ns.
Regarding Bull NovaScale, Xeon 5472 processor is formed by two sockets, and one dedicated bus per socket. This allows every quad core socket to have a dedicated bandwith to operate with the rest of the system, avoiding interferences with the other one. The bandwith achieved per socket is 10,5 GB/s, as the bus operates at 1.600MHz. Read latency with memory is 98 ns.
Technical features and
performance of processor
|
|
SGI Altix Itanium2 |
HP CP4000 Opteron 275 |
Bull Novascale Xeon E5472 |
|
Frequency (GHz) |
1.6 | 2.2 |
3.0 |
| Data bus width | 128 | 128 |
128 |
| Cache
(L1 KB/L2 MB/L3 MB) |
16/0.25/6 | 128/2/- |
256/12/- |
|
Rpeak (Gflop/s) |
6.4 | 8.8 |
48.0 |
|
LINPACK TPP (Gflop/s) |
5.94 | 7.15 |
4.60 |
|
LINPACK 100x100 (Gflop/s) |
1.77 | 1.60 |
1.30 |
|
SPECint2000/2006 |
1441/- | 1515/- |
-/26.50 |
|
SPECfp2000/2006 |
2647/- | 1830/- |
-/23.40 |
| Data per processor. It should be taken into
account that Opteron 275 is a dual core processor and
Xeon E5472 is a quad core processor. |
|||
All that systems are being supported by the Data Storage Service hardware.
Glossary
Superscalar processors can simultaneously start the execution in parallel of several scalar instructions, in such a way that different vector elements can be handled during a single clock cycle.
When shared memory access is present among processors, that is, only a single addressing space exists, programming becomes easier since data can be stored at any segment of memory, which all processors can access uniformly.
When distributed memory access is present among processors, that is, each processor has access to its own local memory, then programming becomes more complex because when data needed by a processor are placed in another processor's addresing space, they must be requested and transferred via messaging protocol. In order to minimize inter-processor communication, thus achieving a better performance, we must increase the proportion of references to local memory. An advantage provided by this architecture is its scalability; the system can exceed the number of processors used in a shared memory system and, therefore, it is more suitable to parallel computers.
A third type of memory hierarchy exists, known as distributed shared memory (DSM), which combines the advantages of the two hierarchies mentioned above. Memory is physically distributed and therefore we don't loose sense of scalability, on the other hand there is only a single memory address space, which can be easily programmed.
In order to optimize a supercomputer's performance, one of the factors that should be considered is the cache memory size available to the processor:
Rmax is the best result obtained for the parallel benchmark "Linpack" (that solves a weighty system of linear equations) of different sizes. To achieve the size of Rmax we use Nmax.
Superscalar processors can simultaneously start the execution in parallel of several scalar instructions, in such a way that different vector elements can be handled during a single clock cycle.
When shared memory access is present among processors, that is, only a single addressing space exists, programming becomes easier since data can be stored at any segment of memory, which all processors can access uniformly.
When distributed memory access is present among processors, that is, each processor has access to its own local memory, then programming becomes more complex because when data needed by a processor are placed in another processor's addresing space, they must be requested and transferred via messaging protocol. In order to minimize inter-processor communication, thus achieving a better performance, we must increase the proportion of references to local memory. An advantage provided by this architecture is its scalability; the system can exceed the number of processors used in a shared memory system and, therefore, it is more suitable to parallel computers.
A third type of memory hierarchy exists, known as distributed shared memory (DSM), which combines the advantages of the two hierarchies mentioned above. Memory is physically distributed and therefore we don't loose sense of scalability, on the other hand there is only a single memory address space, which can be easily programmed.
In order to optimize a supercomputer's performance, one of the factors that should be considered is the cache memory size available to the processor:
- Altix 3700's cache allocation for the Itanium2 processor, 16 KB of L1, 256 KB of L2 and 6 MB of L3.
- CP4000 cache allocation for the Opteron 275 processor, 64 KB of L1 for data and 1 MB of L2, each core.
- Bull NovaScale cache
allocation for the Xeon E5472, 256 KB of L1 and 12 MB of
L2.
Rmax is the best result obtained for the parallel benchmark "Linpack" (that solves a weighty system of linear equations) of different sizes. To achieve the size of Rmax we use Nmax.

Welcome

