#### Parallelism in Mainstream Enterprise Platforms of the Future

#### **Dileep Bhandarkar**

Architect at Large Enterprise Platforms Group Intel Corporation

September 23<sup>rd</sup>, 2002







## Outline

Semiconductor Technology Evolution
Moore's Law Video
Parallelism in Microprocessors Today
Multiprocessor Systems
The Billion Transistor Chip
Summary





©2002, Intel Corporation Intel, the Intel logo, Pentium, Itanium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries \*Other names and brands may be claimed as the property of others

#### Birth of the Revolution --The Intel 4004





## Introduced November 15, 1971 108 KHz, 50 KIPs , 2300 10μ transistors



#### 2001 – Pentium® 4 Processor

Introduced November 20, 2000 @1.5 GHz core, 400 MT/s bus

August 27, 2001 @2 GHz, 400 MT/s bus 640 SPECint\_base2000 704 SPECfp\_base2000

42 Million 0.18µ transistors







#### **30 Years of Progress**

4004 to Pentium® 4 processor
 Transistor count: 20,000x increase
 Frequency: 20,000x increase
 39% Compound Annual Growth rate





#### 2002 – Pentium® 4 Processor

August 26, 2002 @2.8 GHz, 533 MT/s bus 976 SPECint\_base2000 915 SPECfp\_base2000 55 Million 130 nm process







### Itanium<sup>®</sup> 2 Processor Overview

- .18µm bulk, 6 layer Al process
- 8 stage, fully stalled inorder pipeline
- Symmetric six integerissue design
- IA32 execution engine integrated
- 3 levels of cache on-die totaling 3.3MB
- 221 Million transistors
- 130W @1GHz, 1.5V

inta



### **Continuing at this Rate by End of the Decade**





Billion Transistors by 2005



"If the automobile industry advanced as rapidly as the semiconductor industry, a Rolls Royce would get 1/2 million miles per gallon and it would be cheaper to throw it away than to park it."

> Gordon Moore, Intel Corporation





#### **Semiconductor Manufacturing Process Evolution**

|                 |             | Actual      |      |                | Forecast    |              |        |  |
|-----------------|-------------|-------------|------|----------------|-------------|--------------|--------|--|
| Process name    | <u>P852</u> | <u>P854</u> | P856 | P858           | <u>Px60</u> | <u>P1262</u> | P1264  |  |
| Production      | 1993        | 1995        | 1997 | 1999           | 2001        | 2003         | 2005   |  |
| Generation      | 0.50        | 0.35        | 0.25 | <b>0.18</b> µm | 130 nm      | 90 nm        | 65 nm  |  |
| Gate Length     | 0.50        | 0.35        | 0.20 | 0.13           | <70 nm      | <50 nm       | <35 nm |  |
| Wafer Size (mm) | 200         | 200         | 200  | 200            | 200/300     | 300          | 300    |  |

New generation every 2 years





## Outline

Semiconductor Technology Evolution
Moore's Law Video
Parallelism in Microprocessors Today
Multiprocessor Systems
The Billion Transistor Chip
Summary



PACT-2002 Keynote

©2002, Intel Corporation Intel, the Intel logo, Pentium, Itanium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries \*Other names and brands may be claimed as the property of others

#### Moore's Law



The experts look ahead

#### Cramming more components onto integrated circuits

With unit cost falling as the number of components per circuit rises, by 1975 economics may dictate squeezing as many as 65,000 components on a single silicon chip

By Gordon E. Moore Director, Research and Development Laboratories, Fairchild Sentconductor division of Fairchild Owners and Instrument Corp.

The future of integrated electronics is the future of electronics itself. The advantages of integration will bring about a proliferation of electronics, pushing this science into many new areas.

Integrated circuits will lead to such workers to horne computers—or at least terminals connected to a control computer—automatic controls for nationabiles, and personal portable communications equipment. The electronic wristwatch needs only a display to be forsibilitotalay.

But the biggest potential lies in the prediction of large systems. In telephone communications, integrated circuits in digital filters will separate channels on multiplex equipment. Integrated circuits will also switch telephone circuits and perform data processing.

Computers will be more power ful, and will be organized in completely different ways. For example, memories built of integrated electronics may be distributed throughout the

1959

#### The autors



Dr. Gordon E. Whow is one of the new lawed of electronic is engineen, activated in the physical activateurs and/or that is electronics. He cannot a D. S. degree in characteristy from the University of Californi and a Ph.D. degree in physical characteristy from the California Institute of the foundance is ware one of the foundance of Painchilt Semiconductor and has been director of the neurosch and development Inducations show

Electronics, Volume 38, Number 8, April 19, 1965

machine instead of being concentrated in a central unit. In addition, the improved reliability made possible by integrated criteriability and the construction of larger processing units. Machines similar to those in esistence to day will be built at lower costs and with fuster turner and.

#### D Present and future

By integrated electronics, I mean all the various techadogies which are referred to as micro-electronics today as well as any additional ones that result in electronics functions applied to the near as irreducible units. These techadogies were first investigated in the late 1950%. The doject was to minimum ice electronics equipment to include increasingly complex electronic functions in limited space with minimum weight. Several approaches evolved, including microssenthly techniques for individual components, thisfilm structure and semicolatoric integrated circuits.

Each approach evolved rapidly and converged so that each borrowed techniques from another. Many researchers believe the way of the future to be a combination of the various autroaches.

The advo cates of som isomalactor integrated circuity y are advady uning the improved classroctristics of this dimension tors by applying such filters directly to an active semiconductor advartant. Those advocating a technology based upon filters are developing sophilicitated techniques offer for an attachment of network semiconductor devices to the passive film artrons.

Both approaches have worked well and are being used in equipment today.





## Outline

Semiconductor Technology Evolution
Moore's Law Video
Parallelism in Microprocessors Today
Multiprocessor Systems
The Billion Transistor Chip
Summary





©2002, Intel Corporation Intel, the Intel logo, Pentium, Itanium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries \*Other names and brands may be claimed as the property of others

#### Parallelism at Multiple Levels

 Within a processor - multiple issue processors with lots of execution units - wider superscalar explicit parallelism Multiple processors on a chip – Hardware Multi Threading – Multiple cores System Level Multiprocessors



# Explicitly Parallel Instruction Computing

 Enable wide execution by providing processor implementations that compiler can take advantage of

Performance through parallelism

 Multiple execution units and issue ports in parallel
 2 bundles (up to 6 Instructions) dispatched every cycle

#### • Massive on-chip resources

- 128 general registers, 128 floating point registers
- 64 predicate registers, 8 branch registers
- Exploit parallelism
- Efficient management engines (register stack engine)

 Provide features that enable compiler to reschedule programs using advanced features (predication, speculation)

Enable, enhance, express, and exploit parallelism



### **Instruction Formats: Bundles**

| 127 87                          | 86 46                           | 45 5                            | 4 0                  |
|---------------------------------|---------------------------------|---------------------------------|----------------------|
| Instruction Slot 2<br>(41 bits) | Instruction Slot 1<br>(41 bits) | Instruction Slot 0<br>(41 bits) | Template<br>(5 bits) |
|                                 | 128 bite                        |                                 | CONTRACTOR OF        |

- Template identifies types of instructions in bundle and delineates independent operations (through "stops")
- Instruction types
  - M: Memory
  - I: Shifts and multimedia
  - A: ALU
  - B: Branch
  - F: Floating point
  - L+X: Long

#### Template encodes types

- MII, MLX, MMI, MFI, MMF, MI\_I, M\_MI
- Branch: MIB, MMB, MFB, MBB, BBB
- •Template encodes parallelism
  - All come in two flavors: with and without stop at end





### Itanium<sup>®</sup> 2 Processor Architecture



#### **Processor Structure**

#### McKinley Block Diagram



### Integer & FP Performance

#### SPECint2000\_base SPECfp2000\_base



### Long Latency DRAM Accesses: Needs Memory Level Parallelism (MLP)



### Multithreading



- Introduced on Intel<sup>®</sup> Xeon<sup>™</sup> Processor MP
- Two logical processors for < 5% additional die area
- Executes two tasks simultaneously
  - Two different applications
  - Two threads of same application
- CPU maintains architecture state for two processors
  - Two logical processors per physical processor
- Power efficient performance gain
- 20-30% performance improvement on many throughput oriented workloads





#### **IBM Power4 Dual Processor on a Chip**





\*Other names and brands may be claimed as the property of others

PACT-2002 Keynote

#### HP PA-8800 Dual Processor on a Chip





\*Other names and brands may be claimed as the property of others

## Outline

Semiconductor Technology Evolution
Moore's Law Video
Parallelism in Microprocessors Today
Multiprocessor Systems
The Billion Transistor Chip
Summary





©2002, Intel Corporation Intel, the Intel logo, Pentium, Itanium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries \*Other names and brands may be claimed as the property of others

#### Large Multiprocessor Systems

32 and 64 processor systems available today
300K to 400K transactions per minute
>100 Linpack Gigaflops





### **IBM eServer pSeries 690**

#### 4-module, 32-way SMP System



# 1.3 GHz Power4 8 to 32 CPU

- Starting at \$450,000
- 8-way MCM @ \$275,000\*\*
- 403,255 tpmc @ \$17.80 per tpmC
- 95 Linpack
   Gflops



\*\*\*http://www-132.ibm.com/content/home/store\_IBMPublicUSA/en\_US/eServer/pSeries/high\_end/pSeries\_highend.html \*\*Source: http://www.tpc.org/results/individual\_results/IBM/IBMp690es\_08142002.pdf \*Other names and brands may be claimed as the property of others

### NEC TX7/i9510 SMP Server

- Up to 32 Itanium<sup>®</sup> 2 processors
  Up to 512GB memory (with 2GB DIMMs)
  Up to 112 PCI-X I/O slots
  Low latency and high bandwidth cross-bar interconnect
  Inter-cell memory interleaving
- ECC protected data transfer
- 308,620 tpmC @ \$14.96 per tpmC
- 101 Linpack GigaFlops
- 32 Processors + 256GB @ \$1,397,152\*\*



\*\*http://www.tpc.org/results/individual\_results/NEC/nec.tx7.i9510.c5.020909.es.pdf \*Other names and brands may be claimed as the property of others



### **HP Superdome**

Super Dome is a cell-based hierarchical cross-bar system.



### **64P Performance**

875 MHz PA-RISC 8700

• 423,414 tpmC @ \$15.64 per tpmC

• 134 Linpack Gigaflops



A cell consists of → 4 CPUs → 2 to 16GBs of Memory → A link to 12 PCI I/O Slots → Cell Board with 4 PA-8700 875MHz Processors @ \$10.080\*\* (2 chassis @ \$424,275\*\*)

#### the crossbar mesh: interconnect fabric

#### fully-connected crossbar mesh

- four crossbars
- four cells per crossbar
- all links have equal bandwidth and latency
  - minimizes latency
- maximizes usable bandwidth implements point-to-point packet filtering and routing network
  - allows hardware isolation of all faults
- interconnect 16 cells with 3 latency domains
  - · cel local
  - crossbar local
  - remote crossbar



### HPC Clusters



Commercial Off The Shelf (COTS) shelf (COTS) components
 Processors
 Packaging
 Interconnects
 Operating systems





\*Other names and brands may be claimed as the property of others

Tru64 UNIX

## Outline

Semiconductor Technology Evolution
Moore's Law Video
Parallelism in Microprocessors Today
Multiprocessor Systems
The Billion Transistor Chip
Summary





©2002, Intel Corporation Intel, the Intel logo, Pentium, Itanium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries \*Other names and brands may be claimed as the property of others

#### Parallelism Design Space

#### With Each Process Generation

- Frequency increases by about 1.5X
- Vcc will scale by only ~0.8
- Active power will scale by ~0.9
- Active power density will increase by ~30-80%
- Leakage power will make it even worse

#### Doubling performance requires more than 4 times the transistors



PACT-2002 Keynote



#### **1999 Mainstream Microprocessor**



Pentium® III Processor
Integrated 256 KB L2 cache
106 mm² die size
0.18µ process
6 metal layer process
28 million transistors





## **Technology Projection**

|                              | 1999   | 2001      | 2003  | 2005  | 2007     |
|------------------------------|--------|-----------|-------|-------|----------|
| Process                      | 180 nm | 130<br>nm | 90 nm | 65 nm | 50<br>nm |
| Core<br>Sq mm                | 50-100 | 25-50     | 12-25 | 6-12  | 3-6      |
| 1 MB cache<br>Sq mm          | 100    | 50        | 25    | 12    | 6        |
| # of cores in ~<br>200 sq mm | 2-4    | 4-8       | 8-16  | 16-32 | 32-64    |
| MB of cache in<br>~200 sq mm | ~2     | ~4        | ~8    | ~16   | ~24      |





#### Art of the Possible

Billion Transistors possible in 2005

- Large die sizes can be built
  - 4 to 6 square centimeters
- What can fit on a single die in 2005?
  - 12 mm<sup>2</sup> per processor
  - 12 mm<sup>2</sup> per MB

inta

| Die size in     | 4     | 8     | 16    |
|-----------------|-------|-------|-------|
| mm <sup>2</sup> | cores | cores | cores |
| 16 MB cache     | 240   | 288   | 384   |
| 32 MB cache     | 432   | 480   | 576   |



### **CMP** Challenges

- How much Thread Level Parallelism is there in non-embarassingly parallel workloads?
- Ability to generate code with lots of threads
- Thread synchronization
- Operating systems for parallel machines
- Single thread performance
- Power limitations
- On-chip interconnect infrastructure





## Outline

Semiconductor Technology Evolution
Moore's Law Video
Parallelism in Microprocessors Today
Multiprocessor Systems
The Billion Transistor Chip
Summary





©2002, Intel Corporation Intel, the Intel logo, Pentium, Itanium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries \*Other names and brands may be claimed as the property of others

#### Summary

- Plenty of opportunities for "parallel programming" in Commercial Off The Shelf Server platforms
- Amount of parallelism in hardware will increase
- Need applications and tools that can exploit parallelism at all levels



