并行计算机体系结构(英文版）_并行计算机体系结构(英文版）PDF下载_并行计算机体系结构(英文版）电子书,介绍,并行计算机体系结构(英文版）txt,作者,视频教程,在线阅读,并行计算机体系结构(英文版）

并行计算机体系结构(英文版）

作者

丛书名

出版社

ISBN

简要

简介

内容简介书籍计算机书籍 David E. Culler & Jaswinder Pal Singh with Anoop Gupta: Parallel Computer Architecture, A Hard ware/Software Approach. Second Edition. Copyright @ 1996 by Morgan Kaufmann Publishers, Inc. Harcourt Asia Pte Ltd under special arrangement with Morgan Kaufmann authorizes China Machine Press to print and exclusively distribute this edition, which is the only authorized complete and unabridged reproduction of the latest American Edition published and priced for sale in China only, not including Hong Kong SAR and Taiwan. Unauthorized export of this edition is a violation of the Copyright Act. Violation of this Law is subjected to Civil and Criminal penalties.

Contents
Foreword vi
Preface xviii
Introduction 1
1.1 Why Parallel Architecture 4
1.1.1 Application Trends 6
1.1.2 Technology Trends 12
1.1.3 Architectural Trends 14
1.1.4 Supercomputers 21
1.1.5 Summary 23
1.2 Convergence of Parallel Architectures 25
1.2.1 Communication Architecture 25
1.2.2 Shared Address Space 28
1.2.3 Message Passing 37
1.2.4 Convergence .42
1.2.5 Data Parallel Processing 44
1.2.6 Other Parallel Architectures 47
1.2.7 A Generic Parallel Architecture 50
1.3 Fundamental Design Issues 52
1.3.1 Communication Abstraction 53
1.3.2 Programming Model Requirements 53
1.3.3 Communication and Replication 58
1.3.4 Performance 59
1.3.5 Summary 63
1.4 Concluding Remarks 63
1.5 Historical References 66
1.6 Exercises 70
Parallel Programs 75
2.1 Parallel Application Case Studies 76
2. 1.1 Simulating Ocean Currents 77
2.1.2 Simulating the Evolution of Galaxies 78
2.1.3 Visualizing Complex Scenes Using Ray Tracing 79
2.1.4 Mining Data for Associations 80
2.2 The Parallelization Process 81
2.2.1 Steps in the Process 82
2.2.2 Parallelizing Computation versus Data 90
2.2.3 Goals of the Parallelization Process 91
2.3 Parallelization of an Example Program 92
2.3.1 The Equation Solver Kernel 92
2.3.2 Decomposition 93
2.3.3 Assignment 98
2.3.4 Orchestration under the Data Parallel Model 99
2.3.5 Orchestration under the Shared Address Space Model 101
2.3.6 Orchestration under the Message-Passing Model 108
2.4 Concluding Remarks 116
2.5 Exercises 117
Programming for Performance 121
3.1 Partitioning for Performance 123
3. 1.1 Load Balance and Synchronization Wait Time 123,
3.1.2 Reducing Inherent Communication. 131
3.1.3 Reducing the Extra Work 135.
3.1.4 Summary 136
3.2 Data Access and Communication in a Multimemory System 137
3.2.1 A Multiprocessor as an Extended Memory Hierarchy 138
3.2.2 Artifactual Communication in the Extended Memory Hierarchy 139
3.2.3 Artifactual Communication and Replication: The Working Set
Perspective 140
3.3 Orchestration for Performance 142
3.3.1 Reducing Artifactual Communication 142
3.3.2 Structuring Communication to Reduce Cost 150
3.4 Performance Factors from the Processor誷 Perspective 156
3.5 The Parallel Application Case Studies: An In-Depth Look 160
3.5.1 Ocean 161
3.5.2 Barnes-Hut 166
3.5.3 Raytrace 174
3.5.4 Data Mining 178
3.6 Implications for Programming Models 182
3.6.1 Naming 184
3.6.2 Replication 184
3.6.3 Overhead and Granularity of Communication 186
3.6.4 Block Data Transfer 187
3.6.5 Synchronization 188
3.6.6 Hardware Cost and Design Complexity 188
3.6.7 Performance Model 189
3.6.8 Summary 189
3.7 Concluding Remarks 190
3.8 Exercises 192
Workload-Driven Evaluation 199
4.1 Scaling Workloads and Machines 202
4.1.1 Basic Measures of Multiprocessor Performance 202
4.1.2 Why Worry about Scaling? 204
4.1.3 Key Issues in Scaling 206
4.1.4 Scaling Models and Speedup Measures 207
4.1.5 Impact of Scaling Models on the Equation Solver Kernel 211
4.1.6 Scaling Workload Parameters 213
4.2 Evaluating a Real Machine 215
4.2.1 Performance Isolation Using Microbenchmarks 215
4.2.2 Choosing Workloads 216
4.2.3 Evaluating a Fixed-Size Machine 221
4.2.4 Varying Machine Size 226
4.2.5 Choosing Performance Metrics 228
4.3 Evaluating an Architectural Idea or Trade-off 231
4.3.1 Multiprocessor Simulation 233
4.3.2 Scaling Down Problem and Machine Parameters for Simulation 234
4.3.3 Dealing with the Parameter Space: An Example Evaluation 238
4.3.4 Summary 243
4.4 Illustrating Workload Characterization 243
4.4.1 Workload Case Studies 244
4.4.2 Workload Characteristics 253
4.5 Concluding Remarks 262
4.6 Exercises 263
Shared Memory Multiprocessors 269
5.1 Cache Coherence 273
5.1.1 The Cache Coherence Problem 273
5.1.2 Cache Coherence through Bus Snooping 277
5.2 Memory Consistency 283
5.2.1 Sequential Consistency 286
5.2.2 Sufficient Conditions for Preserving Sequential Consistency 289
5.3 Design Space for Snooping Protocols 291
5.3.1 A Three-State (MSI) Write-Back Invalidation Protocol 293
5.3.2 A Four-State (MESI) Write-Back Invalidation Protocol 299
5.3.3 A Four-State (Dragon) Write-Back Update Protocol 301
5.4 Assessing Protocol Design Trade-offs 305
5.4.1 Methodology 306
5.4.2 Bandwidth Requirement under the MESI Protocol 307
5.4.3 Impact of Protocol Optimizations 311
5.4.4 Trade-Offs in Cache Block Size 313
5.4.5 Update-Based versus Invalidation-Based Protocols 329
5.5 Synchronization 334
5.5.1 Components of a Synchronization Event 335
5.5.2 Role of the User and System., 336
5.5.3 Mutual Exclusion 337
5.5.4 Point-to-Point Event Synchronization 352
5.5.5 Global (Barrier) Event. Synchronization 353
5.5.6 Synchronization Summary 358
5.6 Implications for Software 359
5.7 Concluding Remarks 366
5.8 Exercises 367
6 Snoop-Based Multiprocessor Design 377
6.1 Correctness Requirements 378
6.2 Base Design: Single-Level Caches with an Atomic Bus 380
6.2.1 Cache Controller and Tag Design 381
6.2.2 Reporting Snoop Results 382
6.2.3 訢ealing with Write Backs 384
6.2.4 Base Organization 385
6.2.5 Nonatomic State Transitions 385
6.2.6 Serialization 388
6.2.7 Deadlock 390
6.2.8 Livelock and Starvation 390
6.2.9 Implementing Atomic Operations 391
6.3 Multilevel Cache Hierarchies 393
6.3.1 Maintaining Inclusion 394
6.3.2 Propagating Transactions for Coherence in the Hierarchy 396
6.4 Split-Transaction Bus 398
6.4.1 An Example Split-Transaction Design 400
6.4.2 Bus Design and Request- Response Matching 400
6.4.3 Snoop Results and Conflicting Requests 402
6.4.4 Flow Control 404
6.4.5 Path of a Cache Miss 404
6.4.6 Serialization and Sequential Consistency 406
6.4.7 Alternative Design Choices 409
6.4.8 Split-Transaction Bus with Multilevel Caches 410
6.4.9 Supporting Multiple Outstanding Misses from a Processor 413
6.5 Case Studies: SGI Challenge and Sun Enterprise 6000 415
6.5.1 SGI Powerpath-2 System Bus 417
6.5.2 SGI Processor and Memory Subsystems 420
6.5.3 SGI 1/0 Subsystem 422
6.5.4 SGI Challenge Memory System Performance 424
6.5.5 Sun Gigaplane System Bus 424
6.5.6 Sun Processor and Memory Subsystem 427
6.5.7 Sun 1/0 Subsystem 429
6.5.8 Sun Enterprise Memory System Performance 429
6.5.9 Application Performance 429
6.6 Extending Cache Coherence 433
6.6.1 Shared Cache Designs 434
6.6.2 Coherence for Virtually Indexed Caches 437
6.6.3 Translation Lookaside Buffer Coherence 439
6.6.4 Snoop-Based Cache Coherence on Rings 441
6.6.5 Scaling Data and Snoop Bandwidth in Bus-Based Systems 445
6.7 Concluding Remarks 446
6.8 Exercises 446
Scalable Multiprocessors 453
7.1 Scalability 456
7.1.1 Bandwidth Scaling 457
7.1.2 Latency Scaling 460
7.1.3 Cost Scaling 461
7.1.4 Physical Scaling 4.62
7.1.5 Scaling in a Generic Parallel Architecture 467
7.2 Realizing Programming Models 468
7.2.1 Primitive Network Transactions 470
7.2.2 Shared Address Space 473
7.2.3 Message Passing 476
7.2.4 Active Messages 481
7.2.5 Common Challenges 482
7.2.6 Communication Architecture Design Space 485
7.3 Physical DMA 486
7.3.1 Node-to-Network Interface 486
7.3.2 Implementing Communication Abstractions 488
7.3.3 A Case Study: nCUBE/2 488
7.3 . 4 Typical LAN Interfaces 490
7.4 User-Level Access 491
7.4.1 Node-to-Network Interface 491
7.4.2 Case Study: Thinking Machines CM-5 493
7.4.3 User-Level Handlers 494
7.5 Dedicated Message Processing 496
7.5.1 Case Study: Intel Paragon 499
7.5.2 Case Study: Meiko CS-2 503
7.6 Shared Physical Address Space ?06
7.6.1 Case Study: CRAY T3D 508
7.6.2 Case Study: CRAY T3E 512
7.6.3 Summary 513
7.7 Clusters and Networks of Workstations 513
7.7.1 Case Study: Myrinet SBUS Lanai 516
7.7.2 Case Study: PCI Memory Channel 518
7.8 Implications for Parallel Software 522
7.8.1 Network Transaction Performance 522
7.8.2 Shared Address Space Operations 527,
7.8.3 Message-Passing Operations 528
7.8.4 Applica tio n -Level Performance 531
7.9 Synchronization 538
7.9.1 Algorithms for Locks 538
7.9.2 Algorithms for Barriers 542
7.10 Concluding Remarks 548
7.11 Exercises 548
Directory-Based Cache Coherence 553
8.1 Scalable Cache Coherence 558
8.2 Overview of Directory-Based Approaches 559
8.2.1 Operation of a Simple Directory Scheme 560
8.2.2 Scaling 564
8.2.3 Alternatives for Organizing Directories 565
8.3 Assessing Directory Protocols and Trade-Offs 571
8.3.1 Data Sharing Patterns for Directory Schemes 571
8.3.2 Local versus Remote Traffic 578
8.3.3 Cache Block Size Effects 579
8.4 Design Challenges for Directory Protocols 579
8.4.1 Performance. 584
8.4.2 Correctness 589
8.5 Memory-Based Directory Protocols: The SGI Origin System 596
8.5.1 Cache Coherence Protocol 597
8.5.2 Dealing with Correctness Issues 604
8.5.3 Details of Directory Structure 609
8.5.4 Protocol Extensions 610
8.5.5 Overview of the Origin2000Hardware 612
8.5.6 Hub Implementation 614
8.5.7 Performance Characteristics 618
8.6 Cache-Based Directory Protocols: The Sequent NUMA-Q 622
8.6.1 Cache Coherence Protocol 624
8.6.2 Dealing with Correctness Issues 632
8.6.3 Protocol Extensions 634
8.6.4 Overview of NUMA-Q Hardware 635
8.6.5 Protocol Interactions with SMP Node 637
8.6.6 IQ-Link Implementation 639
8.6.7 Performance Characteristics 641
8.6.8 Comparison Case Study: The HAL S1 Multiprocessor 643
8.7 Performance Parameters and Protocol Performance 645
8.8 Synchronization 648
8.8.1 Performance of Synchronization Algorithms 649
8.8.2 Implementing Atomic Primitives 651
8.9 Implications for Parallel Software 652
8.10 Advanced Topics 655
8.10.1 Reducing Directory Storage Overhead 655
8.10.2 Hierarchical Coherence 659
8.11 Concluding Remarks 669
8.12 Exercises 672
Flardware/Software Trade-Offs 679
9.1 Relaxed Memory Consistency Models 681
9. 1. 1 The System Specification 686
9.1.2 The Programmer誷 Interface 694
9.1.3 The Translation Mechanism 698
9.1.4 Consistcncy Models in Real Multiprocessor Systems 698
9.2 Overcoining Capacity Limitations 700
9.2.1 Tertiary Caches 700
9.2.2 Cache-Only Memory Architectures (COMA) 701
9-3 Reducing Hardware Cost 705
9.3.1 Hardware Access Control with a Decoupled Assist 707
9.3.2 Access Control through Code Instrumentation 707
9.3.3 Page-Based Access Control: Shared Virtual Memory 709
9.3.4 Acccss Control throuph Language and Compiler Support 721
9.4 Putting It All Together: A Taxonomy and Simple COMA 724
9.4.1 Putting It All Together: Simple COMA and Stache 726
9.5 Implications for Parallel Software 729
9.6 Advanced Topics 730
9.6.1 Flexibility and Address Constraints in CC-NUMA Systems 730
9.6.2 Implementing Relaxed Memory Consistency in Software 732
9.7 Concluding Remarks 739
9.8 Exercises 740
Interconnection Network Design 749
10.1 Basic Definitions 750
10.2 Basic Communication Performance 755
10.2.1 Latency 755
10.2.2 Bandwidth 761
10.3 Organizational Structure 764
10.3.1 Links 764
10.3.2 Switches 767
10.3.3 Network Interfaces768
10.4 Interconnection Topologies 768
10.4.1 Fully Connected Network 768
10.4.2 Linear Arrays and Rings 769
10.4.3 Multidimensional Meshes and Tori 769
10.4.4 Trees 772
10.4.5 Butterflies 774
10.4.6 Hypercubes 778
10.5 Evaluating Design Trade-Offs in Network Topology 779
10.5.1 Unloaded Latency 780
10.5.2 Latency under Load 785
10.6 Routing 789
10.6.1 Routing Mechanisms 789
10.6.2 Deterministic Routing 790
10.6.3 Deadlock Freedom 791
10.6.4 Virtual Channels 795
10.6.5 Up*-Down* Routing 796
10.6.6 Turn-Model Routing 797
10.6.7 Adaptive Routing 799
10.7 Switch Design 801
10.7.1 Ports 802
10.7.2 Internal Datapath 802
10.7.3 Channel Buffers804
10.7.4 Output Scheduling 808
10.7.5 Stacked Dimension Switches 810
10.8 Flow Control 811
10.8.1 Parallel Computer Networks versus LANs and WANs 811
10.8.2 Link-Level Flow Control 813
10.8.3 End-to-End Flow Control 816
10.9 Case Studies 818
10.9.1 CRAY T3D Network 818
10.9.2 IBM SP-1, SP-2 Network 820
10.9.3 Scalable Coherent Interface 822
10.9.4 SGI Origin Network 825
10.9.5 Myricom Network 826
10-10 Concluding Remarks 827
10-11 Exercises 828
Latency Tolerance 831
11.1 Overview of Latency Tolerance 834
11.1.1 Latency Tolerance and the Communication Pipeline 836
11.1.2 Approaches 837
11. 1.3 Fundaniciital Rcquirements, Benefits, and Limitations 840
11.2 Latency Tolerance in Explicit Message Passing 847
11.2.1 Structure of Communication 848
11.2.2 Block Data Transfer 848
11.2.3 Precommunication 848
11.2.4 Proccedim Past Communication in the Same Thread 850
11.2.5 Multithreading 850
11.3 Latency Tolerance in a Shared Address Space 851
11.3.1 Structure of Communication 852
11.4 Block Data iransfcr in a Shared Address Space 853
11 .4.1 Techniques and Mechanisms 853
11.4.2 Policy Issues and Trade-Offs 854
11.4.3 Performance Benefits 856
11.5 Proceeding Past Long-Latency Events 863
11.5.1 Proceeding Past Writes 864
11.5.2 Proceeding Past Reads 868
11.5.3 Summary 876
11.6 Precommunication in a Shared Address Space 877
11.6.1. Shared Address Space without Caching of Shared Data 877
11.6.2 Cache-Coherent Shared Address Space 879
11.6.3 Performance Benehts 891
1.1.6.4 Summary 896
11.7 Multithreading in a Shared Address Space 896
11.7.1 Techniques and Mechanisms 898
11.7.2 Performance Benefits 910
11.7.3 Implementation Issues for the Blocked Scheme 914
11.7.4 Implementation Issues for the Interleaved Scheme 917
11.7.5 Integrating Multithreading with Multiple-Issue Processors 920
11.8 Lockup-Free Cache Design 922
11.9 Concluding Remarks 926
11.10 Exercises 927
12 Future Directions 935
12.1 Technology and Architecture 936
12.1.1 Evolutionary Scenario 937
12.1.2 Hitting a Wall 940
12.1.3 Potential Breakthroughs 944
12.2 Applications and System Software 955
12.2.1 Evolutionary Scenario 955
12.2.2 Hitting a Wall 960
12.2.3 Potential Breakthroughs 961
Appendix: Parallel Benchmark Suites 963
A.1 ScaLapack 963
A.2 TPC 963
A.3 SPLASH 965
A.4 NAS Parallel Benchmarks 966
A.5 PARKBENCH 967
A.6 Other Ongoing Efforts 968
References 969
Index 993

并行计算机体系结构(英文版）

推荐