This is a book that systematically and deeply explains the big data technology stack. It explains the principles, architecture and practice of all core technologies in the entire big data technology system from six levels: data collection, data storage, resource management and service coordination, computing engine, data analysis, and data visualization. It not only allows readers to fully understand the entire big data system from a macro perspective, but also allows readers to deeply understand the details of various big data technologies from a micro perspective. This book will take the life cycle of data in the big data system as a clue. There are 17 chapters in total, divided into seven parts: Part I (Chapter 1): Overview Mainly introduces the enterprise-level big data technology framework, technical implementation solutions and architecture, including Google\'s big data technology stack and open source technology stack represented by Hadoop and Spark. Part II (Chapters 2-4): Data Collection Explains big data collection related technologies, mainly involving relational data collection tools Sqoop and Canel, non-relational data collection system Flume and distributed message queue Kafka. Part III (Chapters 5-7): Data Storage Explains big data storage related technologies, involving data storage format, distributed file system and distributed database, including Thrift, Protobuf, Avro, HDFS and HBase. Part 4 (Chapters 8-9): Distributed Coordination and Resource Management explains the technologies related to resource management and service coordination, involving the resource management and scheduling system YARN and the resource coordination system Zookeeper. Part 5 (Chapters 10-13): Computing Engine explains the technologies related to computing engines, involving three types of engines: batch processing, interactive processing, and streaming real-time processing, including common technologies such as MapReduce, Spark, Impala/Presto, and Storm. Part 6 (Chapters 14-16): Data Analysis explains the technologies related to data analysis, involving data analysis languages HQL and SQL, big data unified programming models, and machine learning libraries. Part 7 (Chapter 17): Application Cases explains 3 enterprise-level big data comprehensive application cases, including Lambda architecture, data warehouse based on big data technology, and user behavior real-time statistics system. Contents Preface Part I Overview Chapter 1 Overview of Enterprise-Level Big Data Technology System 2 1.1 Background and Application Scenarios of Big Data Systems 2 1.1.1 Background 2 1.1.2 Common Big Data Application Scenarios 3 1.2 Enterprise-Level Big Data Technology Framework 5 1.2.1 Data Collection Layer 6 1.2.2 Data Storage Layer 7 1.2.3 Resource Management and Service Coordination Layer 7 1.2.4 Computing Engine Layer 8 1.2.5 Data Analysis Layer 9 1.2.6 Data Visualization Layer 9 1.3 Enterprise-Level Big Data Technology Implementation Plan 9 1.3.1 Google Big Data Technology Stack 10 1.3.2 Hadoop and Spark Open Source Big Data Technology Stack 12 1.4 Big Data Architecture: Lambda Architecture 15 1.5 Hadoop and Spark Version Selection and Installation and Deployment 16 1.5.1 Hadoop and Spark Version Selection 16 1.5.2 Hadoop and Spark Installation and Deployment 17 1.6 2.4 Incremental Data Collection 31 2.4.1 CDC Motivation and Application Scenarios 31 2.4.2 CDC Open Source Implementation Canal 32 2.4.3 Multi-Data Center Data Synchronization System Otter 33 2.5 Summary 35 2.6 Questions in this Chapter 35 Chapter 3 Collection of Non-Relational Data 36 3.1 Overview 36 3.1.1 Flume Design Motivation 36 3.1.2 Flume Basic Ideas and Features 37 3.2 Flume NG Basic Architecture 38 3.2.1 Flume NG Basic Architecture 38 3.2.2 Flume NG Advanced Components 41 3.3 Flume NG Data Flow Topology Construction Method 42 3.3.1 How to Build Data Flow Topology 42 3.3.2 Data Flow Topology Example Analysis 46 3.4 Summary 50 3.5 Questions in This Chapter 50 Chapter 4 Distributed Message Queue Kafka 51 4.1 Overview 51 4.1.1 Kafka Design Motivation 51 4.1.2 Kafka Features 53 4.2 Kafka Design Architecture 53 4.2.1 Kafka Basic Architecture 54 4.2.2 Detailed Explanation of Kafka Components 54 4.2.3 Kafka Key Technical Points 58 4.3 Kafka Programming 60 4.3.1 Producer Programming 61 4.3.2 Consumer Programming 63 4.3.3 Open Source Producer and Consumer Implementation 65 4.4 Typical Application Scenarios of Kafka 65 4.5 Summary 67 4.6 Questions in This Chapter 67 Part III Data Storage Chapter 5 Data Serialization and File Storage Format 70 5.1 The Significance of Data Serialization 70 5.2 Data Serialization Scheme 72 5.2.1 Serialization Framework Thrift 72 5.2.2 Serialization Framework Protobuf 74 5.2.3 Serialization Framework Avro 76 5.2.4 Comparison of Serialization Frameworks 78 5.3 Analysis of File Storage Formats 79 5.3.1 Row Storage and Column Storage 79 5.3.2 Row Storage Format 80 5.3.3 Column Storage Formats ORC, Parquet and CarbonData 82 5.4 Summary 88 5.5 Chapter Questions 89 Chapter 6 Distributed File Systems 90 6.1 Background 90 6.2 File-Level and Block-Level Distributed File Systems 91 6.2.1 File-Level Distributed System 91 6.2.2 Block-Level Distributed System 92 6.3 HDFS Basic Architecture 93 6.4 HDFS Key Technologies 94 6.4.1 Fault-Tolerant Design 95 6.4.2 Replica Placement Strategy 95 6.4.3 Heterogeneous Storage Media 96 6.4.4 Centralized Cache Management 97 6.5 HDFS Access Methods 98 6.5.1 HDFS Shell 98 6.5.2 HDFS API 100 6.5.3 Data Collection Components 101 6.5.4 Computing Engine 102 6.6 Summary 102 6.7 Chapter Questions 103 Chapter 7 Distributed Structured Storage System 104 7.1 Background 104 7.2 HBase Data Model 105 7.2.2 Physical Data Storage 107 7.3 HBase Basic Architecture 108 7.3.1 HBase Basic Architecture 108 7.3.2 HBase Internal Principles 110 7.4 HBase Access Methods 114 7.4.1 HBase Shell 114 7.4.2 HBase API 116 7.4.3 Data Collection Components 118 7.4.4 Computing Engine 119 7.4.5 Apache Phoenix 119 7.5 HBase Application Cases 120 7.5.1 Social Relationship Data Storage 120 7.5.2 OpenTSDB Time Series Database 122 7.6 Distributed Columnar Storage System Kudu 125 7.6.1 Kudu Basic Features 125 7.6.2 Kudu Data Model and Architecture 126 7.6.3 Comparison between HBase and Kudu 126 7.7 Summary 127 7.8 Chapter Questions 127 Part IV Distributed Coordination and Resource Management Chapter 8 Distributed Coordination Service ZooKeeper 130 8.1 The Existence of Distributed Coordination Services 130 8.1.1 Leader Election 130 8.1.2 Load Balancing 131 8.2 ZooKeeper Data Model 132 8.3 ZooKeeper Basic Architecture 133 8.4 ZooKeeper Programming 134 8.4.1 ZooKeeper API 135 8.4.2 Apache Curator 139 8.5 ZooKeeper Application Cases 142 8.5.1 Leader Election 142 8.5.2 Distributed Queues 143 8.5.3 Load Balancing 143 8.6 Summary 144 8.7 Questions in This Chapter 145 Chapter 9 Resource Management and Scheduling System YARN 146 9.1 Background of YARN 146 9.1.1 Limitations of MRv1 146 9.1.2 Motivation for YARN Design 147 9.2 Design Ideas for YARN 148 9.3 9.3.1 YARN Basic Architecture 149 9.3.2 YARN High Availability 152 9.3.3 YARN Workflow 153 9.4 YARN Resource Scheduler 155 9.4.1 Hierarchical Queue Management Mechanism 155 9.4.2 Background of Multi-tenant Resource Scheduler 156 9.4.3 Capacity/Fair Scheduler 157 9.4.4 Scheduling Based on Node Labels 160 9.4.5 Resource Preemption
You Might Like
Recommended ContentMore
Open source project More
Popular Components
Searched by Users
Just Take a LookMore
Trending Downloads
Trending ArticlesMore