| 作者 |
| (美)Russell Jurney(罗素·朱尼) |
| 丛书名 |
| 出版社 |
| 电子工业出版社 |
| ISBN |
| 9787121351662 |
| 简要 |
| 简介 |
| 内容简介书籍计算机书籍 本书介绍了作者提出的敏捷数据科学的方法论,结合作者在行业中多年的实际工作经验,为数据科学团队提供了一套以类似敏捷开发的方法开展数据科学研究的实践经验。全书基于Spark做全栈数据分析,书中展示了工业界一些常见工具的使用,包括从前端显示到后端处理的各个环节,手把手帮助数据科学家快速将理论转化为真正面向用户的应用程序,从而让读者在利用数据创造真正价值的同时,也能不断完善自己的研究。本书适合初学者阅读,数据科学家、工程师、分析师都能在本书中有所收获。 |
| 目录 |
| 前言 .................................................................................................. xiv 第Ⅰ部分 准备工作 第1章 理论 ..........................................................................................3 导论 .............................................................................................................................3 定义 .............................................................................................................................5 方法学 ................................................................................................................5 敏捷数据科学宣言 ............................................................................................6 瀑布模型的问题 .......................................................................................................10 研究与应用开发 ..............................................................................................11 敏捷软件开发的问题 ...............................................................................................14 最终质量:偿还技术债 ....................................................................................14 瀑布模型的拉力 ..............................................................................................15 数据科学过程 ...........................................................................................................16 设置预期 ..........................................................................................................17 数据科学团队的角色 ......................................................................................18 认清机遇与挑战 ..............................................................................................19 适应变化 ..........................................................................................................21 过程中的注意事项 ...................................................................................................23 代码审核与结对编程 ......................................................................................25 敏捷开发的环境:提高生产效率 ....................................................................25 用大幅打印实现想法 ......................................................................................27 第2章 敏捷工具 ................................................................................29 可伸缩性=易用性 ...................................................................................................30 敏捷数据科学之数据处理 .......................................................................................30 搭建本地环境 ...........................................................................................................32 配置要求 ..........................................................................................................33 配置Vagrant .....................................................................................................33 下载数据 ..........................................................................................................33 搭建EC2环境 ............................................................................................................34 下载数据 ..........................................................................................................38 下载并运行代码 .......................................................................................................38 下载代码 ..........................................................................................................38 运行代码 ..........................................................................................................38 Jupyter笔记本 ...................................................................................................39 工具集概览 ...............................................................................................................39 敏捷开发工具栈的要求 ..................................................................................39 Python 3 ...........................................................................................................39 使用JSON行和Parquet序列化事件 .................................................................42 收集数据 ..........................................................................................................45 使用Spark进行数据处理 .................................................................................45 使用MongoDB发布数据 .................................................................................48 使用Elasticsearch搜索数据 .............................................................................50 使用Apache Kafka分发流数据 .......................................................................54 使用PySpark Streaming处理流数据 ...............................................................57 使用scikit-learn与Spark MLlib进行机器学习 ................................................58 使用 Apache Airflow(孵化项目)进行调度 ....................................................59 反思我们的工作流程 ......................................................................................70 轻量级网络应用 ..............................................................................................70 展示数据 ..........................................................................................................73 本章小结 ...................................................................................................................75 第3章 数据 ........................................................................................77 飞行航班数据 ...........................................................................................................77 航班准点情况数据 ..........................................................................................78 OpenFlights数据库 ...........................................................................................79 天气数据 ...................................................................................................................80 敏捷数据科学中的数据处理 ...................................................................................81 结构化数据vs.半结构化数据 ..........................................................................81 SQL vs. NoSQL .........................................................................................................82 SQL ...................................................................................................................83 NoSQL与数据流编程 ......................................................................................83 Spark: SQL + NoSQL ......................................................................................84 NoSQL中的表结构 ..........................................................................................84 数据序列化 ......................................................................................................85 动态结构表的特征提取与呈现 ......................................................................85 本章小结 ...................................................................................................................86 第Ⅱ部分 攀登金字塔 第4章 记录收集与展示 ......................................................................89 整体使用 ...................................................................................................................90 航班数据收集与序列化 ...........................................................................................91 航班记录处理与发布 ...............................................................................................94 把航班记录发布到MongoDB .........................................................................95 在浏览器中展示航班记录 .......................................................................................96 使用Flask和pymongo提供航班信息 ...............................................................97 使用Jinja2渲染HTML5页面............................................................................98 敏捷开发检查站 .....................................................................................................102 列出航班记录 .........................................................................................................103 使用MongoDB列出航班记录 .......................................................................103 数据分页 ........................................................................................................106 搜索航班数据 .........................................................................................................112 创建索引 ........................................................................................................112 发布航班数据到Elasticsearch ......................................................................113 通过网页搜索航班数据 ................................................................................114 本章小结 .................................................................................................................117 第5章 使用图表进行数据可视化 .................................................... 119 图表质量:迭代至关重要 .......................................................................................120 用发布/装饰模型伸缩数据库 ................................................................................120 一阶形式 ........................................................................................................121 二阶形式 ........................................................................................................122 三阶形式 ........................................................................................................123 选择一种形式 ................................................................................................123 探究时令性 .............................................................................................................124 查询并展示航班总数 ....................................................................................124 提取“金属”(飞机(实体)) .....................................................................................132 提取机尾编号 ................................................................................................132 评估飞机记录 ................................................................................................139 数据完善 .................................................................................................................140 网页表单逆向工程 ........................................................................................140 收集机尾编号 ................................................................................................142 自动化表单提交 ............................................................................................143 从HTML中提取数据 .....................................................................................144 评价完善后的数据 ........................................................................................147 本章小结 .................................................................................................................148 第6章 通过报表探索数据 ............................................................... 149 提取航空公司为实体 .............................................................................................150 使用PySpark把航空公司定义为飞机的分组 ...............................................150 在MongoDB中查询航空公司数据 ...............................................................151 在Flask中构建航空公司页面 ........................................................................151 添加回到航空公司页面的链接 ....................................................................152 创建一个包括所有航空公司的主页 ............................................................153 整理半结构化数据的本体关系 .............................................................................154 改进航空公司页面 .................................................................................................155 给航空公司代码加上名称 ............................................................................156 整合维基百科内容 ........................................................................................158 把扩充过的航空公司表发布到MongoDB ...................................................159 在网页上扩充航空公司信息 ........................................................................160 调查飞机(实体) .....................................................................................................162 SQL嵌套查询vs.数据流编程 ........................................................................164 不使用嵌套查询的数据流编程 ....................................................................164 Spark SQL中的子查询...................................................................................165 创建飞机主页 ................................................................................................166 在飞机页面上添加搜索 ................................................................................167 创建飞机制造商的条形图 ............................................................................172 对飞机制造商条形图进行迭代 ....................................................................174 实体解析:新一轮图表迭代 ..........................................................................177 本章小结 .................................................................................................................183 第7章 进行预测 ............................................................................. 185 预测的作用 .............................................................................................................186 预测什么 .................................................................................................................186 预测分析导论 .........................................................................................................187 进行预测 ........................................................................................................187 探索航班延误 .........................................................................................................189 使用PySpark提取特征............................................................................................193 使用scikit-learn构建回归模型 ...............................................................................198 读取数据 ........................................................................................................198 数据采样 ........................................................................................................199 向量化处理结果 ............................................................................................200 准备训练数据 ................................................................................................201 向量化处理特征 ............................................................................................201 稀疏矩阵与稠密矩阵 ....................................................................................203 准备实验 ........................................................................................................204 训练模型 ........................................................................................................204 测试模型 ........................................................................................................205 小结 ................................................................................................................207 使用Spark MLlib构建分类器.................................................................................208 使用专用结构加载训练数据 ........................................................................208 处理空值 ........................................................................................................210 用Route(路线)替代FlightNum(航班号) .....................................................210 对连续变量分桶以用于分类 ........................................................................211 使用pyspark.ml.feature向量化处理特征 ......................................................219 用Spark ML做分类 ........................................................................................221 本章小结 .................................................................................................................223 第8章 部署预测系统 ...................................................................... 225 把scikit-learn应用部署为网络服务 .......................................................................225 scikit-learn模型的保存与读取 ......................................................................226 提供预测模型的准备工作 ............................................................................227 为航班延误回归分析创建API ......................................................................228 测试API .........................................................................................................232 在产品中使用API ..........................................................................................232 使用Airflow部署批处理模式Spark ML应用 ........................................................234 在生产环境中收集训练数据 ........................................................................235 Spark ML模型的训练、存储与加载 ..............................................................237 在MongoDB中创建预测请求 .......................................................................239 从MongoDB中获取预测请求 .......................................................................245 使用Spark ML以批处理模式进行预测 ........................................................248 用MongoDB保存预测结果 ...........................................................................252 在网络应用中展示批处理预测结果 ............................................................253 用Apache Airflow(孵化项目)自动化工作流 ...............................................256 小结 ................................................................................................................264 用Spark Streaming部署流式计算模式Spark ML应用 ..........................................264 在生产环境中收集训练数据 ........................................................................265 Spark ML模型的训练、存储、读取 ................................................................265 发送预测请求到Kafka ..................................................................................266 用Spark Streaming进行预测 ..........................................................................277 测试整个系统 ................................................................................................283 本章小结 .................................................................................................................285 第9章 改进预测结果 ...................................................................... 287 解决预测的问题 .....................................................................................................287 什么时候需要改进预测 .........................................................................................288 改进预测表现 .........................................................................................................288 黏附试验法:找出黏性好的 ..........................................................................288 为试验建立严格的指标 ................................................................................289 把当日时间作为特征 ....................................................................................298 纳入飞机数据 ................................................................................................302 提取飞机特征 ................................................................................................302 在分类器模型中纳入飞机特征 ....................................................................305 纳入飞行时间 .........................................................................................................310 本章小结 .................................................................................................................313 附录A 安装手册 ............................................................................. 315 安装Hadoop ...........................................................................................................315 安装Spark ...............................................................................................................316 安装MongoDB .......................................................................................................317 安装MongoDB的Java驱动 .....................................................................................317 安装mongo-hadoop ................................................................................................318 编译mongo-hadoop .......................................................................................318 安装pymongo_spark ......................................................................................318 安装 Elasticsearch ..................................................................................................318 安装Elasticsearch的Hadoop支持库 .......................................................................319 配置我们的Spark环境 ...........................................................................................320 安装 Kafka .............................................................................................................320 安装scikit-learn ......................................................................................................320 安装Zeppelin ..........................................................................................................321 |