Key Features
- An advanced guide with a combination of instructions and practical examples to extend the most up-to date Spark functionalities.
- Extend your data processing capabilities to process huge chunk of data in minimum time using advanced concepts in Spark.
- Master the art of real-time processing with the help of Apache Spark 2.0
Book Description
Apache Spark is an in-memory cluster based parallel processing system that provides a wide range of functionality like graph processing, machine learning, stream processing and SQL. This book aims to take your limited knowledge of Spark to the next level by teaching you how to expand Spark functionality.
The book commences with an overview of the Spark eco-system. You will learn how hive can be configured and used on Spark to provide real time SQL processing. The book will introduce you to Project Tungsten. You will understand how Memory Management and Binary Processing, Cache-aware Computation and Code Generation are used to speed things up dramatically. The book extends to show how to incorporate H20 and Deeplearning4j for machine learning, Titan for graph based storage, Databricks and Juypter Notebooks for cloud-based Spark. During the course of the book, you will learn about the latest enhancements in Apache Spark 2.0 such as Interactive querying of live data, unifying dataframes and data sets, and so on.
You will also learn about update in Accumulative APIs and DataFrame-based ML API. You will learn to use Spark as a Compiler, understand how to implement structure streaming, and thus explore how easy it is to use Spark in day-to-day tasks.
What you will learn
- Examine clustering and classification using MLlib
- Create a schema in Spark SQL, and learn how a Spark schema can be populated with data
- Study Spark based graph processing using Spark GraphX
- Combine Spark with H20 and DeepLearning4j and learn why it is useful
- Evaluate how graph storage works with Apache Spark, Titan, HBase and Cassandra
- Use Apache Spark in the cloud with Databricks, Jupyter Notebooks, Docker and OpenStack
About the Author
Romeo Kienzler is the Chief Data Scientist of the IBM Watson IoT Division and working as an Advisory Architect helping client worldwide to solve their data analysis problems.
https://www.linkedin.com/in/romeo-kienzler-089b4557
https://www.packtpub.com/big-data-and-business-intelligence/learning-data-mining-r-video
He holds an M. Sc. of Information System, Bioinformatics and Applied Statistics from the Swiss Federal Institute of Technology. He works as an Associate Professor for data mining at a Swiss University and his current research focus is on cloud-scale data mining using open source technologies including R, ApacheSpark, SystemML, ApacheFlink, and DeepLearning4J. He also contributes to various open source projects. Additionally, he is currently writing a chapter on Hyperledger for a book on Blockchain technologies.
http://dataconomy.com/where-life-science-and-data-science-meet-interview-with-romeo-kienzler-of-ibm/
Romeo has spoken at the O'Reilly's Velocity conference.
http://conferences.oreilly.com/velocity/devops-web-performance-eu-2015/public/schedule/speaker/219260
http://www.meetup.com/Big-Data-Developers-in-Berlin/events/227744512/