Spark has become the de facto standard for large-scale data analytics. I have been using and teaching Spark since its inception nine years ago, and I have seen tremendous improvements in Extract, Transform, Load (ETL) processes, distributed algorithm development, and large-scale data analytics. I started using Spark with Java, but I found that while the code is pretty stable, you have to write long lines of code, which can become unreadable. For this book, I decided to use PySpark (a Python API for Spark) because it is easier to express the power of Spark in Python: the code is short, readable, and maintainable. PySpark is powerful but simple to use, and you can express any ETL or distributed algorithm in it with a simple set of transformations and actions.
نظرات کاربران