We had our last 2016 Data Science Meetup speaker series, December 15th with Jim Crozier from IBM Data Science, giving us a great detailed introduction to Python 2.0 with Spark 2.0 (PySpark) and showcased it analyzing NFL data.
Apache Spark is an open-source cluster computing framework with in-memory processing to speed analytic applications up to 100 times faster compared to other technologies currently in the market. Is known for its ease of use in creating algorithms that harness insight from complex data.
Jim also spoke about the Spark Core, and the different languages you can use such as R, Python and Scala. He told us that Scala has strong static types. Errors are raised at the compilation stage. It makes your development process easier especially in big projects. Also is based on JVM so it’s native for Hadoop. Hadoop is important because Spark was made on the top of the Hadoop’s filesystem HDFS.
Scala interacts with Hadoop via native Hadoop’s API in Java. That’s why it’s very easy to write native Hadoop applications in Scala.
He also covered some information on Machine learning that has come a long way from its early roots in classical math and statistics. Today’s machine learning uses analytic models and algorithms that iteratively learn from data, allowing computers to find hidden insights without being explicitly programmed where to look. This means data analysts and scientists can teach computers to solve problems without having to recode rules each time a new data set is presented.
Using algorithms that learn by looking at hundreds or thousands of data samples, computers can make predictions based on these learned experiences to solve the same problem in new situations. And they’re doing it with a level of accuracy that is beginning to mimic human intelligence.
IBM is helping organizations apply Machine Learning through the power of Apache Spark, bringing significant benefits to the analytics industry as companies increasingly make space for machine learning in the enterprise.
The demand for machine learning is booming!
Jim gave us some information about Pandas Data-frames and how they are not part of the Spark Library. Pandas is an open source Python library for data analysis.
Jim spoke about how IBM is mastering the art of Data Science via the IBM Data Science Experience. Is a new cloud-based, social work-space that helps data professionals consolidate create and collaborate across multiple open source tools such as R and Python. Read more Here.
Since its release, Apache Spark has seen rapid adoption by enterprises across a wide range of industries. Internet powerhouses such as Netflix, Yahoo, and eBay have deployed Spark at massive scale, collectively processing multiple petabytes of data on clusters of over 8,000 nodes. It has quickly become the largest open source community in big data, with over 1000 contributors from 250+ organizations.