Developing Spark Applications Using Scala & Cloudera

Developing Spark Applications Using Scala & Cloudera
  • TaskSpark,Scala,Cloudera
Este curso explica cómo desarrollar con Spark una solución Big Data usando Scala con la distribución de Hadoop en Cloudera

Desarrollo de aplicaciones Spark con Scala y Cloudera, aprenderá a procesar datos a escalas que antes pensaba que estaban fuera de su alcance. Primero, aprenderá todos los detalles técnicos de cómo funciona Spark. A continuación, explorará la API de RDD, la abstracción principal original de Spark. Luego, descubrirá cómo ser más competente con Spark SQL y DataFrames. Finalmente, aprenderá a trabajar con la API escrita de Spark: Conjuntos de datos. Cuando haya terminado con este curso, tendrá un conocimiento fundamental de Apache Spark con Scala y Cloudera que lo ayudará a avanzar en el desarrollo de aplicaciones de datos a gran escala que le permitan trabajar con Big Data de manera eficiente y manera performante.



Horas dedicadas al curso.


Clases totales


Horas dedicadas de estudio



Why Spark with Scala and Cloudera?

  • But Why Apache Spark?
  • Brief History of Spark.
  • What We Will Cover in This Training.
  • Picking a Spark Supported Language: Scala, Python, Java, or R.
  • What Do You Need for This Course?

Getting an Environment and Data: CDH + StackOverflow

  • Getting an Environment & Data: CDH + StackOverflow.
  • Upgrading Cloudera Manager and CDH.
  • Installing or Upgrading to Java 8 (JDK 1.8)
  • Installing Spark 2 on Cloudera.
  • Data: StackOverflow & StackExchange Dumps + Demo Files.
  • Preparing Your Big Data.

Scala Fundamentals

  • Scala's History and Overview
  • Building and Running Scala Applications
  • Creating Self-contained Applications, Including scalac & sbt
  • The Scala Shell: REPL (Read Evaluate Print Loop)
  • Scala, the Language
  • More on Types, Functions, and Operations
  • Expressions, Functions, and Methods
  • Classes, Case Classes, and Traits
  • Flow Control
  • Functional Programming
  • Enter spark2-shell: Spark in the Scala Shell

Understanding Spark: An Overview

  • Spark, Word Count, Operations, and Transformations
  • A Few Words on Fine Grained Transformations and Scalability.
  • How Word Count Works, Featuring Coarse Grained Transformations
  • Parallelism by Partitioning Data
  • Pipelining: One of the Secrets of Spark's Performance
  • Narrow and Wide Transformations
  • Lazy Execution, Lineage, Directed Acyclic Graph (DAG), and Fault Tolerance.
  • Time for the Big Picture: Spark Libraries

Getting Technical: Spark Architecture

  • Storage in Spark and Supported Data Formats
  • Let's Talk APIs: Low Level and High Level Spark APIs
  • Performance Optimizations: Tungsten and Catalyst
  • SparkContext and SparkSession: Entry Points to Spark Apps
  • Spark Configuration + Client and Cluster Deployment Modes.
  • Spark on Yarn: The Cluster Manager
  • Spark with Cloudera Manager and YARN UI
  • Visualizing Your Spark App: Web UI and History Server
  • Logging in with Spark and Cloudera
  • Navigating the Spark and Cloudera Documentation

Learning the Core of Spark: RDDs

  • SparkContext: The Entry Point to a Spark Application
  • RDD and PairRDD - Resilient Distributed Datasets
  • Creating RDDs with Parallelize
  • Returning Data to the Driver, i.e. collect(), take(), first()...
  • Partitions, Repartition, Coalesce, Saving as Text, and HUE
  • Creating RDDs from External Datasets
  • Saving Data as ObjectFile, NewAPIHadoopFile, SequenceFile, ...
  • Creating RDDs with Transformations
  • A Little Bit More on Lineage and Dependencies

Going Deeper into Spark Core

  • Going Deeper into Spark Core
  • Functional Programming: Anonymous Functions (Lambda) in Spark
  • A Quick Look at Map, FlatMap, Filter, and Sort
  • How Can I Tell It Is a Transformation
  • Why Do We Need Actions?
  • Partition Operations: MapPartitions and PartitionBy
  • Sampling Your Data
  • Set Operations: Join, Union, Full Right, Left Outer, and Cartesian
  • Combining, Aggregating, Reducing, and Grouping on PairRDDs
  • ReduceByKey vs. GroupByKey: Which One Is Better?
  • Grouping Data into Buckets with Histogram
  • Caching and Data Persistence
  • Shared Variables: Accumulators and Broadcast
  • What's Needed for Developing Self-contained Spark Applications
  • Disadvantages of RDDs - So What's Better?

Increasing Proficiency with Spark: DataFrames and Spark SQL

  • Increasing Proficiency with Spark: DataFrames & Spark SQL
  • Everyone Uses SQL and How It All Began
  • Hello DataFrames and Spark SQL
  • SparkSession: The Entry Point to the Spark SQL / DataFrame API
  • Creating DataFrames
  • DataFrames to RDDs and Vice Versa
  • Loading DataFrames: Text and CSV
  • Schemas: Inferred and Programatically Specified + Option
  • More Data Loading: Parquet and JSON
  • Rows, Columns, Expressions, and Operators
  • Working with Columns
  • More Columns, Expressions, Cloning, Renaming, Casting, & Dropping
  • User Defined Functions (UDFs) on Spark SQL

Continuing the Journey on DataFrames and Spark SQL

  • User Defined Functions (UDFs) on Spark SQL
  • Querying, Sorting, and Filtering DataFrames: The DSL
  • What to Do with Missing or Corrupt Data
  • Saving DataFrames
  • Spark SQL: Querying Using Temporary Views
  • Loading Files and Views into DataFrames Using Spark SQL
  • Saving to Persistent Tables + Spark 2 Known Issue
  • Hive Support and External Databases
  • Aggregating, Grouping, and Joining
  • The Catalog API

Working with a Typed API: Datasets

  • Understanding a Typed API: Datasets
  • The Motivation Behind Datasets
  • What's a Dataset?
  • What Do You Need for Datasets?
  • Creating Datasets
  • Dataset Operations
  • RDDs vs. DataFrames vs. Datasets: A Few Final Thoughts


¿Qué aprendí?

  • Crear entorno & Data: CDH +StackOverflow
  • Scala Fundamentals
  • Spark
  • Técnicas de RDD Spark.
  • Dataframe & Spark SQL
  • API: Datasets