Developing Spark Applications Using Scala & Cloudera

Este curso explica cómo desarrollar con Spark una solución Big Data usando Scala con la distribución de Hadoop en Cloudera

Desarrollo de aplicaciones Spark con Scala y Cloudera, aprenderá a procesar datos a escalas que antes pensaba que estaban fuera de su alcance. Primero, aprenderá todos los detalles técnicos de cómo funciona Spark. A continuación, explorará la API de RDD, la abstracción principal original de Spark. Luego, descubrirá cómo ser más competente con Spark SQL y DataFrames. Finalmente, aprenderá a trabajar con la API escrita de Spark: Conjuntos de datos. Cuando haya terminado con este curso, tendrá un conocimiento fundamental de Apache Spark con Scala y Cloudera que lo ayudará a avanzar en el desarrollo de aplicaciones de datos a gran escala que le permitan trabajar con Big Data de manera eficiente y manera performante.

5.42

Horas

Horas dedicadas al curso.

Clases

Clases totales

Estudio

Horas dedicadas de estudio

Nivel

Inicial

Why Spark with Scala and Cloudera?

But Why Apache Spark?
Brief History of Spark.
What We Will Cover in This Training.
Picking a Spark Supported Language: Scala, Python, Java, or R.
What Do You Need for This Course?

Getting an Environment and Data: CDH + StackOverflow

Getting an Environment & Data: CDH + StackOverflow.
Upgrading Cloudera Manager and CDH.
Installing or Upgrading to Java 8 (JDK 1.8)
Installing Spark 2 on Cloudera.
Data: StackOverflow & StackExchange Dumps + Demo Files.
Preparing Your Big Data.

Scala Fundamentals

Scala's History and Overview
Building and Running Scala Applications
Creating Self-contained Applications, Including scalac & sbt
The Scala Shell: REPL (Read Evaluate Print Loop)
Scala, the Language
More on Types, Functions, and Operations
Expressions, Functions, and Methods
Classes, Case Classes, and Traits
Flow Control
Functional Programming
Enter spark2-shell: Spark in the Scala Shell

Understanding Spark: An Overview

Spark, Word Count, Operations, and Transformations
A Few Words on Fine Grained Transformations and Scalability.
How Word Count Works, Featuring Coarse Grained Transformations
Parallelism by Partitioning Data
Pipelining: One of the Secrets of Spark's Performance
Narrow and Wide Transformations
Lazy Execution, Lineage, Directed Acyclic Graph (DAG), and Fault Tolerance.
Time for the Big Picture: Spark Libraries

Getting Technical: Spark Architecture

Storage in Spark and Supported Data Formats
Let's Talk APIs: Low Level and High Level Spark APIs
Performance Optimizations: Tungsten and Catalyst
SparkContext and SparkSession: Entry Points to Spark Apps
Spark Configuration + Client and Cluster Deployment Modes.
Spark on Yarn: The Cluster Manager
Spark with Cloudera Manager and YARN UI
Visualizing Your Spark App: Web UI and History Server
Logging in with Spark and Cloudera
Navigating the Spark and Cloudera Documentation

Learning the Core of Spark: RDDs

SparkContext: The Entry Point to a Spark Application
RDD and PairRDD - Resilient Distributed Datasets
Creating RDDs with Parallelize
Returning Data to the Driver, i.e. collect(), take(), first()...
Partitions, Repartition, Coalesce, Saving as Text, and HUE
Creating RDDs from External Datasets
Saving Data as ObjectFile, NewAPIHadoopFile, SequenceFile, ...
Creating RDDs with Transformations
A Little Bit More on Lineage and Dependencies

Going Deeper into Spark Core

Going Deeper into Spark Core
Functional Programming: Anonymous Functions (Lambda) in Spark
A Quick Look at Map, FlatMap, Filter, and Sort
How Can I Tell It Is a Transformation
Why Do We Need Actions?
Partition Operations: MapPartitions and PartitionBy
Sampling Your Data
Set Operations: Join, Union, Full Right, Left Outer, and Cartesian
Combining, Aggregating, Reducing, and Grouping on PairRDDs
ReduceByKey vs. GroupByKey: Which One Is Better?
Grouping Data into Buckets with Histogram
Caching and Data Persistence
Shared Variables: Accumulators and Broadcast
What's Needed for Developing Self-contained Spark Applications
Disadvantages of RDDs - So What's Better?

Increasing Proficiency with Spark: DataFrames and Spark SQL

Increasing Proficiency with Spark: DataFrames & Spark SQL
Everyone Uses SQL and How It All Began
Hello DataFrames and Spark SQL
SparkSession: The Entry Point to the Spark SQL / DataFrame API
Creating DataFrames
DataFrames to RDDs and Vice Versa
Loading DataFrames: Text and CSV
Schemas: Inferred and Programatically Specified + Option
More Data Loading: Parquet and JSON
Rows, Columns, Expressions, and Operators
Working with Columns
More Columns, Expressions, Cloning, Renaming, Casting, & Dropping
User Defined Functions (UDFs) on Spark SQL

Continuing the Journey on DataFrames and Spark SQL

User Defined Functions (UDFs) on Spark SQL
Querying, Sorting, and Filtering DataFrames: The DSL
What to Do with Missing or Corrupt Data
Saving DataFrames
Spark SQL: Querying Using Temporary Views
Loading Files and Views into DataFrames Using Spark SQL
Saving to Persistent Tables + Spark 2 Known Issue
Hive Support and External Databases
Aggregating, Grouping, and Joining
The Catalog API

Working with a Typed API: Datasets

Understanding a Typed API: Datasets
The Motivation Behind Datasets
What's a Dataset?
What Do You Need for Datasets?
Creating Datasets
Dataset Operations
RDDs vs. DataFrames vs. Datasets: A Few Final Thoughts

Satisfacción

¿Qué aprendí?

Crear entorno & Data: CDH +StackOverflow
Scala Fundamentals
Spark
Técnicas de RDD Spark.
Dataframe & Spark SQL
API: Datasets