Este curso explica cómo desarrollar con Spark una solución Big Data usando Scala con la distribución de Hadoop en Cloudera
Desarrollo de aplicaciones Spark con Scala y Cloudera, aprenderá a procesar datos a escalas que antes pensaba que estaban fuera de su alcance. Primero, aprenderá todos los detalles técnicos de cómo funciona Spark. A continuación, explorará la API de RDD, la abstracción principal original de Spark. Luego, descubrirá cómo ser más competente con Spark SQL y DataFrames. Finalmente, aprenderá a trabajar con la API escrita de Spark: Conjuntos de datos. Cuando haya terminado con este curso, tendrá un conocimiento fundamental de Apache Spark con Scala y Cloudera que lo ayudará a avanzar en el desarrollo de aplicaciones de datos a gran escala que le permitan trabajar con Big Data de manera eficiente y manera performante.
Horas
Clases
Estudio
Nivel
Why Spark with Scala and Cloudera?
- But Why Apache Spark?
- Brief History of Spark.
- What We Will Cover in This Training.
- Picking a Spark Supported Language: Scala, Python, Java, or R.
- What Do You Need for This Course?
Getting an Environment and Data: CDH + StackOverflow
- Getting an Environment & Data: CDH + StackOverflow.
- Upgrading Cloudera Manager and CDH.
- Installing or Upgrading to Java 8 (JDK 1.8)
- Installing Spark 2 on Cloudera.
- Data: StackOverflow & StackExchange Dumps + Demo Files.
- Preparing Your Big Data.
Scala Fundamentals
- Scala's History and Overview
- Building and Running Scala Applications
- Creating Self-contained Applications, Including scalac & sbt
- The Scala Shell: REPL (Read Evaluate Print Loop)
- Scala, the Language
- More on Types, Functions, and Operations
- Expressions, Functions, and Methods
- Classes, Case Classes, and Traits
- Flow Control
- Functional Programming
- Enter spark2-shell: Spark in the Scala Shell
Understanding Spark: An Overview
- Spark, Word Count, Operations, and Transformations
- A Few Words on Fine Grained Transformations and Scalability.
- How Word Count Works, Featuring Coarse Grained Transformations
- Parallelism by Partitioning Data
- Pipelining: One of the Secrets of Spark's Performance
- Narrow and Wide Transformations
- Lazy Execution, Lineage, Directed Acyclic Graph (DAG), and Fault Tolerance.
- Time for the Big Picture: Spark Libraries
Getting Technical: Spark Architecture
- Storage in Spark and Supported Data Formats
- Let's Talk APIs: Low Level and High Level Spark APIs
- Performance Optimizations: Tungsten and Catalyst
- SparkContext and SparkSession: Entry Points to Spark Apps
- Spark Configuration + Client and Cluster Deployment Modes.
- Spark on Yarn: The Cluster Manager
- Spark with Cloudera Manager and YARN UI
- Visualizing Your Spark App: Web UI and History Server
- Logging in with Spark and Cloudera
- Navigating the Spark and Cloudera Documentation
Learning the Core of Spark: RDDs
- SparkContext: The Entry Point to a Spark Application
- RDD and PairRDD - Resilient Distributed Datasets
- Creating RDDs with Parallelize
- Returning Data to the Driver, i.e. collect(), take(), first()...
- Partitions, Repartition, Coalesce, Saving as Text, and HUE
- Creating RDDs from External Datasets
- Saving Data as ObjectFile, NewAPIHadoopFile, SequenceFile, ...
- Creating RDDs with Transformations
- A Little Bit More on Lineage and Dependencies
Going Deeper into Spark Core
- Going Deeper into Spark Core
- Functional Programming: Anonymous Functions (Lambda) in Spark
- A Quick Look at Map, FlatMap, Filter, and Sort
- How Can I Tell It Is a Transformation
- Why Do We Need Actions?
- Partition Operations: MapPartitions and PartitionBy
- Sampling Your Data
- Set Operations: Join, Union, Full Right, Left Outer, and Cartesian
- Combining, Aggregating, Reducing, and Grouping on PairRDDs
- ReduceByKey vs. GroupByKey: Which One Is Better?
- Grouping Data into Buckets with Histogram
- Caching and Data Persistence
- Shared Variables: Accumulators and Broadcast
- What's Needed for Developing Self-contained Spark Applications
- Disadvantages of RDDs - So What's Better?
Increasing Proficiency with Spark: DataFrames and Spark SQL
- Increasing Proficiency with Spark: DataFrames & Spark SQL
- Everyone Uses SQL and How It All Began
- Hello DataFrames and Spark SQL
- SparkSession: The Entry Point to the Spark SQL / DataFrame API
- Creating DataFrames
- DataFrames to RDDs and Vice Versa
- Loading DataFrames: Text and CSV
- Schemas: Inferred and Programatically Specified + Option
- More Data Loading: Parquet and JSON
- Rows, Columns, Expressions, and Operators
- Working with Columns
- More Columns, Expressions, Cloning, Renaming, Casting, & Dropping
- User Defined Functions (UDFs) on Spark SQL
Continuing the Journey on DataFrames and Spark SQL
- User Defined Functions (UDFs) on Spark SQL
- Querying, Sorting, and Filtering DataFrames: The DSL
- What to Do with Missing or Corrupt Data
- Saving DataFrames
- Spark SQL: Querying Using Temporary Views
- Loading Files and Views into DataFrames Using Spark SQL
- Saving to Persistent Tables + Spark 2 Known Issue
- Hive Support and External Databases
- Aggregating, Grouping, and Joining
- The Catalog API
Working with a Typed API: Datasets
- Understanding a Typed API: Datasets
- The Motivation Behind Datasets
- What's a Dataset?
- What Do You Need for Datasets?
- Creating Datasets
- Dataset Operations
- RDDs vs. DataFrames vs. Datasets: A Few Final Thoughts
Satisfacción
¿Qué aprendí?
- Crear entorno & Data: CDH +StackOverflow
- Scala Fundamentals
- Spark
- Técnicas de RDD Spark.
- Dataframe & Spark SQL
- API: Datasets