Sqoop, Hive and Impala for Data Analysts (Formerly CCA 159)

COURSE AUTHOR –
Durga Viswanatha Raju Gadiraju, Asasri Manthena

Last Updated on April 11, 2023 by GeeksGod

As part of Sqoop, Hive, and Impala for Data Analysts (Formerly CCA 159), you will learn key skills such as Sqoop, Hive, and Impala.

This comprehensive course covers all aspects of the certification with real-world examples and data sets.

Overview of Big Data ecosystem

Overview Of Distributions and Management ToolsProperties and Properties Files – General GuidelinesHadoop Distributed File SystemYARN and Map Reduce2Submitting Map ReduceJobDetermining Number of Mappers and ReducersUnderstanding YARN and Map Reduce Configuration PropertiesReview and Override Job PropertiesReviewing Map Reduce Job LogsMap Reduce Job CountersOverview of HiveDatabases and Query EnginesOverview of Data Ingestion in Big DataData Processing using Spark

HDFS Commands to manage files

Introduction to HDFS for Certification ExamsOverview of HDFS and PropertiesFilesOverview of Hadoop CLIListing Files in HDFSUser Spaces or Home Directories in HDFSCreating Directories in HDFSCopying Files and Directories into HDFSFile and Directory Permissions OverviewGetting Files and Directories from HDFSPreviewing Text Files in HDFSCopying or Moving Files and Directories within HDFSUnderstanding Size of File System and FilesOverview of Block Size and ReplicationFactorGetting File Metadata using hdfs fsckResources and Exercises

Getting Started with Hive

Overview of Hive Language ManualLaunching and using Hive CLIOverview of Hive PropertiesHive CLI History and hivercRunning HDFS Commands in Hive CLIUnderstanding Warehouse DirectoryCreating and Using Hive DatabasesCreating and Describing Hive TablesRetrieve Matadata of Tables using DESCRIBERole of Hive Metastore DatabaseOverview of beelineRunning Hive Commands and Queries using beeline

Creating Tables in Hive using Hive QL

Creating Tables in Hive – ordersOverview of Basic Data Types in HiveAdding Comments to Columns and TablesLoading Data into Hive Tables from Local File SystemLoading Data into Hive Tables from HDFSLoading Data – Overwrite vs AppendCreating External tables in HiveSpecifying Location for Hive TablesDifference between Managed Table and External TableDefault Delimiters in Hive Tables using Text FileOverview of File Formats in HiveDifferences between Hive and RDBMSTruncate and Drop tables in HiveResources and Exercises

Loading/Inserting data into Hive tables using Hive QL

Introduction to Partitioning and BucketingCreating Tables using Orc Format – order_itemsInserting Data into Tables using Stage TablesLoad vs. Insert in HiveCreating Partitioned Tables in HiveAdding Partitions to Tables in HiveLoading into Partitions in Hive TablesInserting Data Into Partitions in Hive TablesInsert Using Dynamic Partition ModeCreating Bucketed Tables in HiveInserting Data into Bucketed TablesBucketing with SortingOverview of ACID TransactionsCreate Tables for TransactionsInserting Individual Records into Hive TablesUpdate and Delete Data in Hive Tables

Overview of functions in Hive

Overview of FunctionsValidating FunctionsString Manipulation – Case Conversion and LengthString Manipulation – substr and splitString Manipulation – Trimming and Padding FunctionsString Manipulation – Reverse and Concatenating Multiple StringsDate Manipulation – Current Date and TimestampDate Manipulation – Date ArithmeticDate Manipulation – truncDate Manipulation – Using date formatDate Manipulation – Extract FunctionsDate Manipulation – Dealing with Unix TimestampOverview of Numeric FunctionsData Type Conversion Using CastHandling Null ValuesQuery Example – Get Word Count

Writing Basic Queries in Hive

Overview of SQL or Hive QLExecution Life Cycle of Hive QueryReviewing Logs of Hive QueriesProjecting Data using Select and Overview of FromDerive Conditional Values using CASE and WHENProjecting Distinct ValuesFiltering Data using Where ClauseBoolean Operations in Where ClauseBoolean OR vs IN OperatorFiltering Data using LIKE OperatorPerforming Basic Aggregations using Aggregate FunctionsPerforming Aggregations using GROUP BYFiltering Aggregated Data Using HAVINGGlobal Sorting using ORDER BYOverview of DISTRIBUTE BYSorting Data within Groups using SORT BYUsing CLUSTERED BY

Joining Data Sets and Set Operations in Hive

Overview of Nested Sub QueriesNested Sub Queries – Using IN OperatorNested Sub Queries – Using EXISTS OperatorOverview of Joins in HivePerforming Inner Joins using HivePerforming Outer Joins using HivePerforming Full Outer Joins using HiveMap Side Join and Reduce Side Join in HiveJoining in Hive using Legacy SyntaxCross Joins in HiveOverview of Set Operations in HivePerform Set Union between two Hive Query ResultsSet Operations – Intersect and Minus Not Supported

Windowing or Analytics Functions in Hive

Prepare HR Database in Hive with Employees TableOverview of Analytics or Windowing Functions in HivePerforming Aggregations using Hive QueriesCreate Tables to Get Daily Revenue using CTAS in HiveGetting Lead and Lag using Windowing Functions in HiveGetting First and Last Values using Windowing Functions in HiveApplying Rank using Windowing Functions in HiveApplying Dense Rank using Windowing Functions in HiveApplying Row Number using Windowing Functions in HiveDifference Between rank, dense_rank, and row_number in HiveUnderstanding the order of execution of Hive QueriesOverview of Nested Sub Queries in HiveFiltering Data on Top of Window Functions in HiveGetting Top 5 Products by Revenue for Each Day using Windowing Functions in Hive – Recap

Running Queries using Impala

Introduction to ImpalaRole of Impala DaemonsImpala State Store and Catalog ServerOverview of Impala ShellRelationship between Hive and ImpalaOverview of Creating Databases and Tables using ImpalaLoading and Inserting Data into Tables using ImpalaRunning Queries using Impala ShellReviewing Logs of Impala QueriesSynching Hive and Impala – Using Invalidate MetadataRunning Scripts using Impala ShellAssignment – Using NYSE DataAssignment – Solution

Getting Started with Sqoop

Introduction to SqoopValidate Source Database – MySQLReview JDBC Jar to Connect to MySQLGetting Help using Sqoop CLIOverview of Sqoop User GuideValidate Sqoop and MySQL Integration using Sqoop List DatabasesListing Tables in Database using SqoopRun Queries in MySQL using Sqoop EvalUnderstanding Logs in SqoopRedirecting Sqoop Job Logs into Log Files

Importing data from MySQL to HDFS using Sqoop Import

Overview of Sqoop Import CommandImport Orders using target-dirImport Order Items using warehouse-dirManaging HDFS DirectoriesSqoop Import Execution FlowReviewing Logs of Sqoop ImportSqoop Import Specifying Number of MappersReview the Output Files generated by Sqoop ImportSqoop Import Supported File FormatsValidating avro files using Avro ToolsSqoop Import Using Compression

Apache Sqoop – Importing Data into HDFS – Customizing

Introduction to customizing Sqoop ImportSqoop Import by Specifying ColumnsSqoop import Using Boundary QuerySqoop import while filtering Unnecessary DataSqoop Import Using Split By to distribute import using non default columnGetting Query Results using Sqoop evalDealing with tables with Composite Keys while using Sqoop ImportDealing with tables with Non Numeric Key Fields while using Sqoop ImportDealing with tables with No Key Fields while using Sqoop ImportUsing autoreset-to-one-mapper to use only one mapper while importing data using Sqoop from tables with no key fieldsDefault Delimiters used by Sqoop Import for Text File FormatSpecifying Delimiters for Sqoop Import using Text File FormatDealing with Null Values using Sqoop ImportImport Mulitple Tables from source database using Sqoop Import

Importing data from MySQL to Hive Tables using Sqoop Import

Quick Overview of HiveCreate Hive Database for Sqoop ImportCreate Empty Hive Table for Sqoop ImportImport Data into Hive Table from source database table using Sqoop ImportManaging Hive Tables while importing data using Sqoop Import using OverwriteManaging Hive Tables while importing data using Sqoop Import – Errors Out If Table Already ExistsUnderstanding Execution Flow of Sqoop Import into Hive tablesReview Files generated by Sqoop Import in Hive TablesSqoop Delimiters vs Hive DelimitersDifferent File Formats supported by Sqoop Import while importing into Hive TablesSqoop Import all Tables into Hive from source database

Exporting Data from HDFS/Hive to MySQL using Sqoop Export

Introduction to Sqoop ExportPrepare Data for Sqoop ExportCreate Table in MySQL for Sqoop ExportPerform Simple Sqoop Export from HDFS to MySQL tableUnderstanding Execution Flow of Sqoop ExportSpecifying Number of Mappers for Sqoop ExportTroubleshooting the Issues related to Sqoop ExportMerging or Upserting Data using Sqoop Export – OverviewQuick Overview of MySQL – Upsert using Sqoop ExportUpdate Data using Update Key using Sqoop ExportMerging Data using allowInsert in Sqoop ExportSpecifying Columns using Sqoop ExportSpecifying Delimiters using Sqoop ExportUsing Stage Table for Sqoop Export

Submitting Sqoop Jobs and Incremental Sqoop Imports

Introduction to Sqoop JobsAdding Password File for Sqoop JobsCreating Sqoop JobRun Sqoop JobOverview of Incremental Loads using SqoopIncremental Sqoop Import – Using WhereIncremental Sqoop Import – Using Append ModeIncremental Sqoop Import – Create TableIncremental Sqoop Import – Create Sqoop JobIncremental Sqoop Import – Execute JobIncremental Sqoop Import – Add Additional DataIncremental Sqoop Import – Rerun JobIncremental Sqoop Import – Using Last Modified

Here are the objectives for this course.

Provide Structure to the Data

Use Data Definition Language (DDL) statements to create or alter structures in the metastore for use by Hive and Impala.

Create tables using a variety of data types, delimiters, and file formatsCreate new tables using existing tables to define the schemaImprove query performance by creating partitioned tables in the metastoreAlter tables to modify the existing schemaCreate views in order to simplify queries

Data Analysis

Use Query Language (QL) statements in Hive and Impala to analyze data on the cluster.

Prepare reports using SELECT commands including unions and subqueriesCalculate aggregate statistics, such as sums and averages, during a queryCreate queries against multiple data sources by using join commandsTransform the output format of queries by using built-in functionsPerform queries across a group of rows using windowing functions

Exercises will be provided to have enough practice to get better at Sqoop as well as writing queries using Hive and Impala.

All the demos are given on our state-of-the-art Big Data cluster. If you do not have multi-node cluster, you can sign up for our labs and practice on our multi-node cluster. You will be able to practice Sqoop and Hive on the cluster.

Udemy Coupon :

ITV20230401FREE

How to apply udemy coupons ? Click here.

What you will learn :

1. Overview of Big Data ecosystem such as Hadoop HDFS, YARN, Map Reduce, Sqoop, Hive, etc
2. Overview of HDFS Commands such as put or copyFromLocal, get or copyToLocal, cat, etc along with concepts such as block size, replication factor, etc
3. Managing Tables in Hive Metastore using DDL Commands
4. Load or Insert data into Hive Metastore Tables using commands such as LOAD and INSERT
5. Overview of Functions in Hive to manipulate strings, dates, etc
6. Writing Basic Hive QL Queries using WHERE, JOIN, GROUP BY, etc
7. Analytical or Windowing Functions in Hive
8. Overview of Impala and understanding similarities and differences between Hive and Impala
9. Getting Started with Sqoop by reviewing official documentation and also exploring commands such as Sqoop eval
10. Importing Data from RDBMS tables into HDFS using Sqoop Import
11. Importing Data from RDBMS tables into Hive tables using Sqoop Import
12. Exporting Data from Hive or HDFS to RDBMS tables using Sqoop Export
13. Incremental Imports using Sqoop Import into HDFS or Hive Tables