Sqoop

Sqoop Tutorial

Sqoop is designed to transfer data efficiently between RDBMS Source and the Hadoop system. Sqoop is used for both import and export data from the Hadoop system. In this sqoop tutorial, we dive in and discuss sqoop.

It can transfer data from the RDBMS system like MySQL, Oracle, Teradata, SQL server, etc. Sqoop minimizes the overhead of writing custom applications or MapReduce programs to import or export data to the source systems. It uses mappers to import data. By default, sqoop uses 4 mappers to ingest.

Note: Sqoop is not designed to work with a system other than RDBMS. (Ex: Flat files, streaming, etc.)

Sqoop Architecture

  • Sqoop client submits the job.
  • Job is distributed across the cluster and each mapper will work on their part of sqoop job.
  • Each mapper will try to connect to the database and start the import.
  • Sqoop does not require reducer to import.
  • Each node is the cluster that should have access to Database System.

sqoop arch

Here in the above diagram, the Sqoop client submits the job and the Job will be distributed across three nodes. Each node will work on its part of the mapper and will provide the result in a faster manner.

Sqoop can perform a lot of other activities apart from import. The few are listed below and we will discuss each in this sqoop tutorial.

Sqoop eval: Allows users to quickly validate queries.
Sqoop import: Import statement helps to import data from RDBMS source to hdfs.
Sqoop export:  Allows users to export data from hdfs to the RDBMS system.
Sqoop list-databases: Provides a detailed list of databases present in a schema.
Sqoop list-tables: Provides details of the list of tables present in a schema.

Basic Options

The below table is the list of commonly used sqoop generic commands.

Command-line arguments Details
–connect Specify the JDBC properties of a specific database
–username Specify the username of the database.
–password Specify the password of the database
–table Name of the table to be imported.
–target-dir HDFS directory where data to be stored.
–columns Specify the columns of a table.
–split-by Specify the column used to split records to increase parallelism.
–fields-terminated-by Field separator in HDFS.
–where Specify the condition in a free form import statement.

Import Statement

Mysql:
sqoop import  --connect jdbc:mysql://localhost/testdb 
--username dbuser 
--password *** 
--table employee
SqlServer:
sqoop import ---driver com.microsoft.jdbc.sqlserver.SQLServerDriver  
-connect jdbc:sqlserver://localhost/testdb 
--username dbuser 
--password *** 
--table employee
Teradata:
sqoop import --driver  com.teradata.jdbc.TeraDriver  
--connect "jdbc:teradata:// localhost /DATABASE=testdb 
--username dbuser 
--password *** 
--table employee

-The user has to add the JDBC driver jars to the Sqoop library folder.
– Users may have to provide driver classpath explicitly as ‘–driver’ parameter to make sqoop understand about the driver class.

Parallelism in Sqoop

As sqoop used to import data from RDBMS, in this sqoop tutorial, we will discuss parallelism. Users need to increase the parallelism import process in case of a table having a large volume of data. mappers are used to increasing parallelism in the sqoop statement.

‘-m’, ‘– num-mapper’ argument is used to control the number of mappers in a sqoop statement.

Example:

One mapper

 
sqoop import --connect jdbc:mysql://localhost/test  
--table employee 
-m 1 
--target-dir /user/hadoop/demo1 

Multiple mappers

In the case of parallelism, the user needs to add a ‘–split-by’ argument. ‘split-by’ column is used to split the work and divide the portion of work into multiple nodes.

For example, If the number of mappers is 2 and the split column is ‘id’, then sqoop will generate 2 mappers and divide the ‘id’ column values between 2 mappers.

BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `employee`.

sqoop import --connect jdbc:mysql://localhost/testdb  
--table employee 
-m 2 --split-by  id 
--target-dir /user/hadoop/demo2 

The number of the output files depends upon the number of mappers (‘-m’) argument passed in the sqoop query. In the above example, we will have 2 output file in hdfs location. The Same can be verified in the sqoop log.

Separator/Escape/Enclosed Character

Separator: You can set the field separator depending on a specific delimiter. The below example demonstrates how to separate fields with a semicolon (‘: ’).

sqoop import --connect jdbc:mysql://localhost/test 
--table employee 
--target-dir /ABCDE  
--fields-terminated-by ":"

Note: In sqoop the default field separator is TAB.

Escape: Sqoop can enclose data set with the specified value.

 
sqoop import --connect jdbc:mysql://localhost/test 
--table employee 
--target-dir /ABCDE  
--fields-terminated-by ":"  
--enclosed-by "’";

Sqoop can import data into multiple file formats.

 
sqoop import --connect jdbc:mysql://localhost/test 
--table employee 
--target-dir /user/admin 
--as-avrodatafile;
 
sqoop import --connect jdbc:mysql://localhost/test 
--table employee 
--target-dir /user/admin  
--as-sequencefile;
sqoop import --connect jdbc:mysql://localhost/test 
--table employee 
--target-dir /user/admin 
--as-textfile; 

Hive/Hbase Import

Sqoop also can import data into the hive. Sqoop can generate a CREATE TABLE statement and import data into the table. The below example shows how to create and import data directly.

 
sqoop import --connect jdbc:mysql://localhost/test 
--table employee 
--target-dir /ABCDE 
--fields-terminated-by ":"  
--hive-import  
--hive-table employee 
--as-textfile; 

HBase Import:

 
sqoop import –connect jdbc:mysql://localhost/test 
--username root --password password 
--table employee 
–-hbase-table emp_hbase 
–-column-family empdetails 
–-hbase-row-key empid; 

Free-Form Query:

Sqoop allows free form SQL query to passed using the “–-query” option.

 sqoop import --connect jdbc:mysql://localhost/testdb 
-m 1 
--query "select id,name from student " 
--target-dir /user/admin 

Note: While importing using free form query users need to use “– target-dir”  for hdfs location.

Conditional expression in free form query

When a query contains ‘where‘ clause then you need to use ‘$CONDITION’ like given in the below example.

 
sqoop import --connect jdbc:mysql://localhost/testdb 
--username root 
--password hadoop 
--query 'select id,name from employee WHERE name="John" AND $CONDITIONS' 
--target-dir /user/admin/query 

Import All Tables:

‘import-all-tables’ argument will import all the tables from the RDBMS system to HDFS. This is will import all tables from the specified database.

 
sqoop import-all-tables (generic-args) (import-args)
 
sqoop import-all-tables --connect jdbc:mysql://localhost/testdb 

Export Table

Sqoop export tool helps us to transfer data quickly and efficiently from HDFS to the RDBMS system. Before exporting any table users need to have the below points in mind.

  • The table should already exist in the RDBMS database.
  • Table Schema and data layout should be the same.

Example:

Let’s take an example of an employee file which contains attribute like the below table.

empno name dept age
1 John CS 35
2 Jack IT 40
3 Sam INST 25
4 Matt INST 28
5 Joo MS 45
6 Alex TELC 34

Step1: Create a table in Mysql. The following query is used to create a table.

 CREATE TABLE employee(
id INT NOT NULL PRIMARY KEY, 
name VARCHAR(20),
dept VARCHAR(10),
age INT
);

Step2: Now export the table from HDFS to Mysql.

 
sqoop export --connect jdbc:mysql://localhost/testdb --table employee --username root --password *** --export-dir /user/admin/DataSample.txt;

Here ‘–table’ argument is used to define the table in Mysql and ‘testdb‘ is the database name.

Step3: After successful completion of sqoop job you can verify the table in RDBMS under ‘testdb‘ database.

Eval

Eval is used to preview and validating the import queries to ensure the expected output. Eval query is not distributed across the Hadoop cluster. It generally executed in the master node. This command is generally used for testing the query output type.

Example:

 
sqoop eval --connect jdbc:mysql://localhost/testdb --username root --password test –query "select * from employee limit 10";

List Databases and Tables

In this part of the sqoop tutorial, we will discuss listing databases and tables. Sqoop can list details of databases and tables in any RDBMS system. This is useful when users need to verify tables and databases.

sqoop-list-databases: This tool helps to list all the databases.

 
sqoop list-databases --connect jdbc:mysql:// localhost/

sqoop-list-tables: This tool helps to list all the tables in a given database.

 
sqoop list-tabless --connect jdbc:mysql:// localhost/testdb 

Conclusion:

In this sqoop tutorial blog, we discussed sqoop and it’s usage. Follow this link for more information about Sqoop and Sqoop tutorial. Comment here if you like the blog.

#sqoop tutorial

Leave a Reply