Sqoop
Sqoop Tutorial
Sqoop is designed to transfer data efficiently between RDBMS Source and the Hadoop system. Sqoop is used for both import and export data from the Hadoop system. In this sqoop tutorial, we dive in and discuss sqoop.
It can transfer data from the RDBMS system like MySQL, Oracle, Teradata, SQL server, etc. Sqoop minimizes the overhead of writing custom applications or MapReduce programs to import or export data to the source systems. It uses mappers to import data. By default, sqoop uses 4 mappers to ingest.
Note: Sqoop is not designed to work with a system other than RDBMS. (Ex: Flat files, streaming, etc.)
Table of Contents
Sqoop Architecture
- Sqoop client submits the job.
- Job is distributed across the cluster and each mapper will work on their part of sqoop job.
- Each mapper will try to connect to the database and start the import.
- Sqoop does not require reducer to import.
- Each node is the cluster that should have access to Database System.
Here in the above diagram, the Sqoop client submits the job and the Job will be distributed across three nodes. Each node will work on its part of the mapper and will provide the result in a faster manner.
Sqoop can perform a lot of other activities apart from import. The few are listed below and we will discuss each in this sqoop tutorial.
Sqoop eval: Allows users to quickly validate queries.
Sqoop import: Import statement helps to import data from RDBMS source to hdfs.
Sqoop export: Allows users to export data from hdfs to the RDBMS system.
Sqoop list-databases: Provides a detailed list of databases present in a schema.
Sqoop list-tables: Provides details of the list of tables present in a schema.
Basic Options
The below table is the list of commonly used sqoop generic commands.
Command-line arguments | Details |
–connect | Specify the JDBC properties of a specific database |
–username | Specify the username of the database. |
–password | Specify the password of the database |
–table | Name of the table to be imported. |
–target-dir | HDFS directory where data to be stored. |
–columns | Specify the columns of a table. |
–split-by | Specify the column used to split records to increase parallelism. |
–fields-terminated-by | Field separator in HDFS. |
–where | Specify the condition in a free form import statement. |
Import Statement
Mysql:
sqoop import --connect jdbc:mysql://localhost/testdb --username dbuser --password *** --table employee
SqlServer:
sqoop import ---driver com.microsoft.jdbc.sqlserver.SQLServerDriver -connect jdbc:sqlserver://localhost/testdb --username dbuser --password *** --table employee
Teradata:
sqoop import --driver com.teradata.jdbc.TeraDriver --connect "jdbc:teradata:// localhost /DATABASE=testdb --username dbuser --password *** --table employee
-The user has to add the JDBC driver jars to the Sqoop library folder.
– Users may have to provide driver classpath explicitly as ‘–driver’ parameter to make sqoop understand about the driver class.
Parallelism in Sqoop
As sqoop used to import data from RDBMS, in this sqoop tutorial, we will discuss parallelism. Users need to increase the parallelism import process in case of a table having a large volume of data. mappers are used to increasing parallelism in the sqoop statement.
‘-m’, ‘– num-mapper’ argument is used to control the number of mappers in a sqoop statement.
Example:
One mapper
sqoop import --connect jdbc:mysql://localhost/test --table employee -m 1 --target-dir /user/hadoop/demo1
Multiple mappers
In the case of parallelism, the user needs to add a ‘–split-by’ argument. ‘split-by’ column is used to split the work and divide the portion of work into multiple nodes.
For example, If the number of mappers is 2 and the split column is ‘id’, then sqoop will generate 2 mappers and divide the ‘id’ column values between 2 mappers.
BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `employee`.
sqoop import --connect jdbc:mysql://localhost/testdb --table employee -m 2 --split-by id --target-dir /user/hadoop/demo2
The number of the output files depends upon the number of mappers (‘-m’) argument passed in the sqoop query. In the above example, we will have 2 output file in hdfs location. The Same can be verified in the sqoop log.
Separator/Escape/Enclosed Character
Separator: You can set the field separator depending on a specific delimiter. The below example demonstrates how to separate fields with a semicolon (‘: ’).
sqoop import --connect jdbc:mysql://localhost/test --table employee --target-dir /ABCDE --fields-terminated-by ":"
Note: In sqoop the default field separator is TAB.
Escape: Sqoop can enclose data set with the specified value.
sqoop import --connect jdbc:mysql://localhost/test --table employee --target-dir /ABCDE --fields-terminated-by ":" --enclosed-by "’";
Sqoop can import data into multiple file formats.
sqoop import --connect jdbc:mysql://localhost/test --table employee --target-dir /user/admin --as-avrodatafile;
sqoop import --connect jdbc:mysql://localhost/test --table employee --target-dir /user/admin --as-sequencefile;
sqoop import --connect jdbc:mysql://localhost/test --table employee --target-dir /user/admin --as-textfile;
Hive/Hbase Import
Sqoop also can import data into the hive. Sqoop can generate a CREATE TABLE statement and import data into the table. The below example shows how to create and import data directly.
sqoop import --connect jdbc:mysql://localhost/test --table employee --target-dir /ABCDE --fields-terminated-by ":" --hive-import --hive-table employee --as-textfile;
HBase Import:
sqoop import –connect jdbc:mysql://localhost/test --username root --password password --table employee –-hbase-table emp_hbase –-column-family empdetails –-hbase-row-key empid;
Free-Form Query:
Sqoop allows free form SQL query to passed using the “–-query” option.
sqoop import --connect jdbc:mysql://localhost/testdb -m 1 --query "select id,name from student " --target-dir /user/admin
Note: While importing using free form query users need to use “– target-dir” for hdfs location.
Conditional expression in free form query
When a query contains ‘where‘ clause then you need to use ‘$CONDITION’ like given in the below example.
sqoop import --connect jdbc:mysql://localhost/testdb --username root --password hadoop --query 'select id,name from employee WHERE name="John" AND $CONDITIONS' --target-dir /user/admin/query
Import All Tables:
‘import-all-tables’ argument will import all the tables from the RDBMS system to HDFS. This is will import all tables from the specified database.
sqoop import-all-tables (generic-args) (import-args)
sqoop import-all-tables --connect jdbc:mysql://localhost/testdb
Export Table
Sqoop export tool helps us to transfer data quickly and efficiently from HDFS to the RDBMS system. Before exporting any table users need to have the below points in mind.
- The table should already exist in the RDBMS database.
- Table Schema and data layout should be the same.
Example:
Let’s take an example of an employee file which contains attribute like the below table.
empno | name | dept | age |
1 | John | CS | 35 |
2 | Jack | IT | 40 |
3 | Sam | INST | 25 |
4 | Matt | INST | 28 |
5 | Joo | MS | 45 |
6 | Alex | TELC | 34 |
Step1: Create a table in Mysql. The following query is used to create a table.
CREATE TABLE employee( id INT NOT NULL PRIMARY KEY, name VARCHAR(20), dept VARCHAR(10), age INT );
Step2: Now export the table from HDFS to Mysql.
sqoop export --connect jdbc:mysql://localhost/testdb --table employee --username root --password *** --export-dir /user/admin/DataSample.txt;
Here ‘–table’ argument is used to define the table in Mysql and ‘testdb‘ is the database name.
Step3: After successful completion of sqoop job you can verify the table in RDBMS under ‘testdb‘ database.
Eval
Eval is used to preview and validating the import queries to ensure the expected output. Eval query is not distributed across the Hadoop cluster. It generally executed in the master node. This command is generally used for testing the query output type.
Example:
sqoop eval --connect jdbc:mysql://localhost/testdb --username root --password test –query "select * from employee limit 10";
List Databases and Tables
In this part of the sqoop tutorial, we will discuss listing databases and tables. Sqoop can list details of databases and tables in any RDBMS system. This is useful when users need to verify tables and databases.
sqoop-list-databases: This tool helps to list all the databases.
sqoop list-databases --connect jdbc:mysql:// localhost/
sqoop-list-tables: This tool helps to list all the tables in a given database.
sqoop list-tabless --connect jdbc:mysql:// localhost/testdb
Conclusion:
In this sqoop tutorial blog, we discussed sqoop and it’s usage. Follow this link for more information about Sqoop and Sqoop tutorial. Comment here if you like the blog.
#sqoop tutorial