SQL - Handling Duplicates


SQL is a programming language that is used to manage and manipulate data in relational databases. One of the most common issues that can arise while working with databases is the presence of multiple duplicate records. The duplicate records occur when we sometimes either accidentally or intentionally enter the data into a table multiple times while creating it. Handling duplicates in SQL involves identifying, filtering, removing, or merging duplicate records from a table.

Why is Handling Duplicates in SQL Necessary?

There are various reasons why handling duplicates in a database becomes necessary. One of the main reasons is that the existence of duplicates in an organizational database will lead to logical errors. In addition to it, we need to handle redundant data to prevent the following consequences −

  • Duplicate data occupies the storage size, which leads to decrease in usage efficiency of a database.

  • Due to the increased use of resources, the overall cost of the handling resources rises.

  • With increase in logical errors due to the presence of duplicates, the conclusions derived from data analysis in a database will also be erroneous.

Methods to Handle Duplicates

As the existence of duplicates in a database increase, various methods are introduced to handle them. They are listed below −

  • Using Distinct Keyword
  • Using Group By Clause
  • Using Union Clause

Let us learn more about these methods in detail below.

Using Distinct Keyword

We can handle duplicates in SQL by using the DISTINCT keyword. This is used with the SELECT statement to eliminate all the duplicate records and by retrieving only the unique records.

Syntax

The basic syntax of a DISTINCT keyword to eliminate duplicate records is as follows.

SELECT DISTINCT column1, column2,.....columnN 
FROM table_name
WHERE [condition]

Example

Consider the CUSTOMERS table having the following records.

+----+----------+-----+-----------+----------+
| ID | NAME     | AGE | ADDRESS   | SALARY   |
+----+----------+-----+-----------+----------+
|  1 | Ramesh   |  32 | Ahmedabad |  2000.00 |
|  2 | Khilan   |  25 | Delhi     |  1500.00 |
|  3 | kaushik  |  23 | Kota      |  2000.00 |
|  4 | Chaitali |  25 | Mumbai    |  6500.00 |
|  5 | Hardik   |  27 | Bhopal    |  8500.00 |
|  6 | Komal    |  22 | MP        |  4500.00 |
|  7 | Muffy    |  24 | Indore    | 10000.00 |
+----+----------+-----+-----------+----------+

First, let us see how the following SELECT query returns duplicate salary records.

SQL> SELECT SALARY FROM CUSTOMERS
   ORDER BY SALARY;

This would produce the following result where the salary of 2000 is coming twice which is a duplicate record from the original table.

+----------+
| SALARY   |
+----------+
|  1500.00 |
|  2000.00 |
|  2000.00 |
|  4500.00 |
|  6500.00 |
|  8500.00 |
| 10000.00 |
+----------+

Now, let us use the DISTINCT keyword with the above SELECT query and see the result.

SQL> SELECT DISTINCT SALARY FROM CUSTOMERS
   ORDER BY SALARY;

Output

This would produce the following result where we do not have any duplicate entry.

+----------+
| SALARY   |
+----------+
|  1500.00 |
|  2000.00 |
|  4500.00 |
|  6500.00 |
|  8500.00 |
| 10000.00 |
+----------+

Using Group By Clause

We can also merge two similar records into one using the Group By clause. Following is the syntax to do so −

SELECT column_name(s) FROM table_name GROUP BY column_name(s);

Example

In this example, we are trying to create a new table “Employee” using the query below −

CREATE TABLE EMPLOYEE (
   EID INT NOT NULL,
   EMPLOYEE_NAME VARCHAR (30) NOT NULL,
   SALES_MADE DECIMAL (20)
);

Now, we can insert values into this empty tables using the INSERT statement as follows −

INSERT INTO EMPLOYEE VALUES (102, 'SARIKA', 4500);
INSERT INTO EMPLOYEE VALUES (100, 'ALEKHYA', 3623);
INSERT INTO EMPLOYEE VALUES (101, 'REVATHI', 1291);
INSERT INTO EMPLOYEE VALUES (103, 'VIVEK', 3426);
INSERT INTO EMPLOYEE VALUES (100, 'ALEKHYA', 3623);

The Employee table consists of the details of employees in an organization and sales made by them.

+-----+---------------+------------+
| EID | EMPLOYEE_NAME | SALES_MADE |
+-----+---------------+------------+
| 102 | SARIKA        |       4500 |
| 100 | ALEKHYA       |       3623 |
| 101 | REVATHI       |       1291 |
| 103 | VIVEK         |       3426 |
| 100 | ALEKHYA       |       3623 |
+-----+---------------+------------+

Using the following Group By query, we are trying to merge the duplicate records present in the table into one record and arranges them in ascending order.

SELECT * FROM EMPLOYEE GROUP BY EID, EMPLOYEE_NAME, SALARY;

Output

The table displayed is as follows −

+-----+---------------+------------+
| EID | EMPLOYEE_NAME | SALES_MADE |
+-----+---------------+------------+
| 100 | ALEKHYA       |       3623 |
| 101 | REVATHI       |       1291 |
| 102 | SARIKA        |       4500 |
| 103 | VIVEK         |       3426 |
+-----+---------------+------------+

Using Union

UNION is a type of operator/clause in SQL, that works similar to the union operator in relational algebra. It does nothing more than just combining information from multiple tables that are union compatible.

Only distinct rows from the tables are added to the result table, as UNION automatically eliminates all the duplicate records.

Syntax

Following is the syntax of UNION operator in SQL −

SELECT * FROM table1
UNION
SELECT * FROM table2;

Example

Let us first create two table “COURSES_PICKED” and “EXTRA_COURSES_PICKED” with the same number of columns having same data types.

Create table COURSES_PICKED using the following query −

CREATE TABLE COURSES_PICKED(
   STUDENT_ID INT NOT NULL, 
   STUDENT_NAME VARCHAR(30) NOT NULL, 
   COURSE_NAME VARCHAR(30) NOT NULL
);

Insert values into the COURSES_PICKED table with the help of the query given below −

INSERT INTO COURSES_PICKED VALUES(1, 'JOHN', 'ENGLISH');
INSERT INTO COURSES_PICKED VALUES(2, 'ROBERT', 'COMPUTER SCIENCE');
INSERT INTO COURSES_PICKED VALUES(3, 'SASHA', 'COMMUNICATIONS');
INSERT INTO COURSES_PICKED VALUES(4, 'JULIAN', 'MATHEMATICS');

The table will be displayed as −

+------------+--------------+------------------+
| STUDENT_ID | STUDENT_NAME | COURSE_NAME      |
+------------+--------------+------------------+
|          1 | JOHN         | ENGLISH          |
|          2 | ROBERT       | COMPUTER SCIENCE |
|          3 | SASHA        | COMMUNICATIONS   |
|          4 | JULIAN       | MATHEMATICS      |
+------------+--------------+------------------+

Create table EXTRA_COURSES_PICKED using the following query −

CREATE TABLE EXTRA_COURSES_PICKED(
   STUDENT_ID INT NOT NULL, 
   STUDENT_NAME VARCHAR(30) NOT NULL, 
   EXTRA_COURSE_NAME VARCHAR(30) NOT NULL
);

Following is the query to insert values into the EXTRA_COURSES_PICKED table −

INSERT INTO EXTRA_COURSES_PICKED VALUES(1, 'JOHN', 'PHYSICAL EDUCATION');
INSERT INTO EXTRA_COURSES_PICKED VALUES(2, 'ROBERT', 'GYM');
INSERT INTO EXTRA_COURSES_PICKED VALUES(3, 'SASHA', 'FILM');
INSERT INTO EXTRA_COURSES_PICKED VALUES(4, 'JULIAN', 'MATHEMATICS');

The table will be created as shown below −

+------------+--------------+--------------------+
| STUDENT_ID | STUDENT_NAME | COURSES_PICKED     |
+------------+--------------+--------------------+
|          1 | JOHN         | PHYSICAL EDUCATION |
|          2 | ROBERT       | GYM                |
|          3 | SASHA        | FILM               |
|          4 | JULIAN       | MATHEMATICS        |
+------------+--------------+--------------------+

Now, let us try to combine both these tables using the UNION query as follows −

SELECT * FROM COURSES_PICKED
UNION
SELECT * FROM EXTRA_COURSES_PICKED;

Output

The resultant table obtained after performing the UNION operation is −

+------------+--------------+--------------------+
| STUDENT_ID | STUDENT_NAME | COURSE_NAME        |
+------------+--------------+--------------------+
|          1 | JOHN         | ENGLISH            |
|          1 | JOHN         | PHYSICAL EDUCATION |
|          2 | ROBERT       | COMPUTER SCIENCE   |
|          2 | ROBERT       | GYM                |
|          3 | SASHA        | COMMUNICATIONS     |
|          3 | SASHA        | FILM               |
|          4 | JULIAN       | MATHEMATICS        |
+------------+--------------+--------------------+

Since the record of "Julian" is redundant, UNION clause eliminates the duplicate record and returns distinct values only.

Advertisements