Data Theft and the Role of Big Data

How to discover data theft using Hadoop and Big data?

By Kaushik Pal, Techalpine

big-data-talent-searchData theft has been a big issue for quite some time. What adds to the problem is the long time taken to identify the theft. The longer it takes to detect data theft, the more difficult it is to find a solution. Hadoop and Big Data can help organizations reduce the time to identify data theft and find a solution. A few organizations, as this article will show in due course, have been using Hadoop and Big Data to detect data theft quickly. Still, workable data theft solutions have just started to come and there is still a long time before we are able to develop sound defenses against data theft.

Data theft: some scary statistics

Reputed brands worldwide have suffered from huge loss of reputation and money because of data theft.  Consider the following statistics:

  • In the US, over 8 years, a hacking group targeted banks, departmental stores and payment processors and stole more than 160 million credit and debit card numbers.
  • KT Corp, the Korean mobile carrier suffered a huge loss of reputation when two suspects reportedly earned more than $850,000 by selling the plan details and contact information of more than 8.7 million KT subscribers.
  • Experian, one of the biggest data monitoring companies in the world, disclosed a huge breach of data of customers who had applied for services at T-Mobile. The data included names, addresses, Social Security Numbers, passport details and driving license details.
  • JP Morgan Chase suffered a loss of more than 76,000,000 customer records when hackers stole customer account numbers, names and email IDs. What added to the problem was that the theft was detected almost a month later.
  • Home Depot faced a massive loss of sensitive data when credit card details of up to 56 million customers were stolen from its cash register systems. This breach was done by malware installed by Russian and Ukrainian hackers in the cash register systems.

There are many more such incidents happening every day. The following observations can be inferred from the above samples:

  • Data theft can breach the strongest of systems because data theft methodologies are evolving with anti-data theft methodologies.
  • Data theft cannot be eliminated but it can be managed better.
  • If the systems of such reputed brands like JP Morgan and Chase and Experian can be breached, then almost nothing is safe.
  • Data theft protection systems need other dimensions as well and not just focus on protecting data. For example, there is a need to quickly identify data theft and identify the footprints.

Role of Hadoop and Big Data in recovering stolen data

It is not possible to wipe out data theft and it can strike anytime anywhere. But the approach towards data theft needs modification. While data security systems are upgraded, early theft detection and recovering lost data should also get attention. Hadoop and Big Data can play a role in quickly identifying an incident of data theft. A few companies have been working on finding data theft solutions. They are not even trying to prevent data theft — that is not possible. They are working at the following two things:

  • Identifying data theft as quickly as possible so that the data could be tracked without wasting time.
  • Tracking stolen data on the Internet and the Dark Web.

The concept behind data theft solutions

The assumption behind data theft solutions is that it is almost impossible to stop data theft. The best way to approach a situation of data theft is to assume that it is inevitable and to quickly start looking for the data before it is lost.

There is a fundamental difference between the incidents of stealing a tangible good and data. Unlike a tangible good, data thieves can only steal a copy of data. The original data can help track its copy in the web. It is about comparing the original and its copy.

To match the original and its copy, you need to generate a  hash code of the original and match it with that of the copy. A hash code is a unique number or identification assigned to a chunk of data. The technique to generate the hash code is known as cryptographic hashing. According to experts in this field, a data intelligence company that specializes in data theft solutions, ““It’s not code that’s embedded in the data so much as a computation done on the data itself”. You need to first divide the data into several chunks and then run each chunk through a mathematical function to generate a hash code. After that, you crawl the web and match the hash code with the data found on the web. If the hash code of the original matches with that of any other data, you have found your stolen data.

A few companies also call the entire process Fingerprint Matching. The hash codes of the data chunks are known as fingerprints of data and the action of matching hash codes is called fingerprint matching.

The data theft solutions are quite powerful because they are able to crawl even the Dark Web where the websites can hide their identity. In fact, crawling the Dark Web is claimed as one of the central characteristics of the data theft solutions.

Some data theft solutions also offer analytics and reporting capabilities for their clients. These solutions can be integrated with almost any Security information and event management (SIEM) systems. The SIEMs can receive alerts.

Following is a typical work flow diagram for a standard security application.


Role of Hadoop and Big Data in finding stolen data

Obviously, matching data fingerprints requires handling an enormous volume of data.

The entire process of breaking data into chunks and generating hash codes involves enormous volumes of data. It is imaginable that the database of each data theft management company must be overflowing with data. To process such a huge amount of data, the companies need a reliable Hadoop platform. Not any Hadoop solution will do. It needs to be something like an enterprise-grade version of Hadoop which is implemented in the native code and not on the Virtual Java Machine. This makes Hadoop more resource-efficient.

The data theft solutions in the market completely depend on data chunks or datasets. The more datasets, the higher is the chance to match fingerprints. So, there is a need of a system which can handle large volumes of data. Only Hadoop and Big Data are capable of doing that. According to Danny Rogers, “We are only as good as the data we collect, and our ability to collect more data depends on this key piece of technology.”

The above role of Hadoop in finding stolen data can set a template for tracking stolen data. You need a large-scale and cloud-based automation with an enterprise-grade distribution to find out stolen data. Hadoop plays two roles in this context: dataset manager and dataset processor. For any organization that attempts to match fingerprints of datasets to find stolen data, it will have to store and process huge volumes of datasets. For that, it will need a sound data management and processing system.


The development of data theft detection systems represents a change in the approach towards data theft in a sense. It is good that enterprises are realizing the potential of Hadoop in detecting stolen data. Hadoop complements data theft tracking systems. Fingerprint matching techniques should be supported by adequate data storage and processing capabilities. However, as stated earlier in this article, these are early days in developments like this. Another perspective could be ensuring the security of data storage systems which could be the targets for future attacks as these systems store a huge amount of data. In such a case, enterprise database and Hadoop could be equally facing attacks from Hackers.

Posted in Big Data, Breach, Content, Data Security, Incident Response, Risk Management
Tags: , , , , , , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *


Visit Us On TwitterVisit Us On FacebookVisit Us On LinkedinVisit Us On Google Plus

Keep Current with What’s New in Cybersecurity

Email Address:


Cybersecurity News Daily

Provides a daily summary of what's news in Cybersecurity


Recent Tweets



Get every new post delivered to your Inbox

Join other followers: