Text Mining with Simone #3

Text Mining with Simone #3

In the two previous blogs in this series, we dove into the concept of text mining as well as the possible market solutions to facilitate it. The goal of this blog is to go through a practical case where we applied text mining in an organizational context. We will start by giving an overview of the case, followed by the approach we took, and finally conclude with the lessons learned.

Gerelateerde content

Why did the client start cleansing?

A professional services company had collected an enormous amount of unstructured data over the last 20 years. This data was stored on an array of file shares. They had the ambition to clean house. This is why they decided to implement a new document management system (DMS) to better manage their unstructured data.

The company works on projects for different clients. The document management system reflects this and therefore is organized with workspaces for clients and workspaces for projects. In the old environment, there was no standard way of working. This brought a challenge: how to find the right files: the files? We began with a proof of concept, to show the client the value of text mining when facing such a challenge. This proof of concept was carried out on a subset of about 3.5 terabytes. Tooling was employed to index and OCR all of the data.  

How did we do it?

Our goal was to migrate all valuable files and cleanse invaluable files. The remaining files that were neither cleansed nor migrated remained on the file shares for a limited period of time. End users could request specific files during this grace period. These files would be moved to the new environment if required. Once a file is migrated, the source file share is hidden from end users. Migrating files through the indexing tool is important, to make sure that the new environment does not get polluted as well. This is a risk when letting end users migrate their own files. In order to maintain an overview of the status of the migration, we applied what we call a migration funnel. This is a 5 level funnel, where every level has specific entry criteria.

Infographic text mining with simone 3

Level 0 contains all data, as it is present in the source. This is the starting point. There are no entry requirements. Level 1 contains all approved media types. The media type of a file (e.g. word processing) must match one of the permitted types in order to enter this level. Level 2 contains files that are not redundant, obsolete and trivial (ROT). Only files that have not been tagged as ROT may enter this level. Level 3 contains all files that have business value. Only files that have been tagged with 'client' or 'project' data may enter this level. Of course, this logic was tailored specifically to this client. Level 4 contains all files that have obtained a definite document class (e.g. contract). The document class field must be filled for files to enter this level. Level 5 contains all files that are migration ready. All required metadata fields must be filled out to enter this level.

Level 1: define accepted media types

A media type (or content type) is a two-part identifier for file formats. Examples are applications, images or text. Indexing technology reads the media types of source data. This way, it knows how to process it. Some media types will require text to be extracted, others may not contain any text at all. Defining what media types you do and do not use within your organization will go a long way in cleaning up legacy data. The question to ask here is: what applications do we use? And what media types are needed to run those applications? During this case, we came across Lotus Notes files and Apple II software for instance. These media types were no longer used by anyone. We identified which media types were not expected in a DMS or no longer used and tagged these as ROT. Excluding these files reduced the potential migration volume by 33%.

Level 2: remove redundant, obsolete and trivial (ROT) data

Next to media types, there are many other methods to find ROT data. ROT files are files that no longer have any business value. Indexing technology can identify duplicate files. A duplicate file is a file of any sort (e.g. word, pdf, image) that is present on the file servers more than once. Only the most recent and unique version was relevant for migration. The goal of an organized DMS is to reduce the number of duplicate files as much as possible, to make sure there is only one version of the truth. In this case, more than 4 million of the total files were duplicates: 69% of all data. Next to duplicates, we also looked at file size. The size of a file can determine whether the file actually contains information. We filtered out extremely small files. We also filtered out temporary office files. These are files used to restore an office file when it was not saved correctly. A special category we included as ROT are Personal Folders, these contain personal data that is not subject to migration. These folders contained personal photos, personal invoices, notes and more. We also excluded folders with the name 'archive' or 'old'. All in all, only 15% of the original 3.5 terabytes would potentially be migrated.

Level 3: create and apply business classification logic

We defined classification logic that was tailored specifically to the organization. The trick here is to find unique identifiers that may be present within a file. These unique identifiers are used to say something about the content of that file. If a file contains the name of a client, it is likely that it is client data. This is the reason why we created client profiles using structured data from their enterprise resource planning system. For clients, we were able to create quite extensive profiles, looking at client names, chamber of commerce numbers, telephone numbers, etc. The next step was to distinguish client data from project data. For projects, we searched specifically for 'project codes': unique numbers that were given to all formal documentation shared with clients within a project. Once created, we had more than 400,000 different classification rules, that all searched for specific information in a file. If the classifier found the requested information, that was added as metadata. So the file would get a tag 'client data' as well as the name of the client. If a project code was found, the file would get the tag 'project data' as well as the code of that project. Running the rules took about 4 days and as a result, only 5% of all files were potentially subject to migration.

Level 4: define the document classes

The retention period of a file can only be assessed once its document class has been defined. A document class is a functional description of what a textual file contains. This goes further than the media type, as we are not looking at the format of the file, but rather its content. Examples are CVs, passports and contracts. We created logic to define the document class of files, by searching for key terms. Passports for instance, always contain a person's name, the city of their birth and their social security number. The indexing tool then returned results, for which we selected the ones that were correct and the ones that were not. Artificial intelligence was used to automate this process where possible. Where templates or standard formats were used, this worked very effectively. It is easy to link a retention period to a file when its document class is known. In this case, all CVs created more than 4 weeks ago were classified as ROT. Only files for which the retention period had not yet passed were subject to migration, amounting to 2% of the total data landscape.

Level 5: add metadata

In the final level of the migration funnel, a check was done on the presence of metadata. Depending on the document class, specific information is required. For a CV, that would be the person's name, the document class (CV), the date it was received and the internal contact person who received it. Once all required fields were filled out, the file moved to the 5th and final level of the migration funnel.

Tips for when you start data cleansing

The result? A beautifully organized environment where files are easy to find thanks to their metadata. Overall the proof of concept was a great success, we were able to employ text mining technologies to facilitate an extremely complex data migration. There are a few lessons learned looking back at this case. First of all, classification logic is only as good as its input data. An important step in creating classification logic is cleansing the source data. Looking for a client name called 'DO NOT USE CLIENT NAME' instead of "CLIENT NAME" will not return accurate results. What's more, an assumption that is often made is that folder names and structures will tell you something about the content of a specific folder. We learned that when dealing with large volumes, that is not the case. One must tread carefully when automating migration processes based on folder structures. The risk exists that the new environment will be polluted with ROT data.

The same goes for file properties (the technical metadata) of a file. When Person A creates a new file based on Person B's example, Person B will be shown as the author rather than Person A. Best is to run classification logic on the textual content of files alone. Project data can be found in all sorts of folders. A very value-adding practice is making use of templates, these files will be found very quickly thanks to the consistency in their structure. It is a practical solution that can easily be implemented. Starting with your own data cleansing project? To be able to make some tough calls about deleting legacy data, make sure you have executive sponsorship. This way, you will ensure to have backing regarding what files to retain and what files to delete.

Regarding the deletion, factoring in a fallback plan (making files invisible from end users, but keeping them for a grace period) greatly helps with the transition. All in all, even if the migration had not taken place, the migration funnel identified over 82% of all data that could be cleansed. Struggling with high storage costs? A text mining based data cleansing program may be just the thing for you!

More information? Pleas do not hesitate to contact Remi Verhoeven or Simone Jeurissen.

© 2021 KPMG N.V., een Nederlandse naamloze vennootschap en lid van de wereldwijde KPMG-organisatie van onafhankelijke ondernemingen gelieerd aan KPMG International Limited, een Engelse vennootschap “limited by guarantee”. Alle rechten voorbehouden.

Neem contact met ons op


Wilt u een offerte van ons ontvangen?


loading image Vraag een offerte aan

Mijn profiel

Blader door artikelen en kies uw interesses.

Sign up today