The importance of text mining in times of digital transformation.
With the digital transformation of well, the entire world, there has been an explosion of textual information from a wide array of sources. Within this context, textual information refers to unstructured data in the form of textual data sources, such as html or xml and document formats such as Microsoft Word, PowerPoint, Adobe PDF and email. Along with the world’s digital transformation came the rise of data mining. Organisations are quickly making steps to create value using data and data-driven insights, but these efforts largely remained focused on structured data management as opposed to document management.
Yet the most natural form of storing information is through text. To explain something, one is likely to write it out in a document in a structured way. However, no matter the structure the document is given as long as words are not separated by meaningful characters, it will always remain an unstructured collection of words. Even information about structured data is often communicated through text documents. There is no denying that textual data sources play a huge role in the documentation and communication of information. At the end of the day, it is through language that we communicate knowledge and one could even argue that without language knowledge does not exist at all.
Consumers can be grateful that out-of-the-box tooling such as Microsoft File Explorer or Apple Finder offer enough functionality to make private collections of documents (somewhat) manageable. Consumers also benefit from companies like Google which make that vast ocean of information on the Internet searchable. Organisations, on the other hand, face a more complex struggle with the management of text documents. The majority of organisational data is stored in an unstructured form (about 80%), yet control over all this information remains limited. Not being able to find a specific document on a company share probably does not sound all too unfamiliar to you either.
Although organisations have recently been putting a lot of effort into the effective management of data, this generally remains limited to structured data. Organisations take up data management with goals related to growth, efficiency and compliance. So what about the text documents? Why are those gains not cherished? Why is structured data mining more common than text data mining?
Text mining is a type of data analysis that aims to retrieve valuable insights from textual information. It is part of the field of study referred to as Natural Language Processing (NLP), which sits at the intersection of computational linguistics, computer science and artificial intelligence. NLP is a way for computers to analyse and understand human language. It is commonly used for applications such as machine translation, automated question answering and of course, text mining.
Due to the ambiguity and complexity of human language, a lot of data preparation needs to be carried out prior to text mining to ensure that the text documents are represented in an effective and suitable way. This stage can be referred to as text refinement. Text refinement contains cleansing activities such as the removal and stemming of specific words. For example removing stop words such as ‘the’, ‘as’ and ‘a’. The most common word in this blog is ‘the’ but that is not a valuable insight. Hence, such essentially meaningless words are removed prior to analysis. Stemming words refers to the generalisation of words that have the same meaning, for example, ‘manager’ ‘manage’ and ‘management’ could all be classified into the same category named ‘managing’. During the text refinement phase concepts are predefined and synonyms are identified. This is all done with the goal to create value, to illustrate, a ‘good party’ in the context of a contract negotiation means something entirely different as opposed to the context of a nightclub. Once the text is refined, the text can be analysed. There will likely be a few jumps between refinement and analysis before achieving the analytical goal.
The organisational applications are twofold: analytics and enterprise search.
Descriptive analytics in text mining refer to the automate retrieval of information from documents (without having to entirely read the documents). A word cloud could be created to extract and visually represent the main topics of one or more documents. This can create interesting insights into your information landscape. The application of word clouds is very limited however. A more value adding exercise would be topic extraction and named-entity recognition. Topic extraction is the identification of meaningful terms within a document, while named-entity recognition is the extraction of names that fall into predefined categories such as people, organisations and locations. The extracted terms, topics and entities can be attached to the document through the means of tags or metadata. This makes it much easier to end-users to find and understand documents.
Another application within descriptive text analytics are sentiment analyses. A sentiment analysis can be carried out on a body of documents to determine whether the general sentiment of the documents is negative, positive or neutral. This is done by identifying positive and negative terms and counting the amount of these terms in each document. An example of an organisational application could be to determine the general sentiment of employee reviews or to assess the success of a company event using social media data. Starbucks, for example, uses real-time text mining and sentiment analysis to identify negative tweets and quickly respond to them.
Predictive text analytics take things one step further, as documents can also be clustered and classified based on their content. This is done by grouping similar documents based on the frequency of specific terms within the document in comparison with the frequency of terms in other documents. This is referred to as term frequency-inverse document frequency (tf-idf). Knowing that a specific document is of a certain type, we can use text mining technologies to determine which other documents belong to that same category. This is a way of supervised learning that greatly benefits the organisation of textual information. Implementing such text mining technologies in your document landscape can help your organisation make its way towards effective content curation. Content curation refers to the process of discovering, gathering and presenting information about a specific topic. This resembles what Netflix does with movies, as it suggests other movies based on the characteristics of the selected one. In a knowledge-driven organisation this can be of great benefit, for example when looking for a subject matter expert or when collecting existing information to write a new contract or proposal.
Using prescriptive text analytics the computer could predict where a document needs to be saved based on its content. Implementing real time text mining technologies could allow your system to classify the documents you write on the go. Based on the words you use while writing your document the system could detect anomalies. For example, the system could give a warning asking if the saving location is appropriate, suggest a different title or automatically generate document metadata and tags.
In the context of organisational document management, the implementation of text mining technologies mentioned above all improve the quality of organisation search by making it easier to find information. Organisational search, however, can also be improved upon by implementing text mining technologies within the search itself. Say, an employee wants to find all supplier telephone numbers but these are ‘hidden’ somewhere in their inbox. Using the predefined ‘telephone number’ category, the search would not look a specific phone number one at the time but return all results that match the predefined phone number category. These generalisation techniques make search much more efficient. The concept of stemming help with this as well as text mining technologies return search results even when the term does not match entirely. The implementation of such search technologies has the benefit of increasing operational effectiveness as employees no longer have to click through hundreds of folders to find specific documents. However such search methods are not only useful for the end-user but also for managers trying to maintain control over the information landscape and remain compliant with laws and regulations. Such generalised search categories can be used to monitor whether people are saving information on the right location, for example, if personal information (such as credit card numbers and social security numbers) is used in letters to the customer, these can easily be found in batches and archived appropriately.
Naturally, there are many existing solutions available in the market that make use of the functionalities described above. Stay tuned for the next blog in this series to read more about existing market solutions.
This blog was written by Simone Jeurissen, you can reach her on +31 206 564 089 or via e-mail. Simone is a consultant in the KPMG Data & Analytics Advisory, Enterprise Data Management team. She advises organizations on the design and implementation of data-, document- and records management.