Share with your friends

Text Mining with Simone - part 2

Text Mining with Simone - part 2

With the digital transformation of well, the entire world, there has been an explosion of textual information from a wide array of sources. In part 2 of this story, we continue our quest for control.

Gerelateerde content


So it has been decided. Control needs to be gained over that vast amount of unstructured information and we can agree that merely relying on people is not feasible. Our employees have got enough on their plates as it is. Cleaning up terabytes of unstructured data simply does not fit in their busy schedules. Automation is required. What are the options? Naturally, many out-of-the-box solutions exist to help organisations manage their unstructured data, the Enterprise Content Management (ECM) market has evolved and changed a lot over recent years. There have been many mergers and takeovers, as the focus of offerings has changed from 'document management' to 'content management'. Gartner publishes quarterly reports on ECM services and in their latest release, they have identified OpenText, Hyland, Microsoft and IBM as some of the top solution providers in this field.

The solution that best suits your needs, depends on the ins and outs of your organisation. It is very important to keep in mind that technology is there to support you. What we often see in the market, when looking at ECM is that employees have adjusted their ways of working to comply with what the solution demands, rather than adjusting the solution to fit their own demands. In this blog, we will focus on the provision of seven different capabilities, of which the first four – creation, formalisation, archiving and destruction – fall in the more traditional document and records management sphere, whereas the other three – monitoring, metadata and search – fall in the content management sphere.

The seven capabilities of content and records management
Data presentation of seven capabilities of content


To start off, an ECM solution should support the end user's needs. In a digitising society, that means that it should be possible to create documents on a number of different devices and that (real-time) document collaboration with colleagues is facilitated. From a content management viewpoint, it is also desired that the solution provides the end users with templates when starting with document creation.


Secondly, there is generally a point at which the document is formalised. This is the moment a document becomes a 'record' and it can no longer be adjusted or changed. For a contract, for example, this would be the moment that it is signed by both parties. For other documents this moment may not be as clearly defined. Generally this would be the moment a 1.0 version is created, but even this 1.0 version may later be subject to change. Think about a policy document, for example. To support the formalisation of such documents, a workflow is desired where the head of a department or head of a team can formally approve the document, making it crystal clear which version is the last. For documents that do not require formal approval from a department head, the end users should be able to publish the document as the final version themselves. When formalised, these documents should no longer be confused with the body of active documents such as earlier versions of the same documents, making it clear for the entire organisation which document they should consult when looking for specific information.


Thirdly, the solution should have an archiving feature. There must be a moment at which the documents leave the active environment to be kept in a secure environment where they can be viewed when needed. These 'legacy documents' create unnecessary clutter when kept between active documents, yet need to be saved in case a question is raised regarding the history of documents. Therefore, these two types – legacy and active documents – should be stored in separate environments.


Naturally, legacy documents also stack up. What's more, they are subject to legal requirements such as laws about retention periods. When it comes to personally identifiable information, it may not even be allowed to store these documents at all. That is why a destruction mechanism is an essential part of any ECM solution. For formalised records, this can be automated by having documents destructed when the retention period is over. Non-formal documents that never obtain a final status, such as notes or drafts, should periodically be cleansed according to a standardised destruction policy.


This brings us to my favourite, the crux to ECM: metadata. Metadata is simply data about data. For a document, this could be the document type, the title, the status, the author or the date of creation. This is where text mining technologies do some of their best work. Standardising and structuring unstructured information is done by means of metadata. During the document's lifecycle, certain information should be attached to the document. For example, when the document is formalised, the document is marked as 'final' and a retention period is added to the document. This facilitates the automatic archiving and removal of the document after a certain period of time. For contracts, adding the start and end date of the document as metadata can also facilitate the contract negotiation process, as organisations can quickly identify which contracts need to be renewed, without having to sift through a pile of old documents. Ideally a solution would not ask the document creator to manually add metadata to documents every single time, but rather automatically generate metadata based on content. Automatic generation is favourable, as manual input can lead to end user frustration and is more prone to error.


Like any data management initiative, it is essential to implement some kind of monitoring mechanism that ensures that once brought to an acceptable quality level, data remains on that quality level. The solutions should therefore support some kind of overview of the information landscape that an information manager can consult to gain insight in the document creation process, the availability of templates and the central management of metadata attributes. Patterns and new document types can be identified through text mining technologies that enable content curation. Content curation refers to the clustering of similar and related documents. Content curation identifies documents that do not fit within existing metadata attributes. Hence, it indicates when a new document type and its relevant retention period need to be defined. Monitoring should also be facilitated on an 'object' level, allowing the information manager to identify documents that contain specific or forbidden information. Think of personally identifiable information, for example.


For the end user, this is probably the most essential aspect of any ECM solution. How do I search, and more importantly, find what I am looking for? Here, it is desired to go far beyond having to remember the title of a document. A solution should scan document content, metadata and offer user friendly filtering options to refine search results. A single access point with all the answers to your questions.

Market solutions

Disclaimer: the list of available market solutions below is not exhaustive. The goal of this blog is not to provide a full list of solutions, but rather to explore various text mining functionalities.

Rather than focusing on the top solutions as identified by Gartner, this blog will consider solutions that vary in their offerings. SharePoint, for example is one of the most popular solutions in the market as it comes with Office 365 and is therefore often already part of an organisation's available licences. It covers many of the features discussed above through its Advanced Data Governance capability. The overview considers SharePoint online (for Business). OpenText is the market leader and can therefore not be left out. The overview below considers their entire product stack. In order to cater to all requested features, a combination of their off-the-shelves offerings must be made. iManage is a lower cost solution that offers many of the same functionalities as OpenText. Indica and Index Engines are not traditional ECM solutions, in the sense that they focus on the monitoring, metadata and search rather than document creation and storage. They have developed some extremely useful functionalities in this field. These solutions are great if you do not want to change your current architecture (e.g. you would like to keep using your current file shares) but do want to gain control over that landscape. Below is an overview of some of the out-of-the-box features of these solutions mapped to the focus areas defined above. Note that all of these solutions are continuously being developed and improved upon. This overview serves as an indication of solutions functionalities at this specific moment in time.

Overview of functional requirements per vendor, per capability
Data presentation functional requirements per vendor
Data presentation functional requirements per capability

Wondering how one would go about the transition to Enterprise Content Management? Stay tuned for the next blog in this series for a deep-dive into an ECM case or contact Simone Jeurissen, (020) 656 4089 or by email.

© 2020 KPMG N.V., a Dutch limited liability company and a member firm of the KPMG global organization of independent member firms affiliated with KPMG International Limited, a private English company limited by guarantee. All rights reserved.

For more detail about the structure of the KPMG global organization please visit

Neem contact met ons op


Wilt u een offerte van ons ontvangen?


loading image Offerteaanvraag (RFP)