Having worked with Information Management, Records Management and PROV (Public Records Office of Victoria) for many years, it’s been a constant struggle to get people to tag content. For years we have struggled to work with classic Records Management teams who insist on creating excessive amounts of Content Types and mandatory metadata to make SharePoint “records compliant”.
Generically applying these classic methodologies to SharePoint Online fundamentally breaks the platform! Things like OneDrive sync, co-authoring and collaboration become harder, or even not possible, but that’s not the hard part. The hard part is trying to apply standardised content types and values pushed to all sites without massive amounts of work. Don’t get us wrong, it can be done and pushing content types is only the start of the problem. The problem is trying to stop people from creating libraries, columns and other lists which don’t have the standard content types.
We are now seeing a fundamental change in the way information management and records management is being tackled. The talk of Cognitive Services or Machine Learning algorithms and AI have been touted for some time, but the technology, cost and time investment has been prohibitive until now.
SharePoint Syntex is the first product from the Project Cortex initiative by Microsoft and enables end users to build content models that index and understand documents or forms. Without getting too deep, organisations can train SharePoint to apply Content Types to items in SharePoint through machine learning. As well as applying content types, SharePoint Syntex can pull information from the actual documents/items and populate columns with values.

A quick example is our Statement of Work (SoW) model. With this model, we have pointed SharePoint Syntex at our client libraries and find all SoW’s that have been created over the years. We can then extract key information from each SoW like the customer’s name, topics, deliverables, dollar value, and date of delivery.
We recently had a customer who had their entire archive room (physical files) Optical Character Recognition (OCR) scanned and uploaded into SharePoint. When the scanned documents were returned, they had no physical metadata other than the name, which corresponded to a box number.
As there were over 20 million documents, we estimated that if one person could open, review and tag 100 documents per day, it would take a VERY LONG TIME to get everything tagged. Alternatively, we could use SharePoint Syntex to do this for us.
Using SharePoint Syntex, we were able to separate and group content first by the high-level content type, like HR, Incident, Product, Finance, and many others. By applying the content types, we were able to use Power Automate to move documents to Teams or groups based on its classification as well as extract basic metadata like “date” for sorting, then apply retention polices to these documents.
Once the document/items were separated at a high level, we were able to train the model further to extract values from the document to be stored in columns, making search and other activities even easier.

As an example, we have used some sample data to load into our Content Center. We can then build a classifier to train Syntex to extract an email field from the documents, and then extract the data and place it into a SharePoint list. The potential for this technology is huge, and will revolutionise how organisations store, retain and classify data at massive scale.
The best part of the whole solution was that there was no code at all, no hosting or separate environments, just a small license cost. These models are also easy for end users to modify or tweak to extract even more valuable knowledge.