Enhancing regulatory content with AI metadata

Unlocking content discovery with metadata, auto-classification, and chemical entity recognition

Kim Marshall

by Kim Marshall

In our search for ways to help users find the content they need, we encountered a fascinating concept: ‘semantic fingerprinting’. Think of it as the DNA of each piece of content — a unique set of metadata that allows us to match and connect related materials and pieces of information based on relevant attributed data. Picture yourself reading one of our news articles about PFAS in food packaging. Now, imagine being seamlessly guided to in-depth analysis pieces or even videos that offer even more insight and information on the same topic. This comprehensive metadata holds the key to enhancing search results, refining filters, and suggesting relevant content cv that is used throughout our Product Intelligence platform.

In this article we’ll go into details of the powerful technologies that help us deliver these important recommendations.

Tagging thousands of pieces of content

At Enhesa, our platforms house regulatory documents, news articles, insightful analyses, and video content. The challenge lies in tagging each piece with the necessary good metadata that goes beyond just the basic information of a conventional metadata record — the ‘semantic fingerprint’. But how do we achieve this consistently with the right data providing context across diverse content categories?

Our team of over 20 expert writers kickstart the process by adding crucial tags to their news and insight content. We strive to ensure uniformity across all content types as we don’t want users to miss out on valuable information. To do this we also auto-classify our content using a set of expert-created rules for metadata tagging, as well as enlist AI tools to suggest headings and associated metadata tags for our content. By empowering our experts with AI, we achieve human verified data tagging for an almost endless sea of content that would only be possible with the help of computational tools.

Navigating chemicals: Text mining with OSCAR4

Beyond our published content, we extract the most important content from external sources. This becomes especially critical for chemical list regulations.

In chemical regulatory content, where myriad names and identification codes are used, finding specific chemicals is challenging. At Enhesa, we review and analyze thousands of pieces of content from diverse sources, each harboring thousands of potential chemicals — each with multiple aliases across those sources. Our mission is to help users find regulatory news and content for specific chemicals by using a technique called “Chemical Entity Recognition.”  This natural language processing method builds on the expert work done at the University of Cambridge as part of the OSCAR4 project. Put simply, it mines the text in our content to pinpoint specific text fragments indicative of chemical names, based on information and context within the text. OSCAR4 combines chemical naming rules, dictionaries, and machine learning to identify these substances and assign them unique structural IDs — such as machine-readable InChI keys or SMILES.

The content we’re analyzing contains a complex, nuanced, and diverse mix of specialist phraseology, terminology, and symbols. This is why it’s important that our system is properly equipped with the expert knowledge to correctly identify, code, and catalogue the data. We’ve therefore focused our development efforts on crafting a bespoke specialized Enhesa tool. Here’s how it works:

 

Content enrichment with comprehensive metadata tagging

Every piece of content drafted and published to our platform — whether it’s from the news and insight team, our ever-expanding regulatory database, or engaging videos from our events — is run through our enrichment process, which gathers and assigns all the necessary metadata tags. For chemicals, OSCAR4 automatically identifies the chemical names and assigns the structural keys. We then developed a process to match these names to chemicals in our comprehensive database of chemicals, CAS numbers, and alternative aliases. This dataset specifies a primary name for each chemical, which is used as a facet or filter our users can search by on our platform.

 

Whitelisting and blacklisting key terms

A significant challenge to be addressed is the diversity of terminology — hundreds of thousands of chemical names and their synonyms need to be identified correctly, so even if a name isn’t in our database, this enrichment process automatically adds it as an isolated chemical name, ensuring no vital information slips through the cracks. We also noticed that some names are not detected by OSCAR4, so we created a mechanism to whitelist chemical aliases that were missed — allowing us to add chemicals like PentaBDE to this list to ensure they’re not overlooked. Additionally, we found that some chemical aliases coincidentally also serve as acronyms or names referring to non-chemical terms. To address this, we developed a tool to blacklist these acronyms as we identify them based on context.

By addressing these challenges, we’ve unlocked content and data discovery for chemicals across a diverse range of content types delivered by the Enhesa Product Intelligence platform.

Supporting content discovery with semantic fingerprinting

With comprehensive metadata now attached to all our content, we’ve transformed the user experience on our Product Intelligence web platform. Here’s how:

 

Search facets

The classifications and chemical entities identified become dynamic search facets. Users can filter content based on these facets, narrowing down their results to precisely what they seek.

 

Chemical search

Users can search using the chemical name, synonym, or CAS number they prefer and find the content they need.

 

Relevance-driven listings

Our search engine leverages the semantic fingerprint to ensure the most relevant content appears prominently in search results.

 

Sorting by relevance vs. recency

Balancing relevance and recency is crucial, so we developed two sorting options (as well as by date):

  • Pure relevance: How closely does the content’s semantic fingerprint match the user’s search terms and filters, irrespective of published date.
  • Weighted relevance: A blend of recency and relevance — 50% based on content freshness and 50% on the relevancy score.

 

Alerts for new content

By saving searches and search filters, users can receive alerts when new, relevant content aligns with their interests.

Discovering onward content with semantic fingerprinting

When users engage with any piece of Enhesa’s content — whether it’s a news article or regulatory information — our platform uses semantic fingerprinting to dynamically suggest similar content based on the metadata and context attached to each piece. This includes videos, regulatory updates, news, and insights, all based on information gleaned from their semantic fingerprints. As new content is added, it’s seamlessly integrated into these recommendations, ensuring users always have access to the most relevant information.

Helping Enhesa experts stay ahead of the latest regulatory updates

Beyond our published content, we also use auto-classification technology to track and analyze external sources (this includes machine translation). This ensures we don’t miss critical regulatory updates across the world — and complements the work of our seasoned expert research teams. By combining human expertise with machine efficiency, we stay on top of the latest news and regulatory changes.

What’s next?

In the future we will continue to improve customer content and data discovery along the dimensions of:

  • Personalized content suggestions: A tailored experience where users receive content recommendations based on their past reading habits and the topic/jurisdiction/job profile they’ve shared.
  • Increased user feedback loop: Mechanisms for user feedback — such as thumbs-up or thumbs-down ratings — helping us refine models, and better cater custom content.
  • Large language model and generative AI support: Imagine having a custom ChatGPT for Enhesa content. We’re creating ways to allow users to query our content in a natural, chat-like manner.

The future is bright, and we’re dedicated to enhancing our customer content discovery journey. Keep an eye out for more updates!

Read more about Enhesa’s AI-powered solutions

Regulatory content and sustainability intelligence

Creating better compliance management with AI

AI is great for data-driven compliance solutions, when used correctly. See how Enhesa’s experts use AI processing to deliver excellent solutions to clients.

Regulatory content and sustainability intelligence

From data to decisions: Enhesa’s AI-powered compliance

See how Enhesa’s in-house AI team are pushing AI-driven enhancement to bring expert insights and knowledge to the world of regulatory compliance.

Regulatory content and sustainability intelligence

How Enhesa uses AI

Find out more about how we use AI and machine learning at Enhesa to provide better, more effective products and services for our customers.

Share