Just start typing...

LLM & NLP-based term standardization cuts weeks of manual work

Published January 22, 2026
Let us tell you more about our projects
Start here

For a Forbes Global 2000 client, we have automated the manual process of matching adverse event descriptions from clinical reports with standardized MedDRA vocabulary. Faster and more accurate processing of clinical data means lower R&D costs, faster regulatory submissions, and, ultimately, better treatments for patients.

The challenge: complex terminology and slow manual decoding

The client was manually converting unstructured medical terms from clinical study records into standardized MedDRA (Medical Dictionary for Regulatory Activities) codes, a global regulatory standard for pharmacovigilance and drug safety reporting.

Medical terms can be composed of multiple words and must be converted into words defined by MedDRA. Manual mapping required qualified experts and could take up to two weeks per study, creating a bottleneck in the data processing pipeline. The company needed an automated, reliable way to decode the terms, and integrate it into their existing ETL flows.

The Solution: automated MedDRA standardization pipeline

To ensure a focused and viable implementation, we first conducted a short AI Readiness Workshop with the client’s technical and business teams. In just three hours, the workshop helped

  • analyze the client’s existing data pipelines and constraints,
  • and select the simplest and most effective integration approach.

As a result, instead of redesigning the entire workflow, we defined a lightweight solution: a new automated decoding step embedded directly into the client’s existing ETL process.

We designed and implemented an automated term decoding pipeline using a combination of Large Language Models (LLM) and Dataiku, and integrated it with the client's internal systems via API.

The core of the solution is an LLM-powered service, prompted and configured within Dataiku to accurately map free-text medical terms to the correct, standardized MedDRA codes. This new step was implemented as a reusable pipeline component that can be embedded into any existing data workflow.

Technical implementation

LLM AI

We used Dataiku for pipeline orchestration, LLM integration, and API management with intelligent term standardization performed by ChatGPT.

API-driven integration

The client’s existing ETL pipeline sends batches of new medical terms to the Dataiku application via REST API. This step was added to their workflows with a single line of code.

LLM-powered decoding

Inside the secure Dataiku environment, LLM processes the incoming terms and proposes MedDRA terms based on semantic meaning.

Multi-stage validation and hallucination control

To achieve near-100% accuracy, we implemented a dedicated verification step:

  • All LLM-generated terms are automatically cross-checked against the official MedDRA dictionary. The system ensures that the proposed term actually exists in MedDRA before writing it to the results table.
  • If no valid match is found, the data is not brought further in the pipeline. As a result, the system outputs only 100% accurate data — or no data at all.

If no valid MedDRA match is found, the term undecoded at this stage. It is automatically queued for the next scheduled pipeline run and reprocessed when the workflow executes again. If the term still cannot be decoded after repeated attempts, it is flagged and routed for manual expert review.

Before production rollout, the system was additionally validated by domain experts. At this stage, around 95% of terms were successfully decoded automatically, including terms that specialists previously failed to decode manually.

Reliable handshake and storage

The ETL process monitors job completion in Dataiku and retrieves validated results via API. The standardized data is then written directly into the client’s Snowflake data warehouse for downstream regulatory and analytical use.

Data security

The system processes only study identifiers and medical terminology (disease names, adverse event descriptions). It does not operate on personal patient data, neither shares this data outside the organization, ensuring compliance with internal security and privacy requirements.

Operational integration and results

The solution runs as a scheduled step within the client’s production pipelines:

  • New study records (typically covering one day of gathered data) are processed every few hours.
  • Night-time execution ensures that by the start of the business day, specialists receive fully decoded and standardized terms.
  • Load is distributed across processing windows, ensuring stable system performance.

This approach turned term standardization into a fully automated background process instead of a manual expert task.

The project removed a critical manual bottleneck from the client’s pharmacovigilance data pipeline:

  • Manual MedDRA decoding time reduced from up to two weeks to near real time.
  • Near-100% accuracy in standardization, significantly reducing the risk of reporting errors.
  • Faster processing cycles for critical pharmacovigilance data.
  • Seamless integration with the existing IT landscape (ETL pipelines, Dataiku, Snowflake) without disrupting operations.

The client now has an automated, scalable, and trustworthy mechanism for medical term standardization — turning a slow expert-only task into a fully integrated data processing service that serves data every morning.

*** 

Previously, we Earlier, we described how we built a GenAI search platform for a major pharma company, turning days of manual research across scientific sources into a process that takes minutes. Read the full story here.

Energy Management System for Mata Energy

The solution is aimed to facilitate sector-coupled energy supply, improving its efficiency. For the client we provided a consulting session, and developed an MVP.

GenAI medical search cuts drug discovery time and costs

We developed a GenAI search platform for a major pharma company, transforming a days-long, manual research process across scientific resources into a task taking a minute. The solution accelerates drug development cycles, reduces high-value labor costs, and improves patient outcomes by ensuring critical scientific insights are captured and acted upon instantly.

Omnichannel VoC analytics for a pharmaceutical company

In the pharmaceutical industry, delays in addressing customer feedback can pose serious risks to both patient health and corporate reputation. Our client set out to build a solution for managing and analyzing user requests.
A major pharmaceutical company needed to improve how its medical staff learned about new products. Their manual process for analyzing training tests was inefficient. We developed an AI system that pinpoints weaknesses in training materials, allowing for quick, precise improvements. The result was an increase in knowledge retention and a more efficient training process.
IT architects increasingly treat AI as a practical tool. LLMs like DeepSeek, Claude, and Gemma handle routine tasks, support decision-making, and assist in system design. Here’s what you can delegate, and how to maintain control over the results.
WaveAccess develops custom devices with Embedded AI for Healthcare and other industries. One recent example is a portable diagnostic tool for medical video analysis — designed to deliver fast, intelligent insights wherever care is provided.

Related Services

How we process your personal data

When you submit the completed form, your personal data will be processed by WaveAccess USA. Due to our international presence, your data may be transferred and processed outside the country where you reside or are located. You have the right to withdraw your consent at any time.
Please read our Privacy Policy for more information.