Beginning with Mirascope: Remove semantic duplicates using an LLM

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

Mirascope is a powerful and friendly library that provides a unified interface to work with a wide range of Great language model (LLM) Provideurs, including Openai, Anthropic, Mistral, Google (Gemini and Vertex AI), Groq, Cohere, Litellm, Azure AI and Amazon Bedrock. It simplifies everything, from the generation of text and the structured extraction of data to the construction of workflows and agent systems supplied by the complex AI.

In this guide, we will focus on the use of Mirascope's OPENAI integration to identify and delete semantic duplicates (entries that can differ in the wording but transport the same direction) from a list of customer criticisms.

1500X500

Dependencies installation

pip install "mirascope(openai)"

OPENAI key

To get an API OPENAI key, visit https://platform.openai.com/settings/organization/api-keys And generate a new key. If you are a new user, you may need to add billing details and make a minimum payment of $ 5 to activate access to the API.

import os
from getpass import getpass
os.environ('OPENAI_API_KEY') = getpass('Enter OpenAI API Key: ')

Define the list of customer reviews

customer_reviews = (
    "Sound quality is amazing!",
    "Audio is crystal clear and very immersive.",
    "Incredible sound, especially the bass response.",
    "Battery doesn't last as advertised.",
    "Needs charging too often.",
    "Battery drains quickly -- not ideal for travel.",
    "Setup was super easy and straightforward.",
    "Very user-friendly, even for my parents.",
    "Simple interface and smooth experience.",
    "Feels cheap and plasticky.",
    "Build quality could be better.",
    "Broke within the first week of use.",
    "People say they can't hear me during calls.",
    "Mic quality is terrible on Zoom meetings.",
    "Great product for the price!"
)

These opinions capture the main feelings of customers: praise for sound quality and ease of use, complaints concerning the life of the battery, the quality of construction and the call problems / micro, as well as a positive note on the value of money. They reflect common themes found in real user comments.

Define a pydatic diagram

This pydatic model defines the structure of the response of a semantic deduplication task on customer opinions. This scheme helps to structure and validate the outing of a language model loaded with clustering or to deduce the entry of natural language (for example, user comments, bug reports, product magazines).

from pydantic import BaseModel, Field

class DeduplicatedReviews(BaseModel):
    duplicates: list(list(str)) = Field(
        ..., description="A list of semantically equivalent customer review groups"
    )
    reviews: list(str) = Field(
        ..., description="The deduplicated list of core customer feedback themes"
    )

Define a mirascope @ OPENAI.Call for semantic deduplication

This code defines a semantic deduplication function using the @ OPENAI.Call of Mirascope decorator, which allows transparent integration with the OPENAI GPT-4O model. The deduplicate_customer_reviews function takes a list of customer reviews and uses a structured prompt – defined by the decorator @prompt_template – to guide the LLM to identify and group semantically similar criticism.

The system message asks the model to analyze the meaning, tone and intention behind each review, bringing together those who transmit the same comment even if they are formulated differently. The function expects a structured response to the Pydatic model of DuplicateViews, which includes two outings: a list of unique and deductible revision feelings and a list of grouped duplicates.

This design guarantees that the exit of the LLM is both precise and readable by machine, which makes it ideal for the analysis of customer feedback, the deduplication of the survey or the grouping of product examination.

from mirascope.core import openai, prompt_template

@openai.call(model="gpt-4o", response_model=DeduplicatedReviews)
@prompt_template(
    """
    SYSTEM:
    You are an AI assistant helping to analyze customer reviews. 
    Your task is to group semantically similar reviews together -- even if they are worded differently.

    - Use your understanding of meaning, tone, and implication to group duplicates.
    - Return two lists:
      1. A deduplicated list of the key distinct review sentiments.
      2. A list of grouped duplicates that share the same underlying feedback.

    USER:
    {reviews}
    """
)
def deduplicate_customer_reviews(reviews: list(str)): ...

The following code performs the function ofduplicate_customer_reviews using a list of customer reviews and prints the structured output. First of all, it calls the function and stores the result in the response variable. To ensure that the output of the model is in accordance with the expected format, it uses an instruction assessed to validate that the response is an instance of the Pydatic DedupliceDiews model.

Once validated, it prints the results deducted into two sections. The first section, labeled “✅ Return of customer comments”, displays the list of unique revision feelings identified by the model. The second section, “🌀 grouped with duplicates”, lists the clusters of criticism which have been recognized as semantically equivalent.

response = deduplicate_customer_reviews(customer_reviews)

# Ensure response format
assert isinstance(response, DeduplicatedReviews)

# Print Output
print("✅ Distinct Customer Feedback:")
for item in response.reviews:
    print("-", item)

print("n🌀 Grouped Duplicates:")
for group in response.duplicates:
    print("-", group)
AD 4nXeXEtDIZVmseoJjlYtxSHZz6fjlBHQ8OI44aHObRUI9RKMkfALAcwOf 339xVCBh26JCvlvdZHtPAMeulal7fiJg454clZf5qA Xw nPe1hWwA1au4AD 4nXeXEtDIZVmseoJjlYtxSHZz6fjlBHQ8OI44aHObRUI9RKMkfALAcwOf 339xVCBh26JCvlvdZHtPAMeulal7fiJg454clZf5qA Xw nPe1hWwA1au4

The release shows a proper summary of customer comments by bringing together semantically similar criticism. The separate customer feedback section highlights key information, while the grouped duplicate section captures different formulas of the same feeling. This helps to eliminate redundancy and facilitate the analysis of feedback.


Discover the Complete codes. All the merit of this research goes to researchers in this project.

Ready to connect with 1 million developers / engineers / researchers? Find out how NVIDIA, LG AI Research and the best IA companies operate Marktechpost to reach their target audience (Learn more)


PASSPORT SIZE PHOTO

I graduated in Civil Engineering (2022) by Jamia Millia Islamia, New Delhi, and I have a great interest in data science, in particular neural networks and their application in various fields.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.