Beginning with MLFLOW for the LLM evaluation

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

MLFLOW is a powerful open source platform to manage the life cycle of automatic learning. Although it is traditionally used for monitoring model experiences, parameter journalization and deployment management, MLFlow recently introduced support to assess large language models (LLMS).

In this tutorial, we explore how to use MLFLOW to assess the performance of an LLM – In our case, the Gemini model of Google – on a set of prompts based on facts. We will generate responses to prompts based on facts using Gemini and will assess their quality by using a variety of measures supported directly by MLFLOW.

Dependencies configuration

For this tutorial, we will use the OPENAI and Gemini APIs. The assessment metrics generating the integrated AI of MLFLOW are currently based on OPENAI models (for example, GPT-4) to act as judges for metrics such as the similarity of response or loyalty, therefore a key API OPENAI is required. You can get:

Installation of libraries

pip install mlflow openai pandas google-genai

Definition of API OPENAI and Google keys as an environment variable

import os
from getpass import getpass

os.environ("OPENAI_API_KEY") = getpass('Enter OpenAI API Key:')
os.environ("GOOGLE_API_KEY") = getpass('Enter Google API Key:')

Prepare the evaluation data and recover gemini outputs

import mlflow
import openai
import os
import pandas as pd
from google import genai

Creation of evaluation data

In this stage, we define a small set of evaluation data containing factual prompts with their correct -ground truth responses. These invites cover subjects such as science, health, web development and programming. This structured format allows us to objectively compare the responses generated by the gemini-aux known correct responses by using various evaluation measures in MLFLOW.

eval_data = pd.DataFrame(
    {
        "inputs": (
            "Who developed the theory of general relativity?",
            "What are the primary functions of the liver in the human body?",
            "Explain what HTTP status code 404 means.",
            "What is the boiling point of water at sea level in Celsius?",
            "Name the largest planet in our solar system.",
            "What programming language is primarily used for developing iOS apps?",
        ),
        "ground_truth": (
            "Albert Einstein developed the theory of general relativity.",
            "The liver helps in detoxification, protein synthesis, and production of biochemicals necessary for digestion.",
            "HTTP 404 means 'Not Found' -- the server can't find the requested resource.",
            "The boiling point of water at sea level is 100 degrees Celsius.",
            "Jupiter is the largest planet in our solar system.",
            "Swift is the primary programming language used for iOS app development."
        )
    }
)

eval_data

Get Gemini answers

This block of code defines a gemini_completion () assistance function which sends an prompt to the Flash Gemini 1.5 model using the generative SDK Google and returns the response generated in raw text. We then apply this function to each invite of our assessment data set to generate the predictions of the model, by storing them in a new “predictions” column. These predictions will later be evaluated against the responses of truth to the ground

client = genai.Client()
def gemini_completion(prompt: str) -> str:
    response = client.models.generate_content(
        model="gemini-1.5-flash",
        contents=prompt
    )
    return response.text.strip()

eval_data("predictions") = eval_data("inputs").apply(gemini_completion)
eval_data

Gemini output assessment with MLFlow

In this stage, we start an MLFLOW execution to assess the responses generated by the Gemini model compared to a set of factual responses on the ground. We use the mlflow.evaluate () method with four light measurements: Response_simularity (measure the semantic similarity between the exit of the model and the truth on the ground), exact_match (Verification of mott-to-word correspondences), latency (followed the response generation time), and token_count (Recording the number of output tokens).

It is important to note that the Response_simularity Metric uses internally OPENAI GPT Model to judge the semantic proximity between the answers, which is why access to the Openai API is required. This configuration provides an effective way to assess LLM outputs without relying on the personalized evaluation logic. The results of the final evaluation are printed and also saved in a CSV file for a subsequent inspection or visualization.

mlflow.set_tracking_uri("mlruns")
mlflow.set_experiment("Gemini Simple Metrics Eval")

with mlflow.start_run():
    results = mlflow.evaluate(
        model_type="question-answering",
        data=eval_data,
        predictions="predictions",
        targets="ground_truth",
        extra_metrics=(
          mlflow.metrics.genai.answer_similarity(),
          mlflow.metrics.exact_match(),
          mlflow.metrics.latency(),
          mlflow.metrics.token_count()
      )
    )
    print("Aggregated Metrics:")
    print(results.metrics)

    # Save detailed table
    results.tables("eval_results_table").to_csv("gemini_eval_results.csv", index=False)

To display the detailed results of our assessment, we load the CSV file saved in a dataframe and adjust the display settings to ensure complete visibility of each response. This allows us to inspect individual prompts, predictions generated by the Gemini, responses of truth on the ground and the metric scores associated without truncation, which is particularly useful in notebooks like Colaab or Jupyter.

results = pd.read_csv('gemini_eval_results.csv')
pd.set_option('display.max_colwidth', None)
results

Discover the Codes here. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our Subseubdredit 100k + ml and subscribe to Our newsletter.


I graduated in Civil Engineering (2022) by Jamia Millia Islamia, New Delhi, and I have a great interest in data science, in particular neural networks and their application in various fields.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.