MLFLOW is a powerful open source platform to manage the life cycle of automatic learning. Although it is traditionally used for monitoring model experiences, parameter journalization and deployment management, MLFlow recently introduced support to assess large language models (LLMS).
In this tutorial, we explore how to use MLFLOW to assess the performance of an LLM – In our case, the Gemini model of Google – on a set of prompts based on facts. We will generate responses to prompts based on facts using Gemini and will assess their quality by using a variety of measures supported directly by MLFLOW.
Dependencies configuration
For this tutorial, we will use the OPENAI and Gemini APIs. The assessment metrics generating the integrated AI of MLFLOW are currently based on OPENAI models (for example, GPT-4) to act as judges for metrics such as the similarity of response or loyalty, therefore a key API OPENAI is required. You can get:
Installation of libraries
pip install mlflow openai pandas google-genai
Definition of API OPENAI and Google keys as an environment variable
import os
from getpass import getpass
os.environ("OPENAI_API_KEY") = getpass('Enter OpenAI API Key:')
os.environ("GOOGLE_API_KEY") = getpass('Enter Google API Key:')
Prepare the evaluation data and recover gemini outputs
import mlflow
import openai
import os
import pandas as pd
from google import genai
Creation of evaluation data
In this stage, we define a small set of evaluation data containing factual prompts with their correct -ground truth responses. These invites cover subjects such as science, health, web development and programming. This structured format allows us to objectively compare the responses generated by the gemini-aux known correct responses by using various evaluation measures in MLFLOW.
eval_data = pd.DataFrame(
{
"inputs": (
"Who developed the theory of general relativity?",
"What are the primary functions of the liver in the human body?",
"Explain what HTTP status code 404 means.",
"What is the boiling point of water at sea level in Celsius?",
"Name the largest planet in our solar system.",
"What programming language is primarily used for developing iOS apps?",
),
"ground_truth": (
"Albert Einstein developed the theory of general relativity.",
"The liver helps in detoxification, protein synthesis, and production of biochemicals necessary for digestion.",
"HTTP 404 means 'Not Found' -- the server can't find the requested resource.",
"The boiling point of water at sea level is 100 degrees Celsius.",
"Jupiter is the largest planet in our solar system.",
"Swift is the primary programming language used for iOS app development."
)
}
)
eval_data
Get Gemini answers
This block of code defines a gemini_completion () assistance function which sends an prompt to the Flash Gemini 1.5 model using the generative SDK Google and returns the response generated in raw text. We then apply this function to each invite of our assessment data set to generate the predictions of the model, by storing them in a new “predictions” column. These predictions will later be evaluated against the responses of truth to the ground
client = genai.Client()
def gemini_completion(prompt: str) -> str:
response = client.models.generate_content(
model="gemini-1.5-flash",
contents=prompt
)
return response.text.strip()
eval_data("predictions") = eval_data("inputs").apply(gemini_completion)
eval_data
Gemini output assessment with MLFlow
In this stage, we start an MLFLOW execution to assess the responses generated by the Gemini model compared to a set of factual responses on the ground. We use the mlflow.evaluate () method with four light measurements: Response_simularity (measure the semantic similarity between the exit of the model and the truth on the ground), exact_match (Verification of mott-to-word correspondences), latency (followed the response generation time), and token_count (Recording the number of output tokens).
It is important to note that the Response_simularity Metric uses internally OPENAI GPT Model to judge the semantic proximity between the answers, which is why access to the Openai API is required. This configuration provides an effective way to assess LLM outputs without relying on the personalized evaluation logic. The results of the final evaluation are printed and also saved in a CSV file for a subsequent inspection or visualization.
mlflow.set_tracking_uri("mlruns")
mlflow.set_experiment("Gemini Simple Metrics Eval")
with mlflow.start_run():
results = mlflow.evaluate(
model_type="question-answering",
data=eval_data,
predictions="predictions",
targets="ground_truth",
extra_metrics=(
mlflow.metrics.genai.answer_similarity(),
mlflow.metrics.exact_match(),
mlflow.metrics.latency(),
mlflow.metrics.token_count()
)
)
print("Aggregated Metrics:")
print(results.metrics)
# Save detailed table
results.tables("eval_results_table").to_csv("gemini_eval_results.csv", index=False)
To display the detailed results of our assessment, we load the CSV file saved in a dataframe and adjust the display settings to ensure complete visibility of each response. This allows us to inspect individual prompts, predictions generated by the Gemini, responses of truth on the ground and the metric scores associated without truncation, which is particularly useful in notebooks like Colaab or Jupyter.
results = pd.read_csv('gemini_eval_results.csv')
pd.set_option('display.max_colwidth', None)
results
Discover the Codes here. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our Subseubdredit 100k + ml and subscribe to Our newsletter.
