Effectively managing the context is an essential challenge when you work with large languages models, especially in environments like Google Colab, where resource constraints and long documents can quickly exceed the available token windows. In this tutorial, we guide you through a practical implementation of the model context protocol (MCP) by building a ModelClcontextManager which automatically dulls the incoming text, generates semantic incorporations using phrase transformers and educated each piece according to the reorcess, importance and relevance. You will learn to integrate this manager with a sequence sequence model with the hugged sequence, demonstrated here with FLAN-T5, to add, optimize and recover only the most relevant context parts. Along the way, we will cover the counting of tokens with a GPT-2 tokenizer, strategies for optimizing the context window and interactive sessions that allow you to question and visualize your dynamic context in real time.
import torch
import numpy as np
from typing import List, Dict, Any, Optional, Union, Tuple
from dataclasses import dataclass
import time
import gc
from tqdm.notebook import tqdm
We import essential libraries to create a dynamic context manager: handle and NUMPY holding and tension operations, while typing and data classes provide structured type annotations and data containers. The utility modules, such as time and GC, support the horoditing and cleaning of memory, as well as TQDM.NoteBook offers interactive progress bars for the treatment of pieces in colab.
@dataclass
class ContextChunk:
"""A chunk of text with metadata for the Model Context Protocol."""
text: str
embedding: Optional(torch.Tensor) = None
importance: float = 1.0
timestamp: float = 0.0
metadata: Dict(str, Any) = None
def __post_init__(self):
if self.metadata is None:
self.metadata = {}
if self.timestamp == 0.0:
self.timestamp = time.time()
The Contextchunk data class summarizes a single text segment as well as its integration, an importance score attributed by the user, a time timeing and arbitrary metadata. His __post_init_ method ensures that each piece is stamped with current time during the creation and that the default metadata an empty dictionary if none is provided.
class ModelContextManager:
"""
Manager for implementing Model Context Protocol in LLMs on Google Colab.
Handles context window optimization, token management, and relevance scoring.
"""
def __init__(
self,
max_context_length: int = 8192,
embedding_model: str = "sentence-transformers/all-MiniLM-L6-v2",
relevance_threshold: float = 0.7,
recency_weight: float = 0.3,
importance_weight: float = 0.3,
semantic_weight: float = 0.4,
device: str = "cuda" if torch.cuda.is_available() else "cpu"
):
"""
Initialize the Model Context Manager.
Args:
max_context_length: Maximum number of tokens in context window
embedding_model: Model to use for text embeddings
relevance_threshold: Threshold for chunk relevance to be included
recency_weight: Weight for recency in relevance calculation
importance_weight: Weight for importance in relevance calculation
semantic_weight: Weight for semantic similarity in relevance calculation
device: Device to run computations on
"""
self.max_context_length = max_context_length
self.device = device
self.chunks = ()
self.current_token_count = 0
self.relevance_threshold = relevance_threshold
self.recency_weight = recency_weight
self.importance_weight = importance_weight
self.semantic_weight = semantic_weight
try:
from sentence_transformers import SentenceTransformer
print(f"Loading embedding model {embedding_model}...")
self.embedding_model = SentenceTransformer(embedding_model).to(self.device)
print(f"Embedding model loaded successfully on {self.device}")
except ImportError:
print("Installing sentence-transformers...")
import subprocess
subprocess.run(("pip", "install", "sentence-transformers"))
from sentence_transformers import SentenceTransformer
self.embedding_model = SentenceTransformer(embedding_model).to(self.device)
print(f"Embedding model loaded successfully on {self.device}")
try:
from transformers import GPT2Tokenizer
self.tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
except ImportError:
print("Installing transformers...")
import subprocess
subprocess.run(("pip", "install", "transformers"))
from transformers import GPT2Tokenizer
self.tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
def add_chunk(self, text: str, importance: float = 1.0, metadata: Dict(str, Any) = None) -> None:
"""
Add a new chunk of text to the context manager.
Args:
text: The text content to add
importance: Importance score (0-1)
metadata: Additional metadata for the chunk
"""
with torch.no_grad():
embedding = self.embedding_model.encode(text, convert_to_tensor=True)
chunk = ContextChunk(
text=text,
embedding=embedding,
importance=importance,
timestamp=time.time(),
metadata=metadata or {}
)
self.chunks.append(chunk)
self.current_token_count += len(self.tokenizer.encode(text))
if self.current_token_count > self.max_context_length:
self.optimize_context()
def optimize_context(self) -> None:
"""Optimize context by removing less relevant chunks to fit within token limit."""
if not self.chunks:
return
print("Optimizing context window...")
scores = self.score_chunks()
sorted_indices = np.argsort(scores)(::-1)
new_chunks = ()
new_token_count = 0
for idx in sorted_indices:
chunk = self.chunks(idx)
chunk_tokens = len(self.tokenizer.encode(chunk.text))
if new_token_count + chunk_tokens self.relevance_threshold * 1.5:
for i, included_chunk in enumerate(new_chunks):
included_idx = sorted_indices(i)
if scores(included_idx) np.ndarray:
"""
Score chunks based on recency, importance, and semantic relevance.
Args:
query: Optional query to calculate semantic relevance against
Returns:
Array of scores for each chunk
"""
if not self.chunks:
return np.array(())
current_time = time.time()
max_age = max(current_time - chunk.timestamp for chunk in self.chunks) or 1.0
recency_scores = np.array((
1.0 - ((current_time - chunk.timestamp) / max_age)
for chunk in self.chunks
))
importance_scores = np.array((chunk.importance for chunk in self.chunks))
if query is not None:
query_embedding = self.embedding_model.encode(query, convert_to_tensor=True)
similarity_scores = np.array((
torch.cosine_similarity(chunk.embedding, query_embedding, dim=0).item()
for chunk in self.chunks
))
similarity_scores = (similarity_scores - similarity_scores.min()) / (similarity_scores.max() - similarity_scores.min() + 1e-8)
else:
similarity_scores = np.ones(len(self.chunks))
final_scores = (
self.recency_weight * recency_scores +
self.importance_weight * importance_scores +
self.semantic_weight * similarity_scores
)
return final_scores
def retrieve_context(self, query: str = None, k: int = None) -> str:
"""
Retrieve the most relevant context for a given query.
Args:
query: The query to retrieve context for
k: The maximum number of chunks to return (None = all relevant chunks)
Returns:
String containing the combined relevant context
"""
if not self.chunks:
return ""
scores = self.score_chunks(query)
relevant_indices = np.where(scores >= self.relevance_threshold)(0)
relevant_indices = relevant_indices(np.argsort(scores(relevant_indices))(::-1))
if k is not None:
relevant_indices = relevant_indices(:k)
relevant_texts = (self.chunks(i).text for i in relevant_indices)
return "\n\n".join(relevant_texts)
def get_stats(self) -> Dict(str, Any):
"""Get statistics about the current context state."""
return {
"chunk_count": len(self.chunks),
"token_count": self.current_token_count,
"max_tokens": self.max_context_length,
"usage_percentage": self.current_token_count / self.max_context_length * 100 if self.max_context_length else 0,
"avg_chunk_size": self.current_token_count / len(self.chunks) if self.chunks else 0,
"oldest_chunk_age": time.time() - min(chunk.timestamp for chunk in self.chunks) if self.chunks else 0,
}
def visualize_context(self):
"""Visualize the current context window distribution."""
try:
import matplotlib.pyplot as plt
import pandas as pd
if not self.chunks:
print("No chunks to visualize")
return
scores = self.score_chunks()
chunk_sizes = (len(self.tokenizer.encode(chunk.text)) for chunk in self.chunks)
timestamps = (chunk.timestamp for chunk in self.chunks)
relative_times = (time.time() - ts for ts in timestamps)
importance = (chunk.importance for chunk in self.chunks)
df = pd.DataFrame({
'Size (tokens)': chunk_sizes,
'Age (seconds)': relative_times,
'Importance': importance,
'Score': scores
})
fig, axs = plt.subplots(2, 2, figsize=(14, 10))
axs(0, 0).bar(range(len(chunk_sizes)), chunk_sizes)
axs(0, 0).set_title('Token Distribution by Chunk')
axs(0, 0).set_ylabel('Tokens')
axs(0, 0).set_xlabel('Chunk Index')
axs(0, 1).scatter(chunk_sizes, scores)
axs(0, 1).set_title('Score vs Chunk Size')
axs(0, 1).set_xlabel('Tokens')
axs(0, 1).set_ylabel('Score')
axs(1, 0).scatter(relative_times, scores)
axs(1, 0).set_title('Score vs Chunk Age')
axs(1, 0).set_xlabel('Age (seconds)')
axs(1, 0).set_ylabel('Score')
axs(1, 1).scatter(importance, scores)
axs(1, 1).set_title('Score vs Importance')
axs(1, 1).set_xlabel('Importance')
axs(1, 1).set_ylabel('Score')
plt.tight_layout()
plt.show()
except ImportError:
print("Please install matplotlib and pandas for visualization")
print('!pip install matplotlib pandas')
The ModelConTextManager class orchestrates end -to -end handling for the context for LLM by shaking the input text, generating integrations and following the use of tokens against a configurable limit. It implements the notation of relevance (combining the request, the importance and the semantic similarity), the pruning of the automatic context, the recovery of the most relevant pieces and the practical utilities to monitor and visualize context statistics.
class MCPColabDemo:
"""Demonstration of Model Context Protocol in Google Colab with a Language Model."""
def __init__(
self,
model_name: str = "google/flan-t5-base",
max_context_length: int = 2048,
device: str = "cuda" if torch.cuda.is_available() else "cpu"
):
"""
Initialize the MCP Colab demo with a specified model.
Args:
model_name: Hugging Face model name
max_context_length: Maximum context length for the MCP manager
device: Device to run the model on
"""
self.device = device
self.context_manager = ModelContextManager(
max_context_length=max_context_length,
device=device
)
try:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
print(f"Loading model {model_name}...")
self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
print(f"Model loaded successfully on {device}")
except ImportError:
print("Installing transformers...")
import subprocess
subprocess.run(("pip", "install", "transformers"))
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
print(f"Model loaded successfully on {device}")
def add_document(self, text: str, chunk_size: int = 512, overlap: int = 50) -> None:
"""
Add a document to the context by chunking it appropriately.
Args:
text: Document text
chunk_size: Size of each chunk in characters
overlap: Overlap between chunks in characters
"""
chunks = ()
for i in range(0, len(text), chunk_size - overlap):
chunk = text(i:i + chunk_size)
if len(chunk) > 20:
chunks.append(chunk)
print(f"Adding {len(chunks)} chunks to context...")
for i, chunk in enumerate(tqdm(chunks)):
pos = i / len(chunks)
importance = 1.0 - 0.5 * min(pos, 1 - pos)
self.context_manager.add_chunk(
text=chunk,
importance=importance,
metadata={"source": "document", "position": i, "total_chunks": len(chunks)}
)
def process_query(self, query: str, max_new_tokens: int = 256) -> str:
"""
Process a query using the context manager and model.
Args:
query: The query to process
max_new_tokens: Maximum number of tokens in response
Returns:
Model response
"""
self.context_manager.add_chunk(query, importance=1.0, metadata={"type": "query"})
relevant_context = self.context_manager.retrieve_context(query=query)
prompt = f"Context: {relevant_context}\n\nQuestion: {query}\n\nAnswer:"
inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
print("Generating response...")
with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.7,
top_p=0.9,
)
response = self.tokenizer.decode(outputs(0), skip_special_tokens=True)
self.context_manager.add_chunk(
response,
importance=0.9,
metadata={"type": "response", "query": query}
)
return response
def interactive_session(self):
"""Run an interactive session in the notebook."""
from IPython.display import clear_output
print("Starting interactive MCP session. Type 'exit' to end.")
conversation_history = ()
while True:
query = input("\nYour query: ")
if query.lower() == 'exit':
break
if query.lower() == 'stats':
print("\nContext Statistics:")
stats = self.context_manager.get_stats()
for key, value in stats.items():
print(f"{key}: {value}")
self.context_manager.visualize_context()
continue
if query.lower() == 'clear':
self.context_manager.chunks = ()
self.context_manager.current_token_count = 0
conversation_history = ()
clear_output(wait=True)
print("Context cleared!")
continue
response = self.process_query(query)
conversation_history.append((query, response))
print("\nResponse:")
print(response)
print("\n" + "-"*50)
stats = self.context_manager.get_stats()
print(f"Context usage: {stats('token_count')}/{stats('max_tokens')} tokens ({stats('usage_percentage'):.1f}%)")
The MCPCOLABDEMO Class Ties The Context Manager to A SEQ2SEQ LLM, Loading Flan-T5 (or Any specific Hugging Face Model) on the Chosen Device, and Provids Utility Methods for Chunking and Ingesting Entire Documents, Processing User Queries by Prepending Only The Most List Context, Interactive COLAB Session Complete With Real-Time Stats, Visualizations, and Commands for Clearing or Inspecting the Evolving Context Window.
def run_mcp_demo():
"""Run a simple demo of the Model Context Protocol."""
print("Running Model Context Protocol Demo...")
context_manager = ModelContextManager(max_context_length=4096)
print("Adding sample chunks...")
context_manager.add_chunk(
"The Model Context Protocol (MCP) is a framework for managing context "
"windows in large language models. It helps optimize token usage and improve relevance.",
importance=1.0
)
context_manager.add_chunk(
"Context management involves techniques like sliding windows, chunking, "
"and relevance filtering to handle large documents efficiently.",
importance=0.8
)
for i in range(10):
context_manager.add_chunk(
f"This is test chunk {i} with some filler content to simulate a larger context "
f"window that needs optimization. This helps demonstrate the MCP functionality "
f"for context window management in language models on Google Colab.",
importance=0.5 - (i * 0.02)
)
stats = context_manager.get_stats()
print("\nInitial Statistics:")
for key, value in stats.items():
print(f"{key}: {value}")
query = "How does the Model Context Protocol work?"
print(f"\nRetrieving context for: '{query}'")
context = context_manager.retrieve_context(query)
print(f"\nRelevant context:\n{context}")
print("\nVisualizing context:")
context_manager.visualize_context()
print("\nDemo complete!")
The RUN_MCP_DEMO function attaches everything together in a single script: it installs the ModelClcontextManager, adds a series of song samples with variable importance, prints initial statistics, recovers and displays the most relevant context for a test of test test, and finally visualizes the context window, providing a complete and most relevant demonstration of the context protocol Action.
if __name__ == "__main__":
run_mcp_demo()
Finally, this standard Python entry point guard guarantees that the RUN_MCP_DEMO () function only runs when the script is executed directly (rather than important as a module), triggering the end -to -end demonstration of the workflow of the model context protocol.
In conclusion, we will have a fully functional MCP system which not only limits the use of fleeing tokens, but also favors context fragments that really count for your requests. The ModelConTextManager offers you tools to balance semantic relevance, temporal freshness and the importance attributed by the user. At the same time, the MCPCOLABDEMO class which supports it provides an accessible framework for experimentation and visualization in real time. Armed with these models, you can extend the basic principles by adjusting the relevance thresholds, experimenting with various incorporation models or by integrating with alternative LLM backends to adapt your workflows specific to the domain. In the end, this approach allows you to create concise but very relevant prompts, resulting in more precise and effective responses from your language models.
Here is the Colaab. Also, don't forget to follow us Twitter And join our Telegram And Linkedin Group. Don't forget to join our 90K + ML Subdreddit.
Asif Razzaq is the CEO of Marktechpost Media Inc .. as a visionary entrepreneur and engineer, AIF undertakes to exploit the potential of artificial intelligence for social good. His most recent company is the launch of an artificial intelligence media platform, Marktechpost, which stands out from its in-depth coverage of automatic learning and in-depth learning news which are both technically solid and easily understandable by a large audience. The platform has more than 2 million monthly views, illustrating its popularity with the public.
