A Coding Guide To Create A Functional Data Analysis Workflow Using Lilac To Transform, Filter And Export Structured Information

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

In this tutorial, we demonstrate a fully functional and modular data analysis pipeline using the Lilac library, without counting on signal processing. It combines Lilac data set management capacities with the Python functional programming paradigm to create a clean and extensible workflow. From the configuration of a project and the generation of data from realistic samples to the extraction of information and export of filtered outings, the tutorial emphasizes the reusable and testable code structures. Basic functional utilities, such as pipe, map_ver and filter_by, are used to build a declarative flow, while Pandas facilitates detailed data transformations and quality analysis.

!pip install lilac(all) pandas numpy

To start, we install the required libraries using the command! Pip Install Lilac (all) Pandas Numpy. This guarantees that we have the complete Lilac suite alongside Pandas and Numpy for the management and analysis of smooth data. We have to execute it in our notebook before continuing.

import json
import uuid
import pandas as pd
from pathlib import Path
from typing import List, Dict, Any, Tuple, Optional
from functools import reduce, partial
import lilac as ll

We import all essential libraries. These include JSON and UUID to manage data and generate unique project names, pandas to work with tabular data and pathlib path for directory management. We also introduce type tips to improve the clarity of the function and Functools for functional composition models. Finally, we import the basic Lilas library as LL to manage our data sets.

def pipe(*functions):
   """Compose functions left to right (pipe operator)"""
   return lambda x: reduce(lambda acc, f: f(acc), functions, x)


def map_over(func, iterable):
   """Functional map wrapper"""
   return list(map(func, iterable))


def filter_by(predicate, iterable):
   """Functional filter wrapper"""
   return list(filter(predicate, iterable))


def create_sample_data() -> List(Dict(str, Any)):
   """Generate realistic sample data for analysis"""
   return (
       {"id": 1, "text": "What is machine learning?", "category": "tech", "score": 0.9, "tokens": 5},
       {"id": 2, "text": "Machine learning is AI subset", "category": "tech", "score": 0.8, "tokens": 6},
       {"id": 3, "text": "Contact support for help", "category": "support", "score": 0.7, "tokens": 4},
       {"id": 4, "text": "What is machine learning?", "category": "tech", "score": 0.9, "tokens": 5}, 
       {"id": 5, "text": "Deep learning neural networks", "category": "tech", "score": 0.85, "tokens": 4},
       {"id": 6, "text": "How to optimize models?", "category": "tech", "score": 0.75, "tokens": 5},
       {"id": 7, "text": "Performance tuning guide", "category": "guide", "score": 0.6, "tokens": 3},
       {"id": 8, "text": "Advanced optimization techniques", "category": "tech", "score": 0.95, "tokens": 3},
       {"id": 9, "text": "Gradient descent algorithm", "category": "tech", "score": 0.88, "tokens": 3},
       {"id": 10, "text": "Model evaluation metrics", "category": "tech", "score": 0.82, "tokens": 3},
   )

In this section, we define reusable functional utilities. The pipe function clearly helps us the chain transformations, while MAP_OVER and FILTER_BY allow us to transform or filter the iterable data functionally. Then, we create an example of a dataset that imitates real world recordings, with fields such as text, category, score and tokens, which we will use later to demonstrate the Lilac data storage capacities.

def setup_lilac_project(project_name: str) -> str:
   """Initialize Lilac project directory"""
   project_dir = f"./{project_name}-{uuid.uuid4().hex(:6)}"
   Path(project_dir).mkdir(exist_ok=True)
   ll.set_project_dir(project_dir)
   return project_dir


def create_dataset_from_data(name: str, data: List(Dict)) -> ll.Dataset:
   """Create Lilac dataset from data"""
   data_file = f"{name}.jsonl"
   with open(data_file, 'w') as f:
       for item in data:
           f.write(json.dumps(item) + '\n')
  
   config = ll.DatasetConfig(
       namespace="tutorial",
       name=name,
       source=ll.sources.JSONSource(filepaths=(data_file))
   )
  
   return ll.create_dataset(config)

With the Setup_Lilac_Project function, we initialize a single work directory for our Lilac project and record it using the Lilac API. Using Create_Dataset_from_Data, we convert our raw list of dictionaries to .jsonl file and create a Lilas data set by defining its configuration. This prepares the data for a clean and structured analysis.

def extract_dataframe(dataset: ll.Dataset, fields: List(str)) -> pd.DataFrame:
   """Extract data as pandas DataFrame"""
   return dataset.to_pandas(fields)


def apply_functional_filters(df: pd.DataFrame) -> Dict(str, pd.DataFrame):
   """Apply various filters and return multiple filtered versions"""
  
   filters = {
       'high_score': lambda df: df(df('score') >= 0.8),
       'tech_category': lambda df: df(df('category') == 'tech'),
       'min_tokens': lambda df: df(df('tokens') >= 4),
       'no_duplicates': lambda df: df.drop_duplicates(subset=('text'), keep='first'),
       'combined_quality': lambda df: df((df('score') >= 0.8) & (df('tokens') >= 3) & (df('category') == 'tech'))
   }
  
   return {name: filter_func(df.copy()) for name, filter_func in filters.items()}

We extract the data set in a dataframe pandas using extract_dataframe, which allows us to work with selected fields in a familiar format. Then, using application_functional_filters, we define and apply a set of logical filters, such as the selection of high scores, filtering based on categories, tokens counting constraints, double deletion and composite quality conditions, to generate several filtered views of the data.

def analyze_data_quality(df: pd.DataFrame) -> Dict(str, Any):
   """Analyze data quality metrics"""
   return {
       'total_records': len(df),
       'unique_texts': df('text').nunique(),
       'duplicate_rate': 1 - (df('text').nunique() / len(df)),
       'avg_score': df('score').mean(),
       'category_distribution': df('category').value_counts().to_dict(),
       'score_distribution': {
           'high': len(df(df('score') >= 0.8)),
           'medium': len(df((df('score') >= 0.6) & (df('score') < 0.8))),
           'low': len(df(df('score') < 0.6))
       },
       'token_stats': {
           'mean': df('tokens').mean(),
           'min': df('tokens').min(),
           'max': df('tokens').max()
       }
   }


def create_data_transformations() -> Dict(str, callable):
   """Create various data transformation functions"""
   return {
       'normalize_scores': lambda df: df.assign(norm_score=df('score') / df('score').max()),
       'add_length_category': lambda df: df.assign(
           length_cat=pd.cut(df('tokens'), bins=(0, 3, 5, float('inf')), labels=('short', 'medium', 'long'))
       ),
       'add_quality_tier': lambda df: df.assign(
           quality_tier=pd.cut(df('score'), bins=(0, 0.6, 0.8, 1.0), labels=('low', 'medium', 'high'))
       ),
       'add_category_rank': lambda df: df.assign(
           category_rank=df.groupby('category')('score').rank(ascending=False)
       )
   }

To assess the quality of the data set, we use Analyze_Data_quality, which helps us measure key measurements such as total and unique recordings, double rates, category breakdowns and score / tokens distributions. This gives us a clear image of the preparation and reliability of the data set. We also define the transformation functions using Create_Data_transformations, allowing improvements such as normalization of the score, the categorization of tokens, the allocation of quality level and the intra-category classification.

def apply_transformations(df: pd.DataFrame, transform_names: List(str)) -> pd.DataFrame:
   """Apply selected transformations"""
   transformations = create_data_transformations()
   selected_transforms = (transformations(name) for name in transform_names if name in transformations)
  
   return pipe(*selected_transforms)(df.copy()) if selected_transforms else df


def export_filtered_data(filtered_datasets: Dict(str, pd.DataFrame), output_dir: str) -> None:
   """Export filtered datasets to files"""
   Path(output_dir).mkdir(exist_ok=True)
  
   for name, df in filtered_datasets.items():
       output_file = Path(output_dir) / f"{name}_filtered.jsonl"
       with open(output_file, 'w') as f:
           for _, row in df.iterrows():
               f.write(json.dumps(row.to_dict()) + '\n')
       print(f"Exported {len(df)} records to {output_file}")

Then, via applying_transformations, we selectively apply the necessary transformations in a functional chain, ensuring that our data is enriched and structured. Once filtered, we use export_filted_data to write each database variant in a separate .jsonl file. This allows us to store sub-assemblies, such as high-quality inputs or non-duplicated recordings, in an organized format for downstream.

def main_analysis_pipeline():
   """Main analysis pipeline demonstrating functional approach"""
  
   print("🚀 Setting up Lilac project...")
   project_dir = setup_lilac_project("advanced_tutorial")
  
   print("📊 Creating sample dataset...")
   sample_data = create_sample_data()
   dataset = create_dataset_from_data("sample_data", sample_data)
  
   print("📋 Extracting data...")
   df = extract_dataframe(dataset, ('id', 'text', 'category', 'score', 'tokens'))
  
   print("🔍 Analyzing data quality...")
   quality_report = analyze_data_quality(df)
   print(f"Original data: {quality_report('total_records')} records")
   print(f"Duplicates: {quality_report('duplicate_rate'):.1%}")
   print(f"Average score: {quality_report('avg_score'):.2f}")
  
   print("🔄 Applying transformations...")
   transformed_df = apply_transformations(df, ('normalize_scores', 'add_length_category', 'add_quality_tier'))
  
   print("🎯 Applying filters...")
   filtered_datasets = apply_functional_filters(transformed_df)
  
   print("\n📈 Filter Results:")
   for name, filtered_df in filtered_datasets.items():
       print(f"  {name}: {len(filtered_df)} records")
  
   print("💾 Exporting filtered datasets...")
   export_filtered_data(filtered_datasets, f"{project_dir}/exports")
  
   print("\n🏆 Top Quality Records:")
   best_quality = filtered_datasets('combined_quality').head(3)
   for _, row in best_quality.iterrows():
       print(f"  • {row('text')} (score: {row('score')}, category: {row('category')})")
  
   return {
       'original_data': df,
       'transformed_data': transformed_df,
       'filtered_data': filtered_datasets,
       'quality_report': quality_report
   }


if __name__ == "__main__":
   results = main_analysis_pipeline()
   print("\n✅ Analysis complete! Check the exports folder for filtered datasets.")

Finally, in the hand_analysis_pipeline, we execute the full work flow, from data export configuration, presenting the way Lilac, combined with functional programming, allows us to create modular, scalable and expressive pipelines. We even print superior quality entries as a quick snapshot. This function represents our complete data storage loop, fueled by Lilac.

In conclusion, users will have acquired a practical understanding of the creation of a reproducible data pipeline which exploits the abstractions of the Lilac data set and the functional programming models for an evolutionary and clean analysis. The pipeline covers all critical stages, including the creation of the data set, transformation, filtering, quality analysis and export, offering flexibility for both experimentation and deployment. It also shows how to integrate significant metadata such as standardized scores, quality levels and length categories, which can contribute to downstream tasks such as modeling or human review.

Discover the Codes. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our Subseubdredit 100k + ml and subscribe to Our newsletter.

Nikhil is an intern consultant at Marktechpost. It pursues a double degree integrated into materials at the Indian Kharagpur Institute of Technology. Nikhil is an IA / ML enthusiast who is still looking for applications in fields like biomaterials and biomedical sciences. With a strong experience in material science, he explores new progress and creates opportunities to contribute.

Leave a Comment Cancel reply

Join our community

LEARNOPOLY

Categories

Popular

About

A coding guide to create a functional data analysis workflow using Lilac to transform, filter and export structured information