Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. work
  4. ›
  5. …

  6. ›
  7. 2 model gym

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🦥 Sloths can hold their breath longer than dolphins 🐬.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

AI-Machine-Learning

    AI & Machine Learning
    • AI Infrastructure & LLM Platform

    • Model Gym

    • RAG Factory

    • NVIDIA Super POD

    • GPU Fabric Bench

    • Prompt Bridge


    Cloud & DevOps

    Full-Stack Applications

    Mobile Development

Cover Image for Model Gym
AI & Machine Learning

Model Gym

Personal / Open Source

Ongoing

Creator / Maintainer

AI Infrastructure & LLM

Tech Stack
C++17
TensorRT
ONNX Runtime
PyTorch
bitsandbytes
auto-gptq
Quantization (INT8/FP8/GPTQ)

Summary

End-to-end LLM inference and quantization pipeline benchmarking latency, throughput, and cost across CPU, ONNX Runtime, and TensorRT backends.


What I Built

Project Overview

Model Gym is an open-source AI infrastructure project designed to evaluate, optimize, and benchmark Large Language Models across different inference runtimes, quantization strategies, and hardware configurations.

The project provides a reproducible pipeline that takes models from popular ecosystems such as Hugging Face, Ollama, NVIDIA NGC, GGUF, and SafeTensors, converts them into deployable formats, applies multiple quantization techniques, and benchmarks inference performance across CPU and GPU execution environments.

The primary goal is to help engineers make data-driven decisions when deploying LLMs by comparing latency, throughput, memory usage, and operational cost across different optimization strategies.


Key Features

Universal Model Import Pipeline

Supports importing models from multiple ecosystems including Hugging Face, GGUF, SafeTensors, Ollama, and NVIDIA NGC containers.

Automated Model Conversion

Converts foundation models into ONNX and deployment-ready formats suitable for optimized inference runtimes.

Multi-Format Quantization

Implements multiple quantization strategies to reduce memory footprint and inference cost while maintaining acceptable model quality.

Supported formats include:

  • INT8 Symmetric Quantization
  • INT8 Asymmetric Quantization
  • FP8 (E4M3)
  • FP8 (E5M2)
  • GPTQ W4A16

High-Performance Inference Engine

Custom C++17 inference engine supporting:

  • Native CPU execution
  • ONNX Runtime
  • NVIDIA TensorRT

Benchmarking & Evaluation Platform

Automated benchmarking framework that measures:

  • Latency
  • Throughput
  • Memory Consumption
  • GPU Utilization
  • Cost per Million Tokens

My Contributions

  • Designed the overall architecture and benchmarking methodology.
  • Built model import and conversion pipelines supporting multiple model ecosystems.
  • Implemented quantization workflows for INT8, FP8, and GPTQ formats.
  • Developed a C++17 inference engine supporting multiple execution backends.
  • Integrated ONNX Runtime and TensorRT acceleration paths.
  • Created automated benchmark suites using Google Benchmark.
  • Built HTML-based dashboards for performance analysis and comparison.
  • Developed reproducible evaluation workflows for comparing model optimization strategies.
  • Maintained documentation, release automation, and open-source project governance.

Technical Highlights

End-to-End LLM Optimization Pipeline

Automates the entire workflow from model acquisition to deployment-ready artifacts and performance evaluation.

Cross-Backend Benchmarking

Enables direct comparison between CPU execution, ONNX Runtime acceleration, and TensorRT-optimized inference.

Quantization Research Platform

Provides a controlled environment for evaluating the trade-offs between model quality, latency, memory consumption, and infrastructure cost.

Production Deployment Readiness

Generates artifacts and metrics that help teams select deployment strategies for production AI systems.

Performance Engineering

Built low-level inference capabilities in C++ to maximize execution efficiency and reduce runtime overhead.


Challenges & Solutions

Challenge

Modern LLM deployments involve dozens of optimization choices including model format, quantization strategy, runtime backend, and hardware configuration. Comparing these options manually is time-consuming and often produces inconsistent results.

Solution

Created a unified benchmarking platform capable of automating model conversion, quantization, inference testing, and performance reporting across multiple deployment scenarios.

Outcome

Model Gym enables engineers to evaluate deployment trade-offs quickly and objectively, reducing experimentation time while improving infrastructure and cost optimization decisions.


Technology Stack

AI Frameworks PyTorch, Hugging Face Transformers

Inference TensorRT, ONNX Runtime

Optimization GPTQ, bitsandbytes, INT8 Quantization, FP8 Quantization

Programming Languages C++17, Python

Model Formats SafeTensors, GGUF, ONNX

Benchmarking Google Benchmark

Visualization HTML Dashboards, Performance Reports

Domain LLM Inference, Model Optimization, AI Infrastructure, Performance Engineering

← Previous

AI Infrastructure & LLM Platform

Next →

RAG Factory

Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.