Model Gym
Personal / Open Source
Ongoing
Creator / Maintainer
AI Infrastructure & LLM
Tech Stack
Summary
End-to-end LLM inference and quantization pipeline benchmarking latency, throughput, and cost across CPU, ONNX Runtime, and TensorRT backends.
What I Built
Project Overview
Model Gym is an open-source AI infrastructure project designed to evaluate, optimize, and benchmark Large Language Models across different inference runtimes, quantization strategies, and hardware configurations.
The project provides a reproducible pipeline that takes models from popular ecosystems such as Hugging Face, Ollama, NVIDIA NGC, GGUF, and SafeTensors, converts them into deployable formats, applies multiple quantization techniques, and benchmarks inference performance across CPU and GPU execution environments.
The primary goal is to help engineers make data-driven decisions when deploying LLMs by comparing latency, throughput, memory usage, and operational cost across different optimization strategies.
Key Features
Universal Model Import Pipeline
Supports importing models from multiple ecosystems including Hugging Face, GGUF, SafeTensors, Ollama, and NVIDIA NGC containers.
Automated Model Conversion
Converts foundation models into ONNX and deployment-ready formats suitable for optimized inference runtimes.
Multi-Format Quantization
Implements multiple quantization strategies to reduce memory footprint and inference cost while maintaining acceptable model quality.
Supported formats include:
- INT8 Symmetric Quantization
- INT8 Asymmetric Quantization
- FP8 (E4M3)
- FP8 (E5M2)
- GPTQ W4A16
High-Performance Inference Engine
Custom C++17 inference engine supporting:
- Native CPU execution
- ONNX Runtime
- NVIDIA TensorRT
Benchmarking & Evaluation Platform
Automated benchmarking framework that measures:
- Latency
- Throughput
- Memory Consumption
- GPU Utilization
- Cost per Million Tokens
My Contributions
- Designed the overall architecture and benchmarking methodology.
- Built model import and conversion pipelines supporting multiple model ecosystems.
- Implemented quantization workflows for INT8, FP8, and GPTQ formats.
- Developed a C++17 inference engine supporting multiple execution backends.
- Integrated ONNX Runtime and TensorRT acceleration paths.
- Created automated benchmark suites using Google Benchmark.
- Built HTML-based dashboards for performance analysis and comparison.
- Developed reproducible evaluation workflows for comparing model optimization strategies.
- Maintained documentation, release automation, and open-source project governance.
Technical Highlights
End-to-End LLM Optimization Pipeline
Automates the entire workflow from model acquisition to deployment-ready artifacts and performance evaluation.
Cross-Backend Benchmarking
Enables direct comparison between CPU execution, ONNX Runtime acceleration, and TensorRT-optimized inference.
Quantization Research Platform
Provides a controlled environment for evaluating the trade-offs between model quality, latency, memory consumption, and infrastructure cost.
Production Deployment Readiness
Generates artifacts and metrics that help teams select deployment strategies for production AI systems.
Performance Engineering
Built low-level inference capabilities in C++ to maximize execution efficiency and reduce runtime overhead.
Challenges & Solutions
Challenge
Modern LLM deployments involve dozens of optimization choices including model format, quantization strategy, runtime backend, and hardware configuration. Comparing these options manually is time-consuming and often produces inconsistent results.
Solution
Created a unified benchmarking platform capable of automating model conversion, quantization, inference testing, and performance reporting across multiple deployment scenarios.
Outcome
Model Gym enables engineers to evaluate deployment trade-offs quickly and objectively, reducing experimentation time while improving infrastructure and cost optimization decisions.
Technology Stack
AI Frameworks PyTorch, Hugging Face Transformers
Inference TensorRT, ONNX Runtime
Optimization GPTQ, bitsandbytes, INT8 Quantization, FP8 Quantization
Programming Languages C++17, Python
Model Formats SafeTensors, GGUF, ONNX
Benchmarking Google Benchmark
Visualization HTML Dashboards, Performance Reports
Domain LLM Inference, Model Optimization, AI Infrastructure, Performance Engineering
