Sai Teja Srivillibhutturu

Sai Teja
Srivillibhutturu

GPU Systems & Inference Optimization | LLM Production Systems | Published ML Research

8+ Production Deployments
2 IEEE Published
sai_teja — bash — 80×24
Download Resume
Experience Projects Contact Explore Roles

Featured Technical Work

Attention Kernel Compiler Benchmark

Inference Optimization
12.3× Throughput gain
99.7% Memory reduction
573K→6.03M tok/s (4× A30)

Production-grade benchmark comparing custom Triton attention kernels against FlashAttention-2 (CUDA) across both vLLM and SGLang inference engines under real concurrent load (1-32 concurrent users). Measures TTFT, token throughput, P50/P95/P99 latency, and GPU SM utilization. Demonstrates 12.3× throughput gains (573K → 6.03M tok/s) and 99.7% memory reduction vs vanilla attention on NVIDIA L4. Real hardware measurements on 4× A30 with reproducible benchmarks.

TritonCUDAvLLMSGLangPagedAttention

Flash Attention + Kernels from Scratch

Deep Learning Systems
1.41× RMSNorm speedup
105–120% of cuBLAS (int8)
11/11 Tests passing

High-performance GPU kernels written from scratch in Triton and CUDA C++: (1) Flash Attention forward pass with online softmax recurrence, O(N·D) HBM vs O(N²), supports causal masking. (2) int8 GEMM + fused dequantization using DP4A, achieving 105–120% of fp16 cuBLAS throughput. (3) Fused RMSNorm+Linear (1.41–1.48× speedup). Benchmarked on NVIDIA A30 with 11/11 correctness tests passing. Beats torch SDPA at seqlen≥512.

TritonCUDA C++PyTorchpybind11GPU Optimization

SGLang Speculative Decoding

Inference Optimization
2.5–3.2× Measured speedup
22/22 Unit tests pass
Lossless Accept-reject verified

Implemented lossless speculative decoding for SGLang on 4× NVIDIA A30 cluster. Custom draft/verify architecture with RadixAttention-safe provisional KV cache (insert/commit/evict operations). Accept-reject sampling verified mathematically lossless. Achieves 2.5–3.2× measured speedup on Llama-3-8B with TinyLlama-1.1B draft (K=4). Full unit test suite: 22/22 tests passing. Production-ready with real hardware measurements and reproducible benchmarks.

SGLangSpeculative DecodingRadixAttentionCUDAInference

CUDA Attention Kernel + AWS Neuron SDK

Deep Learning
5.64× Faster vs PyTorch (N=32)
3.7× Neuron speedup
45ms→12ms Latency (GPT-2)

Two production-grade systems: (1) Custom CUDA C++ scaled-dot-product attention kernel with tiled QKᵀ, numerically stable softmax, and pybind11 PyTorch binding, 5.64x faster than PyTorch at N=32, correctness verified at max_diff < 1e-7. (2) GPT-2 ported to AWS Inferentia via neuronx-cc 3-step pipeline achieving 3.7x speedup (45ms to 12ms, 1,800 to 6,700 tokens/sec). Deployed as FastAPI REST endpoint on HuggingFace Spaces.

CUDA C++PyTorchAWS Inferentiapybind11FastAPI

Production LLM Fine-Tuning: Qwen-7B SFT

LLM / GenAI
17% Loss reduction
0.855 BERTScore
30min Training (T4)

Supervised fine-tuning of Qwen2.5-7B using LoRA (r=8, alpha=16) on UltraFeedback, training only 0.5% of parameters (35M of 7B) with QLoRA 4-bit quantization and FP16 mixed precision. Achieved 17% training loss reduction (1.412 to 1.176) and 0.855 BERTScore in 30 minutes on a T4 GPU. Model merged and deployed to HuggingFace Hub.

PyTorchLoRA / PEFTQLoRATRLHuggingFace

Attention Mechanism Optimization Suite

Deep Learning

Benchmarking framework comparing 4 attention implementations on NVIDIA L4 (seq len 1024, batch 32): FlashAttention-2 achieves 12.3x throughput (573K to 6.03M tok/s) and 99.7% memory reduction (12,582 MB to 38 MB) vs vanilla. Includes batch-size auto-tuner using binary search and ONNX/TensorRT export benchmarks. Key finding: algorithm-level optimization (FlashAttention) outperforms hardware-level optimization (TensorRT) by 6x for attention operations.

PyTorchFlashAttention-2xFormersONNX RuntimeCUDA

Advanced AI Agent System

GenAI

Multi-strategy reasoning system implementing 4 research papers: Chain-of-Thought with self-consistency voting (3 independent paths), Tree-of-Thoughts with beam search (width=3, depth=3), ReAct with real-time web search via Tavily API and ChromaDB vector memory, and Multi-Agent collaboration (Planner, Worker, Critic). LLM-based auto-classifier routes each query to the optimal strategy. Rate-limited (10/min, 100/day), with real-time streaming. Built with Groq LLM, deployed on HuggingFace Spaces.

PythonGroqTavilyChromaDBGradio

Latest 15 repositories (auto-updated, 1hr cache):

Loading latest repos from GitHub...

View All Repositories on GitHub →

Experience

Qure.ai Technologies Healthcare AI

AI Solutions Engineer Intern

Mar 2026 - May 2026 • New York, NY (Remote)

End-to-end LLM deployment across 6+ live hospital systems (Medstar, Mount Sinai, UFL). Implemented clinical LLM post-training pipelines (SFT → DPO) for medical question answering on real FHIR/EPIC data. Deployed and maintained inference endpoints on AWS with real-time latency monitoring. Configured domain-specific prompting for radiologist report parsing and automated EHR data extraction. HIPAA-compliant: built DPIAs, security assessments, and bidirectional EPIC integration documentation.

PythonLLMsEPIC/FHIRAWSClinical AI

ReplyQuickAI (DentalScan) Healthcare ML

Machine Learning Engineer Intern

Dec 2025 - Feb 2026 • United States

Built computer vision pipelines for intra-oral image analysis across 6+ clinical categories (gingivitis staging, plaque detection, recession classification) on a 50K+ labeled dataset. Engineered automated retraining pipeline on AWS SageMaker incorporating dentist-corrected labels, improving model accuracy iteratively across production inference endpoints.

PyTorchCNNsAWS SageMakerComputer VisionPython

The University of Texas at Arlington

Graduate Research Assistant

Jun 2025 - May 2026 • Arlington, TX

Built TopGPT, a full-stack LLM application fine-tuned on 3+ textbooks with RAG over 1,000+ research paper chunks stored in Pinecone on AWS. Led CTMap, an LLM fine-tuning pipeline for mmWave 6G path planning, resulting in two accepted IEEE publications: IEEE ICC 2026 (conference) and IEEE OJ-COMS 2026 (journal). Built SFT pipelines encoding Sionna channel maps and OpenStreetMap graphs into transformer-readable formats.

PyTorchRAGPineconeAWSLLM Fine-tuningSionna

The University of Texas at Arlington

Graduate Teaching Assistant

Aug 2024 - May 2025 • Arlington, TX

Supported graduate courses in Numerical Methods for 50+ students, assisting with algorithmic problem solving, optimization, and computational modeling. Concurrently developed CTMap, an LLM-enabled 6G path planning system accepted at IEEE ICC 2026 and extended to a journal paper at IEEE OJ-COMS 2026, fine-tuning LLMs on Dijkstra-generated coordinate paths applied to Sionna wireless network simulation outputs.

PythonLLM Fine-tuningOpenStreetMapSionna 6GDijkstra

Tata Consultancy Services

Senior Software Engineer → Software Engineer

Jun 2019 - May 2023 • 4 years • Chennai

Designed and owned Java-based distributed data processing services handling millions of records daily across production systems serving 10+ enterprise clients. Led system design and architecture reviews for data-intensive microservices. Built fault-tolerant backend pipelines with high availability and reduced processing latency by 40% through service refactoring.

JavaSpring BootMicroservicesSQLREST APIs

Research & Publications

ITSC 2026 May 2026

Efficiency–Equity Trade-Off Characterization in Hierarchical Traffic Congestion Pricing

Sai Teja Srivillibhutturu et al. • IEEE Intelligent Transportation Systems Conference 2026

Structured efficiency–equity trade-off analysis within a hierarchical Stackelberg congestion pricing framework. Demonstrates that modest incorporation of equity considerations can meaningfully reduce travel cost disparities without significant congestion degradation, providing quantitative guidance for equity-aware congestion pricing design.

IEEE OJ-COMS 2026

Digital Twin–Guided AI Path Planning for Connectivity-Aware Mobility

Sai Teja Srivillibhutturu et al. • IEEE Open Journal of the Communications Society 2026

Comprehensive digital twin–guided AI framework for connectivity-aware mobility in 6G networks. Uses Sionna ray-tracing simulations to train AI agents that predict link quality along candidate routes and select paths maximizing sustained mmWave wireless connectivity.

IEEE ICC 2026

CTMap: LLM-Enabled Connectivity-Aware Path Planning for mmWave 6G Networks

Sai Teja Srivillibhutturu et al. • IEEE International Conference on Communications 2026

Fine-tuned LLMs on Dijkstra-generated coordinate paths from OpenStreetMap, applied to Sionna 6G wireless simulation outputs to produce real-time optimal paths for mmWave connectivity-aware routing.

Let's Connect

Open to LLM Engineering, Healthcare AI, ML Research, and Deep Learning roles, applying language models to real clinical and physical systems.

Or reach me directly:

🤖

Sai Teja's AI

Online · Powered by Claude

ST
👋 Welcome to my portfolio! I'm Sai Teja's AI assistant — ask me anything about his experience, projects, publications, or skills.
💼 Work experience? 📄 IEEE publications? ⚡ CUDA projects? 📬 Contact info?