Sai Teja Srivillibhutturu | ML & Deep Learning Engineer

Featured Technical Work

Attention Kernel Compiler Benchmark

Inference Optimization

12.3× Throughput gain

99.7% Memory reduction

573K→6.03M tok/s (4× A30)

Production-grade benchmark comparing custom Triton attention kernels against FlashAttention-2 (CUDA) across both vLLM and SGLang inference engines under real concurrent load (1-32 concurrent users). Measures TTFT, token throughput, P50/P95/P99 latency, and GPU SM utilization. Demonstrates 12.3× throughput gains (573K → 6.03M tok/s) and 99.7% memory reduction vs vanilla attention on NVIDIA L4. Real hardware measurements on 4× A30 with reproducible benchmarks.

TritonCUDAvLLMSGLangPagedAttention

GitHub

Flash Attention + Kernels from Scratch

Deep Learning Systems

1.41× RMSNorm speedup

105–120% of cuBLAS (int8)

11/11 Tests passing

High-performance GPU kernels written from scratch in Triton and CUDA C++: (1) Flash Attention forward pass with online softmax recurrence, O(N·D) HBM vs O(N²), supports causal masking. (2) int8 GEMM + fused dequantization using DP4A, achieving 105–120% of fp16 cuBLAS throughput. (3) Fused RMSNorm+Linear (1.41–1.48× speedup). Benchmarked on NVIDIA A30 with 11/11 correctness tests passing. Beats torch SDPA at seqlen≥512.

TritonCUDA C++PyTorchpybind11GPU Optimization

GitHub

SGLang Speculative Decoding

Inference Optimization

2.5–3.2× Measured speedup

22/22 Unit tests pass

Lossless Accept-reject verified

Implemented lossless speculative decoding for SGLang on 4× NVIDIA A30 cluster. Custom draft/verify architecture with RadixAttention-safe provisional KV cache (insert/commit/evict operations). Accept-reject sampling verified mathematically lossless. Achieves 2.5–3.2× measured speedup on Llama-3-8B with TinyLlama-1.1B draft (K=4). Full unit test suite: 22/22 tests passing. Production-ready with real hardware measurements and reproducible benchmarks.

SGLangSpeculative DecodingRadixAttentionCUDAInference

GitHub

CUDA Attention Kernel + AWS Neuron SDK

Deep Learning

5.64× Faster vs PyTorch (N=32)

3.7× Neuron speedup

45ms→12ms Latency (GPT-2)

Two production-grade systems: (1) Custom CUDA C++ scaled-dot-product attention kernel with tiled QKᵀ, numerically stable softmax, and pybind11 PyTorch binding, 5.64x faster than PyTorch at N=32, correctness verified at max_diff < 1e-7. (2) GPT-2 ported to AWS Inferentia via neuronx-cc 3-step pipeline achieving 3.7x speedup (45ms to 12ms, 1,800 to 6,700 tokens/sec). Deployed as FastAPI REST endpoint on HuggingFace Spaces.

CUDA C++PyTorchAWS Inferentiapybind11FastAPI

GitHub Live API

Production LLM Fine-Tuning: Qwen-7B SFT

LLM / GenAI

17% Loss reduction

0.855 BERTScore

30min Training (T4)

Supervised fine-tuning of Qwen2.5-7B using LoRA (r=8, alpha=16) on UltraFeedback, training only 0.5% of parameters (35M of 7B) with QLoRA 4-bit quantization and FP16 mixed precision. Achieved 17% training loss reduction (1.412 to 1.176) and 0.855 BERTScore in 30 minutes on a T4 GPU. Model merged and deployed to HuggingFace Hub.

PyTorchLoRA / PEFTQLoRATRLHuggingFace

GitHub Model on HF Hub

Attention Mechanism Optimization Suite

Deep Learning

Benchmarking framework comparing 4 attention implementations on NVIDIA L4 (seq len 1024, batch 32): FlashAttention-2 achieves 12.3x throughput (573K to 6.03M tok/s) and 99.7% memory reduction (12,582 MB to 38 MB) vs vanilla. Includes batch-size auto-tuner using binary search and ONNX/TensorRT export benchmarks. Key finding: algorithm-level optimization (FlashAttention) outperforms hardware-level optimization (TensorRT) by 6x for attention operations.

PyTorchFlashAttention-2xFormersONNX RuntimeCUDA

GitHub

Advanced AI Agent System

GenAI

Multi-strategy reasoning system implementing 4 research papers: Chain-of-Thought with self-consistency voting (3 independent paths), Tree-of-Thoughts with beam search (width=3, depth=3), ReAct with real-time web search via Tavily API and ChromaDB vector memory, and Multi-Agent collaboration (Planner, Worker, Critic). LLM-based auto-classifier routes each query to the optimal strategy. Rate-limited (10/min, 100/day), with real-time streaming. Built with Groq LLM, deployed on HuggingFace Spaces.

PythonGroqTavilyChromaDBGradio

GitHub Live Demo

Latest 15 repositories (auto-updated, 1hr cache):

Loading latest repos from GitHub...

View All Repositories on GitHub →

Experience

Qure.ai Technologies Healthcare AI

AI Solutions Engineer Intern

Mar 2026 - May 2026 • New York, NY (Remote)

End-to-end LLM deployment across 6+ live hospital systems (Medstar, Mount Sinai, UFL). Implemented clinical LLM post-training pipelines (SFT → DPO) for medical question answering on real FHIR/EPIC data. Deployed and maintained inference endpoints on AWS with real-time latency monitoring. Configured domain-specific prompting for radiologist report parsing and automated EHR data extraction. HIPAA-compliant: built DPIAs, security assessments, and bidirectional EPIC integration documentation.

PythonLLMsEPIC/FHIRAWSClinical AI

ReplyQuickAI (DentalScan) Healthcare ML

Machine Learning Engineer Intern

Dec 2025 - Feb 2026 • United States

Built computer vision pipelines for intra-oral image analysis across 6+ clinical categories (gingivitis staging, plaque detection, recession classification) on a 50K+ labeled dataset. Engineered automated retraining pipeline on AWS SageMaker incorporating dentist-corrected labels, improving model accuracy iteratively across production inference endpoints.

PyTorchCNNsAWS SageMakerComputer VisionPython

UTA

The University of Texas at Arlington

Graduate Research Assistant

Jun 2025 - May 2026 • Arlington, TX

Built TopGPT, a full-stack LLM application fine-tuned on 3+ textbooks with RAG over 1,000+ research paper chunks stored in Pinecone on AWS. Led CTMap, an LLM fine-tuning pipeline for mmWave 6G path planning, resulting in two accepted IEEE publications: IEEE ICC 2026 (conference) and IEEE OJ-COMS 2026 (journal). Built SFT pipelines encoding Sionna channel maps and OpenStreetMap graphs into transformer-readable formats.

PyTorchRAGPineconeAWSLLM Fine-tuningSionna

UTA

The University of Texas at Arlington

Graduate Teaching Assistant

Aug 2024 - May 2025 • Arlington, TX

Supported graduate courses in Numerical Methods for 50+ students, assisting with algorithmic problem solving, optimization, and computational modeling. Concurrently developed CTMap, an LLM-enabled 6G path planning system accepted at IEEE ICC 2026 and extended to a journal paper at IEEE OJ-COMS 2026, fine-tuning LLMs on Dijkstra-generated coordinate paths applied to Sionna wireless network simulation outputs.

PythonLLM Fine-tuningOpenStreetMapSionna 6GDijkstra

Tata Consultancy Services

Senior Software Engineer → Software Engineer

Jun 2019 - May 2023 • 4 years • Chennai

Designed and owned Java-based distributed data processing services handling millions of records daily across production systems serving 10+ enterprise clients. Led system design and architecture reviews for data-intensive microservices. Built fault-tolerant backend pipelines with high availability and reduced processing latency by 40% through service refactoring.

JavaSpring BootMicroservicesSQLREST APIs

Sai Teja
Srivillibhutturu

Featured Technical Work

Attention Kernel Compiler Benchmark

Flash Attention + Kernels from Scratch

SGLang Speculative Decoding

CUDA Attention Kernel + AWS Neuron SDK

Production LLM Fine-Tuning: Qwen-7B SFT

Attention Mechanism Optimization Suite

Advanced AI Agent System

Experience

Qure.ai Technologies Healthcare AI

ReplyQuickAI (DentalScan) Healthcare ML

The University of Texas at Arlington

The University of Texas at Arlington

Tata Consultancy Services

Research & Publications

Let's Connect

Sai Teja's AI

Sai TejaSrivillibhutturu

Featured Technical Work

Attention Kernel Compiler Benchmark

Flash Attention + Kernels from Scratch

SGLang Speculative Decoding

CUDA Attention Kernel + AWS Neuron SDK

Production LLM Fine-Tuning: Qwen-7B SFT

Attention Mechanism Optimization Suite

Advanced AI Agent System

Experience

Qure.ai Technologies Healthcare AI

ReplyQuickAI (DentalScan) Healthcare ML

The University of Texas at Arlington

The University of Texas at Arlington

Tata Consultancy Services

Research & Publications

Let's Connect

Sai Teja's AI

Sai Teja
Srivillibhutturu