Sai Teja Srivillibhutturu | ML & Deep Learning Engineer

About Me

Deep Learning Engineer specializing in GPU Optimization and LLM Inference. MS in Computer Science from UT Arlington (May 2025).

I optimize ML systems for speed, memory, and cost. My work focuses on CUDA kernels, FlashAttention, quantization techniques, and building production-grade inference infrastructure.

Published researcher with work accepted at IEEE ICC 2026 on LLM-enabled path planning for wireless networks.

Experience

Q.ai

Qure.ai Technologies 🏥 Healthcare AI

AI Solutions Engineer Intern

Mar 2026 - May 2026 • New York, NY (Remote)

Configured LLMs for protocol-specific clinical workflows, orchestrating radiologist report parsing and EMR data extraction across EPIC/FHIR-integrated hospital systems including Medstar, Mount Sinai, and UFL. Deployed and maintained AI inference endpoints supporting 6+ live health system sites under real-time US time zone SLAs. Prepared HIPAA-aligned technical documentation including DPIAs, security questionnaires, and EPIC bidirectional integration guides for clinical AI deployment.

PythonLLMsEPIC/FHIRAWSClinical AI

ReplyQuickAI (DentalScan) 🏥 Healthcare ML

Machine Learning Engineer Intern

Dec 2025 - Feb 2026 • United States

Built computer vision pipelines for intra-oral image analysis across 6+ clinical categories (gingivitis staging, plaque detection, recession classification) on a 50K+ labeled dataset. Engineered automated retraining pipeline on AWS SageMaker incorporating dentist-corrected labels, improving model accuracy iteratively across production inference endpoints.

PyTorchCNNsAWS SageMakerComputer VisionPython

UTA

The University of Texas at Arlington

Graduate Research Assistant

Jun 2025 - Present • Arlington, TX

Built TopGPT, a full-stack LLM application fine-tuned on 3+ textbooks with RAG over 1,000+ research paper chunks stored in Pinecone on AWS. Also contributed to CTMap, an LLM-enabled path planning system for mmWave 6G networks (accepted at IEEE ICC 2026), fine-tuning LLMs on Dijkstra-generated paths with real-time OpenStreetMap + Sionna 6G integration.

PyTorchRAGPineconeAWSLLM Fine-tuningSionna

UTA

The University of Texas at Arlington

Graduate Teaching Assistant

Aug 2024 - May 2025 • Arlington, TX

Supported graduate courses in Numerical Methods for 50+ students, assisting with algorithmic problem solving, optimization, and computational modeling. Concurrently developed CTMap, an LLM-enabled 6G path planning system accepted at IEEE ICC 2026, fine-tuning LLMs on Dijkstra-generated coordinate paths and applying them to Sionna wireless network simulation outputs.

PythonLLM Fine-tuningOpenStreetMapSionna 6GDijkstra

Tata Consultancy Services

Senior Software Engineer → Software Engineer

Jun 2019 - May 2023 • 4 years • Chennai

Designed and owned Java-based distributed data processing services handling millions of records daily across production systems serving 10+ enterprise clients. Led system design and architecture reviews for data-intensive microservices. Built fault-tolerant backend pipelines with high availability and reduced processing latency by 40% through service refactoring.

JavaSpring BootMicroservicesSQLREST APIs

Education

UTA

The University of Texas at Arlington

Master of Science - Computer Science & Engineering

📜 Specialization in Deep Learning

Aug 2023 - May 2025

4.0 / 4.0

Completed 10+ rigorous graduate-level courses, building a solid foundation in advanced computing and ML systems. Graduated May 2025 with 4.0 GPA.

Graduate Coursework

🧠Neural Networks

🤖Artificial Intelligence

👁️Computer Vision

📈Machine Learning

📊Data Analysis & Modeling

⛏️Data Mining

⚙️Design & Analysis of Algorithms

AU

Andhra University

Bachelor of Technology - Computer Science Engineering

2015 - 2019

8.2 / 10.0

Featured Projects

IEEE ICC 2026 2026

CTMap: LLM-Enabled Connectivity-Aware Path Planning for mmWave 6G Networks

Sai Teja Srivillibhutturu et al. • IEEE International Conference on Communications 2026

Fine-tuned LLMs on Dijkstra-generated coordinate paths from OpenStreetMap, applied to Sionna 6G wireless simulation outputs to produce real-time optimal paths for mmWave connectivity-aware routing.

📄 arXiv 🔗 IEEE ICC 2026

CUDA Attention Kernel + AWS Neuron SDK

Deep Learning

Two production-grade systems: (1) Custom CUDA C++ scaled-dot-product attention kernel with tiled QKᵀ, numerically stable softmax, and pybind11 PyTorch binding — 5.64x faster than PyTorch at N=32, correctness verified at max_diff < 1e-7. (2) GPT-2 ported to AWS Inferentia via neuronx-cc 3-step pipeline (TorchScript trace, neuronx-cc compile, profiling) achieving 3.7x speedup (45ms to 12ms, 1,800 to 6,700 tokens/sec). Deployed as FastAPI REST endpoint on HuggingFace Spaces.

CUDA C++PyTorchAWS Inferentiapybind11FastAPI

GitHub 🚀 Live API

Production LLM Fine-Tuning: Qwen-7B SFT

LLM / GenAI

Supervised fine-tuning of Qwen2.5-7B using LoRA (r=8, alpha=16) on UltraFeedback, training only 0.5% of parameters (35M of 7B) with QLoRA 4-bit quantization and FP16 mixed precision. Achieved 17% training loss reduction (1.412 to 1.176) and 0.855 BERTScore in 30 minutes on a T4 GPU. Model merged and deployed to HuggingFace Hub.

PyTorchLoRA / PEFTQLoRATRLHuggingFace

GitHub 🤗 Model on HF Hub

Attention Mechanism Optimization Suite

Deep Learning

Benchmarking framework comparing 4 attention implementations on NVIDIA L4 (seq len 1024, batch 32): FlashAttention-2 achieves 12.3x throughput (573K to 6.03M tok/s) and 99.7% memory reduction (12,582 MB to 38 MB) vs vanilla. Includes batch-size auto-tuner using binary search and ONNX/TensorRT export benchmarks. Key finding: algorithm-level optimization (FlashAttention) outperforms hardware-level optimization (TensorRT) by 6x for attention operations.

PyTorchFlashAttention-2xFormersONNX RuntimeCUDA

GitHub

Advanced AI Agent System

GenAI

Multi-strategy reasoning system implementing 4 research papers: Chain-of-Thought with self-consistency voting (3 independent paths), Tree-of-Thoughts with beam search (width=3, depth=3), ReAct with real-time web search via Tavily API and ChromaDB vector memory, and Multi-Agent collaboration (Planner, Worker, Critic). LLM-based auto-classifier routes each query to the optimal strategy. Rate-limited (10/min, 100/day), with real-time streaming. Built with Groq LLM, deployed on HuggingFace Spaces.

PythonGroqTavilyChromaDBGradio

GitHub 🚀 Live Demo