Sai Teja Srivillibhutturu
Available for opportunities

Sai Teja
Srivillibhutturu

Download Resume
Experience Projects Contact Explore Roles

About Me

Deep Learning Engineer specializing in GPU Optimization and LLM Inference. MS in Computer Science from UT Arlington (May 2025).

I optimize ML systems for speed, memory, and cost. My work focuses on CUDA kernels, FlashAttention, quantization techniques, and building production-grade inference infrastructure.

Published researcher with work accepted at IEEE ICC 2026 on LLM-enabled path planning for wireless networks.

My Skills

PythonPython
PyTorchPyTorch
TensorFlowTensorFlow
CUDACUDA
C++C++
JavaJava
GoGo
HuggingFaceHugging Face
DockerDocker
K8sKubernetes
AWSAWS
PostgreSQLPostgreSQL
PythonPython
PyTorchPyTorch
TensorFlowTensorFlow
CUDACUDA
C++C++
JavaJava
GoGo
HuggingFaceHugging Face
DockerDocker
K8sKubernetes
AWSAWS
PostgreSQLPostgreSQL
RedisRedis
MongoDBMongoDB
AirflowAirflow
SparkSpark
LinuxLinux
GitGit
Neo4jNeo4j
FastAPIFastAPI
FlaskFlask
NumPyNumPy
RedisRedis
MongoDBMongoDB
AirflowAirflow
SparkSpark
LinuxLinux
GitGit
Neo4jNeo4j
FastAPIFastAPI
FlaskFlask
NumPyNumPy

Experience

Qure.ai Technologies ๐Ÿฅ Healthcare AI

AI Solutions Engineer Intern

Mar 2026 - May 2026 โ€ข New York, NY (Remote)

Configured LLMs for protocol-specific clinical workflows, orchestrating radiologist report parsing and EMR data extraction across EPIC/FHIR-integrated hospital systems including Medstar, Mount Sinai, and UFL. Deployed and maintained AI inference endpoints supporting 6+ live health system sites under real-time US time zone SLAs. Prepared HIPAA-aligned technical documentation including DPIAs, security questionnaires, and EPIC bidirectional integration guides for clinical AI deployment.

PythonLLMsEPIC/FHIRAWSClinical AI

ReplyQuickAI (DentalScan) ๐Ÿฅ Healthcare ML

Machine Learning Engineer Intern

Dec 2025 - Feb 2026 โ€ข United States

Built computer vision pipelines for intra-oral image analysis across 6+ clinical categories (gingivitis staging, plaque detection, recession classification) on a 50K+ labeled dataset. Engineered automated retraining pipeline on AWS SageMaker incorporating dentist-corrected labels, improving model accuracy iteratively across production inference endpoints.

PyTorchCNNsAWS SageMakerComputer VisionPython

The University of Texas at Arlington

Graduate Research Assistant

Jun 2025 - Present โ€ข Arlington, TX

Built TopGPT, a full-stack LLM application fine-tuned on 3+ textbooks with RAG over 1,000+ research paper chunks stored in Pinecone on AWS. Also contributed to CTMap, an LLM-enabled path planning system for mmWave 6G networks (accepted at IEEE ICC 2026), fine-tuning LLMs on Dijkstra-generated paths with real-time OpenStreetMap + Sionna 6G integration.

PyTorchRAGPineconeAWSLLM Fine-tuningSionna

The University of Texas at Arlington

Graduate Teaching Assistant

Aug 2024 - May 2025 โ€ข Arlington, TX

Supported graduate courses in Numerical Methods for 50+ students, assisting with algorithmic problem solving, optimization, and computational modeling. Concurrently developed CTMap, an LLM-enabled 6G path planning system accepted at IEEE ICC 2026, fine-tuning LLMs on Dijkstra-generated coordinate paths and applying them to Sionna wireless network simulation outputs.

PythonLLM Fine-tuningOpenStreetMapSionna 6GDijkstra

Tata Consultancy Services

Senior Software Engineer โ†’ Software Engineer

Jun 2019 - May 2023 โ€ข 4 years โ€ข Chennai

Designed and owned Java-based distributed data processing services handling millions of records daily across production systems serving 10+ enterprise clients. Led system design and architecture reviews for data-intensive microservices. Built fault-tolerant backend pipelines with high availability and reduced processing latency by 40% through service refactoring.

JavaSpring BootMicroservicesSQLREST APIs

Education

The University of Texas at Arlington

Master of Science - Computer Science & Engineering

๐Ÿ“œ Specialization in Deep Learning

Aug 2023 - May 2025

4.0 / 4.0

Completed 10+ rigorous graduate-level courses, building a solid foundation in advanced computing and ML systems. Graduated May 2025 with 4.0 GPA.

Graduate Coursework

๐Ÿง Neural Networks
๐Ÿค–Artificial Intelligence
๐Ÿ‘๏ธComputer Vision
๐Ÿ“ˆMachine Learning
๐Ÿ“ŠData Analysis & Modeling
โ›๏ธData Mining
โš™๏ธDesign & Analysis of Algorithms

Andhra University

Bachelor of Technology - Computer Science Engineering

2015 - 2019

8.2 / 10.0

Featured Projects

IEEE ICC 2026 2026

CTMap: LLM-Enabled Connectivity-Aware Path Planning for mmWave 6G Networks

Sai Teja Srivillibhutturu et al. โ€ข IEEE International Conference on Communications 2026

Fine-tuned LLMs on Dijkstra-generated coordinate paths from OpenStreetMap, applied to Sionna 6G wireless simulation outputs to produce real-time optimal paths for mmWave connectivity-aware routing.

CUDA Attention Kernel + AWS Neuron SDK

Deep Learning

Two production-grade systems: (1) Custom CUDA C++ scaled-dot-product attention kernel with tiled QKแต€, numerically stable softmax, and pybind11 PyTorch binding โ€” 5.64x faster than PyTorch at N=32, correctness verified at max_diff < 1e-7. (2) GPT-2 ported to AWS Inferentia via neuronx-cc 3-step pipeline (TorchScript trace, neuronx-cc compile, profiling) achieving 3.7x speedup (45ms to 12ms, 1,800 to 6,700 tokens/sec). Deployed as FastAPI REST endpoint on HuggingFace Spaces.

CUDA C++PyTorchAWS Inferentiapybind11FastAPI

Production LLM Fine-Tuning: Qwen-7B SFT

LLM / GenAI

Supervised fine-tuning of Qwen2.5-7B using LoRA (r=8, alpha=16) on UltraFeedback, training only 0.5% of parameters (35M of 7B) with QLoRA 4-bit quantization and FP16 mixed precision. Achieved 17% training loss reduction (1.412 to 1.176) and 0.855 BERTScore in 30 minutes on a T4 GPU. Model merged and deployed to HuggingFace Hub.

PyTorchLoRA / PEFTQLoRATRLHuggingFace

Attention Mechanism Optimization Suite

Deep Learning

Benchmarking framework comparing 4 attention implementations on NVIDIA L4 (seq len 1024, batch 32): FlashAttention-2 achieves 12.3x throughput (573K to 6.03M tok/s) and 99.7% memory reduction (12,582 MB to 38 MB) vs vanilla. Includes batch-size auto-tuner using binary search and ONNX/TensorRT export benchmarks. Key finding: algorithm-level optimization (FlashAttention) outperforms hardware-level optimization (TensorRT) by 6x for attention operations.

PyTorchFlashAttention-2xFormersONNX RuntimeCUDA

Advanced AI Agent System

GenAI

Multi-strategy reasoning system implementing 4 research papers: Chain-of-Thought with self-consistency voting (3 independent paths), Tree-of-Thoughts with beam search (width=3, depth=3), ReAct with real-time web search via Tavily API and ChromaDB vector memory, and Multi-Agent collaboration (Planner, Worker, Critic). LLM-based auto-classifier routes each query to the optimal strategy. Rate-limited (10/min, 100/day), with real-time streaming. Built with Groq LLM, deployed on HuggingFace Spaces.

PythonGroqTavilyChromaDBGradio

Certifications

UC Berkeley

Advanced Large Language Model Agents

UC Berkeley EECS โ€ข Jul 2025

AWS

AWS Data Engineer Associate

Amazon Web Services โ€ข Dec 2024 - Dec 2027

Microsoft

Microsoft Fabric Data Engineer Associate

Microsoft โ€ข Aug 2025

Oracle

Oracle GenAI Professional

Oracle Cloud โ€ข Jun 2024 - Jun 2026

Explore My Work By Role

Let's Connect

Open to ML Engineering, Deep Learning, LLM/GenAI, and Backend roles.

GitHub contribution graph

Or reach me directly: