Projects

Selected Projects

A selected set of projects spanning CUDA optimization, multimodal AI, code intelligence, and systems tooling.

Independent ResearcherIndependent DeveloperCore Researcher

Project Lens

These projects are organized to show the progression from model understanding to system optimization and hardware-aware execution.

Case 01

CUDA GEMM Optimization and Architectural Analysis

Independent Researcher / Mar 2026

Implemented and systematically optimized GEMM kernels while studying how memory hierarchy and arithmetic intensity shape end-to-end execution performance.

  • Used 2D block tiling and shared memory to improve data reuse within a thread block.
  • Applied register blocking to raise arithmetic intensity from 7.2 to 14.1 FLOPs/Byte.
  • Used Nsight Compute to confirm major DRAM traffic reduction and 3.57x overall speedup.

Case 02

LLM + RAG Code Architecture Analysis System

Independent Developer / Mar 2026

Built a repository analysis tool that combines LLM reasoning, AST-based chunking, vector retrieval, and CUDA-aware parsing for structured source code understanding.

  • Automated GitHub repository ingestion and function-level source chunking.
  • Added CUDA-specific parsing rules to inspect kernels, shared memory, and memory access patterns.
  • Used vector search and LLM inference for cross-file reasoning and documentation generation.

Case 03

Multimodal Video Captioning Research

Core Researcher / Aug 2025 - Dec 2025

Designed a multimodal system that translates video information into structured text while exploring efficient deployment behavior.

  • Used a ViT-style visual encoder with patch projection and positional embeddings.
  • Built transformer-based cross-modal alignment between visual and textual representations.
  • Prepared the work for manuscript submission and connected model design to system efficiency concerns.

Case 04

Transformer-based LLM from Scratch

Independent Developer / Jul 2025 - Aug 2025

Built a small generative language model from scratch to better understand tokenization, attention, pre-training, and fine-tuning dynamics.

  • Implemented data preprocessing and tokenizer pipeline independently.
  • Completed pre-training and domain-specific fine-tuning workflow.
  • Focused on training strategy, memory control, and domain generation quality.

Case 05

Etshark Packet Analysis System

Independent Developer / Feb 2025 - May 2025

Developed a full-stack packet capture and protocol analysis tool inspired by tshark, with C++ backend parsing and a localized frontend experience.

  • Implemented offline and online packet parsing with protocol tree translation.
  • Integrated C++ backend with Electron and JavaScript packaging workflow.
  • Improved accessibility for Chinese-speaking developers through interface localization.