Past event

School of Computer Science PGR Seminar Honglin Zhang

Honglin Zhang will present RepoSnipy-LLM: Structured Knowledge-Grounded Explanations for Scientific Software Mining

Abstract: Understanding and reusing software repositories is a fundamental challenge in modern scientific research. While existing tools such as RepoSim4Py and RepoSnipy can identify semantically related repositories, they lack the interpretability required for researchers to assess architectural choices or execution requirements. In this paper, we present RepoSnipyLLM, a question-driven semantic search and explanation tool that extends the RepoSnipy ecosystem with Retrieval-Augmented Generation (RAG). To improve the trustworthiness and verifiability of its insights, our system grounds generated answers in structured repository knowledge extracted via inspect4py and stored in Elasticsearch. The system supports 18 structured questions across five categories including repository understanding, architecture, execution, implementation, and reuse, aligned with the FAIR principles for research software. We evaluate the system on 487 Python repositories from the Awesome Python dataset using retrieval and generation quality evaluations. The retrieval results show that code-level embeddings perform best, outperforming keyword-based baselines and full concatenated repository embeddings. The RAG-based generation evaluation further demonstrates that structured repository knowledge supports wellgrounded, contextually relevant explanations for question-driven repository exploration.

Bio: Honglin Zhang is a first-year PhD student supervised by Dr Lei Fang and Professor Juan Ye. His research sits at the intersection of mining software repositories and large language models, with a focus on building explainable, evidence-grounded systems for scientific software discovery and reuse, and code representation learning.