Project description

Modern DNA sequencing lets us study entire microbial communities, including bacteria and viruses from the human gut, ocean, soil, and clinical samples. The data is often represented as a sequence graph, where nodes are DNA fragments and edges show how those fragments connect. These graphs can become enormous, complex, and difficult to store, search, and analyse efficiently. As sequencing datasets continue to grow, efficient graph storage and search are becoming essential for turning raw sequencing data into useful biological insights. In this project, you will design and evaluate smarter ways to represent large sequence graphs. You will explore compressed data structures and indexing strategies that make it possible to answer graph queries quickly while using less memory. For example, you may investigate how to efficiently find neighbouring DNA fragments, connected components, paths, repeated sequences, or circular structures in very large graphs. You will tackle a real scalability problem using data structures, graph algorithms, performance benchmarking, and software design. You may compare approaches such as adjacency lists, sparse matrices, compressed sparse row formats, disk-backed indexes, or succinct graph structures. This project gives you the opportunity to contribute to open-source bioinformatics software and work on a genuine scalability bottleneck in modern genomics. By the end of the project, you will have extended our published library agtools with the new indexing capabilities and built a benchmarking framework that reveals the trade-offs between memory usage, query speed, scalability, and implementation complexity.

Co-supervisors

Prof Robert Edwards, Flinders Accelerator for Microbiome Exploration, College of Science & Engineering

Further information

For more information about our research, check out our GitHub profiles metagentools, Vini2, linsalrob, theĀ Edward's lab website and the FAME group's website.

Assumed knowledge

You should have a background in Computer Science, Information Technology, or a related field. A basic understanding of Programming in Python, Data structures and Graph theory would be awesome. We'll teach you bioinformatics and the related biology.


Note: You need to register interest in projects from different supervisors (not a number of projects with the one supervisor).
You must also contact each supervisor directly to discuss both the project details and your suitability to undertake the project.