Tech Tree - A Systematic Approach to ML Research (and Beyond)

I have always wondered how deep learning heroes or leading labs develop such ingenious and elegant architectures and algorithmic improvements — especially the ones that stand the test of time. The challenge involves navigating such a vast design space that, despite attempts to automate, humans still outperform machines thanks to an intuition built on theoretical knowledge and countless empirical experiences.

Fortunately, many scientists have shared their strategies for conducting research, such as maintaining journals or spending a lot of time reproducing methods from papers. Among these, Andrej Karpathy’s amazing "A Recipe for Training Neural Networks" stands out as a uniquely structured framework that provides clear, step-by-step guidelines to maximize success when training neural networks. However, deep learning remains a highly empirical domain with many non-trivial dynamics.

This blog presents my attempt to develop a systematic approach to tracking and developing research ideas, which I call the "Tech Tree". The methodology serves a dual purpose: it documents research progress while helping to understand the relationships between components to build better mental models of their interactions and impact. By making the experimental process explicit and visually traceable, this methodology could improve collaboration by lowering the barrier to entry for newcomers to the project.

Method

The core of the process involves building a graph of the performed experiments, which serves as a visual representation of a "compressed chain of thought" documenting the entire research process.

In the following example, I had three prior requirements that I think are fairly standard for most machine learning setups, and also not strictly necessary:

Robust evaluation pipeline: Ideally, to simplify the judgment, it should return numerical scores that we want to minimize or maximize, for example from a benchmark we care about or a weighted combination of multiple benchmarks. However, one needs to be aware of Goodhart’s law, the stochasticity of the small variances in the scores, and other possible biases such as RL reward hacking.
Reproducible environment: The setup should be as deterministic as possible, minimizing the impact of random seeds on the experiments. This ensures that changes in performance are primarily due to intentional modifications.
Tools to provide qualitative descriptions and insights for the nodes, beyond just numerical metrics.

Each experiment is represented as a node in the graph, encapsulating key information to minimize information loss during rapid prototyping. A node consists of a name indicating its modification relative to its predecessor, benchmark metrics quantifying its impact, and optional qualitative notes capturing insights beyond numerical scores. To illustrate relationships between consecutive experiments, nodes use a color-coded scheme: green for improvements, yellow for negligible change, and red for performance degradation or other undesirable effects.

Background

To illustrate the method, I’ll share examples from my work on adapting transformer architecture for point cloud understanding. This technique helped me evolve an initial concept from 73% accuracy into a solution achieving over 90% on the ModelNet40 benchmark, matching other transformer adaptations in a replication environment.

Example

This method is particularly effective when applied to complex systems composed of multiple interacting elements, where each experiment builds on previous discoveries. In my attempt, I initiated the research tree with a simple linear embedding sanity check to verify the training pipeline — this baseline served as the root node from which the research progressed. Each subsequent experiment inherited all elements of the previous setup except for a specific modification noted in its name. For example, applying the initial patching idea to the baseline configuration achieved 73% accuracy, creating a node connected to its parent by a green link. The experimental setup remained identical across iterations, with the best scores recorded for each run in the limit of 100 epochs.

Disclaimer: Evaluating on the test set instead of the validation set during model development is not a good idea. However, I made an exception in this case, as the actual test set for my method comprised other benchmarks that were evaluated only after finalizing the model.

I continued to explore the solution by incorporating both successful and unsuccessful ideas, mapping their interactions to deepen my understanding of the system. A particularly powerful aspect of this approach was the ability to observe the same idea applied at different levels, simplifying the assessment of its impact and identifying potential dependencies.

In some cases, the tree contains red branches without recorded results. These represent experiments that performed so poorly or fell outside the main scope of the research that I decided to stop them early to avoid wasting resources. As this was my first time using this method, there is still plenty of room for improvement in defining clearer rules and improving visual consistency.

When time allowed, I also recorded simple hyperparameter changes to separate the effects of architectural modifications from non-architectural ones.

Complete research tree for this study. The final solution achieved 90.4% accuracy with a model that has over 10x fewer parameters. The entire research process was especially engaging, driven by the sense that just a few layers deeper, significant improvements were waiting to be discovered.

Test accuracy curves from most of the experiments carried out during the development of the architecture.

Another common concept, which I haven't exactly used in the example, touches on experiment parallelism. When possible, multiple ideas can be pursued simultaneously, either by a single researcher or by whole teams working on completely different parts of the system for efficiency. Ideally, these changes should combine and lead to the best model. However, this is often not the case because of complex interplay and diminishing returns. Nevertheless, it's valuable to try, save the data, and perhaps plan a different mix of improvements to deepen the understanding of the system.

Summary

The search process can be viewed as a type of genetic algorithm, where certain evolutionary changes improve or degrade genetic traits, and the final offspring represents the best species found for the problem. Fitness is determined by metrics and qualitative assessment, while the researcher is responsible for manually performing mutation and crossover operations.

I have found that representing the research landscape in the form of a tree (or DAG) structure helps to visualize the "gradient" of improvements, allowing one to better understand the relationships between ideas rather than relying on the linear history of the current boards. Being able to zoom out the research process helped me visualize the exploration-exploitation tradeoff, and stay aware of the sunk cost fallacy bias.

However, this method is definitely not without downsides, particularly when it comes to fast prototyping. This research tree captures only about 30% of the total experiments performed, omitting numerous dead ends, minor training optimizations, subtle modifications, and simple mistakes. Many of these were either insignificant for a full evaluation or lacked time for separate testing.

Future Ideas

I believe this approach could be a valuable way to store and share research maps with both humans and AI, providing a structured context for new collaborators joining a project. LLM-based agents could potentially analyze patterns, compare findings with existing literature, and even suggest or implement improvements. Moreover, the proposed framework could improve reproducibility efforts by tracking not only what worked, but also why certain paths were abandoned - crucial information that's often lost in traditional publications.

The tree shown in this paper was built manually throughout the research process. For deployment-focused teams, it might be useful to develop a tool that automates the construction and updating of the graph and integrates it into the MLOps pipeline.

I hope this blog will, at the very least, inspire others and encourage conversations about effective ways to approach research and complex, open-ended problem solving, which could be especially useful for early practitioners.

Acknowledgments

I’m sincerely grateful to Vladyslav Kyryk, Michał Wiliński, and Yassine Taoudi-Benchekroun for their insightful feedback on this blog.