HardToCache: An Automated Framework for Quantifying and Enhancing Reproducibility Risks in Computational Research
Creators
Description
Computational studies are currently facing a significant reproducibility crisis, with approximately 70% of studies failing to replicate due to undocumented dependencies, environmental drift, and fragmented workflows. To address this issue, we introduce HardToCache—a novel framework designed to quantify reproducibility risks across research artifacts systematically.
Our methodology utilizes an automated pipeline to: (1) collect code and data from publications using scraping tools, (2) score artifacts with an 8-dimensional scorecard (rated from 0 to 5 in each category), and (3) share the results through visualizations and open GitHub repositories. Our evaluation has revealed critical gaps, as few papers provide fully executable code or detailed specifications of the environment.
The framework incorporates tools such as Python (with libraries like pandas and matplotlib), Google Colab, and GitHub to streamline the processes of data collection, analysis, and dissemination. Future work will focus on expanding the framework to biomedical and climate domains, as well as developing real-time browser extensions for live reproducibility scoring.
HardToCache establishes a measurable foundation for auditing, improving, and promoting transparency in computational science.
Files
      
        Gateways2025_paper_13 (1).pdf
        
      
    
    
      
        Files
         (1.8 MB)
        
      
    
    | Name | Size | Download all | 
|---|---|---|
| md5:b72a6d9ca75e5ee7baca4c3763712d26 | 1.8 MB | Preview Download |