repboxr

Repbox

Author: Sebastian Kranz, Ulm University

The repbox project is a collection of tools (mostly R packages) with the goal to facilitate reproducible research. For the project, I created the repboxr organisation on Github. (The name repbox was already taken). The project currently concentrates on economics and social sciences.

Main goals:

Help data editors and authors to check replication package of scientific articles
Create a systematic data base for meta studies containing mapped information from run supplements and articles. A focus will be on regression analyses.

Stata and R scripts in replication packages can be analyzed. Currently, more functionality is implemented for Stata scripts.

Far from being generally usable

Currently, all packages are in a pilot phase and the project needs substantial large scale testing, improvement and documentation. The exact stage of development differs between packages, but it will take substantial effort before repbox can be used for researchers to run their own meta studies. Also the overall design is far from being settled and you should expect a lot of breaking changes in future.

The pilot version of repbox can be tested by data editors to check reproduction package, but it is far away from being usable for meta studies.

In the moment more features are implemented for Stata than for R, like mapping of regression results. The goal is to have a similar feature set in the future. While currently almost all contributions have been made by Sebastian Kranz, the goal is that a future stable version will have a wide community including data editors, authors and researchers performing meta studies.

Testing pilot version of repbox as a data editor

If you want to test repbox as a data editor (or member of a data editor team), I would recommend to use it via GitHub actions. There is a tutorial video here:

https://www.youtube.com/watch?v=T7DWBMzKboQ

and a short overview in the README.md here:

https://github.com/repboxr/gha_repbox_mono

In the future the shall be different ways to use the repbox toolchain to check reproduction packages:

Running directly on your local system
Running inside docker containers
Running in a Github Action pipeline or similar frameworks.

To install repbox on your local system, you first need to install R. Then you can install all required packages by running in R:

install.packages('repboxverse', repos = c('https://repboxr.r-universe.dev', 'https://cloud.r-project.org'))

If possibly this will install binary packages for your OS, which are built and hosted from r-universe. I have not yet developed an example to use repbox on a local system. That is also because I recommend to test it via Github actions.

Packages and repositories

General packages

repboxRun: Functions that help running repbox analysis steps that are implemented in the different packages.

repboxArt: Analyse articles in PDF or HTML versions.

Convert PDF or HTML to a common representation that stores text information including sections, paragraphs and footnote markers.
Extract and store scientific tables (relying on the ExtractSciTab package explained further below).
Analyse keywords, like regression analys and links to tables or figures in the text.

repboxCodeText: Study keywords in comments of supplement script files.

Currently only analyses in Stata do files of supplements whether comments contain links to tables or figures.

repboxStata: Analyse, modify and run Stata do files in supplements

Contains a rudimentary Stata parser that can write Stata do files in a more canonical form. That is required to perform code injections and systematically store information about the stata scripts.
Allows to prepare do files via specific code injections. Some code injections like automatic path corrections have the goal increase the share of do files that run out of the box. Other injections help to systematically extract log information that can be stored in a data base. While the proto-type works, there are many special cases that will need adaption of the code base. It will be a lengthy process.
Run the prepared Stata code in the correct order.
Extract the generated raw log information generated by the code injection and convert it into a format better suited for subsequent analysis.
The package also contains Stata code. These are mainly ado functions that will be called to perform path correction at runtime or extract information.

repboxR: Analyse, modify and run R files in supplements

While repboxStata has a rather monolithic structure, repboxR uses several other helper packages described further below. They include sourcemodify for code analysis and injections, repboxRfun for functions that will be called by injected code for path correction and more.

repboxReg: Analyse regressions in Stata and R scripts

Currently mainly implemented for Stata scripts. The prototype for R is in the repboxR package.
Use code injection to extract detailed information after a regression command is run, including non-rounded values for coefficients, standard errors, t- and p-values. Also general regression statistics like R2 will be stored.
The code injection may cause itself errors. Typically, the Stata scripts will be run twice: first without and then with code injections related to the regression analysis.
We also extract information from a static analysis of the Stata commands and store them in a systematic fashion. One main example are systematic extraction of the type of standard errors (including clustering variables).
We also store information about the data set used in the regressions, e.g. which variables are numeric, categorical or dummy variables.
Regression formulas will be stored in a canonical table representation that contains information about variable roles (e.g. explanatory variable or instrument), features like interaction effects or variable transformations.
In the results table, coefficient names are not normalized across Stata commands (or R commands). Also using the systematic information extracted from the regression formulas the repboxReg package transform coefficient names into a canonical form that will also allow mapping back to the information extracted from the data set.
In the long run, I would like to have a framework that allows to easily run meta studies that systematically study effects of modifying existing regression analyses. E.g. one might check whether results are robust to different specifications of standard errors. My codename of that framework is metareg. The repboxReg package already contains functionality for the metareg project and tries to store information in a way that can be used for metareg studies. But the metareg project is still in its infancy.

repboxMap: Map results of the different analysis steps and sources.

Currently it is all about mapping the numbers shown in tables in the article with corresponding commands and output from the supplements.
A particular focus is on mapping results from regression tables for which repboxReg stores information in a systematic way.
Also numbers in non-regression tables can be mapped, but less well.
Use case 1 is to help authors and data editors that want to check whether results shown in the article's tables can indeed be reproduced by running the code in the supplement or whether there are differences.
Use case 2 is to store the mapped information in a database that can be used for meta studies. In the future it would be nice if LLM can extract better information from the article, e.g. the variable of interest in a regression and if could also map this information from the LLM.
We currently use quite crude heuristics and there is much scope for continous improvement by studiying a lot of articles and supplements.
The core structure currently distinguishes between a matching and a mapping step. The matching steps relatively mechanically maps numbers (taking into account for different roundings) between outputs of scripts and tables in the article. Typically, one part of a table has matchings with different outputs of several commands in the code. The mapping steps tries to select the best matchings for a table. Mapping heuristics can use different information. E.g. all things equal it seems more likely that all values in a table's column are generated by the same command rather than different commands, and that two different columns in a table may be generated by similar types of commands (e.g. both by regression commands). Also keyword information from the table or code comments can help mapping. For example, if the analysis in repboxCodeText determines based on the comments in the code, that some lines in a Stata script should correspond to Table 7 in the article, a mapping algorithm should prefer matchings from those commands. But mistakes can happen and possible the code says "Table 7" while in the final of version of the article it has been renamed to "Table 8". Thus most mapping algorithms will likely rely on a point system to transform matchings into the prefered mapping.

repboxHtml: Generate reports consisting of HTML pages that describe the results of repbox analyses.

Currently, we create HTML pages that show the script files on the left hand side. On the right hand side are the extracted tables from the articles with color coding to describe the mapping and quickly detect possible problems.

repboxDB: Functionality to systematically store data extracted in repbox analyses.

The data base's abbreviation, e.g. used in function names and directories is repdb.
We will store externally accessible data in a flat table format. This means we will allow no nested json structures. Rather we follow standard design of SQL data bases and use key variables to map entries of different tables.
Key component are table specifications as YAML files (see e.g. https://github.com/repboxr/repboxDB/blob/main/inst/repdb/stata_cmd.yml). They describe the fields in each table, their data type and a possible explanation of the fields.
Currently, not all table definitions are in the repboxDB package. E.g. the repboxArt package contains table definitions related to articles.
While also indices of the tables can be specified, they are not yet used.
Currently, table data will just be stored in R's native Rds format separately for each project. Information from multiple tables can be stored together in a single Rds file. We call such a collection a parcel. Internally, parcel is an R list that can contain data frames of multiple tables. While the different packages store specific parcels, the exact specification of the parcels is not yet described in any YAML files.
Different parcels may contain information from the same table. E.g. there is a table reg that stores general information about a regression. We may have a parcel that stores information from regression run in R scripts and another parcels
The table definitions also allow other storage modes, e.g. a SQLite database.

Repositories for Github Action Pipelines

One way to use repbox is via Github Action Pipelines. The following repositories help:

GithubActions: Contains functions to easier use Github Actions from R.

repboxGithub: Functions for Github interactions specific to repboxGithub

gha_repbox_mono: A template repo that can be adapted to run a repbox analysis of a single article & supplement via Github Actions.

Repositories for LLM analysis (do not yet exist)

One of the future goals is to add large language model analysis for the code and article text to the repbox framework. That could both benefit checking of replication packages, as well as augment data bases for meta studies. But so far there is nothing.

Further utility packages

ExtractSciTab: Extract scientific tables from PDF.

The package considers text representations of article PDF files generated with pdftotext. Then use heuristics to detect and extract scientific tables.
It is used by the repboxArt package, which further transforms the table information into a common format for tables extracted from PDF and HTML more suitable for further analysis.

sourcemodify: Analyse and modify R source code

Extends utils::getParseData to parse R source code and return the information about every token in a convenient data frame format. We augment the information, e.g. by determining which expressions are inside a function.
The package then allows to systematically modify the code, e.g. replace or surround certain function calls or function arguments.
The package is used by repboxR for static code analysis and to modify code for automatic path correction and extraction of regression specific information.

repboxEvaluate: A fork of the evaluate package used by repboxR to evaluate R scripts in a way that facilitates systematic storage of results.

repboxDeploy: Tools to deploy a repbox project in different forms

Currently it contains only some very basic functionality to deploy a project's results as an example for repboxExamples

repboxRfun: Contains functions called when modified R scripts are run.

Mainly functions for automatic path correction at run time and to extract and store information after a regression command was run.

repboxUtils: A collection of utility functions shared across repbox packages.

repboxverse: Similar to the tidyverse package a package that helps installing and loading all relevant repbox packages.

pkgFunIndex: Extract information about functions in R packages

The goal of this package is to generate systematic information about exported functions of all R package versions on CRAN.
That information could be helpful to determine approbriate package versions to run a historic replication package. But currently, the database is not yet generated and not used by any other repbox package.

Repositories for examples, issues and testing

The following repositories are so far pretty empty, but should be helpful in the longer run.

repboxExamples: Shall contain one or several example projects.

repboxIssues: Central repository to discuss repbox related issues.

repboxTests: Unit tests

Projects

A project refers to a particular research article, its data and code supplement and the corresponding results of the performed repbox analysis steps.

Every project has a separate directory. In the repbox database a project is identified by a short, unique artid, like aer_112_9_9. The basename of the project directory should be equal to its artid.

The exact structure of a project directory is still not fixed. In December 2023, the following subfolders can be found:

org contains the original code and data supplement
mod contains a modified code and data supplement including scripts modified by code injection
repbox contains extracted information from static and runtime analysis of the supplements. There can be subfolders for R and Stata.
art contains the original article (PDF or HTML) and information extracted from it
meta contains meta information from the article, currently always taken from the EJD data base. After the repbox analysis is run, the meta infomation will also be stored in the table art.
map contains results from mapping information about extracted tables from the article and the results from the scripts.
reports contains HTML reports of the repbox results
metareg contains also extracted information from the repbox run in a format that will facilitate future meta studies. Far from being settled and well documented.

The generated data parcels (see description of repboxDB above) of the different repbox steps are scattered around the different subfolders and can be typically found in a regdb subdirectory. To load them the repboxDB::regdb_load_parcels function is recommended.

The idea is that appropriate information from the project directories can be aggregated and exported in different forms. For example, one might generate larger databases that allow to search for particular regression specifications used in articles.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

repboxr

Repbox

Far from being generally usable

Testing pilot version of repbox as a data editor

Packages and repositories

General packages

Repositories for Github Action Pipelines

Repositories for LLM analysis (do not yet exist)

Further utility packages

Repositories for examples, issues and testing

Projects

Popular repositories Loading

Repositories

People

Top languages

Most used topics