The Reproducibility Challenge With Economic Data

Vilhuber writes: "In 1960, 76% of empirical AER [American Economic Review- articles used public-use data. By 2010, 60% used administrative data, presumably none of which is public use ..."

You can't just write to, say, the Internal Revenue Service and ask to see all the detailed data from tax returns. Nor can you directly access detailed data from Social Security or Medicare or a school district, or from what people reported in the US Census. There are obvious privacy concerns here. 

Thus, one change in recent years is what are called "restricted access data environments," where accredited researchers can get access to detailed data, but in ways that protect individual privacy. For example, there are now 30 Federal Statistical Data Research Centers around the country, mostly located close to big universities.  Vilhuber writes (citations omitted): 

It is worth pointing out the increase in the past 2 decades of formal restricted-access data environments (RADEs), sponsored or funded by national statistical offices and funding agencies. RADE networks, with formal, nondiscriminatory, albeit often lengthy access protocols, have been set up in the United States (FSRDC), France, and many other countries. Often, these networks have been initiated by economists, though widespread use is made by other social scientists and in some cases health researchers. RADE are less common for private-sector data, although several initiatives have made progress and are frequently used by researchers: Institute for Research on Innovation and Science, Health Care Cost Institute , Private Capital Research Institute (PCRI). When such nondiscriminatory agreements are implemented at scale, a significant number of researchers can obtain access to these data under strict security protocols. As of 2018, the FSRDC hosted more than 750 researchers on over 300 projects, of which 140 had started within the last 12 months. The IAB FDZ [a source of German employment data] lists over 500 projects active as of September 2019, most with multiple authors. In these and other networks, many researchers share access to the same data sets, and could potentially conduct reproducibility studies. Typically, access is via a network of secure rooms (FSRDC, Canada, Germany), but in some cases, remote access via ‘thin clients’ (France) or virtual desktop infrastructure (some Scandinavian countries, data from the Economic Research Service of the United States Department of Agriculture [USDA] via NORC) is allowed.


A common situation is that this kind of data often cannot be put into the public domain; instead, you would need to apply and to gain access to the "restricted access data environment," and access the data in that way. 

Another issue is that in some of these data sources, researchers are not given access to all of the data; instead, to protect privacy, they are given an extract of the overall data. As a result, two researchers who go to the data center and make the same data request will not get the same data. The overall patterns in the data should be pretty close, if random samples are used, but they won't be the same. Vilhuber writes: 

Some widely used data sets are accessible by any researcher, but the license they are subject to prevents their redistribution and thus their inclusion as part of data deposits. This includes nonconfidential data sets from the Health and Retirement Study (HRS) and the Panel Study of Income Dynamics (PSID) at the University of Michigan and data provided by IPUMS at the Minnesota Population Center. All of these data can be freely downloaded, subject to agreement to a license. IPUMS lists 963 publications for 2015 alone that use one of its data sources. The typical user will create a custom extract of the PSID and IPUMS databases through a data query system, not download specific data sets. Thus, each extract is essentially unique. Yet that same extract cannot be redistributed, or deposited at a journal or any other archive.undefined In 2018, the PSID, in collaboration with ICPSR, has addressed this issue with the PSID Repository, which allows researchers to deposit their custom extracts in full compliance with the PSID Conditions of Use.
