The Reproducibility Challenge With Economic Data

Yet another issue arises with data from commercial sources, which often require a fee to access: 

Commercial (‘proprietary’) data is typically subject to licenses that also prohibit redistribution. Larger companies may have data provision as part of their service, but providing it to academic researchers is only a small part of the overall business. Dun and Bradstreet’s Compustat, Bureau van Dijk’s Orbis, Nielsen Scanner data via the Kilts Center at Chicago Booth (Kilts Center, n.d.), or Twitter data are all used frequently by economists and other social scientists. But providing robust and curated archives of data as used by clients over 5 or more years is typically not part of their service.

 

Research using social media data can pose particular problems for someone who wants to reproduce the study using the same data:

Difficulties when citing data are compounded when the data is either changing, or is a potentially ill-defined subset of a larger static or dynamic databases. ‘Big data’ have always posed challenges—see the earlier discussion of the 1950s–1960s demand for access to government databases. By nature, they most often fall into the ‘proprietary’ and ‘commercial’ category, with the problems that entails for reproducibility. However, beyond the (solvable) problem of providing replicators with authorized access and enough computing resources to replicate original research, even defining or acquiring the original data inputs may be hard. Big data may be ephemerous by nature, too big to retain for significant duration (sometimes referred to as ‘velocity’), temporally or cross-sectionally inconsistent (variable specifications change, sometimes referred to as ‘variety’). This may make computational reproducibility impossible. ... For instance, a study that uses data from an ephemerous social media platform where posts last no more than 24 hours (‘velocity’) and where the data schema may mutate over time (‘variety’) may not be computationally reproducible, because the posts will have been deleted (and terms of use may prohibit redistribution of any scraped data). But the same data collection (scraping or data extraction) can be repeated, albeit with some complexity in reprogramming to address the variety problem, leading to a replication study.

Finally, there a problem of "cleaning" data. "Raw" data always has errors. Sometimes data isn't filled in. Other times it may show a nonsensical finding, like someone having a negative level of income in a year, or an entry where it looks as if several zeros were added to a number by accident. Thus, the data needs to be "cleaned" before it's used. For well-known data, there are archives of documentation for how data has been cleaned, and why. But for lots of data, the documentation for how it has been cleaned isn't available.  Vilhuber writes: 

While in theory, researchers are able to at least informally describe the data extraction and cleaning processes when run on third-party–controlled systems that are typical of big data, in practice, this does not happen. An informal analysis of various Twitter-related economics articles shows very little or no description of the data extraction and cleaning process. The problem, however, is not unique to big-data articles—most articles provide little if any input data cleaning code in reproducibility archives, in large part because provision of the code that manipulates the input data is only suggested, but not required by most data deposit policies.
View single page >> |
How did you like this article? Let us know so we can better customize your reading experience.

Comments

Leave a comment to automatically be entered into our contest to win a free Echo Show.