Artificial Intelligence And The Risks Of Harking (Hypothesizing After-the-Fact)

Image Source: DepositPhotos


Academics have long been aware of the risks of data mining—torturing the data until it confesses. The concern is that correlation of variables doesn’t imply that the correlation is a result of causation. That is the reason that the prevailing academic standard for researchers is that they should first develop their hypothesis and predictions before testing them against the data. To minimize the risks of a study being the result of a data mining exercise, in our book “Your Complete Guide to Factor-Based Investing,” Andrew Berkin and I recommend that before one should consider investing in a factor-based strategy all of the following tests be applied. To start, it must provide explanatory power to portfolio returns and have delivered a premium (higher returns). Additionally, the factor must be:

  • Persistent — It holds across long periods of time and different economic regimes.
  • Pervasive — It holds across countries, regions, sectors, and even asset classes.
  • Robust — It holds for various definitions (for example, there is a value premium whether it is measured by price-to-book, earnings, cash flow, or sales).
  • Investable — It holds up not just on paper, but also after considering actual implementation issues, such as trading costs.
  • Intuitive — There are logical risk-based or behavioral-based explanations for its premium and why it should continue to exist.

The important role of these criteria has increased due to enhanced power of the tools of artificial intelligence and large language models (LLMs).


The Role of AI in Financial Research

Artificial Intelligence (AI) offers the intriguing potential to revolutionize investment decision-making by providing important advantages such as:  

Enhanced Data Analysis: AI can process and analyze vast amounts of data from various sources, including financial news, market trends, and company fundamentals, at a speed and scale far surpassing human capabilities. This enables investors to identify patterns, correlations, and anomalies that may be difficult for humans to detect.  

Improved Prediction Accuracy: AI algorithms can leverage historical data and machine learning techniques to build predictive models that forecast future market movements, asset prices, and investment returns with greater accuracy than traditional methods (avoidance of cognitive biases to which humans are susceptible—AI is more rational).

However, using AI to build predictive models increases the risks of data mining outcomes. In their December 2024 paper, “AI-Powered (Finance) Scholarship,” authors Robert Novy-Marx and Mihail Velikov began by noting that the prior research (on LLMs and novel research,  Generative AI, language models, , the increasing use of LLMs in scientific papers, and scientific discovery has shown that AI systems can not only meaningfully engage with economic reasoning and prediction, but that it is capable of testing scientific hypotheses in silico—using computer programs and algorithms to model a system, simulate experiments and analyze data.


Benefits of in silico testing:

  • Faster and cheaper: Compared to traditional lab experiments, in silico methods can be much faster and less expensive.
  • More efficient: Allows researchers to explore a wider range of possibilities and test more hypotheses in a shorter amount of time.

Novy-Marx and Velikov described a process for automatically generating academic finance papers using LLMs. They began by mining over 30,000 potential stock return predictor signals from accounting data, and applied the Novy-Marx and Velikov (2024) “Assaying Anomalies” protocol to generate standardized “template reports” for the 96 signals that passed the protocol’s rigorous criteria (identifying issues that commonly arise testing equity strategies, paying particular attention to arbitrage limits that can make a strategy look good on paper even when it cannot be profitably traded in practice).  Each report detailed a signal’s performance predicting stock returns using a wide array of tests and benchmarked it to more than 200 other known anomalies. They then used state-of-the-art LLMs to generate three distinct complete versions of academic papers for each signal. The different versions included creative names for the signals, contained custom introductions providing different theoretical justifications for the observed predictability patterns, and incorporated citations to existing (and, on occasion, imagined) literature supporting their respective claims.  As my friend, and co-author, Andrew Berkin pointed out: This is emblematic of some of the problems that currently exist with AI. It will give you an answer, but not necessarily a correct one. For that reason, some call AI a “lying machine.”

The “288 fully programmatically-generated papers contain introductions that follow standard academic conventions, developing theoretical arguments that connect the documented return patterns to established economic mechanisms, incorporating citations to existing (and, at least for now, on occasion hallucinated) literature. Each paper includes comprehensive descriptions of the data and methodology, detailed discussion of results, and contextualized conclusions.”

They then used a more advanced LLM (Claude 3.5-Sonnet) to generate the core textual content of each paper. For example, the introduction, composed of roughly 1,100 words, was subdivided into four sections to ensure a balanced, academically coherent narrative:

1. Motivation (200 words): Frames the research question within the broader asset pricing literature, discussing market efficiency, cross-sectional predictability, and recent developments in factor research.

2. Hypothesis Development (300 words): Proposes economic mechanisms justifying the signal’s predictive power, citing relevant theoretical and empirical studies to maintain a scholarly tone and contextualize the new factor.

3. Results Summary (300 words): Presents key empirical findings, highlighting statistical significance, robustness checks, and comparisons to established anomalies.

4. Contribution (300 words): Places the proposed signal in relation to 3–4 closely related studies, articulating how the new evidence enhances our understanding of systematic return drivers and contributes to ongoing debates in the literature.

All generated text adhered to a formal academic writing style, utilized active voice, and carefully distinguished correlation from causation, avoided unwarranted claims, and ensured appropriate application of tense to reflect established knowledge versus new findings. In addition, citations were embedded using LaTeX-formatted references, and all writing conventions aligned with norms in leading finance journals.

The other added sections of each manuscript, including Data and Conclusion, were generated following similarly structured prompts. They added:
 

“While the papers and their theoretical frameworks are automatically generated, it’s important to note that all empirical analyses and statistical validations are conducted using rigorous methods developed in the academic literature, ensuring the reliability (if not the interpretation) of the underlying findings.”


Novy-Marx and Velikov noted:
 

“The process is remarkably efficient – while the data mining, validation, and generation of the PDF “template reports” from the “Assaying Anomalies” protocol takes about a day of computation time, the final paper generation takes minutes. This represents a dramatic acceleration compared to traditional research paper development.”


Their findings led Novy-Marx and Velikov to conclude:
 

“This experiment illustrates AI’s potential for enhancing financial research efficiency, but also serves as a cautionary tale, illustrating how it can be abused to industrialize HARKing (Hypothesizing After Results are Known).”


They added this caution:
 

“The ease with which AI can generate convincing theoretical frameworks that reference prior literature may inadvertently create a new form of academic arbitrage – where researchers can boost their citation counts through automated paper generation. It is actually easy to imagine a scenario in which entire fictitious sub-fields of a literature emerge in which all of the citations are from AI-generated papers to other reciprocally citing AI-generated papers.”


Investor Takeaways

Novy-Marx and Velikov provided a concrete demonstration of how LLMs can be used to automate the generation of academic finance papers at scale. “Our results show that AI can now develop hypotheses at an unprecedented scale.” They then demonstrated: The emergence of sophisticated AI systems capable of generating (multiple) plausible theoretical frameworks at scale poses novel challenges to traditional mechanisms used to judge the reliability of research findings. The takeaway then is that because AI systems can produce hundreds of seemingly coherent theoretical explanations for mined empirical results, investors need to establish high hurdles before allocating to anomaly-based strategies.


More By This Author:

What the Index Effect’s Disappearance means for Market Efficiency
Exploring Bond Tax Efficiency: Futures or Bond ETFs?
The Negative Impact Of Crowding On Active Fund Performance

How did you like this article? Let us know so we can better customize your reading experience.

Comments

Leave a comment to automatically be entered into our contest to win a free Echo Show.
Or Sign in with