Request for Proposals: AI for Forecasting and Sound Reasoning

We believe that modern AI has the potential to improve our ability to reason in a structured and quantified way, and thereby improve human decision-making. To help realize this potential, we are launching a request for proposals (RFP) for two related areas of interest:

AI for forecasting: We are looking for proposals for AI models that help to make forecasts more accurate or more relevant. We are primarily interested in probabilistic, judgmental forecasting, i.e. quantitative forecasts that cannot be based fully on large sets of structured data. Aside from models that directly produce forecasts, ideally approaching or exceeding human performance on forecasting tasks, we’re also looking to fund work on models that perform one or more of the subtasks involved in using forecasts for decision-making, such as explaining the reasoning behind forecasts or building forecasting models.

See here for a more detailed discussion of what we’re looking for.

AI for sound reasoning: Modern AI models are being adopted at a rapid pace throughout society, including for high-stakes decisions in law, academia and policy. We expect this trend to continue over the coming years, and possibly accelerate. It seems crucial to us that models that are used for highly consequential decisions are generally truth-oriented, and support such tendencies among their users. We see two main paths to this goal that we’re interested in funding:

Research into understanding when models do and do not support sound reasoning, including evaluations of models with respect to principles of sound reasoning like truthfulness, meta-reasoning, or consistency.
Developing tools that directly help with specific tasks that are disproportionately helpful for clear reasoning, like fact-checkers, fact tracers, arbitrators, or argument analyzers.

See here for a more detailed discussion of what we’re looking for.

We are primarily interested in the potential of AI models to improve high-stakes decision making, especially in the context of global catastrophic risks (GCRs). However, we expect that most projects we fund will also apply to decision-making with lower stakes, with only a minority of projects directly focusing on applications to GCRs. See our Q&A for more detail.

[maxbutton id=”1″ url=”https://airtable.com/appAxw3NwvGNOg0pe/pagu4hOkFRcmj9rpq/form” text=”Click Here to Apply”]

Submissions will be evaluated by Open Philanthropy’s Forecasting team. We will be accepting proposals at least until January 30, 2026. Until then, we expect to spend most of the Forecasting team’s grantmaking time on the RFP.

We plan to make grants totaling around $8-10 million for proposals submitted in this period. We expect the typical successful proposal will receive funding in the range of $100,000 to $1 million for a project duration between 6 months and 2 years. While AI for forecasting is a subcategory of AI for sound reasoning in principle, we are more confident about our vision for the former, and currently expect to end up with more projects in this area. However, this assessment will be subject to continued reevaluation, and may very well be reversed in the future.

We’ve seriously considered the risk that supporting projects to improve forecasting and epistemic tools might inadvertently accelerate AI capabilities — e.g. by generating useful training data or by incentivizing benchmark-climbing. However, given the relatively modest scale of funding, the focus on downstream tasks rather than fundamental model improvements, and the nature of the work (e.g. helping models extract, structure, and express knowledge more accurately), we believe the net effect is more likely to be positive. In particular, these tools seem well-suited to improving model transparency, calibration, and alignment with truth — qualities we consider central to mitigating catastrophic risk. We’re also taking care to define the scope of this RFP so it avoids incentivizing the development of capabilities with agentic or deceptive potential, instead favoring tools that increase understanding and support good decision-making.

Submission guide

Submitting your application

Please submit your proposal using this form. Include a short project description (up to 300 words) at the first stage, though you can add a more extensive description in your supporting material. For the short description, focus on concrete descriptions of the specific actions you and/or your team are planning as part of this project.

These descriptions can include rough guesses, and do not need to be polished. We strongly suggest that you aim to spend no longer than 4 hours filling out the form, and potentially much less than that if you have an existing description of your idea. We deliberately keep this form brief because it is the first stage of a longer process; if we decide to move forward with your application, we will request further information from you as needed. We will be accepting proposals on a rolling basis at least until January 30, 2026, but we encourage submissions before December 1, 2025 to be included in the first round of reviews.

We will aim to respond to you within 6-8 weeks of receiving your proposal. We may need additional time to respond in some cases — for example, if we decide to consult with external advisors, or if application volume is unexpectedly high.

We do not plan to provide feedback for rejected proposals in most instances; we want to focus on evaluating the most promising proposals and responding quickly to all of our applicants. If we move forward with your application, we will introduce you to the grant investigator who will evaluate your proposal in depth. They will then start a conversation with you, typically requesting further information as a first step. At this stage, you’ll have the opportunity to clarify and evolve your proposal in dialogue with the grant investigator and to develop a finalized budget. See this page for more details on the grantmaking process.

Eligibility

A wide range of projects are eligible for this RFP.

Types of applicants: We’re open to funding both nonprofit and for-profit organizations (including academic institutions), as well as individuals. However, we prefer nonprofit organizations. This means that if your organization is for-profit or if you apply for individual funding, your project will have to clear a higher bar.
Experience: We’re open to funding people at all levels of seniority and organizations at all levels of maturity, but we will value relevant experience highly. That said, relevant experience can come in many forms, and we especially encourage people without traditional academic credentials to apply.
Location: There are in principle no restrictions with respect to location — you can apply from anywhere in the world. However, some locations pose legal and logistical challenges that may make it difficult to fund you. While we will do our best to fund all projects that excite us, this may be impossible in practice.
Available funding: We plan to make grants totaling around $8-10 million for proposals received until January 30, 2026, and expect that in this period, the typical successful proposal will receive funding in the range of $100,000 to $1 million for a project duration between 6 months and 2 years.

What makes for a strong submission

Key criteria

When evaluating submissions, we will primarily evaluate (i) how likely the project is to be successful, and (ii) how much impact it could have on high-stakes decision-making if successfully completed.

We’ll be more confident about the success of your project if:

The budget and project plan are realistic given the project’s goals.
You and your team have relevant expertise and/or experience. For instance:
1. A technical background in machine learning — whether it involves generative AI or more traditional approaches. We expect that some of the most exciting technical projects we fund will combine traditional and gen-AI approaches.
2. A demonstrated track record of developing or marketing apps/products that have found a large user base.
3. A relevant publication record or academic background (e.g. a Ph.D. in statistics, cognitive science, decision science).
4. Experience with judgmental forecasting, e.g. experience as a forecaster or experience working for a forecasting platform.If you have relevant expertise or experience that isn’t directly obvious from your CV, we recommend you highlight it elsewhere in your application.
Similar projects have previously been successful (for instance, you could show that the training or post-training approach you’re proposing has previously led to improvements on a task that is structurally similar to forecasting).

The following criteria will make us more confident about the project’s potential impact, though successful submissions won’t need to satisfy all of these (especially #6 and #7):

Your submission clearly articulates how your approach will make forecasting more accurate or relevant (AI for probabilistic forecasting), or how it will improve sound reasoning relevant for high-stakes decision-making in a consequential way (AI for sound reasoning).
Your submission includes a description of how your approach differs from the current standard in the industry or research area your project falls into, and why you expect it to have an advantage over alternatives.
You/your organization have experience working with high-stakes decision-makers, or you can provide evidence of demand for this project from such decision-makers.
Your submission is focused on reducing global catastrophic risks or provides a clear story for how your idea can be applied to this domain.

We understand that you don’t have much space to provide detail in the first submission round, so we only expect you to describe the rough outline of how your project fits these criteria. We will ask for additional detail if your project advances to the next stage.

Further detail on our areas of interest

AI for forecasting

We are looking for proposals for AI models that help to make forecasts more accurate and relevant. We are thinking primarily of forecasting that is

Probabilistic: Quantitative forecasts that express uncertainty in terms of probabilities (though this might include the use of confidence intervals over continuous parameters). Probabilistic forecasts stand in contrast to qualitative forecasting methods like scenario analysis or expert interviews.
Judgmental: Forecasts that require judgment-calls by the forecaster, i.e. forecasts in a setting that does not allow straightforward derivation of forecasts via statistical models or trend extrapolation. This is required in decision-making facing new and untested technology, or when trying to estimate the risk of extreme events that have never occurred, like nuclear war. Judgmental forecasts stand in contrast to model-based quantitative forecasting as it is used in e.g. weather forecasting, supply-chain analysis, or financial trading, which usually require large sets of structured data in the domain of application.^[1]Note that automated forecasters also require such datasets in training, but they do not require them after deployment.

That said, we see potential in combining probabilistic forecasts with qualitative approaches, and in combining judgmental forecasting with empirically informed quantitative modeling.

Projects in this area that we currently see as most promising mostly fall into two categories: Projects that work directly on automated forecasting, and projects that automate the “infrastructure” that connects forecasting to decision processes. We are also interested in evaluations and benchmarks, although we expect that they will take a smaller share of our funding in this area than in the AI for sound reasoning area.

We are most excited about projects that make AI forecasting (or AI-supported human forecasting) more:

Accurate. Here, we’re thinking of accuracy in terms of how forecasts are assessed by a proper scoring rule (see Gneiting and Raftery, 2007, for an academic overview). Proper scores combine an evaluation of a forecast’s informativeness (did it assign high probability to events that actually occurred?), and calibration (did events assigned a probability 70% occur roughly 70% of the time?), and can sometimes be decomposed precisely into these two components.
Relevant. The forecasting questions are relevant for high-stakes decision-making, e.g., they are consequential for a large number of individuals. They should also be action-guiding. For instance, the probability of a candidate winning an election moving from 45% to 60%, or the predicted average global temperature in 10 years increasing by 0.1°C, won’t have clear policy implications for most individual actors, even if the forecast events themselves are consequential. Forecasts of conditional outcomes (if a carbon tax of X% is implemented, emission levels are estimated to be Y), and collections of forecasts of more fine-grained events (e.g. hurricane incidence in specific regions) will often be more relevant in that sense.

Examples of projects we’re interested in funding

Automated Forecasting

AI forecasters: Systems that can respond to forecasting questions as they might appear on a public forecasting platform. Typically, forecasting platforms ask for probabilities for discrete outcomes or for point estimates of a continuous variable. We’re interested both in AI forecasters that deal with these kinds of questions and in those that handle more complex questions — for instance, systems that output a full forecasting model involving explicit causal relationships between quantitative variables. Existing work in this area includes Schoenegger et al (2023), Halawi et al (2024), FutureSearch, and submissions to Metaculus’ AI Benchmark competition. The key metric for such systems will typically be accuracy. AI forecasters are currently nearing performance parity with pooled non-expert human forecasts, and we anticipate they will soon match or outperform top superforecasters (see our Q&A for why we believe this). This would make AI forecasting significantly cheaper, faster, and easier to deploy than human forecasting, enabling broader application in critical decision-making.
Forecasting model builders: Quantitative forecasting models (e.g. causal influence diagrams or programs in a probabilistic programming language) are sometimes an important analytical tool that improves the performance of human forecasters. They also make forecasters’ reasoning more transparent to other forecasters, and to stakeholders who ultimately make use of the forecasts and rationales they produce. Developing AI tools that assist or automate the development of such models (perhaps after seeing a prompt with a qualitative explanation of a given model) could increase the performance and transparency of human and AI forecasters. We’re particularly interested in tools designed with AI forecasters in mind, given that we expect a growing share of forecasting work to be done by AIs in the future. Existing work in this domain includes SquiggleAI, an LLM-powered app that generates probabilistic programs for estimation. Model-building might improve forecasting on several dimensions, but will especially make it more relevant.
Automated foresight or risk analysis: Another important class of tools could generate qualitative foresight analyses complementary to quantitative forecasting, in particular scenario planning: the concrete description of possible future events for the purpose of informing decision-making. For instance, users might start by describing the features of the future they’re envisioning and some criteria the scenario should satisfy. Then, the tool will suggest specific events or decision points that the scenario could contain. The tool might also run a simulation of the scenario (often called a “wargame”), though we’re unsure whether current AI capabilities will be sufficient for coherent real-time gameplay.
Automated question generation for training and evaluating AI forecasters: One approach to building better AI forecasters involves generating and resolving new forecasting questions. Currently, public forecasting platforms offer a fairly limited set of questions, many of which have already been resolved (which makes it harder to use them to evaluate forecasting skill). Thus, generating a large and curated set of forecasting questions that resolve in the near-term future will likely improve the training process for AI forecasters. We’d be interested to see research into validating the usefulness of these question sets for increasing the accuracy of AI forecasters.

Connecting Forecasting to Decision Processes

Using forecasting for decision-making requires “infrastructure” that connects the forecasts to the processes that determine the decision. A key part of such infrastructure will be to ensure that forecasts are addressing the right questions, and that forecasts are legible to decision-makers.

Automated forecasting rationales: Forecasting rationales are explanations of the reasoning underlying a forecast, similar to a forecasting model, but typically of a more qualitative nature. Improving rationale quality might be an effective way to build better AI forecasters. Rationales can also make AI forecasters more legible, as a pure accuracy track record is often not sufficient to convince decision-makers (see also related commentary by FutureSearch). Rationales can also allow easier combination of forecasts, including with forecasts that use other sources of information, and they can be tailored to the needs of decision-makers with different backgrounds and idiosyncrasies.
Automated question decomposition: Question decomposition is central to defining forecastable proxies for high-level, decision-relevant questions that are otherwise too vague to operationalize. A question decomposition tool would support this by taking ambiguous prompts like “What will be the overall effect of X?” and breaking them down into subquestions about measurable, forecastable quantities. This often involves identifying intermediate outcomes, precursors, or necessary conditions that can be forecast directly. In the context of global catastrophic risks, where direct feedback is rare or delayed, this decomposition is especially critical.^[2]For example, trying to forecast the odds of an engineered pandemic is more complicated in some ways than forecasting the odds of an economic recession; we have a lot of historical data about recessions, and practically none about engineered pandemics. So rather than looking at past data, we need to … Continue reading While there has already been some research on this, we think AI has significant potential for expanding and accelerating it. Question decomposition will likely help with making forecasting more relevant.

Evaluations and benchmarks

Evaluations and benchmarks: We’re further interested in benchmarks that evaluate how relevant AI forecasting tools are, including tools for question generation. We’re also open to funding further evaluations for forecasting accuracy, but submissions would need to present a strong case for why they would be improvements over existing efforts like ForecastBench, Bench to the Future, and the Metaculus AI Benchmark.

AI for sound reasoning

Background

AI models, especially LLMs (and LLM-based models) are being incorporated into decision-making processes at a rapid pace across society, including in legal, academic and political domains. We expect this trend to continue over the coming years, and potentially to accelerate. It seems crucial to us that models that are used for highly consequential decisions follow principles of sound reasoning, and nudge your users towards following those principles as well.

By default, AI companies are strongly incentivized to create broadly capable, agentic models that can accomplish real-world tasks. In significant part, we expect them to do this by iteratively trying to improve performance on easy-to-measure tasks of clear economic importance. This will improve their ability to solve a wide variety of epistemic challenges, turning models into general-purpose digital assistants. They’ll likely integrate search and reasoning features, learn to perform tasks in unfamiliar domains without needing extensive retraining or human instruction (“learning on the job”), and excel at tasks like navigating browsers or APIs. We want to support work that takes increasing capabilities in such domains as a backdrop, and adds on top something that AI companies may not provide by default.

Typically, we expect that AI companies will have an incentive for their models not to commit egregious factual errors, and will typically have an incentive to implement high standards with respect to a model’s core function. Thus, we expect AI companies to reduce the rate of hallucinations / confabulations by default, and also expect that specific tools like research assistants will become more reliable, and will generally not make up academic papers or court cases when the technology has matured. However, we believe that there will still be many more subtle errors and biases that won’t hurt AI models’ commercial viability, but might have strongly adverse effects on the reliability of their reasoning.

We want to especially flag two directions:

Research: We are interested in evaluations of AIs with respect to principles of sound reasoning like truthfulness, metareasoning, or consistency (see below), and also in funding work that investigates the causes of AI displaying or not displaying these principles, and in work that develops techniques or builds relevant datasets for AI to display them more reliably.
Developing AI tools that directly help with specific tasks that are disproportionately helpful for sound reasoning, like fact-checkers, arbitrators, argument analyzers, or fact tracers.

We believe that evaluations and related research matter not only because they help researchers improve on the relevant metrics, but also because they might create a race to the top among AI developers to improve their models on the evaluated characteristics. Therefore, we’d like the evaluation results to be easily accessible (e.g. on a dedicated website) and updated to include new models as they are developed. We will preferentially (but not exclusively) fund efforts that plan to do this, and are also potentially interested in projects entirely focused on curation and communication of existing benchmarks.

Examples of projects we’re interested in funding

Here we list examples of principles that seem relevant for sound reasoning and that could plausibly satisfy the above criteria:

Truthfulness: Seeking truthful sources and making truthful statements is a core epistemic virtue, and there are several promising ways to operationalize it:
1. Detection of false or misleading claims. The ability to identify incorrect or deceptive statements in longer contexts, including against adversarial pressure (i.e. in the context of texts deliberately meant to mislead).
2. Informativeness. How much better informed (if at all) are humans after interacting with an AI model? We see room for several experimental studies in this area, across different domains of knowledge and types of interactions.

There are also several promising tools in this area like fact-checkers and fact-tracers, i.e. tools that start with a claim, search for sources, and recursively follow them. They might then output a judgment on whether the original claim can be supported (fact-checkers), or produce a report that provides context on the root-source of a claim (fact-tracers), e.g. tracing a common belief to its first mention, identifying the original/seminal work in a line of research, identifying the primary source for a quote, etc. Tools applying the concept of Community Notes to context-provision in other domains also seem promising.

Meta-reasoning: we expect that the AI models most useful for sound reasoning will be highly aware of their own limitations and will be able to reason competently about their own reasoning.^[3]Assigning mental states like “awareness” or competencies like “reasoning” to current AI models is controversial, but we mean it here in the restricted sense of AI models highly reliably behaving as if they possessed these states or competencies, in the relevant domain of application. Research in this area might measure one of the following:
1. Calibration: Does the confidence AI models express in their own claims match actual accuracy? One potential avenue is to measure confidence in probabilistic terms, evaluated analogously to probabilistic forecasts. However, confidence could also be expressed in different ways, e.g. through the willingness or refusal to answer a prompt, through deference to other sources, through asking the user clarifying questions, etc., and we’d also be interested in proposals for how to quantify and evaluate those.
2. Explanation: Evaluating whether models can accurately explain their own output. We might for instance test this by asking a model to justify an answer, then testing whether changing the factor it cites leads to a different answer. This method helps to assess whether stated rationales are genuine drivers of decisions. This method is related to the study of chain-of-thought faithfulness, but is distinct from this area in that the rationales would be generated after the original output is produced.
3. Self-awareness: Are models aware of their own biases and inconsistencies? Can they predict whether they’ll be biased or inconsistent on a given question or topic? Systems with strong self-awareness could provide users with more reliable guidance about when to trust their responses.
Consistency: An important characteristic of epistemically sound AI models is that their output should be internally consistent, robust across users, and not vulnerable to framing effects (giving answers that match “desired” responses implicit in the prompt). Promising properties in this cluster are:
1. Vulnerability to framing: How much does the output generated by the model depend on framing effects? Is it possible to move responses in a particular direction (e.g. a particular political ideology) through subtle variations in otherwise semantically equivalent prompts?
2. Sycophancy: Do language models systematically agree with or flatter users, especially when presented with controversial or value-laden topics? There is a reasonably large existing literature on this, but sycophancy seems to be a case where market forces will consistently have adversarial incentives (since sycophancy, at least when mild, will often lead to larger user satisfaction) and can be particularly damaging for sound reasoning, so more work on careful measurements in this area still seems very valuable.
3. Logical consistency: Does the model’s output adhere to (logical) consistency requirements?^[4]See here for early research on this. Tests could measure understanding logical relationships between different concepts (e.g. understanding that properties like height are transitive), adhering to probability laws (e.g. not falling prey to the conjunction fallacy), or the frequency of direct self-contradiction.
Navigating debates: Sound reasoning often involves engaging with contradictory evidence and opposing viewpoints. Research in this area could evaluate how models update (or don’t) in response to counterarguments or new evidence, or whether they can faithfully reproduce different viewpoints on a controversial question.
- This principle of sound reasoning also is related to argument analyzers, a broad class of potential AI tools that analyze the structure of human arguments and suggest improvements, addressing several principles at once. Such tools could break down a text written by a human into different components (e.g. premises, conclusions, and the relationship between them), and spot errors, biases, and inconsistencies. Other versions of such tools could suggest counterarguments, or ways of strengthening the existing argument.
- Another class of tool related to this principle would be arbitrators, tools that help different sides of a discourse to communicate with each other, one variant of which would be a disagree-and-bet operationalizer. Closely related to automated question generation in the forecasting context, a disagree-and-bet operationalizer reads through a discussion between people with opposing views (e.g. a debate transcript, or a series of articles replying to each other), and suggests bets the participants could make based on their views.

Please note that the properties and tools listed above represent our current best guess about the most exciting work in this area. However, we are much more uncertain about this area than we are about AI for forecasting, and would not be surprised if some of the best projects we end up funding were not to be found on this example list. To evaluate whether a project idea of yours would be a good fit for this RFP, you should evaluate whether it fits our background motivation, and consult our submission guide.

Q&A

What kinds of progress are you expecting from AI forecasters?

We think AI forecasting models offer a scalable and rapidly improving opportunity. They are currently nearing performance parity with pooled non-expert human forecasts,^[5]Some evidence for this includes the ForecastBench leaderboard (where AI models do slightly worse than superforecasters but outperform non-expert public participants in terms of Brier score as of Oct 8, 2025) and the Metaculus AI Benchmarking Tournament (where Metaculus Pro forecasters outperformed … Continue reading and we anticipate they will soon match or outperform top superforecasters. This would make AI forecasting significantly cheaper, faster, and easier to deploy than human forecasting, enabling broader application in critical decision-making.

Scalability: AI models offer cheap, on-demand forecasting, free from some human limitations like fatigue, bias, or the risk of leaking sensitive information.^[6]While we acknowledge that AI models introduce biases and privacy risks of their own, we believe that these are easier to monitor and mitigate than in humans. They can be integrated into tools and adapted to institutional needs, and can handle multiple questions at once. Even without full automation, models can support individual forecasting steps — like decomposing questions, finding evidence, fitting models, or writing rationales. Once they reach human level, their scalability and availability will make them even more useful.^[7]This argument rests on a few core assumptions: that inference costs of AI models will continue to fall (or at least won’t rise) relative to human wages, that output quality will keep improving (or at least not degrade) with scale and schlep, and that AI-generated forecasts will either soon … Continue reading
Fast improvement: We also believe that the performance of AI forecasters, in particular LLM-based forecasters, is likely to improve significantly. Several studies and public benchmarks already suggest that GPT-4o-level models match the performance of average human forecasters across a range of domains (while still falling short of top-human performance),^[8]See e.g. Schoenegger et al (2023), Halawi et al (2024), Karger et al (2024), and the most recent results from the Metaculus AI Benchmark. and there’s evidence of a consistent scaling trend: forecasting accuracy improves with training compute (see Figure 1). If this continues, frontier models might reach superforecaster-level performance before 2030 — even without bespoke fine-tuning or specialized forecasting architectures.^[9]We also note that, in the Metaculus AI benchmark, which is run on a quarterly basis, AI systems did not significantly improve their performance between 2024 Q3 and 2025 Q1. However, this may be because the humans these AI systems are benchmarked against also have access to the most recent (and … Continue reading

Figure 1: Brier score (a measure of forecasting skill; lower is better) vs training compute for selected frontier models and superforecasters. From Karger et al (2025).

Given this, we are especially excited about research or tools to improve LLM forecasting performance and/or integrate LLM-based forecasting systems into researcher and decision-maker workflows. While LLMs still suffer from failures in factual accuracy and quantitative reasoning, we think that combining them with approaches from traditional ML and applied statistics could boost their performance, transparency, and interpretability.

What do you mean by high-stakes decision-making?

One of Open Philanthropy’s key principles is to avoid scope neglect by focusing on the importance of an opportunity. We care deeply about how many individuals are affected by an intervention, and how large the impact is on their lives. As a result, the Forecasting team is especially interested in informing high-stakes decisions, i.e. decisions that have a sizable impact on a large number of individuals. Furthermore, since our team exists within Open Phil’s Global Catastrophic Risks portfolio, we are particularly interested in decisions about issues that could pose an existential threat to humanity, particularly issues related to AI and biosecurity.

The individuals making these decisions, high-stakes decision-makers, will often be lawmakers or high-ranking government officials, but depending on the context, they might also include business executives or members of academia, the media, and civil society. In the context of GCRs, we’re especially thinking about people developing high-risk technologies (e.g. scientists and executives at AI and biotech firms), as well as policymakers deciding on regulations and treaties related to those technologies. We want people in these positions to have access to high-quality tools and accurate forecasts as they carry out their work.

Given AI’s rapidly advancing capabilities, we expect that many high-stakes decisions will have to be made about its deployment, plausibly under significant time pressure. Between the many unprecedented features of this technology and the disorienting pace of progress, we think decision-makers are likely to make significant mistakes, which could lead to catastrophic consequences. One key reason for our interest in epistemic tools for high-stakes decision-making is to reduce the likelihood of catastrophic errors (though we remain interested in tools that are meant to be applied more generally — see below).

Should I apply if I have never worked on GCRs, and don’t intend to?

Despite our core interest in high-stakes decision-making, and specifically the reduction of Global Catastrophic Risks (GCRs), we believe that the the principles behind accurate forecasting and sound reasoning are fairly universal in nature, such that projects that improve decision-making in a specific domain will often be generalizable to contexts we particularly care about. We therefore expect that most projects we fund will apply to decision-making in a fairly general way, and won’t need to have a direct link to the reduction of GCRs.

We still intend to favor projects that align with Open Phil’s priorities, and applicants who have a history of engaging with GCRs. But neither of these are a strict requirement, and we still encourage applicants whose projects fulfill some of our other criteria.

We think it’s plausible that some of the most promising projects will be led by individuals who disagree with us about the likelihood or urgency of GCRs, and we expect to receive some strong submissions from experts in other domains. For instance:

Some of the best submissions under “AI for probabilistic forecasting” might come from experts in established forecasting domains (such as weather forecasting, supply chain management, or financial risk assessment).
Some of the best submissions under “AI for sound reasoning” might come from applicants with substantive experience in software development, or with a background in reasoning-related online communities (such as fact-checking websites, wikis, or discussion boards).

If you don’t intend to apply your expertise to GCRs, we don’t expect you to draw a link between your work and GCR-related applications in your submission. However, we would like projects funded by us to be generalizable — they should not only be applicable to domains that are not among our core priorities. As noted above, you can strengthen your submission by discussing the generalizability of your proposed project, and the types of decisions for which it might or might not make a difference.

What if I have other questions?

If you have any questions, please email us at forecasting@coefficientgiving.org.

Footnotes[+]Footnotes[−]

Footnotes
1	Note that automated forecasters also require such datasets in training, but they do not require them after deployment.
2	For example, trying to forecast the odds of an engineered pandemic is more complicated in some ways than forecasting the odds of an economic recession; we have a lot of historical data about recessions, and practically none about engineered pandemics. So rather than looking at past data, we need to decompose questions about engineered pandemics into sub-questions that give us a way to use what context we have; for example, we could create subquestions about progress in DNA synthesis technology or the number of active terrorist groups with access to certain resources.
3	Assigning mental states like “awareness” or competencies like “reasoning” to current AI models is controversial, but we mean it here in the restricted sense of AI models highly reliably behaving as if they possessed these states or competencies, in the relevant domain of application.
4	See here for early research on this.
5	Some evidence for this includes the ForecastBench leaderboard (where AI models do slightly worse than superforecasters but outperform non-expert public participants in terms of Brier score as of Oct 8, 2025) and the Metaculus AI Benchmarking Tournament (where Metaculus Pro forecasters outperformed AI models in terms of log score in all three quarters between 2024 Q3 and 2025 Q1, although the 2024 Q4 difference didn’t reach the conventional threshold for statistical significance).
6	While we acknowledge that AI models introduce biases and privacy risks of their own, we believe that these are easier to monitor and mitigate than in humans.
7	This argument rests on a few core assumptions: that inference costs of AI models will continue to fall (or at least won’t rise) relative to human wages, that output quality will keep improving (or at least not degrade) with scale and schlep, and that AI-generated forecasts will either soon surpass human-level forecasts or be useful even before that point. If these hold, AI forecasting will be dramatically more cost-efficient than human forecasting, even before accounting for speed, availability, and ease of deployment. To put concrete (but rough) numbers on this: A typical human forecast might take 30 min to 3 hours to generate, at an hourly rate of $50-$150, depending on seniority. That’s $25-$450 per forecast, but usually we want to pool forecasts from several people to get additional accuracy. This makes forecasts generated by groups of top forecasters cost a few thousands of dollars. On the LLM side, generating 1k-10k tokens with OpenAI’s o3 reasoning model currently costs ~$0.008-$0.08. Even adjusting for longer outputs or post-hoc filtering and averaging, LLMs are typically orders of magnitude cheaper per forecast. This picture could change if compute becomes much more expensive in the future (e.g. due to increased demand), or if radically scaling inference-time compute is needed to surpass human performance.
8	See e.g. Schoenegger et al (2023), Halawi et al (2024), Karger et al (2024), and the most recent results from the Metaculus AI Benchmark.
9	We also note that, in the Metaculus AI benchmark, which is run on a quarterly basis, AI systems did not significantly improve their performance between 2024 Q3 and 2025 Q1. However, this may be because the humans these AI systems are benchmarked against also have access to the most recent (and powerful) AIs.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Coefficient is a philanthropic funder and advisor.

Follow us

Sign up for news about our grants, research, and more