How do AI engines pick which sources to cite?

Question

Accepted Answer

AI answer engines select citations through a three-stage pipeline: **retrieve, rank, and quote**. Retrieval pulls 10–50 candidate documents from a search index (Bing for ChatGPT Search and Copilot, a hybrid index for Perplexity, Google's own index for AI Overviews). Ranking re-scores those candidates using a combination of relevance, freshness, source authority, and — increasingly — model-judged answer quality. Quoting selects the 3–8 sources that actually appear in the final answer, biased toward sources whose content most directly contains the answer phrasing. The signals that consistently move citation share across all major engines, ranked by impact: (1) **Topical relevance** of the page to the specific query phrasing — this dwarfs every other signal. (2) **Source authority**, measured by domain rating, Wikipedia/Wikidata presence, and consistent entity mentions across the web. (3) **Answer-shaped formatting** — clear headings, 40–80 word answer blocks, declarative first sentences, named statistics. (4) **Schema markup** — FAQPage, HowTo, Article, and Organization schema demonstrably increase parse rates. (5) **Freshness**, especially for "current state of X" queries where models heavily prefer timestamps under 12 months. (6) **Third-party validation** — Reddit threads, Wikipedia citations, and high-DR media mentions act as trust amplifiers in the retrieval re-ranker. The 2023 GEO paper from Princeton/Georgia Tech/AI2 empirically validated that adding quotations, statistics, and citations increased visibility by up to 40%, and those findings have held across model generations.