Conversational bibliometrics needs a recipe, not just ingredients
Research thoughts bite: 09.
Key takeaways:
Bibliometric analyses are shaped by infrastructural design choices as much as by explicit methodological decisions.
Conversational AI for bibliometrics requires external structural constraints (Model Context Protocols) to prevent hallucinations and ensure methodological validity.
Once mediated by MCPs, bibliometric systems implicitly govern which methods and questions are legitimate.
Bibliometric systems are often treated as neutral tools for measuring scientific activity, with methodological debates focusing on indicator choice, data coverage, or normalisation strategies. Yet bibliometric knowledge is also shaped by infrastructure (Wouters, 2014): how data are accessed, which classifications and analyses are available out of the box, and which assumptions are embedded in default representations and metrics. As conversational interfaces and large language models are increasingly tested as new entry points to bibliometric data, these infrastructural choices become more consequential. In this Research Musings, I show that conversational bibliometrics is not a simple interface change: it switches power over bibliometric knowledge by embedding methodological decisions in infrastructure rather than leaving them to interpretation.
From printed data to interactive systems
Data sharing has evolved alongside computers, shifting how data are accessed and how knowledge is shaped. Early technical data were often shared through printed tables and handbooks: production was computer assisted, and consumption was static. When Eugene Garfield introduced the Science Citation Index (SCI) in the 1960s, it relied on computers for large-scale citation matching, while dissemination was annual and printed. Users did not query a database, but they consulted a static set of printed tables, accompanied by expert interpretation. Computers enabled scale, but interpretation remained largely centralised (Garfield, 1979). As computing power and personal access expanded, data could be distributed digitally: first via magnetic tape and later on CD-ROMs, but it remained disconnected and local, and evolved only via static snapshots. Bibliometrics grew more sophisticated (e.g., co-citation analysis, bibliographic coupling, network-based approaches), but their use was largely confined to specialists with both data access and technical expertise.
The Internet marked a further shift by introducing possibilities to search, filter, and later dashboards. Users could explore, but only within the boundaries set by a website application; questions were shaped by available filters. APIs and SQL-based access reopened the data layer. For the first time, bibliometric data could be embedded into other systems, recombined across sources, and used to build custom analyses. Open infrastructures such as OpenAlex aggregate multi-sourced bibliometric data into a unified representational model and provide public interfaces to these records without requiring paid access. However, they still embed consequential modelling decisions about entities, classifications, and coverage that are difficult to bypass without reconstructing the database itself. What is opened, in this sense, is access and reuse, not control over the data-making process.
At the same time, indicators became the dominant tool through which bibliometric results were communicated. Indicators such as the journal impact factor and the h-index reduced complex scholarly performance to single numerical values while making problematic assumptions. Both metrics incentivised gaming behaviours such as self-citation rings and citation clubs. Table 1 below summarises this progression, showing how bibliometric infrastructures moved from restricted data release to formal openness that nonetheless remained epistemically inaccessible.
These assumptions were reinforced by how research was classified. Historically, bibliometric datasets used a journal-centric model: subject categories were assigned at the journal level and then applied to all articles within it. This approach, rooted in Garfield’s original citation indexing framework and maintained by Clarivate, treated journals as coherent disciplinary containers and made journal-level normalisation both practical and conceptually central. Dimensions marked a structural break by introducing article-level classification as a default (Hook et al., 2018). Using machine learning, each publication could be assigned to multiple fields independent of the journal in which it appeared. This shifted attention away from the journal as the primary epistemic unit and made interdisciplinarity visible without specific analysis. It also mirrored a broader shift in access: instead of printed journals circulating through libraries, articles were now discovered independently of their journal through aggregated databases.
This shift expanded the available analytical tools but also reshaped how bibliometric knowledge was produced and used. Although APIs and SQL-based access enabled sophisticated analyses beyond precomputed indicators, they also reinforced reliance on aggregates and shortcuts for most users. As a result, bibliometric knowledge continued to reflect assumptions embedded in systems designed primarily for expert use, even as formal access to the data layer became more open.
The move from curated indicators to APIs and SQL was often framed as a democratisation of bibliometric knowledge. In practice, it relocated responsibility for understanding data construction, coverage, and methodological implications almost entirely to the user: an arrangement that implicitly assumed substantial bibliometric expertise. Conversational bibliometrics aims to rebalance this by allowing users to articulate complex questions without direct engagement with schemas or query languages. However, this shifts the responsibility for methodological validity back to the data provider, who must ensure that conversational access is constrained by explicit and enforceable analytical rules, and that these constraints are transparent to users.
But, as bibliometric systems became more open and powerful, they also became harder to use correctly. The relationship between what users can do and what they need to know has shifted across eras in revealing ways.
The graph highlights that expertise requirements and analytical freedom do not share a linear relationship. While access through curated APIs and GBQ (4a) granted users significant freedom, reaching the “maximal freedom” of the Open Data era (4b) imposes a massive technical tax, requiring users to reconstruct databases and handle raw metadata to bypass embedded modelling decisions. In both cases, this agency depends on high levels of technical and methodological expertise, effectively shifting the entire responsibility for validity and interpretation onto the user.
Conversational bibliometrics (5) attempts to reconfigure this by lowering the expertise barrier at the point of interaction, but without restoring equivalent safeguards for methodological validity. As a result, conversational systems make the non-linear relationship between expertise and freedom consequential: they lower the perceived expertise required to perform analyses, while leaving the burden of methodological validity unresolved and largely invisible.
Limits of ‘conversational bibliometrics’
Could bibliometrics have its “ChatGPT moment”? What would that even mean? Researchers, institutions, funders, and publishers might imagine conversing with bibliometric systems to explore trends, test assumptions, or discuss the implications of decisions. Custom GPTs, Gems, and direct LLM API integrations have already attempted this by providing models with schema documentation and task-specific instructions.
In my experience, these approaches (using post-trained conversational LLMs as the primary analytical engine) often fail. Even when supplied with controlled vocabularies, explicit identifiers, and detailed instructions, LLMs will hallucinate non-existent database fields, substitute fuzzy name matching for required identifiers (such as GRID IDs), or bypass methodological constraints. The outputs are fluent and plausible, but analytically invalid.
These failures are not primarily a prompt-engineering problem. In my own experiments, adding more instructions or documentation often degrades performance rather than improving it. As input length increases, reasoning deteriorates well below context window limits, and models revert to pattern completion drawn from training data (for example, repeatedly introducing a fictitious citation.year field). As a result, even deterministic configurations can produce invalid analyses. The issue is not conversational access itself, but the assumption that analytical validity can be reliably enforced through post-training and prompting rather than through executable constraints.
A second, less visible limitation concerns methodology rather than correctness. Post-trained bibliometric LLMs do not operate within a single methodological tradition; they borrow freely from adjacent fields such as information retrieval, network science, and machine learning. While this methodological borrowing can be productive and innovative, and bibliometrics as an applied field has often benefitted from it, it also risks presenting non-canonical or field-external techniques as default bibliometric practice. In conversational settings, where outputs are framed as answers rather than suggestions, this blurs the distinction between exploratory analysis and established method, and risks treating attractive analogies as canon.
Executable constraints make this distinction explicit. By formalising which workflows are recognised as valid defaults, MCP-mediated systems can support methodological innovation without silently redefining what counts as bibliometric practice.
Model Context Protocols
That is where Model Context Protocols (MCPs) come in. MCPs are standardised interfaces that allow LLMs to access external tools and data sources through controlled, auditable gateways. MCPs matter here not only because they reduce errors, but because they determine which analytical constructions can exist at all. Think of it like baking cookies. You could give someone unrestricted access to your kitchen and say “make chocolate chip cookies” and hope for the best. Or you could provide a detailed recipe that specifies butter temperature (room temperature vs. melted vs. cold creates different textures), mixing techniques, and baking times. The difference between “cookies” and “delicious cookies” lies in these seemingly small details, as illustrated in this serious cookie experiment. MCPs are the recipe that enforces data (ingredients) are properly handled; they do not define analytical workflows, but they determine which formally specified workflows can be executed, logged, and reproduced within the system. Once analytical workflows are formalised and passed through MCP-enforced execution, they cease to function as informal methodological guidance and instead become durable, auditable procedures.
In bibliometric systems, MCPs can be implemented as gateways stored on the server side of platforms like Dimensions, where they can be versioned, curated, and linked to specific categories of tasks. This means researchers can use any LLM interface they prefer, including proprietary tools like ChatGPT, Gemini, or Claude, or open models such as GPT-OSS or Mistral, confident that the MCP gateway ensures consistent, methodologically valid execution behind the scenes.
When a user asks a question, the MCP gateway structures and validates the analytical workflow before any data is accessed. This ensures that bibliometric tasks follow explicit rules (using GRID IDs rather than fuzzy name matching, or enforcing field-normalised citation windows) even if the user never sees these constraints directly. In baking terms, this makes sure that no ingredient is substituted (baking powder leads to different cookies than baking soda, as demonstrated in this video), no steps are skipped, or timings are adjusted.
In this framing, a robust conversational system does not rely on a single cook. One component interprets the request and identifies the right recipe. Another follows that recipe precisely, applying the defined rules and transformations. A third checks that the result makes sense given the inputs and assumptions. Conversation remains central, it is how questions are asked, but correctness depends on separating interpretation from execution and enforcing structure at each step.
Without MCPs, conversational bibliometrics risks producing answers that are fluent, confident even, ...but wrong. With MCPs, LLMs become more useful: they stop improvising in the kitchen and become a trained assistant working with a trusted cookbook.
Who designs the cookbook?
This shift from indicators to workflows introduces a new form of epistemological infrastructure, and with it, new questions about authority and participation. If bibliometric knowledge increasingly circulates as executable procedures rather than static outputs, then control over workflow design becomes a form of methodological power.
Design choices about bibliometric indicators extend beyond what is displayed to how analyses are constructed by default. For instance, while Dimensions does not surface the h-index as a default summary measure (to reflect critiques of composite indicators and alignment with recommendations such as DORA), it does apply field-normalised citation metrics as a standard analytical baseline. This choice privileges comparative interpretability across disciplines while downplaying raw citation counts, illustrating how bibliometric platforms already govern knowledge production by embedding methodological preferences into defaults rather than prohibiting alternative analyses. MCP-mediated workflows extend this logic: governance shifts from curating visible indicators to defining which analytical constructions are standard, repeatable, and implicitly endorsed.
Table 2 situates this shift within a longer scientometric trajectory, showing how authority over valid methods has moved from expert interpretation to infrastructural enforcement.
Conversational bibliometrics is not primarily a usability innovation. Once mediated by MCPs, it becomes an infrastructural regime in which methodological defaults govern what counts as legitimate bibliometric knowledge. Data producers therefore assume a new form of epistemic responsibility. At minimum, this means being explicit about which analytical constructions are defaults, which methodological choices are embedded in workflows, and why. Transparency here is not an add-on to the system but a condition of its legitimacy.
To infinity, with a recipe
If we want to build conversational bibliometrics, therefore, we cannot rely on conversational fluency alone. The arguments above point toward a specific architectural consequence: separating interpretation from execution, formalising analytical workflows, and making methodological defaults explicit and inspectable.
Recent work in Nature Computational Science has begun to explore multi-agent AI collaborators such as SciSciGPT, which orchestrate specialist agents across literature search, data access, analysis, and evaluation within auditable workflows. These systems operationalise some of the same structural separations argued for here — between interpretation, execution, and validation — but they are primarily designed to support exploratory inquiry and sense-making, and continue to rely on substantial domain expertise from the user to assess methodological appropriateness. As such, they remain domain-specific research prototypes rather than general, governed infrastructures that assume responsibility for methodological validity in conversational analytical systems.
In the next piece, I introduce the concept of an AI metascientist: not as a single model, but as a governed system architecture in which conversational interfaces, executable workflows, and validation layers are deliberately separated and coordinated. It offers one possible answer to the question posed here — how to support conversational inquiry while making methodological responsibility explicit, rather than leaving it implicit or deferred to user judgment.
Bibliography
Garfield, E. (1979) Citation indexing: its theory and application in science, technology, and humanities. New York, NY: Wiley.
Hook, D. W., Porter, S. J., & Herzog, C. (2018) Dimensions: Building context for search and evaluation. Frontiers in Research Metrics and Analytics, 3, 23.
Wouters, P. (2014) The citation: From culture to infrastructure. In B. Cronin & C. R. Sugimoto (Eds.), Beyond bibliometrics: Harnessing multidimensional indicators of scholarly impact (pp. 47–66). Cambridge, MA: MIT Press.



Interesting though I wonder if APIs are maximum freedom since the design of the api also limits what is practically possible. Eg I ran into this trying to use the OpenAlex api.
I'm also curious how using LLMs to vibe code analysis or dash boards falls into your category. I've been recently doing it with great success...
I'm also wondering about MCP servers but isnt one way to think of them is there are just wrappers around the api? They encode even more restrictions of course by the designer of the mcp server