Doctor AI Agent

⟵ Home

Doctor AI Agent

Multimodal Medical VQA LLM Agent System GraphQL Sub-questions Tool-Calling Clinical Reasoning

First-author multimodal LLM agent for medical visual question answering

As the first author, I collaborated with the BioNLP lab to design Doctor AI Agent, a multimodal large language model (LLM) agent system for medical visual question answering (VQA). The system decomposes a clinician’s natural-language query into a graph of sub-questions represented in GraphQL, then orchestrates a sequence of tool calls over images, reports, and external medical knowledge to generate grounded, step-by-step reasoning.

At the core of the project is a sub-question-driven planning module that structures each case into atomic reasoning steps such as “localize finding,” “retrieve prior study,” “compare with guideline,” and “summarize for patient safety.” These nodes are stored as a GraphQL schema, enabling the agent to dynamically select and recombine reasoning paths depending on the question type (diagnosis, prognosis, treatment recommendation, or follow-up planning).

The agent integrates multimodal tools including a vision encoder for radiology images, text encoders for clinical notes and reports, retrieval modules over curated medical knowledge, and LLM-based reasoning modules. Each tool call is logged as a structured step, allowing transparent inspection of which evidence was used and how intermediate conclusions were derived before returning the final answer.

Using this pipeline, we aim to improve knowledge-grounded diagnostic reasoning and achieve over 15% performance gain in medical VQA accuracy compared to a single-pass LLM baseline that directly answers questions from raw inputs. The Excel-based analysis sheet is used to track per-question performance, error categories (reasoning, perception, retrieval, and hallucination), and ablation results for each tool and planning strategy.

Beyond raw accuracy, the project emphasizes clinical safety and interpretability. By exposing sub-question graphs and tool-call traces, Doctor AI Agent can explain which lesions were inspected, which guidelines were consulted, and why a particular treatment or follow-up recommendation was made. This structure also makes it easier to plug in future domain-specific tools, such as risk calculators or guideline-specific scoring functions, without redesigning the overall agent.

Key responsibilities & contributions

Designed the overall agent architecture, including sub-question graph schema, tool registry, and orchestration logic for multimodal reasoning.
Implemented the GraphQL-based planner that converts high-level clinical questions into reusable sub-question templates and dynamically selects reasoning paths.
Integrated vision, text, and retrieval tools into a unified tool-calling interface, handling input formatting, output normalization, and error recovery.
Developed evaluation scripts and spreadsheets to categorize errors (perception vs. reasoning vs. retrieval) and measure gains from each architectural component.
Collaborated with clinicians and lab members to refine task definitions, annotation formats, and clinically meaningful success criteria.

GitHub Repository Go Home