← Back to blog

The Unexpected Polymorphism Pitfalls when Structuring LLM Outputs

A deeper dive into structured outputs than you may have wished for.

Overview

This post covers how to best utilise the Pydantic library for manipulating LLM responses and although it starts off pretty simple, we dive into some esoteric gotchas which may be novel to even hardcore llm hackers.

I'm assuming some pre-requisite knowledge here, such as adequate python skills and familiarity with Anthropic's SDK.

💡 I don't have access to the code powering OpenAI's structured outputs, but I'm curious as to whether they address the problems outlined in this post. My guess is they don't, but get in touch if you think otherwise.

Basic Structured Outputs

In a nutshell, structured outputs allow us to enforce some schema onto an LLM. Take a prompt that looks like:

1prompt = """
2You are expert in some field. 
3Given the following context:
4
5{context} 
6
7Answer the following question:
8
9{question}
10"""
11
12response: str = call_anthropic(
13    prompt.format(
14        context=some_context,
15        question=some_question
16    )
17)
18
19assert isinstance(response, str)  # True

The LLM will generate a raw string containing the answer to the question, great. Now let's look at the structured output approach:

1prompt = """
2You are expert in some field. 
3Given the following context:
4
5{context} 
6
7Answer the following question:
8
9{question}
10
11You must return your response in the following format:
12
13```json
14{{
15    "answer": str
16}}
17```
18"""
19
20response: str = call_anthropic(
21    prompt.format(
22        context=some_context,
23        question=some_question
24    )
25)
26
27assert json.loads(response).hasattr("answer")  # True

Now there's nothing special about doing this on its own, but it starts to get fun when you approach this problem from first-principles.

LLMs are autoregressive meaning that each token is conditional on the tokens that came before it i.e. it's a conditional distribution over tokens at each time step:

$$P(x_t | x_{t-1},\ \ldots, x_0)$$

This means we can use structure to force the model to generate some preamble that might improve its answer i.e.

$$P(answer\ |\ preamble,\ \ldots, context)$$

Let's look at a prompt that conditions the answer on some reasoning preamble - this is sort of similar to what reasoning models do. The difference here being the format of our reasoning isn't optimised via some reinforcement learning process:

1prompt = """
2You are expert in some field. 
3Given the following context:
4
5{context} 
6
7Answer the following question:
8
9{question}
10
11You must return your response in the following format:
12
13```json
14{{
15    "reasoning": str,
16    "answer": str
17}}
18```
19"""

Great! But our model is hallucinating and for some reason always messes up answers which require reasoning over some time period!

If we give the model a full medical record and a question such as "has the patient received treatment X in the past two years?" the model might be inclined to say "yes, absolutely" at any sign of treatment X, disregarding the temporal aspect or whether the treatment took place 10 years ago. Let's update our prompt with a simple solution which absolutely will not work:

1prompt = """
2You are expert in some field. 
3Given the following context:
4
5{context} 
6
7Answer the following question,
8when reasoning over time you must be extra careful!! 
9
10{question}
11
12You must return your response in the following format:
13
14```json
15{{
16    "reasoning": str,
17    "answer": str
18}}
19```
20"""

A much more robust way to solving is by doing so with structured outputs; remember that we can manipulate what the model generates:

1prompt = """
2You are expert in some field. 
3Given the following context:
4
5{context} 
6
7Answer the following question:
8
9{question}
10
11You must return your response in the following format:
12
13```json
14{{
15    "this_question_requires_temporal_reasoning": bool,
16    "thorough_temporal_reasoning": str,
17    "succinct_reasoning": str,
18    "answer": str
19}}
20```
21"""

There's more we could say here but that's the gist of it. Let's move on to Pydantic.

Introducing Pydantic

Enter Pydantic, python's library for handling structured typing and schemas. Let's rewrite the last example:

1from pydantic import BaseModel
2
3prompt = """
4You are expert in some field. 
5Given the following context:
6
7{context} 
8
9Answer the following question:
10
11{question}
12
13You must return your response in the following format:
14
15```json
16{{
17    "this_question_requires_temporal_reasoning": bool,
18    "thorough_temporal_reasoning": str,
19    "succinct_reasoning": str,
20    "answer": str
21}}
22```
23"""
24
25class Response(BaseModel):
26    this_question_requires_temporal_reasoning: bool
27    thorough_temporal_reasoning: Optional[str] = None
28    succinct_reasoning: str
29    answer: str
30    
31response: Response = call_anthropic_structured(
32    prompt.format(
33        context=some_context,
34        question=some_question
35    ),
36    output_schema=Response
37)
38
39assert isinstance(response, Response)
40assert isinstance(response.answer, str)

With the overview out of the way, the rest of this blog will dive into the implementation of call_anthropic_structured(). Some people might stop here and ask:

"But wait, can't we just use libraries such as instructor?"

The answer is sometimes. Let's keep going; starting with simplifying our prompt using Pydantic's out of the box features:

1from pydantic import BaseModel
2
3prompt = """
4You are expert in some field. 
5Given the following context:
6
7{context} 
8
9Answer the following question:
10
11{question}
12
13You must return your response in the following format:
14
15```json
16{output_schema}   # NEW CHANGE
17```
18"""
19
20class Response(BaseModel):
21    this_question_requires_temporal_reasoning: bool
22    thorough_temporal_reasoning: Optional[str] = None
23    succinct_reasoning: str
24    answer: str
25    
26response: Response = call_anthropic_structured(
27    prompt.format(
28        context=some_context,
29        question=some_question,
30        output_schema=Response.model_json_schema(),  # NEW CHANGE
31    ),
32    output_schema=Response
33)
34
35assert isinstance(response, Response)
36assert isinstance(response.answer, str)

This is very useful for when our models get increasingly complex because we don't want human-error introducing discrepancies between our prompt and output schema. Here's what the output of Response.model_json_schema() looks like:

1{
2  "properties": {
3    "this_question_requires_temporal_reasoning": {
4      "title": "This Question Requires Temporal Reasoning",
5      "type": "boolean"
6    },
7    "thorough_temporal_reasoning": {
8      "anyOf": [
9        {
10          "type": "string"
11        },
12        {
13          "type": "null"
14        }
15      ],
16      "default": null,
17      "title": "Thorough Temporal Reasoning"
18    },
19    "succinct_reasoning": {
20      "title": "Succinct Reasoning",
21      "type": "string"
22    },
23    "answer": {
24      "title": "Answer",
25      "type": "string"
26    }
27  },
28  "required": [
29    "this_question_requires_temporal_reasoning",
30    "succinct_reasoning",
31    "answer"
32  ],
33  "title": "Response",
34  "type": "object"
35}

💡 Another easy win is to use Pydantic.Field(desc="your description here") which helps the LLM understand exactly what it's looking for at each attribute, but I've excluded this from my examples because it adds clutter.

Of course this still doesn't explain why we can't always use Instructor. Let's look at a more complicated real-world example. In this example, we're trying to help a farmer decide which pesticides they should purchase for their crops.

Assume we have two inputs:

  • The farm record which contains all irrigation and crop details
  • The guidance on pesticides as released by the farmer's association
1guidance = """
2—-----------------------------------------------------------------------------------
3AGRICULTURE PESTICIDES GUIDANCE
4—-----------------------------------------------------------------------------------
5Determine the crop type
6[ ] A. Barley and other cereals 
7[ ] B. Beans and other legume-based crops
8[ ] C. Sunflowers and other oil generating plants
9
10—-----------------------------------------------------------------------------------
11A. Barley and other cereals 
12—-----------------------------------------------------------------------------------
13Approve PESTICIDE-A if one or more selected
14[ ] Soil is loam based
15[ ] Drainage is greater than x per y.
16[ ] The crop has not been sown for 2 years.
17
18Approve PESTICIDE-B if all are selected
19[ ] No animals at risk of trespassing
20[ ] No methane tanks within 500m of crop
21[ ] The crop has not been sown for 1 year.
22
23—-----------------------------------------------------------------------------------
24B. Beans and other legume-based crops
25—-----------------------------------------------------------------------------------
26Approve PESTICIDE-C if none selected
27[ ] You get the picture
28"""
29
30from pydantic import BaseModel
31
32prompt = """
33Determine which pesticide the farmer should use given the following
34farm records and pesticide guidance:
35
36<farm-record>
37{farm_record}
38</farm-record>
39
40<guidance>
41{guidance} 
42</guidance>
43
44You must return your response in the following format:
45
46```json
47{output_schema}
48```
49"""
50
51class Response(BaseModel):
52    approved_pesticides: list[str]
53    
54response: Response = call_anthropic_structured(
55    prompt.format(
56        farm_record=some_record,
57        guidance=guidance,
58        output_schema=Response.model_json_schema(), 
59    ),
60    output_schema=Response
61)
62
63print(
64    f"The farmer should use {', '.join(response.approved_pesticides)}"
65    )

We've used a string guidance here but you'll appreciate that most LLM providers support direct PDF byte ingestion, meaning that this structured approach generalises pretty well to more complicated examples. Obviously the PDFs may not have such lovely structure, but you can just make your schemas more generalised.

The Meat of the Issue

Now let's make the problem a little bit harder and add a condition where the developer must render a frontend dynamically, perfectly showing the reasoning for each bullet point and whether each bullet's logical conditions were met.

We want to capture everything and make it easily parseable and can achieve a rough template with the following schema:

1class Option(BaseModel):
2  option: str
3  temporal_reasoning_required: bool
4  temporal_reasoning: Optional[str] = None
5  succinct_reasoning: str
6  condition_met: bool
7
8class Section(BaseModel):
9    heading: str
10    option: Option
11
12class Response(BaseModel):
13    sections: list[Section]

But let's say we use this pattern everywhere, we'd likely create some base abstraction e.g.:

1# base.py
2class BaseReasoning(BaseModel):
3    citation: str
4    reasoning: str
5    condition_met: bool
6    
7class TemporalReasoning(BaseReasoning):
8    temporal_reasoning_required: bool
9    temporal_reasoning: Optional[str] = None
10  
11class MathematicalReasoning(BaseReasoning):
12    mathematical_reasoning_required: bool
13    mathematical_reasoning: Optional[str] = None
14
15# farm.py
16class Option(BaseModel):  
17    reasoning: TemporalReasoning  # inherits reasoning attributes from BaseReasoning
18
19class Section(BaseModel):
20    heading: str
21    option: Option
22
23class Response(BaseModel):
24    sections: list[Section]

The problem is that if we dump this schema via instructor or some other method like Response.model_schema_json(), we're going to hit some highly esoteric bugs - wherein lies the core issue of this blog post.

If we try to dump the schema for Reasoning, we'll get the following:

1{
2  "$defs": {
3    "Option": {
4      "properties": {
5        "citation": {
6          "title": "Citation",
7          "type": "string"
8        },
9        "reasoning": {
10          "$ref": "#/$defs/TemporalReasoning"
11        },
12        "condition_met": {
13          "title": "Condition Met",
14          "type": "boolean"
15        }
16      },
17      "required": [
18        "citation",
19        "reasoning",
20        "condition_met"
21      ],
22      "title": "Option",
23      "type": "object"
24    },
25    "Section": {
26      "properties": {
27        "heading": {
28          "title": "Heading",
29          "type": "string"
30        },
31        "option": {
32          "$ref": "#/$defs/Option"
33        }
34      },
35      "required": [
36        "heading",
37        "option"
38      ],
39      "title": "Section",
40      "type": "object"
41    },
42    "TemporalReasoning": {
43      "properties": {
44        "citation": {
45          "title": "Citation",
46          "type": "string"
47        },
48        "reasoning": {
49          "title": "Reasoning",
50          "type": "string"
51        },
52        "condition_met": {
53          "title": "Condition Met",
54          "type": "boolean"
55        },
56        "temporal_reasoning_required": {
57          "title": "Temporal Reasoning Required",
58          "type": "boolean"
59        }
60      },
61      "required": [
62        "citation",
63        "reasoning",
64        "condition_met",
65        "temporal_reasoning_required"   /* HERE LIES THE ISSUE */
66      ],
67      "title": "TemporalReasoning",
68      "type": "object"
69    }
70  },
71  "properties": {
72    "sections": {
73      "items": {
74        "$ref": "#/$defs/Section"
75      },
76      "title": "Sections",
77      "type": "array"
78    }
79  },
80  "required": [
81    "sections"
82  ],
83  "title": "Response",
84  "type": "object"
85}

If you immediately understood the issue from just reading the schema then take a bow, I was stumped for hours but essentially Pydantic dumps keys in the order of inheritance.

In case the issue isn't fully clear, the problem is that Pydantic serialisation serialises schema attributes in order of inheritance, meaning that it'll output the schema for BaseReasoning first and the attributes to TemporalReasoning second. This means that when we dump the schema, the LLM will generate BaseReasoning attributes before TemporalReasoning attributes, essentially nullifying the use of the child class.

💡What this means is that your responses will contain all expected keys, but the order in which the LLM generates these keys won't force the answer to be conditioned on the reasoning. We can see how temporal_reasoning_required is only generated AFTER the model has generated its answer.

Your responses will therefore contain all the correct keys, but we'll have been generating

$$P(reasoning\ |\ answer,\ \ldots, context)$$

instead of

$$P(answer\ |\ reasoning,\ \ldots, context)$$

which is even more pernicious that you might initially realise, because if a model's accuracy is low, a sensible dev's first instinct is to inspect the reasoning….

What's the fix?

I'm sure there are lots of great fixes out there, but I've settled for a custom function that recursively re-orders the schema based on some preferred order. It's not a lot of code and I've attached it to a Mixin.

There are other problems that can arise of course, such as ensuring the model dump actually contains all the child attributes, but that's a deeper dive into python's many polymorphism pitfalls and would require a much longer post.

Appendix

1. Handling Model Validation Failures

I didn't cover extracting json or what to do if the model validation fails. Look at instructor, but I've always wondered whether there existed a better approach that retries responses that failed validation using smaller/faster/local models.

2. Self-Checking Validation

A clever approach might be to get the model to generate post-json reasoning that examines whether it's got all the right keys - this works and it will often re- generate new and entirely whole responses within the same response, but you need to watch out for context length (and handle the multi ```json parsing).

3. Improving Accuracy with Field Descriptions

A very easy win to improving accuracy is using Pydantic.Field and adding a description for each key. I didn't bother in the examples because it adds clutter.

4. Missing Child Attributes in Polymorphic Classes

The below pytest file contains an illustrative dive into missing child attributes which might occur if you're not careful. It's really tricky to spot these issues, especially if you can't see into the internals of the library you're using e.g. Instructor or the generic OpenAI client.

1from pydantic import BaseModel
2
3class Section(BaseModel):
4    name: str
5
6class BaseStep(BaseModel):
7    parent: str
8    child: str
9
10class ChildStepA(BaseStep):
11    section: Section
12
13class ChildStepB(BaseStep):
14    section: Section
15
16class Task(BaseModel):
17    steps: list[BaseStep]  # Note "BaseStep"
18
19class PolymorphicTask(BaseModel):
20    steps: list[ChildStepA | ChildStepB]  # Note explicit ChildStepA | ChildStepB
21
22def test_polymorphic_task_definition():
23    # Demonstrate that child classes are not dumped correctly by pydantic
24    # >>> The dump won't contain Section because it's calling BaseStep.model_dump()
25    task = Task(
26        steps=[
27            ChildStepA(
28                parent="a",
29                child="b",
30                section=Section(
31                    name="section",
32                ),
33            )
34        ]
35    )
36    task_dump = task.model_dump()
37    assert task_dump["steps"][0].get("section") is None
38    
39    # Fix it with a polymorphic-proof approach
40    polymorphic_task = PolymorphicTask(
41        steps=[
42            ChildStepA(
43                parent="a",
44                child="b",
45                section=Section(
46                    name="section",
47                ),
48            )
49        ]
50    )
51    polymorphic_task_dump = polymorphic_task.model_dump()
52    assert polymorphic_task_dump["steps"][0].get("section")