Prompt engineering for multilingual structured output

Neural Summary generates 30 types of professional documents from conversation transcripts. Each document is produced in the user's language. Each one is a typed JSON structure, not markdown. And each one is generated by a single LLM call with a carefully designed prompt.

This post covers the prompt engineering patterns we developed over eight months and 30 template iterations. Some of these patterns are obvious in retrospect. Most were not obvious when we started.

The decision: JSON over markdown

Our V1 system generated markdown. The LLM received a transcript and returned formatted text with headings, bullet points, and bold emphasis. This worked for rendering a single document in English. It failed at everything else.

Markdown is a rendering format. It encodes both content and presentation in one string. You cannot search it semantically. You cannot translate it without also translating the formatting markers. You cannot validate whether it contains the sections you expect. You cannot render it differently on web versus mobile.

In V2, every template generates a typed JSON structure. An executive summary returns an object with recommendation, findings (each with finding, implication, evidence), decisions (each with decision, owner, rationale, status), and risks (each with risk, severity, mitigation).

This separation of content from presentation enables five things that matter:

1The same output renders differently on desktop and mobile.
2The same output displays in five languages without re-generation.
3Schema validation catches incomplete or malformed output before it reaches the user.
4Semantic search indexes specific fields (decisions, action items, risks) rather than full text.
5Export to markdown, PDF, or clipboard produces format-appropriate output from the same data.

The upfront cost is higher. You need a JSON schema for every template. You need validation logic. You need a rendering component for each schema. But the payback is immediate and compounds.

Why generic prompts produced generic output

Early prompts said things like "Summarize this transcript and identify key themes." The output was generic. Correct, but unremarkable.

The fix was positioning. Every prompt now opens with a specific professional identity:

>The executive summary prompt positions the LLM as a chief of staff using the Pyramid Principle.
>The agile backlog prompt positions it as a certified Scrum Product Owner with business analysis expertise.
>The coaching notes prompt positions it as a developmental psychologist specializing in executive coaching.
>The competitive intelligence prompt positions it as a VP of competitive intelligence at a B2B SaaS company.

This is not anthropomorphism. It is context setting. The positioning activates the domain knowledge embedded in the model's training data. A prompt that says "as a VP of competitive intelligence, produce a battlecard" generates output with the specificity and structure that a VP of CI would actually expect.

The more specific the positioning, the better the output. "You are a helpful assistant" produces assistant-quality output. "You are a chief of staff preparing a board-ready executive summary following the Pyramid Principle" produces chief-of-staff-quality output.

Showing the model what good looks like (and what bad looks like)

Telling the LLM what you want is less effective than showing it. Every template prompt includes examples of both good and bad output for critical fields.

For action items:

GOOD: "Ship revised pricing page by Friday → owner: Sarah, priority: high"
BAD: "We should probably look into updating the pricing page at some point"

For user stories in the agile backlog template:

GOOD: "As a hiring manager, I can filter candidates by skill match 
       score so that I review the most qualified applicants first"
BAD: "As a user, I want to see candidates so that I can hire people"

The bad examples are as important as the good ones. Without them, the model gravitates toward the median quality of its training data, which for meeting notes is low.

Forcing language compliance, not suggesting it

Supporting five languages means the LLM must generate all content, including field labels and structural text, in the user's language. This is harder than it sounds.

Left to its own devices, the model generates JSON keys in English and values in mixed languages. German output had English section headers. French output fell back to English for technical terms. We tested polite formulations and they produced English leakage in roughly 15% of non-English outputs.

Our solution is a mandatory language block injected into every prompt with emphatic language:

CRITICAL LANGUAGE REQUIREMENT: You MUST generate ALL JSON text 
values in {language}. This includes all headings, descriptions, 
content, labels, and any other text. Do NOT fall back to English 
for any field. Every string value must be in {language}.

The emphatic version reduced English leakage to under 1%. Without strong instruction, the model treats language as a suggestion rather than a constraint.

Constraining verbosity before it reaches the user

LLMs are verbose by default. Without constraints, an action item that should say "Ship revised pricing page by Friday" becomes "It was discussed and agreed upon that the team should prioritize the revision of the pricing page, with a target completion date of Friday."

We enforce conciseness in two ways.

Word limits per field. Each template specifies maximum word counts for specific fields. Action item descriptions: 8-15 words. Key finding summaries: one sentence. Section headings: 3-6 words.

Action-verb-first patterns. For tasks, action items, and next steps, the prompt requires that every item begin with a verb in imperative form. "Ship revised pricing page" not "The pricing page should be revised." This single constraint eliminates most filler language.

Detecting conversation type to improve output relevance

Not all conversations are equal. A sales discovery call has different content than a sprint retrospective. A coaching session has different dynamics than a board meeting.

Our system auto-detects the conversation type during the initial transcription pass. We classify into 14 types: sales call, interview, brainstorm, sprint planning, coaching session, one-on-one, board meeting, and so on. This classification is stored alongside the transcript and used in two ways.

First, it informs template recommendations. A detected sales call surfaces CRM notes, deal qualification, and follow-up email as suggested templates. A detected retrospective surfaces the retrospective template and action items.

Second, the conversation type is passed to the prompt as context. The same executive summary template produces subtly different output for a sales call versus a strategy session, because the prompt knows what kind of conversation it is analyzing.

Validating output before the user sees it

LLMs do not always produce valid JSON. Even when they do, the JSON may not conform to the expected schema. A coaching notes output might be missing the key moments array. An executive summary might have a recommendation field that is null.

We validate every output against the template's expected schema before storing it. Required fields must exist and have the correct type. Arrays must not be empty when they are expected to contain items. String fields must not be empty.

When validation fails, we have two options: reject and retry, or patch the output with defaults. For critical fields (the executive summary's recommendation, the action items list), we retry once with a slightly modified prompt. For optional enhancement fields (risk severity, priority badges), we patch with sensible defaults.

This validation layer catches roughly 3-5% of outputs. Without it, users would occasionally see empty sections or missing data.

The hardest language: German

German is our most challenging language. Not because of character encoding or font rendering, but because of capitalization.

German capitalizes all nouns. English capitalizes proper nouns and sentence beginnings. Spanish, French, and Dutch follow English-like rules with minor variations.

When the LLM generates a heading like "Key Findings" in English, the equivalent in German is "Wichtigste Erkenntnisse," where "Erkenntnisse" (findings) is capitalized because it is a noun. But the LLM, trained primarily on English text, sometimes produces "wichtigste erkenntnisse" (all lowercase after the first word) or "Wichtigste erkenntnisse" (missing the noun capitalization).

We solved this with explicit capitalization instructions per language in the prompt, plus a post-processing step that checks German output for common capitalization errors. The post-processing is imperfect, but it catches the most visible mistakes.

What the prompts look like

A single template prompt averages 400-600 words. It contains:

1Domain expert positioning (2-3 sentences)
2Output format specification with JSON schema
3Language requirement block
4Good and bad examples for critical fields
5Conciseness constraints and word limits
6Template-specific instructions (Pyramid Principle for executive summaries, BANT framework for deal qualification, developmental psychology framework for coaching notes)
7Edge case handling ("If no clear decisions were made, state that explicitly rather than fabricating decisions")

The prompts are the most carefully maintained code in the system. A small change in wording can meaningfully shift output quality. We version them, review changes, and test against a consistent set of sample transcripts before deploying updates.

What this changed about how we build AI features

Structured output is strictly better than markdown for any application that needs to do more than render text. The upfront cost is higher, but the optionality it creates is worth more than the cost within weeks.

Prompt quality matters more than model size. A well-designed prompt on a mid-range model consistently outperforms a vague prompt on a larger model. Invest in the prompt.

Language support is not a feature, it is an architecture decision. If you build an English-first system and add languages later, you will rewrite the prompt layer. If you build language-aware from the start, additional languages are incremental.

Specificity is proof. A vague summary feels generated. A specific finding with evidence, an implication, and a cited quote feels like it was written by someone who listened. The same principle from good writing applies to good prompt engineering: concrete details are what make output credible.

Thirty templates. Five languages. Structured JSON. Built one prompt iteration at a time.