AI & Automation

Building AI Agents for a
Legal Marketing Agency

How we built five production AI agents for a legal marketing agency. Architecture, failure modes, costs, and what we tried and abandoned.

Reading path

AI visibility needs to connect back to the foundations.

The firms that benefit most from AI search and automation are usually the same firms with better structure, stronger content, and clearer entity signals underneath.

17 min read Reading time
3,400 Words
10 FAQs answered
Mar 31, 2026 Last updated

There is a specific moment when a marketing agency outgrows manual processes. For legal marketing agencies, it usually happens between 200 and 500 clients. The work is specialized enough that you cannot hire generalists, but repetitive enough that your specialists are spending half their time on tasks that follow a predictable pattern.

We hit that wall and decided to build AI agents to handle the repetitive parts. Not chatbots. Not copilots. Autonomous agents that take structured input, do multi-step work, produce structured output, and wait for human approval before anything touches a client account.

This is what we built, what actually worked, and what we would do differently.

What we mean by “agent”

The word gets thrown around loosely. In our system, an agent is a Python function that:

  1. Receives a structured input (JSON with a defined schema)
  2. Makes one or more calls to the Claude API with specific system prompts and tool definitions
  3. Validates the output against a schema
  4. Writes the result to a database for human review
  5. Optionally triggers the next agent in a sequence

That is it. No autonomous browsing. No self-modifying goals. No recursive planning loops. Each agent has a single job, defined inputs, defined outputs, and a human checkpoint before anything goes live.

We have five agents in production:

  • Campaign Builder — takes client intake data, produces Google Ads campaign structure
  • Schema Generator — takes page metadata, produces JSON-LD structured data
  • Content Brief Writer — takes a keyword cluster, produces a detailed content brief with outline, word count targets, and internal linking plan
  • Title and Meta Auditor — takes a batch of page URLs, pulls current titles and metas, produces optimized versions with character counts and CTA recommendations
  • Monthly Report Narrator — takes raw analytics data (GSC impressions, clicks, rankings, conversions), produces a human-readable summary with insights

Each agent took 1-3 weeks to build and get to a quality level where the human review step rarely needs to make significant changes.

The architecture that works

Every agent follows the same pattern:

Input (JSON) → Prompt Assembly → Claude API Call → Schema Validation → Database Write → Human Review Queue

Prompt assembly is where the real work lives. The system prompt for each agent is 1,500-3,000 words of specific instructions. It encodes domain expertise: what makes a good title tag for a personal injury page, how to structure ad groups by intent level, which metrics matter in a monthly report for a law firm that does family law versus one that does criminal defense.

These prompts are not generic. They are the distilled knowledge of people who have been doing legal marketing for years, translated into instructions that Claude can follow consistently.

Claude API calls use tool_use (function calling) to force structured output. We never ask Claude to return freeform text when we need structured data. The tool schema defines exactly what fields are expected, what types they should be, and what constraints they must satisfy (character limits, enum values, required fields).

When the tool call fails schema validation (Claude returns a field with the wrong type, or a headline that exceeds the character limit), the system retries with the validation error included in the next message. Claude reads the error and corrects it. This works about 95% of the time on the first retry. If it fails three times, the task goes to a human with the partial output and the error log.

Database writes store the full output along with metadata: which prompt version was used, how many API calls it took, the total token count, and a timestamp. This lets us track quality over time and correlate prompt changes with output quality.

The human review queue is a web interface where reviewers see the generated output alongside the input data. They can approve, modify, or reject each item. Approved items either go live immediately (for things like schema markup that deploy with the next site build) or queue for execution (for things like Google Ads campaigns that require API calls).

What we tried and abandoned

LangChain. We started here because it seemed like the standard tool for building AI workflows. Within two weeks we ripped it out. The abstraction layer added complexity without solving any problem we actually had. Our agents are linear pipelines. LangChain is designed for dynamic chains with branching, memory, and tool selection. Using it for a linear pipeline is like using a full ORM when you need to run three SQL queries.

The other issue is debugging. When a LangChain pipeline fails, the stack trace goes through multiple layers of abstraction before you reach the actual API call that broke. With direct API calls, the stack trace is your code, the Anthropic SDK, and the API response. Three layers, not twelve.

Multi-agent conversations. We tried having one agent generate a campaign and then pass it to a “reviewer” agent that would check the work. The idea was to catch errors before the human review step. In practice, the reviewer agent either rubber-stamped everything or flagged things that were actually correct. It added latency and API cost without improving quality. A simple programmatic validation step (check character limits, verify budget math, confirm required fields) catches real errors. An AI reviewing another AI’s work catches vibes.

CrewAI and AutoGen. We evaluated both. Both are designed for scenarios where multiple agents need to collaborate dynamically — negotiating, delegating, and adapting based on intermediate results. Our agents do not collaborate. They run in sequence. Agent A’s output is Agent B’s input. There is no negotiation. Using a multi-agent framework for sequential processing is over-engineering.

Scheduled autonomous runs. We tried having agents run on a cron schedule — generate monthly reports every first of the month, audit title tags every Monday. It worked technically, but it created a problem: the output piled up in the review queue faster than humans could review it. Scheduled runs make sense only when the review capacity matches the generation rate. We switched most workflows to on-demand triggering: a human clicks “generate” when they are ready to review the output. Only a small number of predictable reporting jobs still run on a schedule.

Failure modes and how we handle them

The AI hallucinates data. This happened with the monthly report agent. It would sometimes invent a percentage improvement or cite a ranking position that did not match the input data. The fix was adding a validation step that cross-references every number in the narrative against the raw data that was provided. If the report says “organic traffic increased 23%” and the actual increase was 19%, the validator flags it.

The API times out. Claude API calls for complex outputs (a full campaign structure with 10+ ad groups) can take 30-60 seconds. Network issues or API load spikes cause occasional timeouts. We use exponential backoff with a maximum of three retries. If all three fail, the task goes to the error queue with a “retry later” option. We do not block on failures.

Prompt changes break edge cases. Improving the prompt for one practice area sometimes degrades output quality for another. We discovered this the hard way when we improved the personal injury keyword heuristics and the system started generating overly aggressive keywords for estate planning (a practice area where the search intent is very different). Now we test prompt changes against a suite of 10 representative client profiles spanning different practice areas, locations, and budgets. The suite runs automatically when a prompt is updated.

The output is technically valid but strategically wrong. Schema validation catches structural errors. It does not catch a campaign that targets the wrong intent level for a client’s budget, or a content brief that focuses on the wrong angle for a practice area. This is why the human review step exists. About 15% of outputs need strategic adjustments that only a domain expert would catch.

Costs

A typical agent call costs $0.05-0.30 in Claude API usage, depending on the prompt length and output size. The Campaign Builder agent is the most expensive at around $0.25 per run because the prompt is long and the output is large.

At 650+ client accounts running various agents on demand, our monthly API spend is roughly $800-1,200. That covers all five agents across all clients.

For comparison, the manual labor those agents replaced cost approximately $35,000/month in specialist time. The ROI is not subtle.

What the stack looks like

  • Python 3.12 — all agent logic
  • Anthropic Python SDK — Claude API calls with tool_use
  • Supabase (PostgreSQL) — client data, agent outputs, review queue state, prompt versions
  • Flask — review interface (intentionally simple, no frontend framework)
  • Google Ads API client — campaign deployment for approved outputs
  • cron + systemd — for the few agents that do run on schedule (report generation on the 1st)

The orchestration layer runs on one VPS at $40/month. Database hosting and API usage are separate line items. The agents are not resource-intensive. They are I/O bound (waiting for API responses), not CPU bound.

If you are building this

For a legal marketing agency considering AI agents, here is what actually matters:

  1. Start with one agent. Pick the task that is most repetitive, most structured, and most annoying for your team. Build one agent for that task. Get it through 50 real uses. Then build the next one.

  2. Invest in the prompt, not the framework. The quality of your agent is 90% the quality of the system prompt. Spend three weeks writing and testing the prompt. Spend three days on the code. If you find yourself spending more time on infrastructure than on prompts, you are solving the wrong problem.

  3. Structured output is non-negotiable. If your agent returns freeform text, you cannot validate it programmatically, you cannot feed it to the next agent reliably, and you cannot deploy it through an API. Use tool_use. Define schemas. Validate everything.

  4. The human review step is the product. Do not think of it as a limitation. Think of it as the feature that makes the system trustworthy. Your clients are paying for expert judgment. The AI handles the labor. The human provides the judgment. The system makes both faster.

  5. Track prompt versions and output quality. You will change prompts frequently in the first few months. Without version tracking, you will not know which changes helped and which ones introduced regressions. Simple file versioning with a changelog is sufficient. You do not need a prompt management platform.

  6. Do not build what you can buy. Analytics dashboards, CRM integration, email sending — these are solved problems. Build agents for the parts that are specific to your domain and your workflow. Buy everything else.

The agencies that will be left behind are the ones that try to do everything manually at scale. The ones that will fail are the ones that try to automate everything with no human oversight. The sweet spot is agents that handle labor, humans that handle judgment, and a system that makes the handoff seamless.

Need a clearer next move?

Get a Free Marketing Automation Audit

We'll map your current workflows, identify which tasks are candidates for AI automation, and build a prioritized roadmap based on ROI potential.

Next steps

Use this topic inside the right part of your growth system.

Keep this topic grounded by moving into the AI-search guide, the service layer that supports citation readiness, or the broader research on how law firms are adapting.

Related reads

Other articles firms usually read next.

These are the closest matches by topic, so the next click keeps building useful context instead of sending you sideways.

Frequently asked questions

AI & Automation FAQ

Quick answers to the most common questions about this topic.

01

What are AI agents in the context of legal marketing?

In our system, an AI agent is a Python function that receives structured JSON input, makes one or more calls to the Claude API with specific system prompts and tool definitions, validates the output against a schema, writes the result to a database for human review, and optionally triggers the next agent in a sequence. No autonomous browsing, no self-modifying goals, no recursive planning loops. Each agent has a single job, defined inputs, defined outputs, and a human checkpoint before anything goes live.

02

How many AI agents do you run in production for law firm marketing?

Five. Campaign Builder takes client intake data and produces Google Ads campaign structure. Schema Generator takes page metadata and produces JSON-LD structured data. Content Brief Writer takes a keyword cluster and produces a detailed outline with word count targets and internal linking plan. Title and Meta Auditor takes page URLs and produces optimized versions with character counts. Monthly Report Narrator takes raw analytics data and produces a human-readable summary with insights.

03

How much does it cost to run AI agents for a legal marketing agency?

A typical agent call costs $0.05-0.30 in Claude API usage depending on prompt length and output size. At 650+ client accounts, our monthly API spend is roughly $800-1,200 covering all five agents. The orchestration layer runs on one VPS at $40/month, with database hosting and API usage as separate line items. For comparison, the manual labor those agents replaced cost approximately $35,000/month in specialist time.

04

Why did you abandon LangChain for AI agent development?

Our agents are linear pipelines. LangChain is designed for dynamic chains with branching, memory, and tool selection. Using it for a linear pipeline added complexity without solving any problem we actually had. The other issue is debugging: when a LangChain pipeline fails, the stack trace goes through multiple abstraction layers before you reach the actual API call that broke. With direct API calls, the stack trace is your code, the Anthropic SDK, and the API response. Three layers, not twelve.

05

Do you use CrewAI or AutoGen for multi-agent workflows?

No. We evaluated both. Both are designed for scenarios where multiple agents need to collaborate dynamically. Our agents do not collaborate. They run in sequence. Agent A's output is Agent B's input. There is no negotiation or delegation between agents. Using a multi-agent framework for sequential processing adds overhead without benefit.

06

How do you handle AI hallucinations in legal marketing reports?

The monthly report agent sometimes invented percentage improvements or cited ranking positions that did not match the input data. The fix was adding a validation step that cross-references every number in the narrative against the raw data. If the report says organic traffic increased 23% and the actual increase was 19%, the validator flags it before it reaches the review queue.

07

How do you version AI prompts for production agents?

Each prompt version is a file with a timestamp and a changelog comment. When we update a prompt, the old version stays. The database stores which prompt version produced each output. We test prompt changes against a suite of 10 representative client profiles spanning different practice areas, locations, and budgets. The suite runs automatically when a prompt is updated to catch regressions before they reach clients.

08

Should AI agents run on a schedule or on demand?

We tried scheduled runs and abandoned them for most agents. The output piled up in the review queue faster than humans could review it. Scheduled runs make sense only when the review capacity matches the generation rate. We switched most workflows to on-demand triggering: a human clicks generate when they are ready to review the output. The few predictable reporting jobs that still run on schedule are timed to match reviewer availability.

09

What is the most important factor in AI agent quality?

The system prompt. The quality of your agent is 90% the quality of the prompt. Our prompts are 1,500-3,000 words of specific instructions that encode domain expertise: what makes a good title tag for a personal injury page, how to structure ad groups by intent level, which metrics matter in a monthly report for different practice areas. If you spend more time on infrastructure than on prompts, you are solving the wrong problem.

10

How do you handle AI agent failures gracefully?

When a tool call fails schema validation, the system retries with the validation error included in the next message. This works about 95% of the time on the first retry. If it fails three times, the task goes to a human with the partial output and the error log. For API timeouts, we use exponential backoff with a maximum of three retries. Failed tasks go to an error queue with a retry-later option. We never block on failures.

Next step

Ready to Automate Your Legal Marketing Workflows?

Book a free strategy session. We'll audit your current marketing operations, identify the biggest automation opportunities, and show you what AI agents can handle today.

Book my strategy call Free SEO Audit
No obligation 100% confidential Custom roadmap included