We Will Win by Doing the Hard Things

Eddy Atkins

January 23, 2026

Head of Machine Learning

As the Head of Machine Learning at Sandstone, I'm sharing how we're thinking about building AI systems for in-house legal teams—what feels clear, what feels uncertain, and where the real work is accumulating.

This is the first post in an ongoing series from the ML team. Many of the ideas touched on here will be expanded in future posts.

The 80% Problem

The cost of getting to an "80% solution" has effectively collapsed to zero. What once took real time, effort, and expertise is now often a prompt away. In a world where convincing demos can be produced almost instantly, a POC no longer proves viability, correctness, or durability.

When the cost of reaching 80% goes to zero, the value of that work diminishes with it. Prototypes, passable outputs, things that look like they work—these no longer differentiate products. What remains difficult, and valuable, is the last mile.

In legal tech, the last mile has a name: the Context Crisis.

Our CEO, Nick Fleisher, recently wrote about how legal requests arrive stripped of their commercial context—deal value, relationship history, prior concessions. Attorneys pay a "Reconstruction Tax" hunting for information. It's the core reason why naive AI deployments stall.

The 80% solution can draft a contract clause. It cannot know whether this is a strategic account, what risk tolerance applies, or what was agreed eighteen months ago. That gap is the last 20%—and it's where we're focused on the Machine Learning & Engineering side.

Where We Will Compete

Foundational models are already robust in naive legal tasks and analysis—and they will get better. So where do we play? Not by trying to out-model the model providers. By building the Unified Legal Context that makes AI actually useful.

The hard part isn't making something that works. That's table stakes. The hard part is deciding what matters, and being willing to be opinionated about it. Most efforts don't fail because the execution is sloppy; they fail because the underlying priorities are fuzzy.

For us, being opinionated means a few things:

Solving data at its source, not just its retrieval. We've built integrations with Salesforce, HubSpot, ServiceNow, Ironclad, Workday—and the technical challenge isn't connecting to these systems. It's that each one is unique. Every Salesforce instance is set up and utilized differently. The work is in modeling data so it's actually useful for legal workflows, not just accessible.

Codifying institutional knowledge into something executable. We have a standard but flexible playbook system that our GenAI systems are optimized upon. These playbooks can be enriched with important context, but also dynamically updated—including surfacing suggestions for updates—based on ongoing usage. Risk frameworks shouldn't live in PDFs on shared drives. They should inform every request.

RAG Is Easy. Solving the Context Crisis Is Not.

Embedding documents, retrieving "relevant" chunks with some reranking thrown in for good measure, is no longer hard. That capability alone doesn't differentiate a product—and it does not solve the problems in knowledge workers' workflows.

Here's why: semantic similarity is not business context.

A naive RAG system might surface a prior NDA because it's textually similar to a new one. But it cannot tell you that the prior contract was with a customer who churned, that the clause was a one-time concession, or that the deal value makes aggressive terms inappropriate.

This solution has not mitigated the Reconstruction Tax—it has just changed its surface area. The attorney still has to hunt for context to augment this 80% solution.

We operate on a spectrum. On one end, we're opinionated about what data matters—extracting and storing it in a structured way before runtime so we can infuse our AI system's analysis with structured business context directly relevant to its outputs. Entity names, dates, contract values. A question like "Find the last two MSAs with counterparty X between 2023 and 2024" should return a reliable and deterministic answer.

On the other end, we have the flexibility to augment our analysis with the long tail of data—which is not up-front structurable. Meeting notes linked to those MSAs provide just as important signal around how our system should operate.

Making both work requires layers: opinionated pre-processing, user-influenced data tagging to canonically represent entities, embedding all supplementary data, and a spectrum of retrieval approaches—from pre-defined queries to a flexible text-to-SQL builder to embedding-based RAG with deliberate reranking pipelines.

The point isn't to eliminate judgment—it's to surface the few things that genuinely deserve attention, and get everything else out of the way.

The Hard Work: Evals in Legal Tech

One of the clearest examples of the "last mile" problem is evaluation.

In coding, evaluation is largely verifiable—code runs or it doesn't, tests pass or fail. That's enabled enormous success with reinforcement learning at the foundation-model level.

Legal is different. The space is largely non-verifiable. There's rarely a single correct answer. A contract redline might be "correct" in one commercial context and inappropriate in another.

Our evals serve two purposes. First, codifying behavior we've intended into the system—things that we don't want to degrade. Second, capturing behavior we observe from bug reports and less-than-ideal outputs from users, so we can hill-climb and fix immediately.

We work side by side with our legal engineering team on this. They're often the ones defining the evals. We evaluate on many dimensions, starting with classification-style questions—did we identify the right issue?—before moving to legal relevance, accuracy, and whether outputs integrate with downstream systems.

Done properly, this is a durable edge. It requires people who understand the legal domain, the underlying technology, and the evaluation process itself. That combination is rare.

Why This Blog Exists

The above demonstrates the areas where we have a clear vision and are moving forward in our deliberate execution. But these are "strong opinions, loosely held"—and building serious products means engaging with uncertainty directly.

This blog isn't meant to present a clean, linear story of progress. The point is to capture the messiness—and to share what we're learning as we build. Some open questions we are sitting with:

Cost Trajectory: Will intelligence become too cheap to meter, or are we threatened by the reversal of current subsidizations? Either path would have profound implications for how we build our AI solutions.

Eval Maintenance: How do we build an eval framework and process that is robust to the rapid expansion and drift in our product surface?

Feedback Loops: Interaction with AI should improve the system over time. We've centered on this from Day 1—but it's hard in practice. How do we design systems and UX to capture user preference and outcomes effectively?

We don't have answers yet. But building serious products means engaging with these questions directly rather than pretending they're settled.

The companies that win won't be the ones who rushed to the easiest 80%, or mistook a demo for a solution. They'll be the ones willing to solve the Context Crisis and do the hard things anyway.