{"id":3410,"date":"2026-05-04T20:52:54","date_gmt":"2026-05-04T20:52:54","guid":{"rendered":"https:\/\/deepseek.international\/?p=3410"},"modified":"2026-05-04T20:53:10","modified_gmt":"2026-05-04T20:53:10","slug":"startup-case-study-building-on-deepseek","status":"publish","type":"post","link":"https:\/\/deepseek.international\/zh\/startup-case-study-building-on-deepseek\/","title":{"rendered":"Building on DeepSeek in 2026: A Startup Case Study That Didn\u2019t Go as Planned"},"content":{"rendered":"<p>I didn\u2019t start this thinking it would turn into a \u201ccase study.\u201d It was supposed to be a quick internal tool. Something narrow. We had a content ops problem\u2014too many inputs, inconsistent briefs, and no one wanted to touch the spreadsheet anymore because it had become this semi-broken, semi-sacred document that nobody owned.<\/p>\n\n\n\n<p>DeepSeek looked like the obvious choice at the time.<\/p>\n\n\n\n<p><a target=\"_blank\" href=\"https:\/\/deepseek.international\/zh\/the-man-behind-deepseek-liang-wenfeng\/\" rel=\"noreferrer noopener\">The Man Behind DeepSeek (Liang Wenfeng)<\/a><\/p>\n\n\n\n<p>Not because it was perfect, but because it was weirdly good at handling long context chains without collapsing into summaries too early. That mattered more than benchmarks. Benchmarks never tell you what happens when your prompt is 4,000 tokens deep and half of it is conflicting instructions from three different people.<\/p>\n\n\n\n<p>So we built on it.<\/p>\n\n\n\n<p>Or tried to.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>The original idea wasn\u2019t even ambitious. We wanted an agent that could:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>ingest messy client briefs (Google Docs, Notion exports, Slack threads dumped into text)<\/li>\n\n\n\n<li>normalize them into structured inputs<\/li>\n\n\n\n<li>generate drafts<\/li>\n\n\n\n<li>push those drafts into a review queue<\/li>\n\n\n\n<li>track revisions across versions without losing intent<\/li>\n<\/ul>\n\n\n\n<p>That\u2019s it. No \u201cAI platform.\u201d No pitch deck language. Just a pipeline that didn\u2019t fall apart halfway through.<\/p>\n\n\n\n<p>DeepSeek handled the ingestion phase surprisingly well. It didn\u2019t freak out when the input included duplicate instructions or contradicting tone guidelines. It just\u2026 tried. Sometimes too hard.<\/p>\n\n\n\n<p>There were moments where it would reconcile contradictions in ways that looked intelligent, but weren\u2019t actually helpful. Like merging two different brand voices into something that sounded \u201cbalanced,\u201d which is exactly what nobody wanted.<\/p>\n\n\n\n<p>Still, early results were good enough to keep going.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>Where things started getting unstable was when we introduced agents.<\/p>\n\n\n\n<p>Agent Mode in 2026 feels like it should be reliable by now, but it still behaves like an overconfident intern that occasionally ignores half the instructions because it found a \u201cbetter way\u201d to do something.<\/p>\n\n\n\n<p>We set up a chain:<\/p>\n\n\n\n<p>Input Agent \u2192 Structuring Agent \u2192 Draft Agent \u2192 QA Agent<\/p>\n\n\n\n<p>Simple on paper.<\/p>\n\n\n\n<p>In reality, the Structuring Agent would sometimes rewrite the brief instead of structuring it. Not summarize\u2014rewrite. It would reinterpret client intent based on patterns it had seen before.<\/p>\n\n\n\n<p>That sounds impressive until you realize it was hallucinating constraints that didn\u2019t exist.<\/p>\n\n\n\n<p>We tried locking it down with stricter instructions. That helped for about a day.<\/p>\n\n\n\n<p>Then it started overfitting to those instructions and ignoring edge cases completely.<\/p>\n\n\n\n<p>There\u2019s this weird middle zone with DeepSeek where if you give it too much freedom, it improvises. If you restrict it too much, it becomes brittle.<\/p>\n\n\n\n<p>We never really found the balance. We just kept adjusting.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>Memory 2.0 made things worse before it made anything better.<\/p>\n\n\n\n<p>At first, persistent memory seemed like exactly what we needed. The system could remember client preferences, tone guidelines, formatting quirks. In theory, that removes a lot of repeated prompt overhead.<\/p>\n\n\n\n<p>In practice, it started storing the wrong things.<\/p>\n\n\n\n<p>It would remember a one-off correction as if it were a permanent rule.<\/p>\n\n\n\n<p>Example: a client once asked for shorter paragraphs in a single draft. That instruction got stored, and suddenly every future draft for that client became aggressively fragmented. Even when longer form was explicitly requested.<\/p>\n\n\n\n<p>There was no clean way to audit what the memory layer had absorbed without digging through logs that weren\u2019t really meant for humans.<\/p>\n\n\n\n<p>We ended up building a manual \u201cmemory reset\u201d step into the workflow. Which defeats the purpose, but it was the only way to stop drift.<\/p>\n\n\n\n<p>And drift is the right word for it. Nothing breaks suddenly. It just slowly stops aligning.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>Somewhere in the middle of all this, we hit scaling issues that weren\u2019t obvious at first.<\/p>\n\n\n\n<p>DeepSeek handled individual runs well. But when we tried to process batches\u2014say 40 briefs at once\u2014the behavior became inconsistent.<\/p>\n\n\n\n<p>Not slower. Just inconsistent.<\/p>\n\n\n\n<p>Some outputs would be clean. Others would ignore half the structure. A few would revert to earlier formatting styles we thought we had eliminated.<\/p>\n\n\n\n<p>We spent too long assuming this was our fault.<\/p>\n\n\n\n<p>Maybe it was prompt variance. Maybe it was input formatting. Maybe something in the agent chain was introducing randomness.<\/p>\n\n\n\n<p>We tried standardizing everything.<\/p>\n\n\n\n<p>Same templates. Same structure. Same input cleaning process.<\/p>\n\n\n\n<p>It reduced variability slightly, but didn\u2019t eliminate it.<\/p>\n\n\n\n<p>At some point, we had to accept that the model itself behaves differently under load. Not in a documented way. Just\u2026 differently.<\/p>\n\n\n\n<p>That\u2019s hard to design around.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>There was also this ongoing tension between DeepSeek and traditional search.<\/p>\n\n\n\n<p>We initially tried to let the system \u201cresearch\u201d missing context using AI-powered search integrations. That worked fine for broad topics, but it struggled with anything niche or recent.<\/p>\n\n\n\n<p>The model would confidently fill gaps using outdated patterns rather than admitting uncertainty.<\/p>\n\n\n\n<p>Which meant the QA Agent had to become more aggressive.<\/p>\n\n\n\n<p>But then the QA Agent started overcorrecting.<\/p>\n\n\n\n<p>It would flag things that were actually fine, just because they didn\u2019t match its internal expectations.<\/p>\n\n\n\n<p>So now we had a system where one agent was too confident and another was too skeptical.<\/p>\n\n\n\n<p>And we were stuck in the middle, trying to decide which one to trust.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>One of the more frustrating issues wasn\u2019t technical. It was plan limitations.<\/p>\n\n\n\n<p>We were running this on a mix of Pro and Go tier accounts during early testing. Not ideal, but it gave us a sense of how the system would behave under constraints.<\/p>\n\n\n\n<p>Usage caps hit faster than expected.<\/p>\n\n\n\n<p>Not because of output size, but because of retries.<\/p>\n\n\n\n<p>When an agent chain fails halfway through, you don\u2019t just lose that run. You often need to rerun the entire sequence because intermediate states aren\u2019t reusable in a clean way.<\/p>\n\n\n\n<p>So a single \u201ctask\u201d might consume 3\u20134x the expected usage.<\/p>\n\n\n\n<p>That makes cost modeling messy.<\/p>\n\n\n\n<p>And it also changes how you design the system.<\/p>\n\n\n\n<p>We started breaking workflows into smaller chunks just to reduce the impact of failures. But that introduced new problems with context fragmentation.<\/p>\n\n\n\n<p>There\u2019s no clean solution there. Just tradeoffs.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>At some point, we tried comparing this setup with a GPT-5.5-based stack.<\/p>\n\n\n\n<p>Not because we wanted to switch, but because we needed a baseline.<\/p>\n\n\n\n<p>GPT-5.5 was more predictable in agent chains. Less creative, but more consistent.<\/p>\n\n\n\n<p>DeepSeek was better at handling messy inputs, but less stable when chained across multiple steps.<\/p>\n\n\n\n<p>So we ended up in this weird hybrid mindset:<\/p>\n\n\n\n<p>Use DeepSeek for ingestion and early-stage synthesis<br>Use GPT-5.5 for structured output and validation<\/p>\n\n\n\n<p>But that added complexity we were originally trying to avoid.<\/p>\n\n\n\n<p>And syncing context between two systems is not trivial. You lose nuance. You introduce translation errors.<\/p>\n\n\n\n<p>We stuck with DeepSeek longer than we probably should have, partly because switching felt like admitting the architecture was flawed.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>There was one failure that kind of forced us to rethink everything.<\/p>\n\n\n\n<p>We had a batch of about 60 briefs. Not huge, but enough to test reliability.<\/p>\n\n\n\n<p>Halfway through, the Draft Agent started producing outputs in a completely different format.<\/p>\n\n\n\n<p>Not random\u2014consistent, but wrong.<\/p>\n\n\n\n<p>It looked like it had picked up a pattern from somewhere else and applied it across the batch.<\/p>\n\n\n\n<p>We checked the inputs. Nothing had changed.<\/p>\n\n\n\n<p>We checked the prompts. No updates.<\/p>\n\n\n\n<p>We checked memory. Nothing obvious.<\/p>\n\n\n\n<p>The only theory that made sense was that the model had drifted internally during the run.<\/p>\n\n\n\n<p>That\u2019s not something you can debug in a traditional way.<\/p>\n\n\n\n<p>You can\u2019t roll back a model state mid-process.<\/p>\n\n\n\n<p>So we had to discard half the outputs and rerun everything.<\/p>\n\n\n\n<p>Which brings us back to usage caps.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>At this point, the project wasn\u2019t failing, but it wasn\u2019t stable either.<\/p>\n\n\n\n<p>We could get good results. Just not consistently enough to trust automation fully.<\/p>\n\n\n\n<p>So we shifted the goal.<\/p>\n\n\n\n<p>Instead of building a fully autonomous pipeline, we moved toward a semi-assisted system.<\/p>\n\n\n\n<p>Agents still handled ingestion and initial drafting, but humans stayed in the loop earlier.<\/p>\n\n\n\n<p>That reduced efficiency, but increased reliability.<\/p>\n\n\n\n<p>Not a satisfying tradeoff, but a necessary one.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>One thing that did improve over time was prompt layering.<\/p>\n\n\n\n<p>We stopped trying to write perfect prompts.<\/p>\n\n\n\n<p>Instead, we used smaller, imperfect prompts stacked together.<\/p>\n\n\n\n<p>Each agent had a narrower role.<\/p>\n\n\n\n<p>Less room for interpretation.<\/p>\n\n\n\n<p>It didn\u2019t eliminate errors, but it made them easier to trace.<\/p>\n\n\n\n<p>If something went wrong, we could usually identify which layer introduced the issue.<\/p>\n\n\n\n<p>That alone saved a lot of time.<\/p>\n\n\n\n<p><a href=\"https:\/\/cloudsecurityalliance.org\/blog\/2025\/03\/25\/deepseek-behind-the-hype-and-headlines\" target=\"_blank\" rel=\"noopener\">DeepSeek: Behind the Hype and Headlines | CSA<\/a><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>I still think DeepSeek is useful.<\/p>\n\n\n\n<p>But not in the way most startup case studies frame it.<\/p>\n\n\n\n<p>It\u2019s not a plug-and-play foundation for fully autonomous systems.<\/p>\n\n\n\n<p>It\u2019s more like a powerful but unpredictable collaborator.<\/p>\n\n\n\n<p>Good at handling chaos, bad at maintaining consistency over long chains.<\/p>\n\n\n\n<p>If your product depends on stability, you\u2019ll spend a lot of time compensating for that.<\/p>\n\n\n\n<p>If your product benefits from flexibility, it can feel like a superpower.<\/p>\n\n\n\n<p>We were somewhere in between.<\/p>\n\n\n\n<p>Which is probably why this never turned into a clean success story.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>There are still parts of the system running today.<\/p>\n\n\n\n<p>Mostly the ingestion layer.<\/p>\n\n\n\n<p>We\u2019ve replaced or modified almost everything else.<\/p>\n\n\n\n<p>Not because DeepSeek failed completely, but because the friction never really went away.<\/p>\n\n\n\n<p>And when you\u2019re building something that needs to run every day, small inconsistencies become big problems.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>I don\u2019t think the lesson here is \u201cdon\u2019t build on DeepSeek.\u201d<\/p>\n\n\n\n<p>It\u2019s more like:<\/p>\n\n\n\n<p>Don\u2019t assume the model will behave the same way twice just because it did once.<\/p>\n\n\n\n<p>And don\u2019t design your system as if it will.<\/p>\n\n\n\n<p>Everything we built that assumed consistency eventually had to be rewritten.<\/p>\n\n\n\n<p>Everything that allowed for drift, retries, and human intervention\u2026 survived.<\/p>\n\n\n\n<p>Not elegantly. But it worked.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>There\u2019s probably a version of this story where everything lines up and the system just works.<\/p>\n\n\n\n<p>We didn\u2019t hit that version.<\/p>\n\n\n\n<p>Maybe it exists. Maybe it doesn\u2019t.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>Some questions that kept coming up during this whole process:<\/p>\n\n\n\n<p>Why does DeepSeek behave differently under batch load even with identical inputs?<br>We never got a clear answer. It doesn\u2019t seem purely random, but it\u2019s not predictable either. Feels like internal optimization tradeoffs leaking into behavior.<\/p>\n\n\n\n<p>Is Memory 2.0 actually useful in production workflows?<br>Sometimes. But only if you actively manage it. Left alone, it accumulates noise faster than signal.<\/p>\n\n\n\n<p>Can agent chains be trusted for end-to-end automation?<br>Not fully. Not yet. They\u2019re good at segments, not full pipelines.<\/p>\n\n\n\n<p>Why not just switch entirely to GPT-5.5?<br>We tried parts of that. It solved some problems and introduced others. Less drift, but also less flexibility with messy inputs.<\/p>\n\n\n\n<p>Is this a DeepSeek problem or an \u201cAI in 2026\u201d problem?<br>Probably both. The tools are powerful, but they\u2019re still inconsistent in ways that are hard to abstract away.<\/p>\n\n\n\n<p>What would we do differently starting over?<br>Smaller scope. Fewer assumptions about reliability. More tolerance for manual checkpoints.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>This isn\u2019t the kind of case study that ends with traction graphs or revenue milestones.<\/p>\n\n\n\n<p>It\u2019s more like a snapshot of what building with AI actually feels like right now.<\/p>\n\n\n\n<p>Messy, uneven, occasionally impressive, and often harder than it looks from the outside.<\/p>\n\n\n\n<p>And still\u2026 kind of worth doing.<\/p>","protected":false},"excerpt":{"rendered":"<p>This isn\u2019t a clean success story. It\u2019s what happened when we tried to build something real on DeepSeek in 2026 and ran into the parts nobody documents.<\/p>","protected":false},"author":91,"featured_media":1371,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_gspb_post_css":"","iawp_total_views":0,"footnotes":""},"categories":[24],"tags":[88],"class_list":["post-3410","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-deepseek-stories","tag-breaking"],"blocksy_meta":[],"_links":{"self":[{"href":"https:\/\/deepseek.international\/zh\/wp-json\/wp\/v2\/posts\/3410","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/deepseek.international\/zh\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/deepseek.international\/zh\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/deepseek.international\/zh\/wp-json\/wp\/v2\/users\/91"}],"replies":[{"embeddable":true,"href":"https:\/\/deepseek.international\/zh\/wp-json\/wp\/v2\/comments?post=3410"}],"version-history":[{"count":3,"href":"https:\/\/deepseek.international\/zh\/wp-json\/wp\/v2\/posts\/3410\/revisions"}],"predecessor-version":[{"id":3413,"href":"https:\/\/deepseek.international\/zh\/wp-json\/wp\/v2\/posts\/3410\/revisions\/3413"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/deepseek.international\/zh\/wp-json\/wp\/v2\/media\/1371"}],"wp:attachment":[{"href":"https:\/\/deepseek.international\/zh\/wp-json\/wp\/v2\/media?parent=3410"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/deepseek.international\/zh\/wp-json\/wp\/v2\/categories?post=3410"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/deepseek.international\/zh\/wp-json\/wp\/v2\/tags?post=3410"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}