Skip to content

You’ve mastered the basics. Your first spec is published and running. Now you want to write specs that:

  • Handle edge cases your competitors don’t
  • Succeed 90%+ of the time (not 50%)
  • Scale to complex feature sets
  • Teach agents better patterns

This guide covers advanced techniques learned from building SpecMarket’s own specs.


1. Decompose Complex Features Into Layered Success Criteria

A bad spec tries to define “complete feature X” in one criterion. A good spec breaks it into layers.

Bad:

- Build a payment system

Good:

- Database schema: transactions table with id, user_id, amount_usd, status, created_at fields
- API endpoint: POST /api/payments/charge accepts { userId, amountUsd, stripeTokenId }
- Validation: reject amounts > $10,000 or < $0.01 with specific error messages
- Success flow: stripe charge succeeds → transaction marked "completed" → confirmation email sent
- Failure flow: stripe charge fails → transaction marked "failed" → error message to user
- Idempotency: sending same request twice doesn't create two charges
- Webhook: Stripe webhooks are processed and update transaction status within 5 seconds
- Tests: 8+ tests covering success, validation failures, Stripe errors, idempotency, webhook processing

Each layer can be verified independently. Agents tackle them one at a time. If one layer fails, you know exactly which one.


2. Model Uncertainty Explicitly

Specs rarely have 100% certainty about what will work. Model the uncertainty.

Example:

## Success Criteria
 
- [ ] API starts without errors
- [ ] GET /api/specs returns list of specs (may be empty)
- [ ] **Error handling: API returns 500 with error message on database failure**
  - *Rationale: We can't guarantee the database always works, but we CAN guarantee we handle it gracefully*
 
## Known Limitations
 
- Sorting specs by "popularity" is a heuristic; it's not based on real-time data
- Email delivery is "eventual" — confirmation emails may take up to 5 minutes
- Search is not full-text indexed; it's a simple substring match (acceptable for <10,000 specs)

When you model uncertainty, agents know what’s non-negotiable vs. what’s “best effort.”


3. Use Concrete Enums, Not Strings

Agents misunderstand what strings mean. Enums clarify.

Bad spec:

status: string  # Can be "draft", "published", "archived", or "deleted"

Good spec:

type SpecStatus = "draft" | "published" | "archived";
// Note: we never use "deleted" — archives are permanent
 
// Business rules:
// - Only "draft" specs can be edited
// - "published" specs are immutable (new versions only)
// - "archived" specs are hidden from search but still visible to their author

Enumerate all valid values. Explain why certain values exist or don’t exist.


4. Prevent Success Criteria Escapes

Agents find loopholes. Your success criteria should have no escape hatches.

Bad criterion:

- The app displays a list of users

Escape: agent creates a 1-item list, hard-coded. Technically correct!

Better criterion:

- The app displays a list of users, fetched from the database
- Test case 1: with 0 users, display is empty or shows "No users"
- Test case 2: with 5 users, display shows all 5 with name, email, created_at
- Test case 3: added a 6th user via API, refresh page, new user appears

Escape hatches are closed when you specify the data source and test cases.


5. Document Spec Dependencies In a DAG

If your spec depends on another spec or external service, document the dependency graph.

## Dependencies
 
### Specs
- Requires: `@alice/auth-system` (version 1.x)
  - Used for: user authentication, JWT token generation
  - If missing: Build a minimal auth system (see stdlib/AUTH_MINIMAL.md)
 
### External Services
- Stripe (required)
  - What: payment processing
  - Free tier: yes ($0 until first transaction)
  - Env var: STRIPE_SECRET_KEY
 
- Supabase (optional)
  - What: PostgreSQL database (can substitute local SQLite)
  - Free tier: yes (up to 1GB)
  - If using local SQLite: set DATABASE_URL="sqlite:./local.db"
 
### APIs
- OpenAI API (optional)
  - What: used only for premium feature "AI suggestions"
  - If missing: feature is disabled gracefully (not an error)

This clarity helps agents know what’s hard-required, what’s nice-to-have, and what can be swapped.


6. Use PATTERNS.md For Domain Knowledge

Your stdlib/PATTERNS.md file is a teaching document. Use it to share patterns, antipatterns, and insights.

Example for a React spec:

## Patterns
 
### State Management
- Use React Context for global state (auth, user preferences)
- Use useState for local component state
- Avoid Redux — overkill for this spec's complexity
 
### Error Handling
- Try-catch blocks around API calls
- Show user-friendly error messages (not stack traces)
- Log errors to console for debugging
 
### Testing
- Use Vitest + React Testing Library
- Test user interactions, not implementation details
- Don't test third-party libraries (assume they work)
 
## Antipatterns to Avoid
 
### ❌ API calls in render
```typescript
// BAD: calls API on every render
function UserList() {
  const users = fetch('/api/users');  // BUG!
  return <ul>{users.map(...)}</ul>;
}
 
// GOOD: use useEffect
function UserList() {
  const [users, setUsers] = useState([]);
  useEffect(() => {
    fetch('/api/users').then(res => setUsers(res.json()));
  }, []);
  return <ul>{users.map(...)}</ul>;
}

❌ Hardcoded API URLs

Always use environment variables:

const API_URL = process.env.REACT_APP_API_URL || 'http://localhost:3000';

Patterns are explicit teachable moments. Agents learn from them.

---

## 7. Estimate Costs Conservatively

Your `estimatedCostUsd` is a contract. Underestimating costs:
- Makes specs unprofitable to run (agents use Managed Runs less)
- Damages your reputation (users see higher-than-expected bills)

**How to estimate:**
1. **Count API calls** — How many times does the spec call an LLM?
   - Example: 50 iterations × 20K input tokens × $0.003/1K = $3.00
2. **Add overhead** — Context expansion, retries, middleware
   - Example: $3.00 × 1.3 (overhead) = $3.90
3. **Round up** — Always round conservatively
   - Final estimate: $5.00 (not $3.90)

**Test your estimate:**
- Run the spec 5 times with the same model
- Calculate actual cost
- If your estimate was wrong by >50%, update it

---

## 8. Use Infrastructure Blocks To Declare Constraints

Your infrastructure declaration tells agents what you're willing to use.

**Example:**
```yaml
infrastructure:
  services:
    - category: "Database"
      name: "SQLite"
      purpose: "Store user data"
      required: true
      providers:
        - name: "SQLite (local file)"
          freeTier: true

    - category: "Database"
      name: "PostgreSQL"
      purpose: "Store user data"
      required: true
      userProvided: true  # User brings their own (e.g., Supabase, Railway)
      providers:
        - name: "Supabase"
          freeTier: true
        - name: "Railway"
          freeTier: false
          paidStartsUsd: 5

  monthlyCost:
    freeTierUsd: 0       # Local SQLite = free
    productionUsd: 5     # Supabase free tier

This tells agents: “You have options. SQLite is free and available. PostgreSQL is also fine, but user has to provide credentials.”


9. Write Defensive Specs For LLM Limitations

LLMs can’t:

  • Read your mind — Specify exactly what you mean
  • Handle ambiguity — Make all decisions explicit
  • Remember context — Repeat important constraints multiple times
  • Invent details — All decisions must be in the spec

Example — Authentication:

Bad spec:

- Users can log in with email/password

Good spec:

- Authentication flow:
  1. User enters email + password in login form
  2. App queries database: SELECT user WHERE email = ?
  3. If user doesn't exist → show "Email not found" error
  4. If user exists:
     a. Compare password hash using bcrypt.compare()
     b. If match → create JWT token (exp: 7 days), set cookie
     c. If mismatch → show "Incorrect password" error (don't reveal whether email exists)
  5. On successful login → redirect to /dashboard
 
- Security rules:
  - Never log plaintext passwords
  - Password hashing: bcrypt with salt rounds = 10
  - JWT secret: read from env var JWT_SECRET (required)
  - Cookie: httpOnly, Secure, SameSite=Strict
 
- Test cases:
  - Non-existent email → "Email not found"
  - Correct credentials → JWT token issued, redirect to /dashboard
  - Wrong password → "Incorrect password"
  - Old token (expired) → redirect to login

Repetitive? Yes. But the agent won’t misunderstand.


10. Handle Versioning Gracefully

Specs evolve. Document how version changes work.

## Versioning & Breaking Changes
 
### Semantic Versioning: MAJOR.MINOR.PATCH
 
**PATCH (1.0.1):**
- Bug fixes
- Documentation updates
- Backward compatible
 
**MINOR (1.1.0):**
- New features
- Backward compatible
- Example: adding an optional parameter to an API endpoint
 
**MAJOR (2.0.0):**
- Breaking changes
- Incompatible with previous versions
- Example: removing an API endpoint, changing database schema
 
### Migration Path
If you release version 2.0.0:
1. Maintain version 1.x for 6 months
2. Announce deprecation: "Version 1.x will be removed on [date]"
3. Provide migration guide: how to upgrade from 1.x to 2.0

Explicit versioning prevents agents (and users) from getting confused.


11. Specify I/O Contracts Precisely

Agents often mess up data serialization, edge cases, and format details.

Bad:

- API returns a user object

Good:

interface User {
  id: string;              // UUID, e.g., "550e8400-e29b-41d4-a716-446655440000"
  email: string;           // RFC 5322 valid email, lowercase
  displayName: string;     // 1-100 characters, no leading/trailing whitespace
  createdAt: string;       // ISO 8601 format, e.g., "2026-02-27T13:30:00Z"
  role: "user" | "admin";  // Only these two values
  isActive: boolean;       // true if user can log in
}
 
// API Response
{
  "success": true,
  "data": User,
  "timestamp": "2026-02-27T13:30:00Z"
}
 
// API Error Response
{
  "success": false,
  "error": {
    "code": "VALIDATION_ERROR" | "NOT_FOUND" | "UNAUTHORIZED" | "INTERNAL_ERROR",
    "message": "Human-readable message",
    "details": { ... }  // Optional; depends on error code
  },
  "timestamp": "2026-02-27T13:30:00Z"
}
 
// Validation Rules
- Email: cannot exceed 254 characters
- displayName: no newlines, no control characters
- createdAt: server-generated; users cannot set this

Specs are a data contract. Be explicit about format.


12. Account For Deployment Complexity

Specs aren’t just features; they’re shippable systems. Document deployment.

## Deployment
 
### Local Development
```bash
git clone <repo>
pnpm install
pnpm dev  # Starts dev server on http://localhost:3000

Production Deployment

pnpm build
vercel deploy  # Needs STRIPE_SECRET_KEY env var

Option 2: Docker

docker build -t myapp .
docker run -e STRIPE_SECRET_KEY=sk_live_... myapp

Option 3: Railway

Connect your GitHub repo, Railway auto-deploys on push.

Environment Variables (Required)

  • STRIPE_SECRET_KEY — Stripe API key
  • DATABASE_URL — PostgreSQL connection string (optional for SQLite)
  • JWT_SECRET — Secret for signing tokens

Database Migrations

pnpm db:migrate  # Runs all pending migrations

If no database exists, the app creates it on first run.

Monitoring

  • Errors are logged to console (local) and Sentry (production)
  • No setup required; happens automatically

This prevents agents from building unmigrable systems.

---

## 13. Use Scaffolding for Complex Domains

If your spec involves complex logic (payment processing, ML, etc.), provide a scaffold or template.

**Example:**
```markdown
## Payment Processing Scaffold

The spec must implement payment processing. Here's a recommended structure:

### File Layout

src/ payments/ stripe.ts # Stripe client setup webhook.ts # Webhook handler checkout.ts # Checkout flow types.ts # TypeScript types tests/ stripe.test.ts webhook.test.ts


### Starting Code (Minimal)

```typescript
// src/payments/stripe.ts
import Stripe from 'stripe';

const stripe = new Stripe(process.env.STRIPE_SECRET_KEY!);

export async function createCheckout(userId: string, amountCents: number) {
  const session = await stripe.checkout.sessions.create({
    payment_method_types: ['card'],
    line_items: [
      {
        price_data: {
          currency: 'usd',
          unit_amount: amountCents,
          product_data: {
            name: 'Product',
          },
        },
        quantity: 1,
      },
    ],
    mode: 'payment',
    success_url: process.env.SUCCESS_URL!,
    cancel_url: process.env.CANCEL_URL!,
    metadata: { userId },
  });
  return session;
}

This gives agents a starting point so they don’t invent wheels.


---

## 14. Test Your Spec Yourself

Before publishing, you should be able to hand your spec to an agent and predict success rate with 80%+ confidence.

**Testing checklist:**

1. ✅ **Ambiguity test** — Read your spec aloud. Does every sentence have exactly one meaning?
2. ✅ **Dependency test** — Can someone follow your spec without external knowledge?
3. ✅ **Constraint test** — Is every constraint explicit? (no implied business logic)
4. ✅ **Data test** — Can you write test cases for every API endpoint and database query?
5. ✅ **Integration test** — Does your spec work on different machines/OS versions?
6. ✅ **Cost test** — Run your own version 3-5 times; is the cost within your estimate ± 20%?

---

## 15. Design For Failure

Specs fail. Document how agents should handle failure.

```markdown
## Failure Handling

### What If The API Service Is Down?

**During Development (Managed Runs):**
- The run fails with status "stall"
- Error message: "Failed to authenticate with Stripe after 3 retries"
- User sees: "Service temporarily unavailable. Try again in 10 minutes."

**Test Case:**
```typescript
test('handles Stripe timeouts gracefully', async () => {
  stripe.charges.create = vi.fn().mockRejectedValue(new Error('TIMEOUT'));

  const result = await processPayment({ userId: '123', amountUsd: 10 });

  expect(result.success).toBe(false);
  expect(result.error).toContain('temporarily unavailable');
});

What If The Database Is Full?

During Development:

  • The spec should gracefully degrade or show an error
  • Never silently fail; always inform the user

Test Case:

test('handles database full error', async () => {
  db.insert = vi.fn().mockRejectedValue(new Error('DISK_FULL'));
 
  const result = await createUser({ email: '...' });
 
  expect(result.success).toBe(false);
  expect(result.error).toContain('Server storage is full');
});

Failure is not shameful. Plan for it.

---

## 16. Browser Testing in Specs

The managed runs sandbox includes headless Chromium and Playwright system dependencies. You can use Playwright for end-to-end testing and visual verification.

### Adding Playwright to Your Spec

In your `stdlib/STACK.md` or `SPEC.md`, include Playwright as a dev dependency:

```markdown
# Testing
- Unit tests: Vitest
- E2E tests: Playwright (pre-installed in managed runs sandbox)

Or in your spec’s package.json (if included):

{
  "devDependencies": {
    "playwright": "^1.50.0",
    "@playwright/test": "^1.50.0"
  }
}

Writing Playwright Tests for the Sandbox

// e2e/homepage.spec.ts
import { test, expect } from '@playwright/test';
 
test('homepage loads and displays title', async ({ page }) => {
  await page.goto('http://localhost:3000');
  await expect(page).toHaveTitle(/My App/);
 
  // Save screenshot — coordinator uploads to dashboard automatically
  await page.screenshot({ path: '/workspace/screenshots/homepage.png' });
});
 
test('login flow works', async ({ page }) => {
  await page.goto('http://localhost:3000/login');
  await page.fill('[name="email"]', 'test@example.com');
  await page.fill('[name="password"]', 'testpass123');
  await page.click('button[type="submit"]');
 
  await expect(page).toHaveURL('/dashboard');
  await page.screenshot({ path: '/workspace/screenshots/login-success.png' });
});

Screenshot Output Convention

Save screenshots to /workspace/screenshots/ with descriptive filenames:

/workspace/screenshots/homepage.png
/workspace/screenshots/login-form.png
/workspace/screenshots/dashboard-loaded.png
/workspace/screenshots/error-404-page.png

The coordinator polls this directory every 5 seconds and uploads new screenshots to your dashboard in real-time. Use descriptive filenames — they’re shown in the screenshot gallery.

Important Notes

  • Xvfb is NOT needed — Playwright runs in headless mode natively
  • Chromium is pre-installed — no need to run npx playwright install in your spec
  • ARM64 compatible — the sandbox runs on ARM64 hosts; Playwright’s Chromium build supports this
  • Screenshots are limited to 100 per run, 5MB each, 200MB total

Example SUCCESS_CRITERIA Entry

## Browser Tests
- [ ] `npx playwright test` passes with 0 failures
- [ ] Screenshots saved to /workspace/screenshots/ for all key pages
- [ ] Login flow screenshot shows authenticated dashboard state

Common Pitfalls To Avoid

❌ Over-Specifying Implementation

Don’t say “use React hooks.” Say “manage state efficiently.”

❌ Vague Success Criteria

“Code is clean” is not testable. “Code passes ESLint + has >80% test coverage” is.

❌ Ignoring Edge Cases

What if the user provides a negative number? An empty string? Null? Spec it.

❌ Assuming Knowledge

Agents don’t know your domain. Explain assumptions.

❌ Not Pricing Realistically

Underpriced specs don’t get run. Overpriced specs don’t get funding.

❌ Changing Specs After Publishing

Once published, a spec is immutable. Create version 2.0 instead.


When To Break These Rules

These patterns are guidelines, not laws. Break them if:

  1. Simplicity > Completeness — Your spec is 300 words, not 3,000. Keep it simple.
  2. Domain Maturity — If the domain is well-established (e.g., “build a CRUD API”), less specification is needed.
  3. Trust — If your first spec has a 90% success rate, agents trust your style. You can be slightly less explicit in the future.

Measuring Spec Quality

Track these metrics over time:

  • Success rate — % of runs that meet all criteria (target: >80%)
  • Avg cost accuracy — |actual cost - estimated cost| / estimated (target: <20% variance)
  • Time accuracy — |actual time - estimated time| / estimated (target: <30% variance)
  • User satisfaction — Reviews and ratings (target: >4.0 stars)

Use these metrics to iterate on your specs.


Learning From Failures

When a run fails:

  1. Read the error message — What did the agent get stuck on?
  2. Review the success criteria — Was the criterion ambiguous?
  3. Update your spec — Add clarity to prevent future failures
  4. Publish version 2.0 — Don’t overwrite version 1.x; create a new version

Advanced Examples

Building an LLM App Spec

  • Document the prompt structure
  • Specify temperature, max_tokens, other parameters
  • Test cases: valid input, edge cases, hallucination scenarios

Building a Data Pipeline Spec

  • Specify data formats: CSV, JSON, Parquet?
  • Error handling: malformed data, missing columns?
  • Performance targets: process 1M rows in <5 minutes?

Building a Mobile App Spec

  • Screen layouts: provide wireframes or detailed descriptions
  • Device support: iOS 14+, Android 11+?
  • Offline support: specs might need to work offline

Next Steps

You’re now equipped to write production-grade specs. Your journey:

  1. Publish spec v1.0 — Simple, clear, testable
  2. Gather feedback — Reviews, ratings, run data
  3. Iterate to v2.0 — Based on failures and user requests
  4. Target 80%+ success rate — This is the quality bar
  5. Help other creators — Share your patterns via comments, Moltbook, GitHub

The specs that win on SpecMarket are the ones that succeed consistently. Focus on clarity first. Optimization second.