How to Evaluate You.com Search API

View as MarkdownOpen in Claude

A practical guide to benchmarking You.com’s Search API: methodology, configurations, datasets, and real performance tradeoffs.

Why This Guide Exists

Most developer docs treat evaluation like checking boxes. This guide treats it like shipping production code: you need real benchmarks, honest tradeoffs, and configurations that actually work.

We’ll cover:

  • Retrieval Quality - Does it actually find what you need?
  • Latency - Fast enough for your users?
  • Freshness - Can it handle “what happened today?”
  • Cost - What’s your burn rate per query?
  • Agent Performance - Does it work in multi-step reasoning workflows?

Want help running your eval? Our team can design and run custom benchmarks for your use case. Talk to us

The Golden Rule: Start Simple, Stay Fair

TL;DR: Use default settings. Don’t over-engineer your first eval.

Most failed evaluations have one thing in common: people add too many parameters too early.

1from youdotcom import You
2
3with You("your_api_key") as you:
4 result = you.search.unified(
5 query=query,
6 count=10
7 )
8
9# That's it. No date filters. No domain whitelists. Just search.

When to Add Complexity

Add parameters ONLY when:

  1. Your evaluation explicitly tests that feature (e.g., freshness requires the freshness parameter)
  2. You’ve already run baseline evals and know what you’re optimizing for
  3. The parameter reflects actual production usage, not hypothetical edge cases

Anti-pattern: “Let me add every possible parameter to make this perfect”

Better approach: “Let me run this with defaults, measure performance, then iterate”


API Parameters Reference

The Search API (GET https://ydc-index.io/v1/search) accepts these parameters:

ParameterTypePurposeWhen to Use
querystring (required)The search queryAlways
countinteger (default: 10)Max results per section (web/news)Fix at 10 for fair comparisons
freshnessstringTime filter: day, week, month, year, or YYYY-MM-DDtoYYYY-MM-DDTime-sensitive queries only
livecrawlstringGet full page content: web, news, or allEssential for RAG/synthesis
livecrawl_formatsstringContent format: html or markdownWhen using livecrawl
countrystringGeographic focus (e.g., US, GB)Location-specific queries
languagestringResult language in BCP 47 (default: EN)Non-English queries
safesearchstringContent filter: off, moderate, strictProduction (default: moderate)
offsetintegerPagination (multiples of count)Fetching more results

Latency: Compare Apples to Apples

Critical insight: Never compare APIs with wildly different latency profiles.

A 200ms API and a 3000ms API serve different use cases. Comparing them is like comparing a bicycle to a freight train.

Latency Buckets

Latency ClassUse CasesCompare Within
Ultra-fast (< 200ms)Autocomplete, real-time voice agentsOther sub-200ms systems
Fast (200-800ms)Chatbots, user-facing QASimilar mid-latency APIs
Deep (>1000ms)Research, multi-hop reasoning, batch processingOther comprehensive search systems

Fair Comparison Framework

1# Good: Comparing within the same latency class
2compare_systems([
3 "You.com (350ms p50)",
4 "Competitor A (380ms p50)",
5 "Competitor B (290ms p50)"
6])
7
8# Bad: Comparing across latency classes
9compare_systems([
10 "You.com (350ms p50)",
11 "Deep Research API (2800ms p50)" # Meaningless comparison
12])

Configuration Examples

Minimal Config (Start Here)

1from youdotcom import You
2
3with You("your_api_key") as you:
4 result = you.search.unified(
5 query=query,
6 count=10
7 )

With Full Page Content (for RAG)

1result = you.search.unified(
2 query=query,
3 count=10,
4 livecrawl="web",
5 livecrawl_formats="markdown"
6)
7
8# Access full content via result.results.web[0].contents.markdown

Freshness Config (Time-Sensitive Queries)

1result = you.search.unified(
2 query="latest AI news",
3 count=10,
4 freshness="day" # or "week", "month", "year", "2024-01-01to2024-12-31"
5)

Raw HTTP Request

$curl -X GET "https://ydc-index.io/v1/search?query=climate+tech+startups&count=10&livecrawl=web&livecrawl_formats=markdown" \
> -H "X-API-Key: your_api_key"

Evaluation Workflow: 4 Steps That Actually Work

1. Define What You’re Testing

Don’t start with “let’s evaluate everything.” Start with:

  • What capability matters? (speed? accuracy? freshness?)
  • What latency can you tolerate?
  • Single-step retrieval or multi-step reasoning?

Example scope: “We need 90%+ accuracy on customer support questions with < 500ms latency”

2. Pick Your Dataset

DatasetTestsNotes
SimpleQAFast factual QAGood baseline
FRAMESMulti-step reasoningAgentic workflows
FreshQATime-sensitive queriesUse with freshness param
Custom (your data)Domain-specific accuracyStart here

Pro tip: Start with public benchmarks, but your production queries are the real test.

Need help building a custom dataset? We can help

3. Run Your Eval

1from youdotcom import You
2import time
3
4def run_eval(dataset_path, config):
5 results = []
6
7 with You("your_api_key") as you:
8 for item in load_dataset(dataset_path):
9 query = item['question']
10 expected = item['answer']
11
12 # Step 1: Retrieve
13 start = time.time()
14 search_results = you.search.unified(
15 query=query,
16 count=config.get('count', 10),
17 livecrawl=config.get('livecrawl'),
18 freshness=config.get('freshness')
19 )
20 latency = (time.time() - start) * 1000
21
22 # Step 2: Synthesize answer using your LLM
23 snippets = [r.snippets[0] for r in search_results.results.web if r.snippets]
24 context = "\n".join(snippets)
25 answer = llm.generate(
26 f"Answer using only this context:\n{context}\n\n"
27 f"Question: {query}\nAnswer:"
28 )
29
30 # Step 3: Grade
31 grade = evaluate_answer(expected, answer)
32
33 results.append({
34 'correct': grade == 'correct',
35 'latency_ms': latency
36 })
37
38 # Calculate metrics
39 accuracy = sum(r['correct'] for r in results) / len(results)
40 p50_latency = sorted([r['latency_ms'] for r in results])[len(results)//2]
41
42 return {
43 'accuracy': f"{accuracy:.1%}",
44 'p50_latency': f"{p50_latency:.0f}ms"
45 }

4. Analyze & Iterate

Look at:

  • Accuracy vs latency tradeoff - Can you get 95% accuracy at 300ms?
  • Failure modes - Which queries fail? Is there a pattern?
  • Cost - What’s your $/1000 queries?

Then iterate:

  • Add livecrawl if snippets aren’t giving enough context
  • Add freshness if failures are due to stale content
  • Compare against competitors in the same latency class

Response Structure

The API returns results in two sections:

1{
2 "results": {
3 "web": [
4 {
5 "url": "https://example.com/article",
6 "title": "Article Title",
7 "description": "Brief description",
8 "snippets": ["Relevant text from the page..."],
9 "page_age": "2025-01-20T12:00:00Z",
10 "contents": {
11 "markdown": "Full page content if livecrawl enabled"
12 }
13 }
14 ],
15 "news": [
16 {
17 "title": "News Headline",
18 "description": "News summary",
19 "url": "https://news.example.com/story",
20 "page_age": "2025-01-25T08:00:00Z"
21 }
22 ]
23 },
24 "metadata": {
25 "search_uuid": "uuid",
26 "query": "your query",
27 "latency": 0.234
28 }
29}

Tool Calling for Agents

When evaluating You.com in agentic workflows, keep the tool definition minimal.

Open-source evaluation framework: Check out Agentic Web Search Playoffs for a ready-to-use benchmark comparing web search providers in agentic contexts.

1search_tool = {
2 "type": "function",
3 "function": {
4 "name": "web_search",
5 "description": "Search the web using You.com. Returns relevant snippets and URLs.",
6 "parameters": {
7 "type": "object",
8 "properties": {
9 "query": {
10 "type": "string",
11 "description": "The search query"
12 }
13 },
14 "required": ["query"]
15 }
16 }
17}

Note: Don’t expose freshness, livecrawl, or other parameters to the agent unless necessary. Let the agent focus on formulating good queries.

Implementation

1def handle_tool_call(tool_call):
2 query = tool_call.arguments["query"]
3
4 with You("your_api_key") as you:
5 results = you.search.unified(query=query, count=10)
6
7 # Format for agent consumption
8 formatted = []
9 for r in results.results.web[:5]:
10 formatted.append({
11 "title": r.title,
12 "snippet": r.snippets[0] if r.snippets else r.description,
13 "url": r.url
14 })
15
16 return json.dumps(formatted)

Common Mistakes to Avoid

1. Over-Filtering Too Early

Don’t:

1result = you.search.unified(
2 query=query,
3 freshness="week",
4 country="US",
5 language="EN",
6 safesearch="strict"
7)

Do:

1result = you.search.unified(query=query, count=10) # Start simple

2. Ignoring Your Actual Queries

Don’t just run: Public benchmarks

Also run: Your actual user queries from production logs

3. Not Measuring What Users Care About

Don’t only measure: Technical accuracy

Also measure: Click-through rate, task completion, reformulation rate

4. Testing in Isolation

Don’t test: Search API alone

Test: Full workflow (search -> synthesis -> grading) with your actual LLM and prompts


Debugging Performance Issues

If Accuracy is Low (< 85%)

  1. Are you requesting enough results? Try count=15
  2. Enable livecrawl for full page content:
1results = you.search.unified(
2 query=query,
3 count=15,
4 livecrawl="web",
5 livecrawl_formats="markdown"
6)
  1. Is your synthesis prompt good? Test with GPT-4
  2. Is your grading fair? Manual review a sample

If Results are Stale

1# Force fresh results
2results = you.search.unified(
3 query=query,
4 count=10,
5 freshness="day" # or "week", "month"
6)

Still stuck? Our team has run hundreds of search evals. Get hands-on help

Production Checklist

1. Run Comparative Benchmarks

1configs = [
2 {'count': 5},
3 {'count': 10},
4 {'count': 10, 'livecrawl': 'web', 'livecrawl_formats': 'markdown'}
5]
6
7for config in configs:
8 results = run_eval('your_dataset.json', config)
9 print(f"{config}: {results}")

2. Set Up Monitoring

1# What to log
2{
3 'query': query,
4 'latency_ms': latency,
5 'num_results_returned': len(results.results.web),
6 'used_livecrawl': bool(livecrawl),
7 'freshness_filter': freshness,
8 'timestamp': now()
9}

3. Document Everything

1## Search Evaluation - 2025-01-26
2
3**Dataset**: 500 customer support queries
4**Config**: count=10, livecrawl=web
5**Results**:
6- Accuracy: 91.2%
7- P50 Latency: 445ms
8- P95 Latency: 892ms
9
10**Decision**: Ship with livecrawl enabled - improves synthesis quality

Getting Help


Remember: The best evaluation is the one you actually run. Start simple, measure what matters, and iterate.