Building A Local File Search Engine

Recently, I found myself writing a class to manipulate word documents. I remembered having written a class similar, though slightly less capable, in the past. But I could not trace it! Well, I have been writing code for 15+ years so I have lost track of things.

I was sure to hae the code somewhere in my backups but try as I may, I could not trace it! So I decided to start indexing my code so that I could perform full-text searches on it.

That thought led me to building X-FileSearch which provides fast, typo-tolerant search across your local filesystem.

X-File-Search

The Architecture

X-FileSearch combines several key technologies:

  • Frontend: SvelteKit for a responsive UI
  • Backend: Node.js with TypeScript
  • Search Engine: Typesense for fast, typo-tolerant search
  • File Analysis: Various Node.js libraries for file parsing and language detection

Key Features

  1. Smart File Processing: One of the biggest challenges was efficiently processing different file types. The solution uses a combination of file analysis tools.

  2. Efficient Search with Typesense: The search implementation leverages Typesense's powerful features:

  3. Real-time UI Updates: The frontend uses Svelte's reactive capabilities to provide instant feedback:

Implementation Challenges

1. File Size Handling

Large files needed special handling to avoid memory issues. The solution was to only inspect a max of 1000000 bytes (1MB) of each file. That is more than sufficient for a majority of code files in order for us to determine the language and index the code.

2. Language Detection

Accurate programming language detection was crucial for proper code highlighting. We determine the language based on known/common file extensions such as '.js'. But I found that to to be 100% efficient as newer file types such as '.svelte' will be mis-categorized, so we implemented a method that also checks for language subsets.

function langSubsets(lang) {
  const languageSubsets = {
    svelte: { subset: 'svelte', language: 'HTML' },
    vue: { subset: 'vue', language: 'HTML' },
    tsx: { subset: 'tsx', language: 'TypeScript' },
    // ... more mappings
  };

  let langObj = languageSubsets[lang] || {};
  return langObj.language ? langObj.language.toLowerCase() : null;
}

Performance Optimizations

1. Bulk Indexing

Files are processed in configurable batches to optimize performance:

const BULK_DOC_COUNT = numberOr(process.env.BULK_DOC_COUNT, 50);

// During indexing
if (docs.length >= bulkDocs) {
  await this.db.index('local_files', docs);
  docs = [];
}

2. Smart File Filtering

Unnecessary file processing is avoided using configurable patterns:

ignorePatterns = arrify(
  ignorePatterns || (process.env.IGNORE_PATTERNS || '').split(',')
);

skipTypes = skipTypes || ['image', 'audio', 'video'];

3. Caching and Lazy Loading

Code highlighting is done on-demand with dynamic imports:

async function loadLanguage(lang) {
  if (supportedLangs[lang]) {
    const url = `https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.9.0/build/languages/${lang}.min.js`;
    await import(url);
  }
}

Deployment Considerations

  1. Environment Configuration

    PORT=9877
    API_PORT=9876
    API_KEY=myAmazingKey
    MAX_FILE_SIZE="1MB"
    IGNORE_PATTERNS=**/node_modules/**,**/env/**
    
  2. Docker Typesense Setup

    docker run -d --restart unless-stopped \
      --name typesense \
      -p "$API_PORT":8108 \
      -v "$DATA_DIR":/data \
      typesense/typesense:27.1
    
  3. Security Considerations

    • Local-only deployment by default
    • File access restricted to configured directories
    • Configurable file size limits

Conclusion

X-File_search

After indexing my hard disks, I discovered that my code was scattered across over 70k files 😂😂. I honestly don't know how this came to be because I could swear not to have written as much code. By the end of the process, I had found my class and could search for any other piece of code.

However, while typesense is powerful, I have found that it slows very significantly as the indexes grow. I am not sure if this is a design flaw on my end or inherent to typesense. I therefore plan to rewrite X-File-Search again pretty soon. And next time around I will use the full power of Svelte 5's Runes!!

Happy Geeking!

programmingJavascriptFull-Text SearchCodeBase IndexingTypesense
By: Anthony Mugendi Published: 11 Oct 2024, 11:00