
Father, Husband, Thinker, Creative
I am passionate about all things technology.
Building A Local File Search Engine
Recently, I found myself writing a class to manipulate word documents. I remembered having written a class similar, though slightly less capable, in the past. But I could not trace it! Well, I have been writing code for 15+ years so I have lost track of things.
I was sure to hae the code somewhere in my backups but try as I may, I could not trace it! So I decided to start indexing my code so that I could perform full-text searches on it.
That thought led me to building X-FileSearch which provides fast, typo-tolerant search across your local filesystem.
The Architecture
X-FileSearch combines several key technologies:
- Frontend: SvelteKit for a responsive UI
- Backend: Node.js with TypeScript
- Search Engine: Typesense for fast, typo-tolerant search
- File Analysis: Various Node.js libraries for file parsing and language detection
Key Features
-
Smart File Processing: One of the biggest challenges was efficiently processing different file types. The solution uses a combination of file analysis tools.
-
Efficient Search with Typesense: The search implementation leverages Typesense's powerful features:
-
Real-time UI Updates: The frontend uses Svelte's reactive capabilities to provide instant feedback:
Implementation Challenges
1. File Size Handling
Large files needed special handling to avoid memory issues. The solution was to only inspect a max of 1000000 bytes (1MB) of each file. That is more than sufficient for a majority of code files in order for us to determine the language and index the code.
2. Language Detection
Accurate programming language detection was crucial for proper code highlighting. We determine the language based on known/common file extensions such as '.js'. But I found that to to be 100% efficient as newer file types such as '.svelte' will be mis-categorized, so we implemented a method that also checks for language subsets.
function langSubsets(lang) {
const languageSubsets = {
svelte: { subset: 'svelte', language: 'HTML' },
vue: { subset: 'vue', language: 'HTML' },
tsx: { subset: 'tsx', language: 'TypeScript' },
// ... more mappings
};
let langObj = languageSubsets[lang] || {};
return langObj.language ? langObj.language.toLowerCase() : null;
}
Performance Optimizations
1. Bulk Indexing
Files are processed in configurable batches to optimize performance:
const BULK_DOC_COUNT = numberOr(process.env.BULK_DOC_COUNT, 50);
// During indexing
if (docs.length >= bulkDocs) {
await this.db.index('local_files', docs);
docs = [];
}
2. Smart File Filtering
Unnecessary file processing is avoided using configurable patterns:
ignorePatterns = arrify(
ignorePatterns || (process.env.IGNORE_PATTERNS || '').split(',')
);
skipTypes = skipTypes || ['image', 'audio', 'video'];
3. Caching and Lazy Loading
Code highlighting is done on-demand with dynamic imports:
async function loadLanguage(lang) {
if (supportedLangs[lang]) {
const url = `https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.9.0/build/languages/${lang}.min.js`;
await import(url);
}
}
Deployment Considerations
-
Environment Configuration
PORT=9877 API_PORT=9876 API_KEY=myAmazingKey MAX_FILE_SIZE="1MB" IGNORE_PATTERNS=**/node_modules/**,**/env/**
-
Docker Typesense Setup
docker run -d --restart unless-stopped \ --name typesense \ -p "$API_PORT":8108 \ -v "$DATA_DIR":/data \ typesense/typesense:27.1
-
Security Considerations
- Local-only deployment by default
- File access restricted to configured directories
- Configurable file size limits
Conclusion
After indexing my hard disks, I discovered that my code was scattered across over 70k files đđ. I honestly don't know how this came to be because I could swear not to have written as much code. By the end of the process, I had found my class and could search for any other piece of code.
However, while typesense is powerful, I have found that it slows very significantly as the indexes grow. I am not sure if this is a design flaw on my end or inherent to typesense. I therefore plan to rewrite X-File-Search again pretty soon. And next time around I will use the full power of Svelte 5's Runes!!
Happy Geeking!