01 — Problem
Three Years of Thinking, Trapped in the Wrong System
I had accumulated 847 notes in Google Keep over 3 years — meeting fragments, project ideas, reading notes, half-formed architectural decisions, and the occasional grocery list. Keep excels at capture but fails at retrieval. There’s no linking between notes, no tagging taxonomy, no way to surface connections between an idea I had in March and a project I started in November. When I committed to Obsidian as my knowledge management system, the migration problem was immediate: 847 unstructured text files needed to become properly formatted markdown with YAML frontmatter, categories, cross-references, and deduplication. Doing this by hand would have taken 40+ hours.
I needed a converter that understood the implicit structure of messy notes — distinguishing a todo list from a journal entry from a technical reference — and imposed the Obsidian conventions automatically.
02 — Architecture
Parse, Classify, Deduplicate, Link
The converter operates in four stages on the Google Keep export:
Stage 1 — Export Parsing
Reads the Google Keep text export (one .txt file per note) and extracts content, creation date, and any embedded metadata. Handles encoding inconsistencies and strips Keep-specific formatting artifacts that don’t translate to markdown.
Stage 2 — Auto-Classification
Regex-based heuristics classify each note into one of 5 categories: Todo (checkbox patterns), Thoughts (reflective language, first-person statements), Sensitive (email addresses, phone numbers, API key patterns), Links (URL-heavy content), and Other. Classification drives both folder placement and frontmatter metadata in the output. The sensitive category triggers a privacy flag that excludes those notes from any cross-reference indexing.
Stage 3 — Content-Hash Deduplication
MD5 hashing on normalized content detects duplicate notes — a common artifact of Keep’s sync behavior, where the same note appears multiple times with slightly different timestamps. Deduplication reduced my 847 exports to 691 unique notes, eliminating 156 duplicates that would have polluted the Obsidian vault.
Stage 4 — Obsidian Output Generation
Each unique note becomes a markdown file with YAML frontmatter (title, date, category, tags), auto-generated content tags extracted from keyword frequency, and bidirectional wikilinks ([[related note]]) generated by detecting shared keywords across notes. An archive-after-processing flag ensures safe re-runs on incremental exports without reprocessing previously converted notes.
Key Design Decisions
Why regex-based classification instead of an LLM? The classification task is shallow — it’s distinguishing between checkboxes and paragraphs, not understanding semantic intent. Regex handles this at thousands of notes per second with deterministic results. An LLM would add latency, cost, and non-determinism to a problem that doesn’t need any of those. Sometimes the right tool is the old one.
Why package as a standalone .exe? I built this for myself, but I also wanted to share it with colleagues who use Keep but don’t have Python installed. PyInstaller produces a single executable that runs on any Windows machine without dependencies. The .exe adds 15 seconds to startup (bundled interpreter) but eliminates the “install Python, create a venv, pip install” barrier entirely.
03 — Outcomes
Measured Results
Notes Processed
from 3 years of Google Keep accumulation
Duplicates Eliminated
via MD5 content hashing — 18.4% of total exports
Auto-Categories
Todo, Thoughts, Sensitive, Links, and Other
Total Processing Time
for the complete 847-note export on a standard machine
04 — Reflection
Migration Is a Knowledge Archaeology Problem
The unexpected value of this project was the classification step. Forcing every note through a category heuristic revealed patterns in my own note-taking behavior: 34% were todos I’d never completed, 22% were links I’d saved and never revisited, and only 28% contained substantive thinking worth preserving. The migration wasn’t just a format conversion — it was a curation. The tool helped me distinguish signal from noise across 3 years of accumulated capture.
What I’d change: the bidirectional wikilink generation is based on keyword overlap, which produces some false connections (two notes mentioning “Python” aren’t necessarily related). A lightweight embedding-based similarity check would produce more meaningful links, though it would require adding a dependency on a sentence-transformer model — which conflicts with the .exe packaging goal. The right answer is probably offering both modes: fast keyword-based linking by default, and optional semantic linking when Python is available.
“The value of a note-taking system isn’t how much it captures. It’s how much it lets you throw away with confidence.”
Outcomes
847 notes processed in under 8 seconds; 156 duplicates eliminated (18.4%); 5 auto-classification categories; Standalone .exe with zero dependencies