Skip to content
Infrastructure AA-016

Portfolio Context Builder

Python tool that mines 3 years of ChatGPT exports (1,195 conversations, 27,689 messages) for portfolio-worthy project candidates through heuristic scoring, 5-theme classification, and privacy-aware sensitive data detection — processing the complete archive in 23 seconds with zero external dependencies.

01 — Problem

Three Years of Work, Buried in a JSON File

Over 3 years of using ChatGPT as a daily working tool, I had accumulated 1,195 conversations containing 27,689 messages. Embedded in that archive was the actual record of my professional development — architecture discussions, debugging sessions, code generation, design decisions, data pipeline builds, and deployment troubleshooting. The problem was extraction. The raw export was a single JSON file of conversational fragments. No structure, no indexing, no way to distinguish a throwaway question from a 40-message session that produced a working data pipeline. Manually reviewing 27,689 messages to identify portfolio-worthy projects would have consumed weeks of time I didn’t have.

I needed a tool that could mine this archive for signal — identifying conversations that represented substantial technical work, categorizing them by theme, and producing structured artifacts I could use as the foundation for portfolio case studies.

02 — Architecture

Parse, Score, Categorize, Protect

The builder operates in four stages on the ChatGPT conversations.json export:

Stage 1 — Export Parsing

Reads the full JSON export and normalizes the conversation structure: extracting message text, timestamps, role labels (user/assistant), and conversation metadata. Handles encoding edge cases, truncated messages, and multi-turn conversations with branching (where the user regenerated a response, creating parallel conversation paths).

Stage 2 — Heuristic Scoring

Each conversation is scored on a composite metric that estimates “portfolio worthiness.” Signals include: message count (longer conversations suggest sustained work), code block frequency (technical substance), technical keyword density (architecture, pipeline, deployment, database), and user-to-assistant turn ratio (higher ratios suggest debugging sessions or iterative development — real work, not casual questions). Conversations scoring above the 80th percentile are flagged as portfolio candidates.

Stage 3 — Theme Classification

Flagged conversations are classified into 5 theme categories based on keyword analysis: AI/ML Engineering, Data Systems, Automation/Tooling, Web Development, and Career/Professional. Each theme tracks coverage counts and representative conversations. The classification drives the portfolio gap analysis — revealing which domains have the most documented work and which are underrepresented.

Stage 4 — Privacy-Aware Output

Before any output is generated, the full archive is scanned for sensitive data signals: email addresses (5,308 patterns detected), phone-like numbers (3,758 patterns), and API key-like strings (36 patterns). Conversations containing sensitive signals are flagged but not excluded — the flag alerts me to redact specific content before using the extracted artifacts. All processing is entirely local; the archive never leaves my machine.

Key Design Decisions

Why heuristic scoring instead of LLM-based evaluation? I considered using an LLM to evaluate each conversation’s portfolio potential. For 1,195 conversations, that would require ~1,200 API calls or ~2 hours of local inference. The heuristic approach processes the entire archive in 23 seconds. More importantly, the heuristic signals (code blocks, message count, keyword density) correlate well enough with portfolio worthiness that LLM evaluation would add cost without meaningfully improving the candidate list. Perfect is the enemy of shipped.

Why detect sensitive data rather than redact it automatically? Automatic redaction risks false positives that destroy useful context (redacting a phone number that’s actually a port number, or an email pattern that’s actually a log format string). Flagging preserves the content and gives me human judgment over what to redact. The privacy layer is a filter, not a censor.

03 — Outcomes

Measured Results

1,195
Conversations Parsed

from 3 years of ChatGPT usage spanning all professional domains

27,689
Messages Analyzed

scored, classified, and scanned for sensitive content

5
Theme Categories

with coverage analysis revealing portfolio gaps and strengths

23s
Total Processing Time

full archive parse, score, classify, and privacy scan

04 — Reflection

Your Archive Is a Mirror You Haven’t Looked Into

The most unexpected output of this project wasn’t the portfolio candidate list. It was the theme distribution chart. Seeing 3 years of professional activity decomposed into 5 categories revealed patterns I hadn’t consciously recognized: 42% of my substantial technical work was in AI/ML Engineering, but only 15% was in Data Systems — despite data work being a core part of my professional identity. The archive told a different story than the one I’d been telling myself. That discrepancy became the basis for deliberately building projects in underrepresented categories, turning a gap analysis into a development roadmap.

The sensitive data scan was similarly illuminating. 5,308 email-like patterns and 3,758 phone-like patterns accumulated across 3 years of casual conversation with an AI assistant. I had shared PII without thinking about it — not because I was careless, but because the conversational interface makes sharing feel ephemeral. It isn’t. This project made the persistence of that data uncomfortably visible, which is exactly the kind of discomfort that improves future behavior.

“We build archives without intending to. The question is whether we ever look back at what they reveal — not about the work we did, but about the patterns of thought we didn’t notice while doing it.”

Outcomes

1,195 conversations parsed; 27,689 messages scored and classified; 5 theme categories with gap analysis; 23-second full-archive processing time; 9,102 sensitive data signals detected