meGPT

# meGPT - upload an author's content into an LLM I have 20 years of public content I've produced and presented over my career, and I'd like to have an LLM that is trained to answer questions and generate summaries of my opinions, in my "voice", At this point, I've found a few companies that are building persona's and tried out soopra.ai. To encourage development and competition in this space I have organized my public content and references to sources in this repo. My own content is stored or linked to in authors/virtual_adrianco and consists of: - 4 published books (pdf of two provided), ~10 forewords to books, ~100 blog posts (text) - Twitter archive 2008-2022 (conversation text) - Mastodon.social - 2021-now https://mastodon.social/@adrianco (RSS at https://mastodon.social/@adrianco.rss) - Github projects (code) - Blog posts from https://adrianco.medium.com extracted as text to authors/virtual_adrianco/medium_posts (with extraction script) - Blog posts from https://perfcap.blogspot.com extracted as text to authors/virtual_adrianco/blogger_percap_posts (with extraction script) - ~100 presentation decks (images) greatest hits: https://github.com/adrianco/slides/tree/master/Greatest%20Hits - ~20 podcasts (audio conversations, should be good Q&A training material) - over 135 videos of talks and interviews (audio/video/YouTube individual videos, playlists, and entire channels) If another author wants to use this repo as a starting point, clone it and add your own directory of content under authors. If you want to contribute the content freely for other people to use as a training data set, then send a pull request and I'll include it here. The scripts in the code directory are there to help pre-process content for an author by extracting from a twitter or medium archive that has to be downloaded by the account owner. Creative Commons - attribution share-alike. Permission explicitly granted for anyone to use as a training set to develop the meGPT concept. Free for use by any author/speaker/expert resulting in a Chatbot that can answer questions as if it was the author, with reference to published content. I have called my own build of this virtual_adrianco - with opinions on cloud computing, sustainability, performance tools, microservices, speeding up innovation, Wardley mapping, open source, chaos engineering, resilience, Sun Microsystems, Netflix, AWS etc. etc. I'm happy to share any models that are developed. I don't need to monetize this, I'm semi-retired and have managed to monetize this content well enough already, I don't work for a big corporation any more.. # I am not a Python programmer All the code in this repo was initially written by the free version of ChatGPT 4 or Cursor Claude Sonnet3.7 based on short prompts, with no subsequent edits, in a few minutes of my time here and there. I can read Python and mostly make sense of it but I'm not an experienced Python programmer. Look in the relevant issue for a public link to the chat thread that generated the code fro ChatGPT. When I transitioned to Cursor I got the context included as a block comment at the start of each file. This is a ridiculously low friction and easy way to write simple code. Development was migrated to Cursor as it has a much better approach to managing the context of a whole project. # YouTube Processing The YouTube processor has been enhanced to handle multiple types of YouTube content automatically: ## Supported YouTube URL Types - **Individual Videos**: `https://www.youtube.com/watch?v=VIDEO_ID` - **Playlists**: `https://www.youtube.com/playlist?list=PLAYLIST_ID` - **Entire Channels**: `https://www.youtube.com/@username/videos` or `https://www.youtube.com/c/channelname/videos` ## Processing Features - **Automatic Detection**: The processor automatically detects whether a URL is an individual video, playlist, or channel - **Bulk Processing**: Channels and playlists are automatically expanded into individual video entries - **Consent Page Handling**: Automatically handles YouTube's "Before you continue" consent pages - **Robust Extraction**: Multiple fallback methods for extracting video metadata and IDs - **MCP Compatibility**: All videos are saved as individual MCP-compatible JSON files ## YouTube Downloads YouTube has strict bot detection measures that can make automatic downloads challenging. The script attempts multiple methods to process YouTube content: 1. First, it tries to extract video metadata directly from the page HTML 2. For consent pages, it automatically submits the consent form 3. Uses multiple regex patterns to find video IDs in the page source 4. Falls back to alternative extraction methods if standard approaches fail 5. Creates placeholder entries for channels when individual videos can't be extracted Despite these measures, YouTube's bot detection is sophisticated and some content may still fail to process automatically. In such cases, the processor will create placeholder entries that referenc

Установка

Описание

Отзывы (0)

Статистика

Информация

Технологии

Похожие серверы

GitHub MCP

Filesystem MCP

Context7 MCP

Serena MCP