May 19, 2025
TLDR
This article describes how to build an AI system to answer questions about the Great Attractor using Model Context Protocol (MCP). In part 1 you’ll learn how to use Mage Pro to create a data pipeline that connects knowledge sources to AI systems, including fetching PDF data, cleaning the text, and preparing it for us with Anthropic’s Claude. This approach allows for more accurate, source-specific AI responses rather than relying solely on general training data.
Table of contents
Introduction
What is Model Context Protocol?
Create a new pipeline
Fetch information from your knowledge systems
Clean and format pdf text
Chunk your data dynamically
What’s next?
Conclusion
Introduction
Over the past few months my son became interested in Space exploration. He’s brought home various books on the Milky Way Galaxy and books about different space exploration technologies. A few days ago we watched a short video clip about the great attractor and he started asking questions I wasn’t able to answer. So I thought what the hell, let me build something that would help me come up with answers.

Meet ”Attractor”, my new MCP generated AI that will answer questions about the Great Attractor from “Evidence of the Great Attractor and Great Repeller from Artificial Neural Network Imputation of Sloan Digital Sky Survey” by Christopher Cillian O’Neill. So how did I do this? Let’s get into it.
What is Model Context Protocol?
Think of MCP as your Physics professor, when you ask it a question, it uses the knowledge available to it to answer that question. It’s the technical connector between your local knowledge systems and AI. MCP creates a bridge between source materials and AI systems. This allows applications to provide accurate, authoritative responses based on targeted information sources. It’s not just referring to general training data. With MCP your AI can finally interact with your specific knowledge assets and provide feedback and answers based only on that information. Next, we’ll create a new pipeline in Mage Pro.

Create a new pipeline
To create a new pipeline, from the Mage Pro home page hover over the left navigation pop-out menu and select “Pipelines.” Once we’re in our Mage Pro environment we’ll head over to the create pipeline page by hitting the green “New pipeline” button.

Next, we’ll create our pipeline using the UI below:
Select the Batch pipeline option.
Give the pipeline a name (mine is “Meet Attractor”)
Optionally provide a description for the pipeline
Tag the pipeline something memorable
Click the blue “Create new pipeline” block.
These 5 simple steps create a batch data pipeline in Mage Pro. You don’t need to write any code to do this, it’s all done within the UI. Super simple!

Finally, before we get into the code, let’s go over how to add a block in Mage Pro. Think of a block like a task. It’s where your code can be written in a Jupiter Notebook style environment. To create a block follow the instructions below:
Click the “Blocks” button
Hover over the “Loader” option
Click API
Give the block a name
Click “Save and add”
These simple steps will create a new API block template to extract data from an API or other type of URL. For our tutorial we’ll be using a Github URL to extract the content from a PDF file.

Fetch information from your knowledge systems
The data loader block fetches a PDF document from Github and transforms it into structured data. After fetching the PDF from Github it uses the PyPDF2 python library to extract the paper’s metadata and content from the document. The content is organized into a dictionary containing document details and complete text. Structuring the output as shown in the code below will serve as the foundation for the rest of the MCP pipeline.
Clean and format pdf text
The first transformation block cleans and refines the raw PDF text to improve readability and consistency. It applies regex to normalize formatting issues that are commonly found in content extracted from PDFs. It cleans multi-line breaks, hyphenated words splitting lines, and headers that repeat. The block adds structured document headers including title, author, and source information before returning the cleaned text with metadata that tracks character count and processing time.
Chunk your data dynamically
The second transformation block divides the cleaned document into manageable chunks for more efficient processing. The chunks maintain context while enabling targeted information retrieval. Every segment includes metadata and prepares the content for retrieval when answering user questions. This chunking approach ensures Claude can efficiently access applicable sections of the document rather than disconnected fragments when responding to questions about the Great Attractor.
What’s next?
Stay tuned for part 2, where we’ll ask “Attractor” some interesting questions about the universe. This will be both fun to code, and fun to find out the answers to some questions about our universe. We’ll use Anthropic’s MCP libraries to interact with our knowledge document specifically leaving out general training data.

Conclusion
MCP is a significant advancement in how we can leverage AI to interact with our knowledge systems specific to a business. By creating pipelines that fetch, clean, and targeted information sources before parring them to LLMs, we can build AI assistants that provide targeted answers about the content we provide it. Building a system like “Attractor” demonstrates how we can target specific knowledge systems and use AI to help answer questions specifically about the information it is given.
Want to build a RAG pipeline using MCP methods discussed above? Schedule a free demo with Mage to get started today.