In this tutorial, I’ll walk through an example implementation of a lightweight GraphRAG (Graph Retrieval-Augmented Generation) system using SQLite instead of Neo4j (which is what I used in the previous article). This setup offers a portable, serverless solution for document processing, relationship extraction, and graph-based querying.
We’ll be using SQLite to store entities and relationships extracted from documents using OpenAI’s GPT models. By calculating centrality measures (such as degree and betweenness), we can prioritize the most relevant entities in the graph to improve query responses. Although SQLite lacks some of the advanced features of dedicated graph databases, it is an ideal solution for small-to-medium datasets that don’t require a full graph database like Neo4j.
This guide will cover:
Setting up SQLite for managing graph data
Extracting entities and relationships using OpenAI GPT models
Using centrality measures to improve query relevance
Visualizing graph data with D3.js
You can follow along with the complete source code on GitHub.
Prerequisites
Ensure you have the following:
Python 3.9+
SQLite (comes pre-installed with Python)
Necessary libraries:
openai
,python-dotenv
You can install these dependencies with:
pip install openai python-dotenv
Project Overview
This GraphRAG system is designed using an object-oriented approach. It processes documents, extracts entities and relationships, and stores them in a SQLite database. By calculating centrality measures, we can identify the most important nodes (entities) in the graph, enabling more accurate and relevant query responses.
Project Structure
app.py
: Orchestrates the document processing and querying workflow.GraphManager
: Manages SQLite operations, builds the graph, calculates centrality measures, and handles graph updates.QueryHandler
: Handles user queries and uses GPT models to generate responses based on centrality measures.DocumentProcessor
: Splits documents, extracts entities and relationships, and summarizes them.GraphDatabase
: Manages the SQLite database connection.Logger
: Provides logging utilities to track the workflow.
Using SQLite to Manage Graph Data
SQLite offers a simple yet effective way to store graph data, such as entities and relationships, extracted from documents. The GraphDatabaseConnection
class manages the connection to the SQLite database:
import sqlite3
class GraphDatabaseConnection:
def __init__(self, db_path='graph_database.db'):
self.conn = sqlite3.connect(db_path)
# Use WAL mode for better performance
self.conn.execute('PRAGMA journal_mode=WAL;')
self.initialize_schema()
def initialize_schema(self):
with self.conn:
self.conn.execute('''
CREATE TABLE IF NOT EXISTS nodes (
id TEXT PRIMARY KEY,
properties TEXT
)
''')
self.conn.execute('''
CREATE TABLE IF NOT EXISTS edges (
source TEXT,
target TEXT,
relationship TEXT,
weight REAL,
PRIMARY KEY (source, target, relationship),
FOREIGN KEY (source) REFERENCES nodes(id),
FOREIGN KEY (target) REFERENCES nodes(id)
)
''')
self.conn.execute('''
CREATE INDEX IF NOT EXISTS source_idx ON edges(source)
''')
self.conn.execute('''
CREATE INDEX IF NOT EXISTS target_idx ON edges(target)
''')
def close(self):
self.conn.close()
def get_session(self):
return self.conn
This setup ensures that we have two main tables: nodes
(for storing entities) and edges
(for relationships between entities). Each time we process documents, entities and their relationships are inserted into these tables. Additionally, indices are created on the source
and target
columns of the edges
table to improve query performance.
Document Processing with DocumentProcessor
The DocumentProcessor
class is responsible for splitting documents, extracting key entities and relationships, and summarizing the results using OpenAI’s GPT models. Here’s how you can process a batch of documents:
Example: Document Processing
from logger import Logger
class DocumentProcessor:
logger = Logger("DocumentProcessor").get_logger()
def __init__(self, client, model):
self.client = client
self.model = model
def split_documents(self, documents, chunk_size=600, overlap_size=100):
chunks = []
for document in documents:
for i in range(0, len(document), chunk_size - overlap_size):
chunk = document[i:i + chunk_size]
chunks.append(chunk)
self.logger.debug("Documents split into %d chunks", len(chunks))
return chunks
def extract_elements(self, chunks):
elements = []
for index, chunk in enumerate(chunks):
self.logger.debug(
f"Extracting elements and relationship strength from chunk {index + 1}")
response = self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system",
"content": "Extract entities, relationships, and their strength from the following text. Use common terms such as 'related to', 'depends on', 'influences', etc., for relationships, and estimate a strength between 0.0 (very weak) and 1.0 (very strong). Format: Parsed relationship: Entity1 -> Relationship -> Entity2 [strength: X.X]. Do not include any other text in your response. Use this exact format: Parsed relationship: Entity1 -> Relationship -> Entity2 [strength: X.X]."},
{"role": "user", "content": chunk}
]
)
entities_and_relations = response.choices[0].message.content
elements.append(entities_and_relations)
self.logger.debug("Elements extracted")
return elements
def summarize_elements(self, elements):
summaries = []
for index, element in enumerate(elements):
self.logger.debug(f"Summarizing element {index + 1}")
response = self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": "Summarize the following entities and relationships in a structured format. Use common terms such as 'related to', 'depends on', 'influences', etc., for relationships. Use '->' to represent relationships after the 'Relationships:' word."},
{"role": "user", "content": element}
]
)
summary = response.choices[0].message.content
summaries.append(summary)
self.logger.debug("Summaries created")
return summaries
This class will split a large document into manageable chunks and use OpenAI’s GPT model to extract and summarize the key entities and relationships.
Graph Management with GraphManager
Once the entities and relationships have been extracted, the GraphManager
class stores them in SQLite as a graph, where nodes represent entities and edges represent relationships. It also calculates centrality measures to identify the most important nodes.
Example: Building the Graph
class GraphManager:
def __init__(self, db_connection):
self.db_connection = db_connection
# Optionally clear the database, else remove this line
self.db_connection.clear_database()
def build_graph(self, summaries):
if self.db_connection is None:
self.logger.error("Graph database connection is not available.")
return
entities = {}
conn = self.db_connection.get_session()
with conn:
for summary in summaries:
# Insert entities and relationships into SQLite
# Parse the summary to extract entities and relationships
# ...
This method creates nodes and relationships based on the processed document summaries and stores them in the SQLite database.
Using Centrality Measures to Enhance Query Relevance
The GraphManager
class also calculates centrality measures such as degree, betweenness, and closeness centrality. These metrics help prioritize the most relevant entities for answering user queries.
Example: Calculating Centrality
def calculate_centrality_measures(self):
conn = self.db_connection.get_session()
# Degree centrality: count incoming/outgoing edges for each node
degree_centrality_query = """
SELECT id,
(SELECT COUNT(*) FROM edges WHERE source = nodes.id) +
(SELECT COUNT(*) FROM edges WHERE target = nodes.id) as degree
FROM nodes
ORDER BY degree DESC
LIMIT 10
"""
self.logger.debug("Starting degree centrality query")
start_time = time.time()
degree_centrality_result = conn.execute(
degree_centrality_query).fetchall()
end_time = time.time()
self.logger.debug(
f"Degree centrality query completed in {end_time - start_time:.8f} seconds")
# We won't implement betweenness and closeness for now as SQLite does not have graph-native support
centrality_data = {
"degree": [{"entityName": row[0], "score": row[1]} for row in degree_centrality_result],
"betweenness": [],
"closeness": []
}
return centrality_data
This function calculates degree centrality—the number of connections each node has—which is a simple yet effective way to identify key entities in the graph. By leveraging indices on the source
and target
columns of the edges
table, the function ensures faster lookup and improved performance.
Running the Application
To get started, follow these steps:
Clone the repository:
git clone git@github.com:stephenc222/example-graphrag-with-sqlite.git cd example-graphrag-with-sqlite
Install the dependencies:
pip install -r requirements.txt
Create a
.env
file with your OpenAI API key:OPENAI_API_KEY=<your-openai-api-key> DB_PATH=data/graph_database.sqlite
Run the application:
python app.py
This will:
Index the initial documents.
Process a user query to extract key themes.
Reindex the graph with new documents and update centrality measures.
Handle another query based on the updated graph.
Visualizing the Graph with D3.js
To visualize the graph, we can use D3.js. First, we need to export the graph data from SQLite to a JSON file using the export_graph_
data.py
script:
python export_graph_data.py data/graph_database.sqlite
Next, we use D3.js to visualize the graph. From the root of the project, run:
python -m http.server --directory public 8000
Then, open your browser and navigate to http://localhost:8000
to see the graph visualization.
Conclusion
SQLite provides a lightweight and easy-to-set-up solution for managing a GraphRAG system, ideal for small-to-medium datasets of approximately 100-1000 documents conservatively. By combining it with OpenAI’s GPT models, we can efficiently process documents, extract entities, and handle queries with relevance-enhancing centrality measures. While SQLite doesn’t offer advanced graph algorithms like Neo4j, it’s an excellent choice for lightweight and self-contained applications.
Check out the full source code on GitHub, and feel free to extend this implementation with more advanced features or optimizations tailored to your use case.