Building a BioCypher-Powered AI Agent for Biomedical Knowledge Graph Generation and Querying

In this tutorial, we implement the BioCypher AI Agent, a powerful tool designed for building, querying, and analyzing biomedical knowledge graphs using the BioCypher framework. By combining the strengths of BioCypher, a high-performance, schema-based interface for biological data integration, with the flexibility of NetworkX, this tutorial empowers users to simulate complex biological relationships such as gene-disease associations, drug-target interactions, and pathway involvements. The agent also includes capabilities for generating synthetic biomedical data, visualizing knowledge graphs, and performing intelligent queries, such as centrality analysis and neighbor detection.

!pip install biocypher pandas numpy networkx matplotlib seaborn


import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
import json
import random
from typing import Dict, List, Tuple, Any

We begin by installing the essential Python libraries required for our biomedical graph analysis, including biocypher, Pandas, NumPy, NetworkX, Matplotlib, and Seaborn. These packages enable us to handle data, create and manipulate knowledge graphs, and effectively visualize relationships. Once installed, we import all necessary modules to set up our development environment.

try:
   from biocypher import BioCypher
   from biocypher._config import config
   BIOCYPHER_AVAILABLE = True
except ImportError:
   print("BioCypher not available, using NetworkX-only implementation")
   BIOCYPHER_AVAILABLE = False

We attempt to import the BioCypher framework, which provides a schema-based interface for managing biomedical knowledge graphs. If the import is successful, we enable BioCypher features; otherwise, we gracefully fall back to a NetworkX-only mode, ensuring that the rest of the analysis can still proceed without interruption.

class BiomedicalAIAgent:
   """Advanced AI Agent for biomedical knowledge graph analysis using BioCypher"""
  
   def __init__(self):
       if BIOCYPHER_AVAILABLE:
           try:
               self.bc = BioCypher()
               self.use_biocypher = True
           except Exception as e:
               print(f"BioCypher initialization failed: {e}")
               self.use_biocypher = False
       else:
           self.use_biocypher = False
          
       self.graph = nx.Graph()
       self.entities = {}
       self.relationships = []
       self.knowledge_base = self._initialize_knowledge_base()
      
   def _initialize_knowledge_base(self) -> Dict[str, List[str]]:
       """Initialize sample biomedical knowledge base"""
       return {
           "genes": ["BRCA1", "TP53", "EGFR", "KRAS", "MYC", "PIK3CA", "PTEN"],
           "diseases": ["breast_cancer", "lung_cancer", "diabetes", "alzheimer", "heart_disease"],
           "drugs": ["aspirin", "metformin", "doxorubicin", "paclitaxel", "imatinib"],
           "pathways": ["apoptosis", "cell_cycle", "DNA_repair", "metabolism", "inflammation"],
           "proteins": ["p53", "EGFR", "insulin", "hemoglobin", "collagen"]
       }
  
   def generate_synthetic_data(self, n_entities: int = 50) -> None:
       """Generate synthetic biomedical data for demonstration"""
       print("🧬 Generating synthetic biomedical data...")
      
       for entity_type, items in self.knowledge_base.items():
           for item in items:
               entity_id = f"{entity_type}_{item}"
               self.entities[entity_id] = {
                   "id": entity_id,
                   "type": entity_type,
                   "name": item,
                   "properties": self._generate_properties(entity_type)
               }
      
       entity_ids = list(self.entities.keys())
       for _ in range(n_entities):
           source = random.choice(entity_ids)
           target = random.choice(entity_ids)
           if source != target:
               rel_type = self._determine_relationship_type(
                   self.entities[source]["type"],
                   self.entities[target]["type"]
               )
               self.relationships.append({
                   "source": source,
                   "target": target,
                   "type": rel_type,
                   "confidence": random.uniform(0.5, 1.0)
               })

We define the BiomedicalAIAgent class as the core engine for analyzing biomedical knowledge graphs using BioCypher. In the constructor, we check whether BioCypher is available and initialize it if possible; otherwise, we default to a NetworkX-only approach. We also set up our base structures, including an empty graph, dictionaries for entities and relationships, and a predefined biomedical knowledge base. We then use generate_synthetic_data() to populate this graph with realistic biological entities, such as genes, diseases, drugs, and pathways, and simulate their interactions through randomly generated but biologically meaningful relationships.

  def _generate_properties(self, entity_type: str) -> Dict[str, Any]:
       """Generate realistic properties for different entity types"""
       base_props = {"created_at": "2024-01-01", "source": "synthetic"}
      
       if entity_type == "genes":
           base_props.update({
               "chromosome": f"chr{random.randint(1, 22)}",
               "expression_level": random.uniform(0.1, 10.0),
               "mutation_frequency": random.uniform(0.01, 0.3)
           })
       elif entity_type == "diseases":
           base_props.update({
               "prevalence": random.uniform(0.001, 0.1),
               "severity": random.choice(["mild", "moderate", "severe"]),
               "age_of_onset": random.randint(20, 80)
           })
       elif entity_type == "drugs":
           base_props.update({
               "dosage": f"{random.randint(10, 500)}mg",
               "efficacy": random.uniform(0.3, 0.95),
               "side_effects": random.randint(1, 10)
           })
      
       return base_props


   def _determine_relationship_type(self, source_type: str, target_type: str) -> str:
       """Determine biologically meaningful relationship types"""
       relationships_map = {
           ("genes", "diseases"): "associated_with",
           ("genes", "drugs"): "targeted_by",
           ("genes", "pathways"): "participates_in",
           ("drugs", "diseases"): "treats",
           ("proteins", "pathways"): "involved_in",
           ("diseases", "pathways"): "disrupts"
       }
      
       return relationships_map.get((source_type, target_type),
                                  relationships_map.get((target_type, source_type), "related_to"))


   def build_knowledge_graph(self) -> None:
       """Build knowledge graph using BioCypher or NetworkX"""
       print("🔗 Building knowledge graph...")
      
       if self.use_biocypher:
           try:
               for entity_id, entity_data in self.entities.items():
                   self.bc.add_node(
                       node_id=entity_id,
                       node_label=entity_data["type"],
                       node_properties=entity_data["properties"]
                   )
                  
               for rel in self.relationships:
                   self.bc.add_edge(
                       source_id=rel["source"],
                       target_id=rel["target"],
                       edge_label=rel["type"],
                       edge_properties={"confidence": rel["confidence"]}
                   )
               print("✅ BioCypher graph built successfully")
           except Exception as e:
               print(f"BioCypher build failed, using NetworkX only: {e}")
               self.use_biocypher = False
          
       for entity_id, entity_data in self.entities.items():
           self.graph.add_node(entity_id, **entity_data)
          
       for rel in self.relationships:
           self.graph.add_edge(rel["source"], rel["target"],
                             type=rel["type"], confidence=rel["confidence"])
      
       print(f"✅ NetworkX graph built with {len(self.graph.nodes())} nodes and {len(self.graph.edges())} edges")


   def intelligent_query(self, query_type: str, entity: str = None) -> Dict[str, Any]:
       """Intelligent querying system with multiple analysis types"""
       print(f"🤖 Processing intelligent query: {query_type}")
      
       if query_type == "drug_targets":
           return self._find_drug_targets()
       elif query_type == "disease_genes":
           return self._find_disease_associated_genes()
       elif query_type == "pathway_analysis":
           return self._analyze_pathways()
       elif query_type == "centrality_analysis":
           return self._analyze_network_centrality()
       elif query_type == "entity_neighbors" and entity:
           return self._find_entity_neighbors(entity)
       else:
           return {"error": "Unknown query type"}


   def _find_drug_targets(self) -> Dict[str, List[str]]:
       """Find potential drug targets"""
       drug_targets = {}
       for rel in self.relationships:
           if (rel["type"] == "targeted_by" and
               self.entities[rel["source"]]["type"] == "genes"):
               drug = self.entities[rel["target"]]["name"]
               target = self.entities[rel["source"]]["name"]
               if drug not in drug_targets:
                   drug_targets[drug] = []
               drug_targets[drug].append(target)
       return drug_targets


   def _find_disease_associated_genes(self) -> Dict[str, List[str]]:
       """Find genes associated with diseases"""
       disease_genes = {}
       for rel in self.relationships:
           if (rel["type"] == "associated_with" and
               self.entities[rel["target"]]["type"] == "diseases"):
               disease = self.entities[rel["target"]]["name"]
               gene = self.entities[rel["source"]]["name"]
               if disease not in disease_genes:
                   disease_genes[disease] = []
               disease_genes[disease].append(gene)
       return disease_genes


   def _analyze_pathways(self) -> Dict[str, int]:
       """Analyze pathway connectivity"""
       pathway_connections = {}
       for rel in self.relationships:
           if rel["type"] in ["participates_in", "involved_in"]:
               if self.entities[rel["target"]]["type"] == "pathways":
                   pathway = self.entities[rel["target"]]["name"]
                   pathway_connections[pathway] = pathway_connections.get(pathway, 0) + 1
       return dict(sorted(pathway_connections.items(), key=lambda x: x[1], reverse=True))


   def _analyze_network_centrality(self) -> Dict[str, Dict[str, float]]:
       """Analyze network centrality measures"""
       if len(self.graph.nodes()) == 0:
           return {}
          
       centrality_measures = {
           "degree": nx.degree_centrality(self.graph),
           "betweenness": nx.betweenness_centrality(self.graph),
           "closeness": nx.closeness_centrality(self.graph)
       }
      
       top_nodes = {}
       for measure, values in centrality_measures.items():
           top_nodes[measure] = dict(sorted(values.items(), key=lambda x: x[1], reverse=True)[:5])
      
       return top_nodes


   def _find_entity_neighbors(self, entity_name: str) -> Dict[str, List[str]]:
       """Find neighbors of a specific entity"""
       neighbors = {"direct": [], "indirect": []}
       entity_id = None
      
       for eid, edata in self.entities.items():
           if edata["name"].lower() == entity_name.lower():
               entity_id = eid
               break
              
       if not entity_id or entity_id not in self.graph:
           return {"error": f"Entity '{entity_name}' not found"}
          
       for neighbor in self.graph.neighbors(entity_id):
           neighbors["direct"].append(self.entities[neighbor]["name"])
          
       for direct_neighbor in self.graph.neighbors(entity_id):
           for indirect_neighbor in self.graph.neighbors(direct_neighbor):
               if (indirect_neighbor != entity_id and
                   indirect_neighbor not in list(self.graph.neighbors(entity_id))):
                   neighbor_name = self.entities[indirect_neighbor]["name"]
                   if neighbor_name not in neighbors["indirect"]:
                       neighbors["indirect"].append(neighbor_name)
                      
       return neighbors


   def visualize_network(self, max_nodes: int = 30) -> None:
       """Visualize the knowledge graph"""
       print("📊 Creating network visualization...")
      
       nodes_to_show = list(self.graph.nodes())[:max_nodes]
       subgraph = self.graph.subgraph(nodes_to_show)
      
       plt.figure(figsize=(12, 8))
       pos = nx.spring_layout(subgraph, k=2, iterations=50)
      
       node_colors = []
       color_map = {"genes": "red", "diseases": "blue", "drugs": "green",
                   "pathways": "orange", "proteins": "purple"}
      
       for node in subgraph.nodes():
           entity_type = self.entities[node]["type"]
           node_colors.append(color_map.get(entity_type, "gray"))
      
       nx.draw(subgraph, pos, node_color=node_colors, node_size=300,
               with_labels=False, alpha=0.7, edge_color="gray", width=0.5)
      
       plt.title("Biomedical Knowledge Graph Network")
       plt.axis('off')
       plt.tight_layout()
       plt.show()

We designed a set of intelligent functions within the BiomedicalAIAgent class to simulate real-world biomedical scenarios. We generate realistic properties for each entity type, define biologically meaningful relationship types, and build a structured knowledge graph using either BioCypher or NetworkX. To gain insights, we included functions for analyzing drug targets, disease-gene associations, pathway connectivity, and network centrality, along with a visual graph explorer that helps us intuitively understand the interactions between biomedical entities.

  def run_analysis_pipeline(self) -> None:
       """Run complete analysis pipeline"""
       print("🚀 Starting BioCypher AI Agent Analysis Pipeline\n")
      
       self.generate_synthetic_data()
       self.build_knowledge_graph()
      
       print(f"📈 Graph Statistics:")
       print(f"   Entities: {len(self.entities)}")
       print(f"   Relationships: {len(self.relationships)}")
       print(f"   Graph Nodes: {len(self.graph.nodes())}")
       print(f"   Graph Edges: {len(self.graph.edges())}\n")
      
       analyses = [
           ("drug_targets", "Drug Target Analysis"),
           ("disease_genes", "Disease-Gene Associations"),
           ("pathway_analysis", "Pathway Connectivity Analysis"),
           ("centrality_analysis", "Network Centrality Analysis")
       ]
      
       for query_type, title in analyses:
           print(f"🔍 {title}:")
           results = self.intelligent_query(query_type)
           self._display_results(results)
           print()
      
       self.visualize_network()
      
       print("✅ Analysis complete! AI Agent successfully analyzed biomedical data.")
      
   def _display_results(self, results: Dict[str, Any], max_items: int = 5) -> None:
       """Display analysis results in a formatted way"""
       if isinstance(results, dict) and "error" not in results:
           for key, value in list(results.items())[:max_items]:
               if isinstance(value, list):
                   print(f"   {key}: {', '.join(value[:3])}{'...' if len(value) > 3 else ''}")
               elif isinstance(value, dict):
                   print(f"   {key}: {dict(list(value.items())[:3])}")
               else:
                   print(f"   {key}: {value}")
       else:
           print(f"   {results}")


   def export_to_formats(self) -> None:
       """Export knowledge graph to various formats"""
       if self.use_biocypher:
           try:
               print("📤 Exporting BioCypher graph...")
               print("✅ BioCypher export completed")
           except Exception as e:
               print(f"BioCypher export failed: {e}")
      
       print("📤 Exporting NetworkX graph to formats...")
      
       graph_data = {
           "nodes": [{"id": n, **self.graph.nodes[n]} for n in self.graph.nodes()],
           "edges": [{"source": u, "target": v, **self.graph.edges[u, v]}
                    for u, v in self.graph.edges()]
       }
      
       try:
           with open("biomedical_graph.json", "w") as f:
               json.dump(graph_data, f, indent=2, default=str)
          
           nx.write_graphml(self.graph, "biomedical_graph.graphml")
           print("✅ Graph exported to JSON and GraphML formats")
       except Exception as e:
           print(f"Export failed: {e}")


   def export_to_formats(self) -> None:
       """Export knowledge graph to various formats"""
       if self.use_biocypher:
           try:
               print("📤 Exporting BioCypher graph...")
               print("✅ BioCypher export completed")
           except Exception as e:
               print(f"BioCypher export failed: {e}")
      
       print("📤 Exporting NetworkX graph to formats...")
      
       graph_data = {
           "nodes": [{"id": n, **self.graph.nodes[n]} for n in self.graph.nodes()],
           "edges": [{"source": u, "target": v, **self.graph.edges[u, v]}
                    for u, v in self.graph.edges()]
       }
      
       with open("biomedical_graph.json", "w") as f:
           json.dump(graph_data, f, indent=2, default=str)
      
       nx.write_graphml(self.graph, "biomedical_graph.graphml")
      
       print("✅ Graph exported to JSON and GraphML formats")
       """Display analysis results in a formatted way"""
       if isinstance(results, dict) and "error" not in results:
           for key, value in list(results.items())[:max_items]:
               if isinstance(value, list):
                   print(f"   {key}: {', '.join(value[:3])}{'...' if len(value) > 3 else ''}")
               elif isinstance(value, dict):
                   print(f"   {key}: {dict(list(value.items())[:3])}")
               else:
                   print(f"   {key}: {value}")
       else:
           print(f"   {results}")

We wrap up the AI agent workflow with a streamlined run_analysis_pipeline() function that ties everything together, from synthetic data generation and graph construction to intelligent query execution and final visualization. This automated pipeline enables us to observe biomedical relationships, analyze central entities, and understand how different biological concepts are interconnected. Finally, using export_to_formats(), we ensure that the resulting graph can be saved in both JSON and GraphML formats for further use, making our analysis both shareable and reproducible.

if __name__ == "__main__":
   agent = BiomedicalAIAgent()
   agent.run_analysis_pipeline()

We conclude the tutorial by instantiating our BiomedicalAIAgent and running the full analysis pipeline. This entry point enables us to execute all steps, including data generation, graph building, intelligent querying, visualization, and reporting, in a single, streamlined command, making it easy to explore biomedical knowledge using BioCypher.

In conclusion, through this advanced tutorial, we gain practical experience working with BioCypher to create scalable biomedical knowledge graphs and perform insightful biological analyses. The dual-mode support ensures that even if BioCypher is unavailable, the system gracefully falls back to NetworkX for full functionality. The ability to generate synthetic datasets, execute intelligent graph queries, visualize relationships, and export in multiple formats showcases the flexibility and analytical power of the BioCypher-based agent. Overall, this tutorial exemplifies how BioCypher can serve as a critical infrastructure layer for biomedical AI systems, making complex biological data both usable and insightful for downstream applications.

Check out the Codes here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.

Building a BioCypher-Powered AI Agent for Biomedical Knowledge Graph Generation and Querying

“Periodic table of machine learning” could fuel AI discovery | MIT News

ReasonFlux-PRM: A Trajectory-Aware Reward Model Enhancing Chain-of-Thought Reasoning in LLMs

Don’t let hype about AI agents get ahead of reality

New model predicts a chemical reaction’s point of no return | MIT News

Top Insights

Analysts Predict Major Dogecoin Price Rally After Breaking 50-Day Trendline

Ethereum.org year in review | Ethereum Foundation Blog

“Periodic table of machine learning” could fuel AI discovery | MIT News

Building a BioCypher-Powered AI Agent for Biomedical Knowledge Graph Generation and Querying

Related Posts

“Periodic table of machine learning” could fuel AI discovery | MIT News

ReasonFlux-PRM: A Trajectory-Aware Reward Model Enhancing Chain-of-Thought Reasoning in LLMs

Don’t let hype about AI agents get ahead of reality

New model predicts a chemical reaction’s point of no return | MIT News

Analysts Predict Major Dogecoin Price Rally After Breaking 50-Day Trendline

Ethereum.org year in review | Ethereum Foundation Blog

“Periodic table of machine learning” could fuel AI discovery | MIT News

Subscribe to Updates