The Wikipedia Search Crawler is designed to find the shortest path between two Wikipedia articles using various graph search algorithms. This tool utilizes Python's asynchronous capabilities to fetch and parse web pages, constructing a graph that represents the links between Wikipedia pages, and then applies search algorithms to determine the path.
- Graph Construction: Dynamically builds a graph by fetching Wikipedia pages starting from the specified start and end pages, exploring up to a user-defined depth.
- Progress Tracking: Uses a progress bar to provide real-time feedback on the number of pages processed during graph construction.
- Error Handling: Robust error handling to ensure stability across network failures and data parsing issues.
- Breadth-First Search (BFS): Used for unweighted graphs to find the shortest path in terms of the number of edges traversed.
- Depth-First Search (DFS): Used to explore as far as possible along each branch before backtracking, useful for finding all possible paths (with modifications to limit depth).
- Dijkstra’s Algorithm: An algorithm for finding the shortest paths between nodes in a graph, which may be weighted.
- A-Star (A) Search*: Utilizes heuristics to efficiently find the shortest path by estimating the cost to get from the current node to the end node.
The project was tested using a variety of start and end pages on Wikipedia, along with different search algorithms to ensure reliability and accuracy. Tests focused on verifying the correct paths were found, the efficiency of the algorithms, and handling of non-existent or looped paths.
To reproduce the test results, run the crawler with different start and end Wikipedia articles and experiment with the available search algorithms. Ensure that the crawler successfully constructs the graph and finds the appropriate paths.
- Python 3.7 or higher
- Pip for Python package management
Install the required Python packages using pip:
pip install -r requirements.txt
``
### Setting Up
1. **Clone the repository**:
```bash
git clone https://github.com/memaxo/WikipediaGame.git
cd wikipedia-search-crawler
-
Install dependencies:
pip install -r requirements.txt
-
Environment Variables: Create a
.env
file in the root directory and specify the following variables:RATE_LIMIT=5/minute # Adjust rate limiting for API requests as needed
To run the crawler directly from the command line and find paths between Wikipedia pages:
python crawler.py --start_page "Start_Article" --end_page "End_Article" --algorithm bfs
Replace "Start_Article"
and "End_Article"
with your chosen Wikipedia pages, and select an algorithm (bfs
, dfs
, dijkstra
, a_star
).