Extracting Wikipedia Information Using Python in a Windows Environment

Estimated read time 5 min read

Python offers a versatile set of libraries to interact with various online databases and resources. One such resource is Wikipedia, and using the wikipedia-api package in Python, developers can fetch articles, summaries, metadata, and more programmatically. This paper focuses on using this package in a Windows environment, providing step-by-step instructions for installation and usage, along with Python code examples.

Setting Up Python in a Windows Environment

  1. Install Python
    If you don’t already have Python installed, download it from the official website. During installation, ensure that you check the box “Add Python to PATH” for easier command line usage.
  2. Install pip (if not already installed)
    pip is the package manager for Python. It usually comes pre-installed with Python, but in case it is not, you can install it manually by following the instructions here.
  3. Install wikipedia-api package
    Once Python and pip are set up, open the Command Prompt (Windows + R, type cmd and hit Enter) and type the following command to install the Wikipedia library:
   pip install wikipedia-api

This will download and install the wikipedia-api package along with its dependencies.

Code Example
Once the package is installed, you can use the following Python code in your Windows environment to retrieve data from Wikipedia:

  1. Create a Python File
    Open Notepad (or any Python IDE such as PyCharm or VS Code), and create a new Python file named wikipedia_example.py.
  2. Python Code for Wikipedia Extraction
# Import the Wikipedia library
import wikipedia

# 1. Search for a topic and retrieve suggestions
print("Search Results for 'Programming':")
search_results = wikipedia.search("Programming")
print(search_results)

# 2. Extract summary of a Wikipedia article (e.g., Linux)
print("\nSummary of Linux:")
summary = wikipedia.summary("Linux")
print(summary)

# 3. Extract a limited number of sentences from a Wikipedia article (e.g., Android)
print("\nFirst 2 sentences of the Android article:")
short_summary = wikipedia.summary("Android", sentences=2)
print(short_summary)

# 4. Retrieve full Wikipedia page data (e.g., Android operating system)
print("\nComplete Wikipedia page for 'Android operating system':")
page = wikipedia.page("Android (operating system)")
print(page)

# 5. Retrieve metadata (title, references, categories) from the Wikipedia page
print("\nTitle, References, and Categories of 'Python (programming language)':")
python_page = wikipedia.page("Python (programming language)")

# 6. Extract the plain text content of the page
print("\nContent of 'Python (programming language)':")
print(python_page.content)

# 7. Retrieve references from the page
print("\nReferences of 'Python (programming language)':")
print(python_page.references)

# 8. Retrieve categories from the page
print("\nCategories of 'Python (programming language)':")
print(python_page.categories)
  1. Save and Run the Script
    Save the file and run it from the Command Prompt or your preferred Python IDE.
  • Using Command Prompt: Navigate to the folder where your script is saved and run the following command: python wikipedia_example.py The script will fetch information from Wikipedia and display the output in the Command Prompt.

Code Explanation

  1. Search for Topics
    The script starts by using wikipedia.search() to search for a topic (e.g., “Programming”) and returns a list of related article titles.
   search_results = wikipedia.search("Programming")
   print(search_results)
  1. Extracting Summaries
    The wikipedia.summary() function is used to retrieve a summary for the specified article (e.g., “Linux”) and can also return a limited number of sentences (e.g., for “Android”).
   summary = wikipedia.summary("Linux")
   short_summary = wikipedia.summary("Android", sentences=2)
   print(summary)
  1. Retrieve Full Wikipedia Page:
    The wikipedia.page() function fetches all the data about a specific article, including metadata and the full content.
   page = wikipedia.page("Android (operating system)")
   print(page)
  1. Retrieve Metadata (Title, References, Categories)
    The code then fetches metadata like references, categories, and content using the attributes of the page object.
   python_page = wikipedia.page("Python (programming language)")
   print(python_page.content)
   print(python_page.references)
   print(python_page.categories)

Handling Errors

The Wikipedia API can occasionally return errors, especially when searching for ambiguous terms or nonexistent pages. You can handle these exceptions as follows:

import wikipedia

try:
    page = wikipedia.page("Python (programming language)")
    print(page.content)
except wikipedia.exceptions.DisambiguationError as e:
    print(f"DisambiguationError: {e.options}")
except wikipedia.exceptions.PageError:
    print("Page not found")

Using the Windows Environment for Automation

Once set up, Python scripts can be automated in a Windows environment using Task Scheduler or by running batch files.

Automating with Task Scheduler

    • Open Task Scheduler (Windows + R, type taskschd.msc, and hit Enter).
    • Create a new task and set the trigger (e.g., every day at a specific time).
    • Under the “Actions” tab, choose “Start a Program” and browse to the location of your Python script.

    Running with Batch Files

    You can create a batch file (run_wiki_script.bat) to automate the process of running your Python script:

         @echo off
         python C:\path\to\your\script\wikipedia_example.py
         pause

      Save the batch file and double-click it to run your Python script.

      Conclusion
      This guide demonstrates how to set up Python in a Windows environment to retrieve and process information from Wikipedia. With minimal effort, you can integrate Python’s wikipedia-api library into your workflow for extracting and processing Wikipedia data. From basic article summaries to full content retrieval, the API offers numerous possibilities for working with one of the most comprehensive information repositories available.

      References

      1. Wikipedia API Documentation. (n.d.). https://wikipedia-api.readthedocs.io/
      2. PyPi. (n.d.). wikipedia-api. https://pypi.org/project/wikipedia-api/
      3. Python.org. (n.d.). Download Python. https://www.python.org/downloads/windows/
      4. Wikipedia. (n.d.). “Wikipedia, The Free Encyclopedia”. Wikipedia. https://www.wikipedia.org/

      + There are no comments

      Add yours