Information Retrieval Systems

Practical Lab 1: Introduction to Search, Documents, and Boolean Retrieval

Author
Affiliation

Engr. Mbiarrambang Alain

PaveWay Academy

Published

May 15, 2026

1 Lab Overview

This first practical lab introduces students to the core idea of Information Retrieval (IR): finding relevant documents from a collection using simple search logic.

Students will learn how to:

  • inspect a small document collection,
  • tokenize text into words,
  • build a simple term-document table,
  • perform Boolean retrieval using AND, OR, and NOT,
  • understand why indexes are used in real search engines.

2 Learning Outcomes

By the end of this lab, students should be able to:

  1. explain the meaning of a document collection in IR,
  2. identify terms in a text collection,
  3. build a small binary term-document matrix,
  4. run simple Boolean searches on documents,
  5. compare linear search with indexed search,
  6. relate the lab work to real-world systems such as Google Search and enterprise document search.

3 Lab Requirements

3.1 Software Needed

  • Python 3.x
  • Jupyter Notebook or Quarto
  • Basic text editor or IDE

3.2 Python Libraries

import re
from collections import defaultdict

4 Introduction

An Information Retrieval system helps a user find documents that satisfy an information need.

A simple example is a search over a small set of documents:

  • lecture notes,
  • emails,
  • news articles,
  • PDF files,
  • web pages.

In this lab, we will use a tiny artificial collection so that the ideas are easy to understand before moving to large-scale systems.

5 Part A: Create a Small Document Collection

Use the following collection of documents.

documents = {
    1: "Artificial intelligence is changing healthcare",
    2: "Machine learning and artificial intelligence are related",
    3: "Cybersecurity protects computers and networks",
    4: "Healthcare systems use data and machine learning",
    5: "Cloud computing supports modern applications"
}

documents
{1: 'Artificial intelligence is changing healthcare',
 2: 'Machine learning and artificial intelligence are related',
 3: 'Cybersecurity protects computers and networks',
 4: 'Healthcare systems use data and machine learning',
 5: 'Cloud computing supports modern applications'}

5.1 Student Task

Read each document carefully and identify the most important words.

6 Part B: Text Tokenization

Tokenization means splitting text into individual words or terms.

For example:

"Machine learning is powerful"

becomes:

["machine", "learning", "is", "powerful"]

6.1 Code

def tokenize(text):
    text = text.lower()
    text = re.sub(r"[^a-z0-9\s]", "", text)
    return text.split()

for doc_id, text in documents.items():
    print(f"Document {doc_id}: {tokenize(text)}")
Document 1: ['artificial', 'intelligence', 'is', 'changing', 'healthcare']
Document 2: ['machine', 'learning', 'and', 'artificial', 'intelligence', 'are', 'related']
Document 3: ['cybersecurity', 'protects', 'computers', 'and', 'networks']
Document 4: ['healthcare', 'systems', 'use', 'data', 'and', 'machine', 'learning']
Document 5: ['cloud', 'computing', 'supports', 'modern', 'applications']

6.2 Student Questions

  1. What does the tokenizer remove?
  2. Why do we convert text to lowercase?
  3. What happens to punctuation marks?

7 Part C: Build a Term-Document Matrix

A term-document matrix shows whether a term appears in a document.

  • 1 means the term is present.
  • 0 means the term is absent.

7.1 Step 1: Find the Vocabulary

vocabulary = sorted(set(term for text in documents.values() for term in tokenize(text)))
vocabulary
['and',
 'applications',
 'are',
 'artificial',
 'changing',
 'cloud',
 'computers',
 'computing',
 'cybersecurity',
 'data',
 'healthcare',
 'intelligence',
 'is',
 'learning',
 'machine',
 'modern',
 'networks',
 'protects',
 'related',
 'supports',
 'systems',
 'use']

7.2 Step 2: Create the Matrix

term_document_matrix = []

for term in vocabulary:
    row = []
    for doc_id, text in documents.items():
        tokens = tokenize(text)
        row.append(1 if term in tokens else 0)
    term_document_matrix.append(row)

# Display in a readable form
for term, row in zip(vocabulary, term_document_matrix):
    print(f"{term:15} {row}")
and             [0, 1, 1, 1, 0]
applications    [0, 0, 0, 0, 1]
are             [0, 1, 0, 0, 0]
artificial      [1, 1, 0, 0, 0]
changing        [1, 0, 0, 0, 0]
cloud           [0, 0, 0, 0, 1]
computers       [0, 0, 1, 0, 0]
computing       [0, 0, 0, 0, 1]
cybersecurity   [0, 0, 1, 0, 0]
data            [0, 0, 0, 1, 0]
healthcare      [1, 0, 0, 1, 0]
intelligence    [1, 1, 0, 0, 0]
is              [1, 0, 0, 0, 0]
learning        [0, 1, 0, 1, 0]
machine         [0, 1, 0, 1, 0]
modern          [0, 0, 0, 0, 1]
networks        [0, 0, 1, 0, 0]
protects        [0, 0, 1, 0, 0]
related         [0, 1, 0, 0, 0]
supports        [0, 0, 0, 0, 1]
systems         [0, 0, 0, 1, 0]
use             [0, 0, 0, 1, 0]

7.3 Interpretation

A matrix like this helps a computer answer search queries quickly.

8 Part D: Create an Inverted Index

An inverted index maps each term to the documents that contain it.

Example:

  • machine -> [2, 4]
  • healthcare -> [1, 4]

8.1 Code

inverted_index = defaultdict(list)

for doc_id, text in documents.items():
    tokens = set(tokenize(text))
    for term in tokens:
        inverted_index[term].append(doc_id)

for term in sorted(inverted_index):
    print(f"{term:15} -> {inverted_index[term]}")
and             -> [2, 3, 4]
applications    -> [5]
are             -> [2]
artificial      -> [1, 2]
changing        -> [1]
cloud           -> [5]
computers       -> [3]
computing       -> [5]
cybersecurity   -> [3]
data            -> [4]
healthcare      -> [1, 4]
intelligence    -> [1, 2]
is              -> [1]
learning        -> [2, 4]
machine         -> [2, 4]
modern          -> [5]
networks        -> [3]
protects        -> [3]
related         -> [2]
supports        -> [5]
systems         -> [4]
use             -> [4]

8.2 Why This Matters

Without an inverted index, a search engine would need to scan every document one by one. With an inverted index, the system jumps directly to the documents containing the query term.

9 Part E: Boolean Retrieval

Boolean retrieval uses logical operators.

  • AND means both terms must appear.
  • OR means at least one term must appear.
  • NOT means the term must not appear.

9.1 Boolean Search Functions

def boolean_and(term1, term2):
    return sorted(set(inverted_index.get(term1, [])) & set(inverted_index.get(term2, [])))

def boolean_or(term1, term2):
    return sorted(set(inverted_index.get(term1, [])) | set(inverted_index.get(term2, [])))

def boolean_not(term, all_doc_ids):
    return sorted(set(all_doc_ids) - set(inverted_index.get(term, [])))

9.2 Sample Queries

all_doc_ids = list(documents.keys())

print("machine AND learning ->", boolean_and("machine", "learning"))
print("healthcare OR cybersecurity ->", boolean_or("healthcare", "cybersecurity"))
print("NOT cloud ->", boolean_not("cloud", all_doc_ids))
machine AND learning -> [2, 4]
healthcare OR cybersecurity -> [1, 3, 4]
NOT cloud -> [1, 2, 3, 4]

10 Part F: Build a Simple Query Processor

Now we combine the ideas into a query processor that accepts simple Boolean queries.

10.1 Code

def search(query):
    query = query.lower().strip()

    if " and " in query:
        parts = query.split(" and ")
        if len(parts) == 2:
            return boolean_and(parts[0].strip(), parts[1].strip())

    if " or " in query:
        parts = query.split(" or ")
        if len(parts) == 2:
            return boolean_or(parts[0].strip(), parts[1].strip())

    if query.startswith("not "):
        term = query.replace("not ", "").strip()
        return boolean_not(term, all_doc_ids)

    return sorted(inverted_index.get(query, []))

print(search("machine and learning"))
print(search("healthcare or cloud"))
print(search("not cybersecurity"))
print(search("artificial"))
[2, 4]
[1, 4, 5]
[1, 2, 4, 5]
[1, 2]

11 Part G: Read the Results

For each query below, write down the returned document IDs and explain the result in plain English.

  1. artificial and intelligence
  2. healthcare or cloud
  3. not machine
  4. cybersecurity

11.1 Student Exercise

For each search result, answer:

  • Which documents matched?
  • Why did they match?
  • Which terms were responsible for the match?

12 Part H: Real-World Scenario

Imagine a university digital library.

A student searches for:

machine learning and healthcare

The search system should return documents such as:

  • research papers on medical AI,
  • lecture notes on data science in medicine,
  • project reports on hospital prediction systems.

In a real search engine, this process happens on millions or billions of documents.

13 Part I: Reflection Questions

Answer the following questions in your lab notebook:

  1. Why is an inverted index faster than scanning all documents?
  2. What is the advantage of Boolean retrieval?
  3. Why is Boolean retrieval sometimes too strict for real users?
  4. How does tokenization improve search?
  5. Why do search engines need ranking beyond Boolean matching?

14 Challenge Exercise

Extend the search system so that it supports phrase-style matching such as:

"machine learning"

Hint: you may need to store word positions in each document.

15 Lab Assignment

Submit the following:

  1. The Python code used in the lab.
  2. The outputs of the sample queries.
  3. Answers to the reflection questions.
  4. A short paragraph explaining one real-life use of Boolean retrieval.

16 Summary

In this lab, students learned the foundation of search systems:

  • documents are processed into terms,
  • terms are organized into indexes,
  • Boolean queries are evaluated on those indexes,
  • the system returns matching documents quickly.

This lab is the foundation for later topics such as ranked retrieval, tf-idf, BM25, evaluation metrics, web crawling, clustering, and machine learning-based search.