import re
from collections import defaultdictInformation Retrieval Systems
Practical Lab 1: Introduction to Search, Documents, and Boolean Retrieval
1 Lab Overview
This first practical lab introduces students to the core idea of Information Retrieval (IR): finding relevant documents from a collection using simple search logic.
Students will learn how to:
- inspect a small document collection,
- tokenize text into words,
- build a simple term-document table,
- perform Boolean retrieval using
AND,OR, andNOT, - understand why indexes are used in real search engines.
2 Learning Outcomes
By the end of this lab, students should be able to:
- explain the meaning of a document collection in IR,
- identify terms in a text collection,
- build a small binary term-document matrix,
- run simple Boolean searches on documents,
- compare linear search with indexed search,
- relate the lab work to real-world systems such as Google Search and enterprise document search.
3 Lab Requirements
3.1 Software Needed
- Python 3.x
- Jupyter Notebook or Quarto
- Basic text editor or IDE
3.2 Python Libraries
4 Introduction
An Information Retrieval system helps a user find documents that satisfy an information need.
A simple example is a search over a small set of documents:
- lecture notes,
- emails,
- news articles,
- PDF files,
- web pages.
In this lab, we will use a tiny artificial collection so that the ideas are easy to understand before moving to large-scale systems.
5 Part A: Create a Small Document Collection
Use the following collection of documents.
documents = {
1: "Artificial intelligence is changing healthcare",
2: "Machine learning and artificial intelligence are related",
3: "Cybersecurity protects computers and networks",
4: "Healthcare systems use data and machine learning",
5: "Cloud computing supports modern applications"
}
documents{1: 'Artificial intelligence is changing healthcare',
2: 'Machine learning and artificial intelligence are related',
3: 'Cybersecurity protects computers and networks',
4: 'Healthcare systems use data and machine learning',
5: 'Cloud computing supports modern applications'}
5.1 Student Task
Read each document carefully and identify the most important words.
6 Part B: Text Tokenization
Tokenization means splitting text into individual words or terms.
For example:
"Machine learning is powerful"
becomes:
["machine", "learning", "is", "powerful"]
6.1 Code
def tokenize(text):
text = text.lower()
text = re.sub(r"[^a-z0-9\s]", "", text)
return text.split()
for doc_id, text in documents.items():
print(f"Document {doc_id}: {tokenize(text)}")Document 1: ['artificial', 'intelligence', 'is', 'changing', 'healthcare']
Document 2: ['machine', 'learning', 'and', 'artificial', 'intelligence', 'are', 'related']
Document 3: ['cybersecurity', 'protects', 'computers', 'and', 'networks']
Document 4: ['healthcare', 'systems', 'use', 'data', 'and', 'machine', 'learning']
Document 5: ['cloud', 'computing', 'supports', 'modern', 'applications']
6.2 Student Questions
- What does the tokenizer remove?
- Why do we convert text to lowercase?
- What happens to punctuation marks?
7 Part C: Build a Term-Document Matrix
A term-document matrix shows whether a term appears in a document.
1means the term is present.0means the term is absent.
7.1 Step 1: Find the Vocabulary
vocabulary = sorted(set(term for text in documents.values() for term in tokenize(text)))
vocabulary['and',
'applications',
'are',
'artificial',
'changing',
'cloud',
'computers',
'computing',
'cybersecurity',
'data',
'healthcare',
'intelligence',
'is',
'learning',
'machine',
'modern',
'networks',
'protects',
'related',
'supports',
'systems',
'use']
7.2 Step 2: Create the Matrix
term_document_matrix = []
for term in vocabulary:
row = []
for doc_id, text in documents.items():
tokens = tokenize(text)
row.append(1 if term in tokens else 0)
term_document_matrix.append(row)
# Display in a readable form
for term, row in zip(vocabulary, term_document_matrix):
print(f"{term:15} {row}")and [0, 1, 1, 1, 0]
applications [0, 0, 0, 0, 1]
are [0, 1, 0, 0, 0]
artificial [1, 1, 0, 0, 0]
changing [1, 0, 0, 0, 0]
cloud [0, 0, 0, 0, 1]
computers [0, 0, 1, 0, 0]
computing [0, 0, 0, 0, 1]
cybersecurity [0, 0, 1, 0, 0]
data [0, 0, 0, 1, 0]
healthcare [1, 0, 0, 1, 0]
intelligence [1, 1, 0, 0, 0]
is [1, 0, 0, 0, 0]
learning [0, 1, 0, 1, 0]
machine [0, 1, 0, 1, 0]
modern [0, 0, 0, 0, 1]
networks [0, 0, 1, 0, 0]
protects [0, 0, 1, 0, 0]
related [0, 1, 0, 0, 0]
supports [0, 0, 0, 0, 1]
systems [0, 0, 0, 1, 0]
use [0, 0, 0, 1, 0]
7.3 Interpretation
A matrix like this helps a computer answer search queries quickly.
8 Part D: Create an Inverted Index
An inverted index maps each term to the documents that contain it.
Example:
machine -> [2, 4]healthcare -> [1, 4]
8.1 Code
inverted_index = defaultdict(list)
for doc_id, text in documents.items():
tokens = set(tokenize(text))
for term in tokens:
inverted_index[term].append(doc_id)
for term in sorted(inverted_index):
print(f"{term:15} -> {inverted_index[term]}")and -> [2, 3, 4]
applications -> [5]
are -> [2]
artificial -> [1, 2]
changing -> [1]
cloud -> [5]
computers -> [3]
computing -> [5]
cybersecurity -> [3]
data -> [4]
healthcare -> [1, 4]
intelligence -> [1, 2]
is -> [1]
learning -> [2, 4]
machine -> [2, 4]
modern -> [5]
networks -> [3]
protects -> [3]
related -> [2]
supports -> [5]
systems -> [4]
use -> [4]
8.2 Why This Matters
Without an inverted index, a search engine would need to scan every document one by one. With an inverted index, the system jumps directly to the documents containing the query term.
9 Part E: Boolean Retrieval
Boolean retrieval uses logical operators.
ANDmeans both terms must appear.ORmeans at least one term must appear.NOTmeans the term must not appear.
9.1 Boolean Search Functions
def boolean_and(term1, term2):
return sorted(set(inverted_index.get(term1, [])) & set(inverted_index.get(term2, [])))
def boolean_or(term1, term2):
return sorted(set(inverted_index.get(term1, [])) | set(inverted_index.get(term2, [])))
def boolean_not(term, all_doc_ids):
return sorted(set(all_doc_ids) - set(inverted_index.get(term, [])))9.2 Sample Queries
all_doc_ids = list(documents.keys())
print("machine AND learning ->", boolean_and("machine", "learning"))
print("healthcare OR cybersecurity ->", boolean_or("healthcare", "cybersecurity"))
print("NOT cloud ->", boolean_not("cloud", all_doc_ids))machine AND learning -> [2, 4]
healthcare OR cybersecurity -> [1, 3, 4]
NOT cloud -> [1, 2, 3, 4]
10 Part F: Build a Simple Query Processor
Now we combine the ideas into a query processor that accepts simple Boolean queries.
10.1 Code
def search(query):
query = query.lower().strip()
if " and " in query:
parts = query.split(" and ")
if len(parts) == 2:
return boolean_and(parts[0].strip(), parts[1].strip())
if " or " in query:
parts = query.split(" or ")
if len(parts) == 2:
return boolean_or(parts[0].strip(), parts[1].strip())
if query.startswith("not "):
term = query.replace("not ", "").strip()
return boolean_not(term, all_doc_ids)
return sorted(inverted_index.get(query, []))
print(search("machine and learning"))
print(search("healthcare or cloud"))
print(search("not cybersecurity"))
print(search("artificial"))[2, 4]
[1, 4, 5]
[1, 2, 4, 5]
[1, 2]
11 Part G: Read the Results
For each query below, write down the returned document IDs and explain the result in plain English.
artificial and intelligencehealthcare or cloudnot machinecybersecurity
11.1 Student Exercise
For each search result, answer:
- Which documents matched?
- Why did they match?
- Which terms were responsible for the match?
12 Part H: Real-World Scenario
Imagine a university digital library.
A student searches for:
machine learning and healthcare
The search system should return documents such as:
- research papers on medical AI,
- lecture notes on data science in medicine,
- project reports on hospital prediction systems.
In a real search engine, this process happens on millions or billions of documents.
13 Part I: Reflection Questions
Answer the following questions in your lab notebook:
- Why is an inverted index faster than scanning all documents?
- What is the advantage of Boolean retrieval?
- Why is Boolean retrieval sometimes too strict for real users?
- How does tokenization improve search?
- Why do search engines need ranking beyond Boolean matching?
14 Challenge Exercise
Extend the search system so that it supports phrase-style matching such as:
"machine learning"
Hint: you may need to store word positions in each document.
15 Lab Assignment
Submit the following:
- The Python code used in the lab.
- The outputs of the sample queries.
- Answers to the reflection questions.
- A short paragraph explaining one real-life use of Boolean retrieval.
16 Summary
In this lab, students learned the foundation of search systems:
- documents are processed into terms,
- terms are organized into indexes,
- Boolean queries are evaluated on those indexes,
- the system returns matching documents quickly.
This lab is the foundation for later topics such as ranked retrieval, tf-idf, BM25, evaluation metrics, web crawling, clustering, and machine learning-based search.