COME5106 - INFORMATION RETRIEVAL SYSTEM (IRS)

Author

Affiliation

Engr. Mbiarrambang Alain

PaveWay Academy

Published

May 15, 2026

Complete Semester Notes for Computer Science & Computer Engineering Students

COURSE INFORMATION

Course Title

Information Retrieval System (IRS)

Credit Value

6 Credits

Duration

15–16 Weeks

Target Students

Computer Science Students
Computer Engineering Students
Software Engineering Students
Data Science Students
AI and Machine Learning Students

Course Description

Information Retrieval (IR) focuses on how computers organize, index, search, retrieve, rank, and recommend information from massive collections of data such as websites, digital libraries, emails, enterprise documents, research databases, and social media platforms.

This course introduces students to:

Search engines
Indexing algorithms
Text processing
Ranking systems
Web search
Machine learning for retrieval
Clustering and classification
Information filtering
Web crawling
Relevance ranking
Natural language processing in search systems

The course combines:

Theory
Algorithms
Mathematical models
Real-world industrial applications
Practical labs
Mini-projects

COURSE LEARNING OUTCOMES

At the end of this course, students should be able to:

Explain the principles of Information Retrieval Systems.
Design and implement simple search engines.
Build inverted indexes.
Process Boolean and ranked queries.
Apply text preprocessing techniques.
Compute tf-idf and cosine similarity.
Evaluate retrieval performance using precision and recall.
Implement document classification and clustering.
Explain web crawling and web indexing.
Understand PageRank and link analysis.
Build mini search systems for real-world applications.
Apply machine learning techniques in retrieval systems.

COURSE OUTLINE

Week	Topic
1	Introduction to Information Retrieval
2	Boolean Retrieval Model
3	Inverted Index and Postings Lists
4	Text Processing and Tokenization
5	Dictionaries and Tolerant Retrieval
6	Index Construction
7	Index Compression
8	Vector Space Model and tf-idf
9	Ranked Retrieval and Scoring
10	Evaluation of IR Systems
11	Relevance Feedback and Query Expansion
12	Probabilistic Retrieval Models
13	Text Classification
14	Clustering and Latent Semantic Indexing
15	Web Search Engines and Crawling
16	Link Analysis, Revision, and Project Presentation

CHAPTER 1 — INTRODUCTION TO INFORMATION RETRIEVAL

1.1 What is Information Retrieval?

Textbook Definition

Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).

Information Retrieval (IR) is the process of obtaining relevant information from large collections of data.

Simple Definition

Information Retrieval helps users find needed information quickly from huge datasets.

Examples

Google Search
YouTube Search
Spotify music recommendation
Netflix movie recommendation
Amazon product search
Gmail email search
Library catalog systems
Digital medical records search

1.2 Difference Between IR and Database Systems

Information Retrieval	Database Systems
Deals with unstructured data	Deals with structured data
Results may be approximate	Results are exact
Uses ranking	Uses exact matching
Example: Google	Example: Banking System

Example

Database Query:

SELECT * FROM students WHERE matricule='FE21A123';

IR Query:

best universities in africa for computer science

1.3 Components of an Information Retrieval System

Major Components

Document Collection
Text Processing Unit
Indexing Engine
Query Processor
Ranking Module
Retrieval Module
User Interface

Real-World Scenario

When a user searches:

cheap smartphones under $200

The system:

Processes the query
Searches indexes
Ranks results
Displays relevant products

1.4 Applications of Information Retrieval

Industrial Applications

Search Engines

Google
Bing
DuckDuckGo

E-commerce

Amazon
Alibaba
Jumia

Healthcare

Patient record retrieval
Disease diagnosis systems

Cybersecurity

Threat intelligence search
Log analysis

Education

Digital libraries
Research paper retrieval

1.5 Challenges in Information Retrieval

Major Challenges

Ambiguity

Example:

Java

Could mean:

Programming language
Island in Indonesia
Coffee

Synonyms

Different words with same meaning.

Example:

Car = Automobile

Large Data Volumes

Search engines process billions of pages.

Misspellings

Example:

Micorsoft

instead of:

Microsoft

Relevance Ranking

Which result should appear first?

LAB 1

Objective

Understand basic Information Retrieval concepts.

Practical Activity

Task 1

Use Google search and analyze:

Ranking order
Autocomplete suggestions
Search snippets
Related searches

Task 2

Compare:

Google
Bing
DuckDuckGo

Search for:

Artificial Intelligence in Healthcare

Document differences.

CHAPTER 2 — BOOLEAN RETRIEVAL MODEL

2.1 Boolean Retrieval

Boolean Retrieval uses logical operators.

Text Definition

The Boolean retrieval model is a model for information retrieval in which we can pose any query which is in the form of a Boolean expression of terms, that is, in which terms are combined with the operators AND, OR, and NOT. The model views each document as just a set of words.

Operators

Operator	Meaning
AND	Both terms must appear
OR	Either term may appear
NOT	Exclude a term

2.2 Examples

AND Query

machine AND learning

Returns documents containing both words.

OR Query

doctor OR physician

NOT Query

python NOT snake

2.3 Advantages of Boolean Retrieval

Simple
Fast
Efficient for exact search
Good for legal and medical systems

2.4 Limitations

No ranking
Difficult for ordinary users
Too many or too few results

2.5 Term-Document Matrix

A binary matrix showing whether terms exist in documents.

Term	D1	D2	D3
AI	1	1	0
Cloud	0	1	1
Security	1	0	1

1 = Present 0 = Absent

A term-document incidence matrix. Matrix element (t, d) is 1 if the document in column d contains the word in row t, and is 0 otherwise.

Note

To answer a query like AI AND Cloud AND NOT Security, we take the vectors for AI, Cloud and Security, complement the last, and then do a bitwise AND:

110 AND 011 AND 010 = 010

The answer for this query is thus document D2.

Step-by-Step Breakdown for Your Class

The 3-bit vectors represent the presence (1) or absence (0) of terms across 3 documents tracked in this exact sequence:

1. Retrieve the Vector Profiles

AI: 110 (Present in D1 and D2)
Cloud: 011 (Present in D2 and D3)
Security: 101 (Present in D1 and D3)

2. Execute the NOT Operator (Bitwise Complement)

Complementing Security’s profile flips every bit (\(0 \rightarrow 1\) and \(1 \rightarrow 0\)): \[\text{NOT } (101) = 010\]

3. Execute the AND Operations (Bitwise Intersection)

Align the bits vertically and apply standard boolean multiplication:

110 (AI) 011 (Cloud) 010 (NOT Security) —— 010 (Resulting Bitstring)

Deciphering the Final Output Vector (010)

Looking at where the active 1 bits land in our predefined matrix sequence:

Bit 2 is 1: Maps to D2.

2.6 Linear Search vs Indexed Search

Linear Search

Checks all documents one by one.

Indexed Search

Uses indexes for faster retrieval.

LAB 2

Objective

Implement a simple Boolean Retrieval system.

Python Example


documents = {
    1: "machine learning and artificial intelligence",
    2: "cloud computing and cybersecurity",
    3: "artificial intelligence in healthcare"
}

query = ["artificial", "intelligence"]

for doc_id, text in documents.items():
    if all(term in text for term in query):
        print(doc_id, text)

Expected Output

Documents containing all query terms.

CHAPTER 3 — INVERTED INDEX AND POSTINGS LISTS

3.1 Inverted Index

An inverted index maps terms to documents.

Example

Term	Documents
AI	D1, D3
Cloud	D2
Security	D2, D3

3.2 Why Inverted Indexes Matter

Without indexes:

Search is slow
Systems become inefficient

With indexes:

Retrieval becomes extremely fast

Google uses massive distributed indexes.

3.3 Postings Lists

A postings list stores document IDs for each term.

Example:

AI → [1, 3, 7, 9]

3.4 Building an Inverted Index

Steps

Collect documents
Tokenize text
Normalize terms
Remove stop words
Build postings lists

3.5 Real-World Application

E-commerce Search

When users search:

wireless headphones

The search engine retrieves product IDs from postings lists.

LAB 3

Objective

Build an inverted index.


docs = {
    1: "AI and machine learning",
    2: "Cloud computing and AI",
    3: "Cybersecurity and cloud systems"
}

index = {}

for doc_id, text in docs.items():
    for word in text.lower().split():
        if word not in index:
            index[word] = []
        index[word].append(doc_id)

print(index)

CHAPTER 4 — TEXT PROCESSING AND TOKENIZATION

4.1 Tokenization

Tokenization splits text into smaller units called tokens.

Example:

"Machine Learning is Powerful"

Tokens:

[Machine, Learning, is, Powerful]

4.2 Stop Words

Common words often removed.

Examples:

4.3 Normalization

Converting text into standard format.

Examples

Original	Normalized
COMPUTER	computer
Running	run

4.4 Stemming

Reduces words to root forms.

Example:

playing → play

Popular Stemmer

Porter Stemmer

4.5 Lemmatization

Uses dictionary meaning to obtain base word.

Example:

better → good

4.6 Challenges in Text Processing

Multilingual Data

French
English
Chinese
Arabic

Emojis and Slang

Social media introduces noisy text.

LAB 4

Objective

Perform text preprocessing.

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

text = "Machine Learning is changing the world"

words = word_tokenize(text.lower())
stop_words = set(stopwords.words('english'))

filtered = [w for w in words if w not in stop_words]

ps = PorterStemmer()
stemmed = [ps.stem(w) for w in filtered]

print(stemmed)

CHAPTER 5 — DICTIONARIES AND TOLERANT RETRIEVAL

5.1 Dictionary Structures

A dictionary stores all vocabulary terms.

Structures Used

Hash Tables
B-Trees
Tries

5.2 Wildcard Queries

Example:

comput*

Matches:

computer
computing
computation

5.3 Spelling Correction

Example:

Gooogle

Corrected to:

Google

5.4 Edit Distance

Measures similarity between strings.

Example

Distance between:

cat → cut

Distance = 1

5.5 Phonetic Correction

Words sounding alike.

Example:

nite → night

LAB 5

Objective

Implement spelling correction.

from textblob import TextBlob

word = TextBlob("googel")
print(word.correct())

CHAPTER 6 — INDEX CONSTRUCTION

6.1 What is Index Construction?

Building searchable structures from documents.

6.2 Steps in Index Construction

Parse documents
Extract tokens
Normalize text
Generate term IDs
Build postings lists
Store indexes

6.3 Distributed Indexing

Large systems distribute indexing.

Technologies

Hadoop
Spark
MapReduce

6.4 Dynamic Indexing

Allows updates while search engine runs.

Used by:

Google
Twitter/X
Facebook

6.5 Enterprise Scenario

A university repository indexes:

Theses
PDFs
Lecture notes
Journals

LAB 6

Objective

Build a scalable indexing pipeline.

Activity

Use:

Python
Elasticsearch

Index 100 sample documents.

CHAPTER 7 — INDEX COMPRESSION

7.1 Why Compression Matters

Search engines store huge indexes.

Compression:

Saves storage
Improves speed
Reduces memory usage

7.2 Zipf’s Law

Few words appear frequently.

Example:

7.3 Heaps’ Law

Vocabulary grows with collection size.

7.4 Compression Techniques

Variable Byte Encoding

Gamma Encoding

Front Coding

7.5 Industrial Relevance

Google compresses indexes heavily to support billions of searches.

LAB 7

Objective

Compare compressed vs uncompressed indexes.

Tasks

Measure storage size
Measure retrieval speed

CHAPTER 8 — VECTOR SPACE MODEL

8.1 Ranked Retrieval

Modern systems rank results by relevance.

8.2 Vector Space Model

Documents and queries are represented as vectors.

8.3 Term Frequency (TF)

Measures how often a term appears.

genui{“math_block_widget_always_prefetch_v2”:{“content”:“tf_{t,d}=”}}

8.4 Inverse Document Frequency (IDF)

Measures importance of terms.

genui{“math_block_widget_always_prefetch_v2”:{“content”:“idf_t=()”}}

8.5 TF-IDF

Combines TF and IDF.

genui{“math_block_widget_always_prefetch_v2”:{“content”:“tfidf_{t,d}=tf_{t,d}idf_t”}}

8.6 Cosine Similarity

Measures similarity between vectors.

genui{“math_block_widget_always_prefetch_v2”:{“content”:“()=”}}

8.7 Real-World Application

Resume Matching

Recruitment systems compare resumes against job descriptions.

Recommendation Systems

Netflix compares user preferences.

LAB 8

Objective

Compute tf-idf manually and programmatically.

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "machine learning is powerful",
    "artificial intelligence and machine learning",
    "cloud computing systems"
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

print(X.toarray())

CHAPTER 9 — RANKED RETRIEVAL AND SCORING

9.1 Ranking

Search engines rank documents by relevance score.

9.2 Scoring Functions

Factors

Term frequency
Document length
Link popularity
User behavior

9.3 Top-K Retrieval

Retrieve only best results.

Example: Top 10 Google results.

9.4 Query Processing Pipeline

Query Parsing
Tokenization
Retrieval
Ranking
Snippet Generation

9.5 Snippet Generation

Example: Google highlights matching words.

LAB 9

Objective

Implement document ranking.

Activity

Use cosine similarity to rank documents.

CHAPTER 10 — EVALUATION OF INFORMATION RETRIEVAL SYSTEMS

10.1 Why Evaluation Matters

We need to measure retrieval quality.

10.2 Precision

Fraction of retrieved documents that are relevant.

genui{“math_block_widget_always_prefetch_v2”:{“content”:“Precision=”}}

10.3 Recall

Fraction of relevant documents retrieved.

genui{“math_block_widget_always_prefetch_v2”:{“content”:“Recall=”}}

10.4 F1 Score

genui{“math_block_widget_always_prefetch_v2”:{“content”:“F1=”}}

10.5 User Satisfaction Metrics

Click-through rate
Dwell time
Bounce rate

10.6 Industrial Scenario

E-commerce companies evaluate:

Product search accuracy
Conversion rates
Customer satisfaction

LAB 10

Objective

Evaluate retrieval effectiveness.

Task

Compute:

Precision
Recall
F1 Score

for a sample dataset.

CHAPTER 11 — RELEVANCE FEEDBACK AND QUERY EXPANSION

11.1 Relevance Feedback

Users indicate useful results.

System improves retrieval.

11.2 Query Expansion

Expands search terms.

Example:

car → automobile, vehicle

11.3 Pseudo-Relevance Feedback

Assumes top results are relevant.

11.4 Real-World Example

Google suggests:

Related searches
Synonyms
Alternative spellings

LAB 11

Objective

Implement query expansion.

Task

Use WordNet synonyms for expansion.

CHAPTER 12 — PROBABILISTIC INFORMATION RETRIEVAL

12.1 Probability Ranking Principle

Documents ranked by probability of relevance.

12.2 Binary Independence Model

Assumes terms occur independently.

12.3 BM25 Ranking

Industry-standard ranking algorithm.

genui{“math_block_widget_always_prefetch_v2”:{“content”:“BM25(q,d)=_{tq}idf(t)“}}

12.4 Applications

Search engines
Legal retrieval
Scientific document retrieval

LAB 12

Objective

Implement BM25 ranking.

Suggested Tools

Python
rank_bm25 library

CHAPTER 13 — TEXT CLASSIFICATION

13.1 Text Classification

Assign documents to categories.

13.2 Examples

Spam Detection

Spam
Not Spam

News Categorization

Sports
Politics
Entertainment

13.3 Naive Bayes Classifier

Popular classification algorithm.

genui{“math_block_widget_always_prefetch_v2”:{“content”:“P(c|d)=”}}

13.4 Feature Selection

Important words selected for classification.

13.5 Industrial Applications

Email filtering
Sentiment analysis
Fake news detection
Content moderation

LAB 13

Objective

Build a spam classifier.

from sklearn.naive_bayes import MultinomialNB

Train and evaluate classifier.

CHAPTER 14 — CLUSTERING AND LATENT SEMANTIC INDEXING

14.1 Clustering

Grouping similar documents.

14.2 K-Means Clustering

Popular clustering algorithm.

14.3 Applications

Search result grouping
Customer segmentation
Topic modeling

14.4 Hierarchical Clustering

Creates cluster trees.

14.5 Latent Semantic Indexing (LSI)

Finds hidden semantic relationships.

14.6 Real-World Scenario

Netflix groups similar movies.

Spotify groups similar songs.

LAB 14

Objective

Cluster documents using K-Means.

from sklearn.cluster import KMeans

CHAPTER 15 — WEB SEARCH ENGINES AND WEB CRAWLING

15.1 Search Engine Architecture

Components

Web crawler
Indexer
Query processor
Ranking engine

15.2 Web Crawling

Crawler downloads web pages.

15.3 URL Frontier

List of URLs waiting to be crawled.

15.4 Challenges

Duplicate pages
Dead links
Dynamic pages
Spam pages

15.5 Distributed Crawling

Used for web-scale search engines.

15.6 Industrial Scenario

Googlebot continuously crawls billions of pages.

LAB 15

Objective

Build a simple web crawler.

import requests
from bs4 import BeautifulSoup

url = "https://example.com"
response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')

for link in soup.find_all('a'):
    print(link.get('href'))

CHAPTER 16 — LINK ANALYSIS AND PAGERANK

16.1 The Web as a Graph

Web pages connected through hyperlinks.

16.2 PageRank

Measures page importance.

Developed by Google founders.

16.3 Intuition Behind PageRank

A page is important if important pages link to it.

16.4 Applications

Search ranking
Influence analysis
Social network analysis

16.5 HITS Algorithm

Uses:

Hubs
Authorities

LAB 16

Objective

Simulate PageRank.

Task

Calculate PageRank for a small web graph.

MODERN TRENDS IN INFORMATION RETRIEVAL

Artificial Intelligence in Search

AI improves:

Recommendations
Semantic search
Conversational search

Voice Search

Examples:

Siri
Alexa
Google Assistant

Semantic Search

Understands meaning instead of keywords.

Generative AI Search

Examples:

ChatGPT
Gemini
Perplexity AI

Vector Databases

Modern AI retrieval uses:

Pinecone
Weaviate
FAISS
ChromaDB

MINI PROJECTS

Project 1

Build a University Search Engine.

Features:

Search lecture notes
Search PDFs
Ranking system
Autocomplete

Project 2

Build a Job Recommendation System.

Project 3

Build a News Classification System.

Project 4

Build a Movie Recommendation System.

SAMPLE EXAM QUESTIONS

Theory Questions

Define Information Retrieval.
Explain inverted indexes.
Differentiate precision and recall.
Explain tf-idf.
Discuss PageRank.
Explain text preprocessing.
Explain Boolean retrieval.
Describe BM25.
Explain clustering.
Explain Naive Bayes classification.

Practical Questions

Build an inverted index.
Compute tf-idf.
Calculate cosine similarity.
Build a simple crawler.
Implement document ranking.

CAREER OPPORTUNITIES

Students can become:

Search Engineer
NLP Engineer
Machine Learning Engineer
Data Scientist
AI Engineer
Recommendation Systems Engineer
Search Relevance Engineer
Big Data Engineer
Cybersecurity Analyst
Research Scientist

RECOMMENDED TOOLS

Programming Languages

Python
Java
Scala

Libraries

NLTK
Scikit-learn
SpaCy
Gensim
TensorFlow
PyTorch

Search Platforms

Elasticsearch
Apache Solr
OpenSearch

Big Data Platforms

Hadoop
Spark

FINAL ADVICE TO STUDENTS

To succeed in Information Retrieval Systems:

Master data structures and algorithms.
Learn linear algebra and probability.
Practice Python programming.
Build real projects.
Understand search engine architecture.
Learn machine learning fundamentals.
Practice evaluating retrieval systems.
Study modern AI retrieval systems.

Information Retrieval is one of the most important foundations of:

Artificial Intelligence
Search Engines
Recommender Systems
NLP
Data Science
Generative AI

Note

Modern companies such as Google, Amazon, Netflix, TikTok, Meta, Microsoft, OpenAI, and Spotify depend heavily on Information Retrieval technologies.

Complete Semester Notes for Computer Science & Computer Engineering Students

COURSE INFORMATION

Course Title

Credit Value

Duration

Target Students

Course Description

COURSE LEARNING OUTCOMES

COURSE OUTLINE

CHAPTER 1 — INTRODUCTION TO INFORMATION RETRIEVAL

1.1 What is Information Retrieval?

Simple Definition

Examples

1.2 Difference Between IR and Database Systems

Example

1.3 Components of an Information Retrieval System

Major Components

Real-World Scenario

1.4 Applications of Information Retrieval

Industrial Applications

Search Engines

E-commerce

Healthcare

Cybersecurity

Education

Social Media

1.5 Challenges in Information Retrieval

Major Challenges

Ambiguity

Synonyms

Large Data Volumes

Misspellings

Relevance Ranking

LAB 1

Objective

Practical Activity

Task 1

Task 2

CHAPTER 2 — BOOLEAN RETRIEVAL MODEL

2.1 Boolean Retrieval

Operators

2.2 Examples

AND Query

OR Query

NOT Query

2.3 Advantages of Boolean Retrieval

2.4 Limitations

2.5 Term-Document Matrix

Step-by-Step Breakdown for Your Class

1. Retrieve the Vector Profiles

2. Execute the NOT Operator (Bitwise Complement)

3. Execute the AND Operations (Bitwise Intersection)

Deciphering the Final Output Vector (010)

2.6 Linear Search vs Indexed Search

Linear Search

Indexed Search

LAB 2

Objective

Python Example

Expected Output

CHAPTER 3 — INVERTED INDEX AND POSTINGS LISTS

3.1 Inverted Index

Example

3.2 Why Inverted Indexes Matter

3.3 Postings Lists

3.4 Building an Inverted Index

Steps

3.5 Real-World Application

E-commerce Search

LAB 3

Objective

CHAPTER 4 — TEXT PROCESSING AND TOKENIZATION

4.1 Tokenization

4.2 Stop Words

4.3 Normalization

Examples

4.4 Stemming

Popular Stemmer

4.5 Lemmatization

4.6 Challenges in Text Processing