COME5106 - INFORMATION RETRIEVAL SYSTEM (IRS)

Author
Affiliation

Engr. Mbiarrambang Alain

PaveWay Academy

Published

May 15, 2026

Complete Semester Notes for Computer Science & Computer Engineering Students

COURSE INFORMATION

Course Title

Information Retrieval System (IRS)

Credit Value

6 Credits

Duration

15–16 Weeks

Target Students

  • Computer Science Students
  • Computer Engineering Students
  • Software Engineering Students
  • Data Science Students
  • AI and Machine Learning Students

Course Description

Information Retrieval (IR) focuses on how computers organize, index, search, retrieve, rank, and recommend information from massive collections of data such as websites, digital libraries, emails, enterprise documents, research databases, and social media platforms.

This course introduces students to:

  • Search engines
  • Indexing algorithms
  • Text processing
  • Ranking systems
  • Web search
  • Machine learning for retrieval
  • Clustering and classification
  • Information filtering
  • Web crawling
  • Relevance ranking
  • Natural language processing in search systems

The course combines:

  • Theory
  • Algorithms
  • Mathematical models
  • Real-world industrial applications
  • Practical labs
  • Mini-projects

COURSE LEARNING OUTCOMES

At the end of this course, students should be able to:

  1. Explain the principles of Information Retrieval Systems.
  2. Design and implement simple search engines.
  3. Build inverted indexes.
  4. Process Boolean and ranked queries.
  5. Apply text preprocessing techniques.
  6. Compute tf-idf and cosine similarity.
  7. Evaluate retrieval performance using precision and recall.
  8. Implement document classification and clustering.
  9. Explain web crawling and web indexing.
  10. Understand PageRank and link analysis.
  11. Build mini search systems for real-world applications.
  12. Apply machine learning techniques in retrieval systems.

COURSE OUTLINE

Week Topic
1 Introduction to Information Retrieval
2 Boolean Retrieval Model
3 Inverted Index and Postings Lists
4 Text Processing and Tokenization
5 Dictionaries and Tolerant Retrieval
6 Index Construction
7 Index Compression
8 Vector Space Model and tf-idf
9 Ranked Retrieval and Scoring
10 Evaluation of IR Systems
11 Relevance Feedback and Query Expansion
12 Probabilistic Retrieval Models
13 Text Classification
14 Clustering and Latent Semantic Indexing
15 Web Search Engines and Crawling
16 Link Analysis, Revision, and Project Presentation

CHAPTER 1 — INTRODUCTION TO INFORMATION RETRIEVAL

1.1 What is Information Retrieval?

NoteTextbook Definition

Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).

Information Retrieval (IR) is the process of obtaining relevant information from large collections of data.

Simple Definition

Information Retrieval helps users find needed information quickly from huge datasets.

Examples

  • Google Search
  • YouTube Search
  • Spotify music recommendation
  • Netflix movie recommendation
  • Amazon product search
  • Gmail email search
  • Library catalog systems
  • Digital medical records search

1.2 Difference Between IR and Database Systems

Information Retrieval Database Systems
Deals with unstructured data Deals with structured data
Results may be approximate Results are exact
Uses ranking Uses exact matching
Example: Google Example: Banking System

Example

Database Query:

SELECT * FROM students WHERE matricule='FE21A123';

IR Query:

best universities in africa for computer science

1.3 Components of an Information Retrieval System

Major Components

  1. Document Collection
  2. Text Processing Unit
  3. Indexing Engine
  4. Query Processor
  5. Ranking Module
  6. Retrieval Module
  7. User Interface

Real-World Scenario

When a user searches:

cheap smartphones under $200

The system:

  1. Processes the query
  2. Searches indexes
  3. Ranks results
  4. Displays relevant products

1.4 Applications of Information Retrieval

Industrial Applications

Search Engines

  • Google
  • Bing
  • DuckDuckGo

E-commerce

  • Amazon
  • Alibaba
  • Jumia

Healthcare

  • Patient record retrieval
  • Disease diagnosis systems

Cybersecurity

  • Threat intelligence search
  • Log analysis

Education

  • Digital libraries
  • Research paper retrieval

Social Media

  • Facebook search
  • TikTok recommendation
  • Twitter/X search

1.5 Challenges in Information Retrieval

Major Challenges

Ambiguity

Example:

Java

Could mean:

  • Programming language
  • Island in Indonesia
  • Coffee

Synonyms

Different words with same meaning.

Example:

Car = Automobile

Large Data Volumes

Search engines process billions of pages.

Misspellings

Example:

Micorsoft

instead of:

Microsoft

Relevance Ranking

Which result should appear first?


LAB 1

Objective

Understand basic Information Retrieval concepts.

Practical Activity

Task 1

Use Google search and analyze:

  • Ranking order
  • Autocomplete suggestions
  • Search snippets
  • Related searches

Task 2

Compare:

  • Google
  • Bing
  • DuckDuckGo

Search for:

Artificial Intelligence in Healthcare

Document differences.


CHAPTER 2 — BOOLEAN RETRIEVAL MODEL

2.1 Boolean Retrieval

Boolean Retrieval uses logical operators.

ImportantText Definition

The Boolean retrieval model is a model for information retrieval in which we can pose any query which is in the form of a Boolean expression of terms, that is, in which terms are combined with the operators AND, OR, and NOT. The model views each document as just a set of words.

Operators

Operator Meaning
AND Both terms must appear
OR Either term may appear
NOT Exclude a term

2.2 Examples

AND Query

machine AND learning

Returns documents containing both words.

OR Query

doctor OR physician

NOT Query

python NOT snake

2.3 Advantages of Boolean Retrieval

  • Simple
  • Fast
  • Efficient for exact search
  • Good for legal and medical systems

2.4 Limitations

  • No ranking
  • Difficult for ordinary users
  • Too many or too few results

2.5 Term-Document Matrix

A binary matrix showing whether terms exist in documents.

Term D1 D2 D3
AI 1 1 0
Cloud 0 1 1
Security 1 0 1

1 = Present 0 = Absent

A term-document incidence matrix. Matrix element (t, d) is 1 if the document in column d contains the word in row t, and is 0 otherwise.

Note

To answer a query like AI AND Cloud AND NOT Security, we take the vectors for AI, Cloud and Security, complement the last, and then do a bitwise AND:

110 AND 011 AND 010 = 010

The answer for this query is thus document D2.

Step-by-Step Breakdown for Your Class

The 3-bit vectors represent the presence (1) or absence (0) of terms across 3 documents tracked in this exact sequence:

  1. D1
  2. D2
  3. D3

1. Retrieve the Vector Profiles

  • AI: 110 (Present in D1 and D2)
  • Cloud: 011 (Present in D2 and D3)
  • Security: 101 (Present in D1 and D3)

2. Execute the NOT Operator (Bitwise Complement)

Complementing Security’s profile flips every bit (\(0 \rightarrow 1\) and \(1 \rightarrow 0\)): \[\text{NOT } (101) = 010\]

3. Execute the AND Operations (Bitwise Intersection)

Align the bits vertically and apply standard boolean multiplication:

110 (AI) 011 (Cloud) 010 (NOT Security) —— 010 (Resulting Bitstring)

Deciphering the Final Output Vector (010)

Looking at where the active 1 bits land in our predefined matrix sequence:

  • Bit 2 is 1: Maps to D2.

LAB 2

Objective

Implement a simple Boolean Retrieval system.

Python Example


documents = {
    1: "machine learning and artificial intelligence",
    2: "cloud computing and cybersecurity",
    3: "artificial intelligence in healthcare"
}

query = ["artificial", "intelligence"]

for doc_id, text in documents.items():
    if all(term in text for term in query):
        print(doc_id, text)

Expected Output

Documents containing all query terms.


CHAPTER 3 — INVERTED INDEX AND POSTINGS LISTS

3.1 Inverted Index

An inverted index maps terms to documents.

Example

Term Documents
AI D1, D3
Cloud D2
Security D2, D3

3.2 Why Inverted Indexes Matter

Without indexes:

  • Search is slow
  • Systems become inefficient

With indexes:

  • Retrieval becomes extremely fast

Google uses massive distributed indexes.


3.3 Postings Lists

A postings list stores document IDs for each term.

Example:

AI → [1, 3, 7, 9]

3.4 Building an Inverted Index

Steps

  1. Collect documents
  2. Tokenize text
  3. Normalize terms
  4. Remove stop words
  5. Build postings lists

3.5 Real-World Application

LAB 3

Objective

Build an inverted index.


docs = {
    1: "AI and machine learning",
    2: "Cloud computing and AI",
    3: "Cybersecurity and cloud systems"
}

index = {}

for doc_id, text in docs.items():
    for word in text.lower().split():
        if word not in index:
            index[word] = []
        index[word].append(doc_id)

print(index)

CHAPTER 4 — TEXT PROCESSING AND TOKENIZATION

4.1 Tokenization

Tokenization splits text into smaller units called tokens.

Example:

"Machine Learning is Powerful"

Tokens:

[Machine, Learning, is, Powerful]

4.2 Stop Words

Common words often removed.

Examples:

  • the
  • is
  • and
  • of

4.3 Normalization

Converting text into standard format.

Examples

Original Normalized
COMPUTER computer
Running run

4.4 Stemming

Reduces words to root forms.

Example:

playing → play

4.5 Lemmatization

Uses dictionary meaning to obtain base word.

Example:

better → good

4.6 Challenges in Text Processing

Multilingual Data

  • French
  • English
  • Chinese
  • Arabic

Emojis and Slang

Social media introduces noisy text.


LAB 4

Objective

Perform text preprocessing.

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

text = "Machine Learning is changing the world"

words = word_tokenize(text.lower())
stop_words = set(stopwords.words('english'))

filtered = [w for w in words if w not in stop_words]

ps = PorterStemmer()
stemmed = [ps.stem(w) for w in filtered]

print(stemmed)

CHAPTER 5 — DICTIONARIES AND TOLERANT RETRIEVAL

5.1 Dictionary Structures

A dictionary stores all vocabulary terms.

Structures Used

  • Hash Tables
  • B-Trees
  • Tries

5.2 Wildcard Queries

Example:

comput*

Matches:

  • computer
  • computing
  • computation

5.3 Spelling Correction

Example:

Gooogle

Corrected to:

Google

5.4 Edit Distance

Measures similarity between strings.

Example

Distance between:

cat → cut

Distance = 1


5.5 Phonetic Correction

Words sounding alike.

Example:

nite → night

LAB 5

Objective

Implement spelling correction.

from textblob import TextBlob

word = TextBlob("googel")
print(word.correct())

CHAPTER 6 — INDEX CONSTRUCTION

6.1 What is Index Construction?

Building searchable structures from documents.


6.2 Steps in Index Construction

  1. Parse documents
  2. Extract tokens
  3. Normalize text
  4. Generate term IDs
  5. Build postings lists
  6. Store indexes

6.3 Distributed Indexing

Large systems distribute indexing.

Technologies

  • Hadoop
  • Spark
  • MapReduce

6.4 Dynamic Indexing

Allows updates while search engine runs.

Used by:

  • Google
  • Twitter/X
  • Facebook

6.5 Enterprise Scenario

A university repository indexes:

  • Theses
  • PDFs
  • Lecture notes
  • Journals

LAB 6

Objective

Build a scalable indexing pipeline.

Activity

Use:

  • Python
  • Elasticsearch

Index 100 sample documents.


CHAPTER 7 — INDEX COMPRESSION

7.1 Why Compression Matters

Search engines store huge indexes.

Compression:

  • Saves storage
  • Improves speed
  • Reduces memory usage

7.2 Zipf’s Law

Few words appear frequently.

Example:

  • the
  • and
  • is

7.3 Heaps’ Law

Vocabulary grows with collection size.


7.4 Compression Techniques

Variable Byte Encoding

Gamma Encoding

Front Coding


7.5 Industrial Relevance

Google compresses indexes heavily to support billions of searches.


LAB 7

Objective

Compare compressed vs uncompressed indexes.

Tasks

  • Measure storage size
  • Measure retrieval speed

CHAPTER 8 — VECTOR SPACE MODEL

8.1 Ranked Retrieval

Modern systems rank results by relevance.


8.2 Vector Space Model

Documents and queries are represented as vectors.


8.3 Term Frequency (TF)

Measures how often a term appears.

genui{“math_block_widget_always_prefetch_v2”:{“content”:“tf_{t,d}=”}}


8.4 Inverse Document Frequency (IDF)

Measures importance of terms.

genui{“math_block_widget_always_prefetch_v2”:{“content”:“idf_t=()”}}


8.5 TF-IDF

Combines TF and IDF.

genui{“math_block_widget_always_prefetch_v2”:{“content”:“tfidf_{t,d}=tf_{t,d}idf_t”}}


8.6 Cosine Similarity

Measures similarity between vectors.

genui{“math_block_widget_always_prefetch_v2”:{“content”:“()=”}}


8.7 Real-World Application

Resume Matching

Recruitment systems compare resumes against job descriptions.

Recommendation Systems

Netflix compares user preferences.


LAB 8

Objective

Compute tf-idf manually and programmatically.

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "machine learning is powerful",
    "artificial intelligence and machine learning",
    "cloud computing systems"
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

print(X.toarray())

CHAPTER 9 — RANKED RETRIEVAL AND SCORING

9.1 Ranking

Search engines rank documents by relevance score.


9.2 Scoring Functions

Factors

  • Term frequency
  • Document length
  • Link popularity
  • User behavior

9.3 Top-K Retrieval

Retrieve only best results.

Example: Top 10 Google results.


9.4 Query Processing Pipeline

  1. Query Parsing
  2. Tokenization
  3. Retrieval
  4. Ranking
  5. Snippet Generation

9.5 Snippet Generation

Example: Google highlights matching words.


LAB 9

Objective

Implement document ranking.

Activity

Use cosine similarity to rank documents.


CHAPTER 10 — EVALUATION OF INFORMATION RETRIEVAL SYSTEMS

10.1 Why Evaluation Matters

We need to measure retrieval quality.


10.2 Precision

Fraction of retrieved documents that are relevant.

genui{“math_block_widget_always_prefetch_v2”:{“content”:“Precision=”}}


10.3 Recall

Fraction of relevant documents retrieved.

genui{“math_block_widget_always_prefetch_v2”:{“content”:“Recall=”}}


10.4 F1 Score

genui{“math_block_widget_always_prefetch_v2”:{“content”:“F1=”}}


10.5 User Satisfaction Metrics

  • Click-through rate
  • Dwell time
  • Bounce rate

10.6 Industrial Scenario

E-commerce companies evaluate:

  • Product search accuracy
  • Conversion rates
  • Customer satisfaction

LAB 10

Objective

Evaluate retrieval effectiveness.

Task

Compute:

  • Precision
  • Recall
  • F1 Score

for a sample dataset.


CHAPTER 11 — RELEVANCE FEEDBACK AND QUERY EXPANSION

11.1 Relevance Feedback

Users indicate useful results.

System improves retrieval.


11.2 Query Expansion

Expands search terms.

Example:

car → automobile, vehicle

11.3 Pseudo-Relevance Feedback

Assumes top results are relevant.


11.4 Real-World Example

Google suggests:

  • Related searches
  • Synonyms
  • Alternative spellings

LAB 11

Objective

Implement query expansion.

Task

Use WordNet synonyms for expansion.


CHAPTER 12 — PROBABILISTIC INFORMATION RETRIEVAL

12.1 Probability Ranking Principle

Documents ranked by probability of relevance.


12.2 Binary Independence Model

Assumes terms occur independently.


12.3 BM25 Ranking

Industry-standard ranking algorithm.

genui{“math_block_widget_always_prefetch_v2”:{“content”:“BM25(q,d)=_{tq}idf(t)“}}


12.4 Applications

  • Search engines
  • Legal retrieval
  • Scientific document retrieval

LAB 12

Objective

Implement BM25 ranking.

Suggested Tools

  • Python
  • rank_bm25 library

CHAPTER 13 — TEXT CLASSIFICATION

13.1 Text Classification

Assign documents to categories.


13.2 Examples

Spam Detection

  • Spam
  • Not Spam

News Categorization

  • Sports
  • Politics
  • Entertainment

13.3 Naive Bayes Classifier

Popular classification algorithm.

genui{“math_block_widget_always_prefetch_v2”:{“content”:“P(c|d)=”}}


13.4 Feature Selection

Important words selected for classification.


13.5 Industrial Applications

  • Email filtering
  • Sentiment analysis
  • Fake news detection
  • Content moderation

LAB 13

Objective

Build a spam classifier.

from sklearn.naive_bayes import MultinomialNB

Train and evaluate classifier.


CHAPTER 14 — CLUSTERING AND LATENT SEMANTIC INDEXING

14.1 Clustering

Grouping similar documents.


14.2 K-Means Clustering

Popular clustering algorithm.


14.3 Applications

  • Search result grouping
  • Customer segmentation
  • Topic modeling

14.4 Hierarchical Clustering

Creates cluster trees.


14.5 Latent Semantic Indexing (LSI)

Finds hidden semantic relationships.


14.6 Real-World Scenario

Netflix groups similar movies.

Spotify groups similar songs.


LAB 14

Objective

Cluster documents using K-Means.

from sklearn.cluster import KMeans

CHAPTER 15 — WEB SEARCH ENGINES AND WEB CRAWLING

15.1 Search Engine Architecture

Components

  • Web crawler
  • Indexer
  • Query processor
  • Ranking engine

15.2 Web Crawling

Crawler downloads web pages.


15.3 URL Frontier

List of URLs waiting to be crawled.


15.4 Challenges

  • Duplicate pages
  • Dead links
  • Dynamic pages
  • Spam pages

15.5 Distributed Crawling

Used for web-scale search engines.


15.6 Industrial Scenario

Googlebot continuously crawls billions of pages.


LAB 15

Objective

Build a simple web crawler.

import requests
from bs4 import BeautifulSoup

url = "https://example.com"
response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')

for link in soup.find_all('a'):
    print(link.get('href'))

LAB 16

Objective

Simulate PageRank.

Task

Calculate PageRank for a small web graph.


MINI PROJECTS

Project 1

Build a University Search Engine.

Features:

  • Search lecture notes
  • Search PDFs
  • Ranking system
  • Autocomplete

Project 2

Build a Job Recommendation System.


Project 3

Build a News Classification System.


Project 4

Build a Movie Recommendation System.


SAMPLE EXAM QUESTIONS

Theory Questions

  1. Define Information Retrieval.
  2. Explain inverted indexes.
  3. Differentiate precision and recall.
  4. Explain tf-idf.
  5. Discuss PageRank.
  6. Explain text preprocessing.
  7. Explain Boolean retrieval.
  8. Describe BM25.
  9. Explain clustering.
  10. Explain Naive Bayes classification.

Practical Questions

  1. Build an inverted index.
  2. Compute tf-idf.
  3. Calculate cosine similarity.
  4. Build a simple crawler.
  5. Implement document ranking.

CAREER OPPORTUNITIES

Students can become:

  • Search Engineer
  • NLP Engineer
  • Machine Learning Engineer
  • Data Scientist
  • AI Engineer
  • Recommendation Systems Engineer
  • Search Relevance Engineer
  • Big Data Engineer
  • Cybersecurity Analyst
  • Research Scientist

FINAL ADVICE TO STUDENTS

To succeed in Information Retrieval Systems:

  1. Master data structures and algorithms.
  2. Learn linear algebra and probability.
  3. Practice Python programming.
  4. Build real projects.
  5. Understand search engine architecture.
  6. Learn machine learning fundamentals.
  7. Practice evaluating retrieval systems.
  8. Study modern AI retrieval systems.

Information Retrieval is one of the most important foundations of:

  • Artificial Intelligence
  • Search Engines
  • Recommender Systems
  • NLP
  • Data Science
  • Generative AI
Note

Modern companies such as Google, Amazon, Netflix, TikTok, Meta, Microsoft, OpenAI, and Spotify depend heavily on Information Retrieval technologies.