COME5106 - INFORMATION RETRIEVAL SYSTEM (IRS)
Complete Semester Notes for Computer Science & Computer Engineering Students
COURSE INFORMATION
Course Title
Information Retrieval System (IRS)
Credit Value
6 Credits
Duration
15–16 Weeks
Target Students
- Computer Science Students
- Computer Engineering Students
- Software Engineering Students
- Data Science Students
- AI and Machine Learning Students
Course Description
Information Retrieval (IR) focuses on how computers organize, index, search, retrieve, rank, and recommend information from massive collections of data such as websites, digital libraries, emails, enterprise documents, research databases, and social media platforms.
This course introduces students to:
- Search engines
- Indexing algorithms
- Text processing
- Ranking systems
- Web search
- Machine learning for retrieval
- Clustering and classification
- Information filtering
- Web crawling
- Relevance ranking
- Natural language processing in search systems
The course combines:
- Theory
- Algorithms
- Mathematical models
- Real-world industrial applications
- Practical labs
- Mini-projects
COURSE LEARNING OUTCOMES
At the end of this course, students should be able to:
- Explain the principles of Information Retrieval Systems.
- Design and implement simple search engines.
- Build inverted indexes.
- Process Boolean and ranked queries.
- Apply text preprocessing techniques.
- Compute tf-idf and cosine similarity.
- Evaluate retrieval performance using precision and recall.
- Implement document classification and clustering.
- Explain web crawling and web indexing.
- Understand PageRank and link analysis.
- Build mini search systems for real-world applications.
- Apply machine learning techniques in retrieval systems.
COURSE OUTLINE
| Week | Topic |
|---|---|
| 1 | Introduction to Information Retrieval |
| 2 | Boolean Retrieval Model |
| 3 | Inverted Index and Postings Lists |
| 4 | Text Processing and Tokenization |
| 5 | Dictionaries and Tolerant Retrieval |
| 6 | Index Construction |
| 7 | Index Compression |
| 8 | Vector Space Model and tf-idf |
| 9 | Ranked Retrieval and Scoring |
| 10 | Evaluation of IR Systems |
| 11 | Relevance Feedback and Query Expansion |
| 12 | Probabilistic Retrieval Models |
| 13 | Text Classification |
| 14 | Clustering and Latent Semantic Indexing |
| 15 | Web Search Engines and Crawling |
| 16 | Link Analysis, Revision, and Project Presentation |
CHAPTER 1 — INTRODUCTION TO INFORMATION RETRIEVAL
1.1 What is Information Retrieval?
Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).
Information Retrieval (IR) is the process of obtaining relevant information from large collections of data.
Simple Definition
Information Retrieval helps users find needed information quickly from huge datasets.
Examples
- Google Search
- YouTube Search
- Spotify music recommendation
- Netflix movie recommendation
- Amazon product search
- Gmail email search
- Library catalog systems
- Digital medical records search
1.2 Difference Between IR and Database Systems
| Information Retrieval | Database Systems |
|---|---|
| Deals with unstructured data | Deals with structured data |
| Results may be approximate | Results are exact |
| Uses ranking | Uses exact matching |
| Example: Google | Example: Banking System |
Example
Database Query:
SELECT * FROM students WHERE matricule='FE21A123';IR Query:
best universities in africa for computer science
1.3 Components of an Information Retrieval System
Major Components
- Document Collection
- Text Processing Unit
- Indexing Engine
- Query Processor
- Ranking Module
- Retrieval Module
- User Interface
Real-World Scenario
When a user searches:
cheap smartphones under $200
The system:
- Processes the query
- Searches indexes
- Ranks results
- Displays relevant products
1.4 Applications of Information Retrieval
Industrial Applications
Search Engines
- Bing
- DuckDuckGo
E-commerce
- Amazon
- Alibaba
- Jumia
Healthcare
- Patient record retrieval
- Disease diagnosis systems
Cybersecurity
- Threat intelligence search
- Log analysis
Education
- Digital libraries
- Research paper retrieval
1.5 Challenges in Information Retrieval
Major Challenges
Ambiguity
Example:
Java
Could mean:
- Programming language
- Island in Indonesia
- Coffee
Synonyms
Different words with same meaning.
Example:
Car = Automobile
Large Data Volumes
Search engines process billions of pages.
Misspellings
Example:
Micorsoft
instead of:
Microsoft
Relevance Ranking
Which result should appear first?
LAB 1
Objective
Understand basic Information Retrieval concepts.
Practical Activity
Task 1
Use Google search and analyze:
- Ranking order
- Autocomplete suggestions
- Search snippets
- Related searches
Task 2
Compare:
- Bing
- DuckDuckGo
Search for:
Artificial Intelligence in Healthcare
Document differences.
CHAPTER 2 — BOOLEAN RETRIEVAL MODEL
2.1 Boolean Retrieval
Boolean Retrieval uses logical operators.
The Boolean retrieval model is a model for information retrieval in which we can pose any query which is in the form of a Boolean expression of terms, that is, in which terms are combined with the operators AND, OR, and NOT. The model views each document as just a set of words.
Operators
| Operator | Meaning |
|---|---|
| AND | Both terms must appear |
| OR | Either term may appear |
| NOT | Exclude a term |
2.2 Examples
AND Query
machine AND learning
Returns documents containing both words.
OR Query
doctor OR physician
NOT Query
python NOT snake
2.3 Advantages of Boolean Retrieval
- Simple
- Fast
- Efficient for exact search
- Good for legal and medical systems
2.4 Limitations
- No ranking
- Difficult for ordinary users
- Too many or too few results
2.5 Term-Document Matrix
A binary matrix showing whether terms exist in documents.
| Term | D1 | D2 | D3 |
|---|---|---|---|
| AI | 1 | 1 | 0 |
| Cloud | 0 | 1 | 1 |
| Security | 1 | 0 | 1 |
1 = Present 0 = Absent
A term-document incidence matrix. Matrix element (t, d) is 1 if the document in column d contains the word in row t, and is 0 otherwise.
To answer a query like AI AND Cloud AND NOT Security, we take the vectors for AI, Cloud and Security, complement the last, and then do a bitwise AND:
110 AND 011 AND 010 = 010The answer for this query is thus document D2.
Step-by-Step Breakdown for Your Class
The 3-bit vectors represent the presence (1) or absence (0) of terms across 3 documents tracked in this exact sequence:
- D1
- D2
- D3
1. Retrieve the Vector Profiles
- AI: 110 (Present in D1 and D2)
- Cloud: 011 (Present in D2 and D3)
- Security: 101 (Present in D1 and D3)
2. Execute the NOT Operator (Bitwise Complement)
Complementing Security’s profile flips every bit (\(0 \rightarrow 1\) and \(1 \rightarrow 0\)): \[\text{NOT } (101) = 010\]
3. Execute the AND Operations (Bitwise Intersection)
Align the bits vertically and apply standard boolean multiplication:
110 (AI) 011 (Cloud) 010 (NOT Security) —— 010 (Resulting Bitstring)
Deciphering the Final Output Vector (010)
Looking at where the active 1 bits land in our predefined matrix sequence:
- Bit 2 is 1: Maps to D2.
2.6 Linear Search vs Indexed Search
Linear Search
Checks all documents one by one.
Indexed Search
Uses indexes for faster retrieval.
LAB 2
Objective
Implement a simple Boolean Retrieval system.
Python Example
documents = {
1: "machine learning and artificial intelligence",
2: "cloud computing and cybersecurity",
3: "artificial intelligence in healthcare"
}
query = ["artificial", "intelligence"]
for doc_id, text in documents.items():
if all(term in text for term in query):
print(doc_id, text)Expected Output
Documents containing all query terms.
CHAPTER 3 — INVERTED INDEX AND POSTINGS LISTS
3.1 Inverted Index
An inverted index maps terms to documents.
Example
| Term | Documents |
|---|---|
| AI | D1, D3 |
| Cloud | D2 |
| Security | D2, D3 |
3.2 Why Inverted Indexes Matter
Without indexes:
- Search is slow
- Systems become inefficient
With indexes:
- Retrieval becomes extremely fast
Google uses massive distributed indexes.
3.3 Postings Lists
A postings list stores document IDs for each term.
Example:
AI → [1, 3, 7, 9]
3.4 Building an Inverted Index
Steps
- Collect documents
- Tokenize text
- Normalize terms
- Remove stop words
- Build postings lists
3.5 Real-World Application
E-commerce Search
When users search:
wireless headphones
The search engine retrieves product IDs from postings lists.
LAB 3
Objective
Build an inverted index.
docs = {
1: "AI and machine learning",
2: "Cloud computing and AI",
3: "Cybersecurity and cloud systems"
}
index = {}
for doc_id, text in docs.items():
for word in text.lower().split():
if word not in index:
index[word] = []
index[word].append(doc_id)
print(index)CHAPTER 4 — TEXT PROCESSING AND TOKENIZATION
4.1 Tokenization
Tokenization splits text into smaller units called tokens.
Example:
"Machine Learning is Powerful"
Tokens:
[Machine, Learning, is, Powerful]
4.2 Stop Words
Common words often removed.
Examples:
- the
- is
- and
- of
4.3 Normalization
Converting text into standard format.
Examples
| Original | Normalized |
|---|---|
| COMPUTER | computer |
| Running | run |
4.4 Stemming
Reduces words to root forms.
Example:
playing → play
Popular Stemmer
Porter Stemmer
4.5 Lemmatization
Uses dictionary meaning to obtain base word.
Example:
better → good
4.6 Challenges in Text Processing
Multilingual Data
- French
- English
- Chinese
- Arabic
Emojis and Slang
Social media introduces noisy text.
LAB 4
Objective
Perform text preprocessing.
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
text = "Machine Learning is changing the world"
words = word_tokenize(text.lower())
stop_words = set(stopwords.words('english'))
filtered = [w for w in words if w not in stop_words]
ps = PorterStemmer()
stemmed = [ps.stem(w) for w in filtered]
print(stemmed)CHAPTER 5 — DICTIONARIES AND TOLERANT RETRIEVAL
5.1 Dictionary Structures
A dictionary stores all vocabulary terms.
Structures Used
- Hash Tables
- B-Trees
- Tries
5.2 Wildcard Queries
Example:
comput*
Matches:
- computer
- computing
- computation
5.3 Spelling Correction
Example:
Gooogle
Corrected to:
Google
5.4 Edit Distance
Measures similarity between strings.
Example
Distance between:
cat → cut
Distance = 1
5.5 Phonetic Correction
Words sounding alike.
Example:
nite → night
LAB 5
Objective
Implement spelling correction.
from textblob import TextBlob
word = TextBlob("googel")
print(word.correct())CHAPTER 6 — INDEX CONSTRUCTION
6.1 What is Index Construction?
Building searchable structures from documents.
6.2 Steps in Index Construction
- Parse documents
- Extract tokens
- Normalize text
- Generate term IDs
- Build postings lists
- Store indexes
6.3 Distributed Indexing
Large systems distribute indexing.
Technologies
- Hadoop
- Spark
- MapReduce
6.4 Dynamic Indexing
Allows updates while search engine runs.
Used by:
- Twitter/X
6.5 Enterprise Scenario
A university repository indexes:
- Theses
- PDFs
- Lecture notes
- Journals
LAB 6
Objective
Build a scalable indexing pipeline.
Activity
Use:
- Python
- Elasticsearch
Index 100 sample documents.
CHAPTER 7 — INDEX COMPRESSION
7.1 Why Compression Matters
Search engines store huge indexes.
Compression:
- Saves storage
- Improves speed
- Reduces memory usage
7.2 Zipf’s Law
Few words appear frequently.
Example:
- the
- and
- is
7.3 Heaps’ Law
Vocabulary grows with collection size.
7.4 Compression Techniques
Variable Byte Encoding
Gamma Encoding
Front Coding
7.5 Industrial Relevance
Google compresses indexes heavily to support billions of searches.
LAB 7
Objective
Compare compressed vs uncompressed indexes.
Tasks
- Measure storage size
- Measure retrieval speed
CHAPTER 8 — VECTOR SPACE MODEL
8.1 Ranked Retrieval
Modern systems rank results by relevance.
8.2 Vector Space Model
Documents and queries are represented as vectors.
8.3 Term Frequency (TF)
Measures how often a term appears.
genui{“math_block_widget_always_prefetch_v2”:{“content”:“tf_{t,d}=”}}
8.4 Inverse Document Frequency (IDF)
Measures importance of terms.
genui{“math_block_widget_always_prefetch_v2”:{“content”:“idf_t=()”}}
8.5 TF-IDF
Combines TF and IDF.
genui{“math_block_widget_always_prefetch_v2”:{“content”:“tfidf_{t,d}=tf_{t,d}idf_t”}}
8.6 Cosine Similarity
Measures similarity between vectors.
genui{“math_block_widget_always_prefetch_v2”:{“content”:“()=”}}
8.7 Real-World Application
Resume Matching
Recruitment systems compare resumes against job descriptions.
Recommendation Systems
Netflix compares user preferences.
LAB 8
Objective
Compute tf-idf manually and programmatically.
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
"machine learning is powerful",
"artificial intelligence and machine learning",
"cloud computing systems"
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())CHAPTER 9 — RANKED RETRIEVAL AND SCORING
9.1 Ranking
Search engines rank documents by relevance score.
9.2 Scoring Functions
Factors
- Term frequency
- Document length
- Link popularity
- User behavior
9.3 Top-K Retrieval
Retrieve only best results.
Example: Top 10 Google results.
9.4 Query Processing Pipeline
- Query Parsing
- Tokenization
- Retrieval
- Ranking
- Snippet Generation
9.5 Snippet Generation
Example: Google highlights matching words.
LAB 9
Objective
Implement document ranking.
Activity
Use cosine similarity to rank documents.
CHAPTER 10 — EVALUATION OF INFORMATION RETRIEVAL SYSTEMS
10.1 Why Evaluation Matters
We need to measure retrieval quality.
10.2 Precision
Fraction of retrieved documents that are relevant.
genui{“math_block_widget_always_prefetch_v2”:{“content”:“Precision=”}}
10.3 Recall
Fraction of relevant documents retrieved.
genui{“math_block_widget_always_prefetch_v2”:{“content”:“Recall=”}}
10.4 F1 Score
genui{“math_block_widget_always_prefetch_v2”:{“content”:“F1=”}}
10.5 User Satisfaction Metrics
- Click-through rate
- Dwell time
- Bounce rate
10.6 Industrial Scenario
E-commerce companies evaluate:
- Product search accuracy
- Conversion rates
- Customer satisfaction
LAB 10
Objective
Evaluate retrieval effectiveness.
Task
Compute:
- Precision
- Recall
- F1 Score
for a sample dataset.
CHAPTER 11 — RELEVANCE FEEDBACK AND QUERY EXPANSION
11.1 Relevance Feedback
Users indicate useful results.
System improves retrieval.
11.2 Query Expansion
Expands search terms.
Example:
car → automobile, vehicle
11.3 Pseudo-Relevance Feedback
Assumes top results are relevant.
11.4 Real-World Example
Google suggests:
- Related searches
- Synonyms
- Alternative spellings
LAB 11
Objective
Implement query expansion.
Task
Use WordNet synonyms for expansion.
CHAPTER 12 — PROBABILISTIC INFORMATION RETRIEVAL
12.1 Probability Ranking Principle
Documents ranked by probability of relevance.
12.2 Binary Independence Model
Assumes terms occur independently.
12.3 BM25 Ranking
Industry-standard ranking algorithm.
genui{“math_block_widget_always_prefetch_v2”:{“content”:“BM25(q,d)=_{tq}idf(t)“}}
12.4 Applications
- Search engines
- Legal retrieval
- Scientific document retrieval
LAB 12
Objective
Implement BM25 ranking.
Suggested Tools
- Python
- rank_bm25 library
CHAPTER 13 — TEXT CLASSIFICATION
13.1 Text Classification
Assign documents to categories.
13.2 Examples
Spam Detection
- Spam
- Not Spam
News Categorization
- Sports
- Politics
- Entertainment
13.3 Naive Bayes Classifier
Popular classification algorithm.
genui{“math_block_widget_always_prefetch_v2”:{“content”:“P(c|d)=”}}
13.4 Feature Selection
Important words selected for classification.
13.5 Industrial Applications
- Email filtering
- Sentiment analysis
- Fake news detection
- Content moderation
LAB 13
Objective
Build a spam classifier.
from sklearn.naive_bayes import MultinomialNBTrain and evaluate classifier.
CHAPTER 14 — CLUSTERING AND LATENT SEMANTIC INDEXING
14.1 Clustering
Grouping similar documents.
14.2 K-Means Clustering
Popular clustering algorithm.
14.3 Applications
- Search result grouping
- Customer segmentation
- Topic modeling
14.4 Hierarchical Clustering
Creates cluster trees.
14.5 Latent Semantic Indexing (LSI)
Finds hidden semantic relationships.
14.6 Real-World Scenario
Netflix groups similar movies.
Spotify groups similar songs.
LAB 14
Objective
Cluster documents using K-Means.
from sklearn.cluster import KMeansCHAPTER 15 — WEB SEARCH ENGINES AND WEB CRAWLING
15.1 Search Engine Architecture
Components
- Web crawler
- Indexer
- Query processor
- Ranking engine
15.2 Web Crawling
Crawler downloads web pages.
15.3 URL Frontier
List of URLs waiting to be crawled.
15.4 Challenges
- Duplicate pages
- Dead links
- Dynamic pages
- Spam pages
15.5 Distributed Crawling
Used for web-scale search engines.
15.6 Industrial Scenario
Googlebot continuously crawls billions of pages.
LAB 15
Objective
Build a simple web crawler.
import requests
from bs4 import BeautifulSoup
url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))CHAPTER 16 — LINK ANALYSIS AND PAGERANK
16.1 The Web as a Graph
Web pages connected through hyperlinks.
16.2 PageRank
Measures page importance.
Developed by Google founders.
16.3 Intuition Behind PageRank
A page is important if important pages link to it.
16.4 Applications
- Search ranking
- Influence analysis
- Social network analysis
16.5 HITS Algorithm
Uses:
- Hubs
- Authorities
LAB 16
Objective
Simulate PageRank.
Task
Calculate PageRank for a small web graph.
MODERN TRENDS IN INFORMATION RETRIEVAL
Artificial Intelligence in Search
AI improves:
- Recommendations
- Semantic search
- Conversational search
Voice Search
Examples:
- Siri
- Alexa
- Google Assistant
Semantic Search
Understands meaning instead of keywords.
Generative AI Search
Examples:
- ChatGPT
- Gemini
- Perplexity AI
Vector Databases
Modern AI retrieval uses:
- Pinecone
- Weaviate
- FAISS
- ChromaDB
MINI PROJECTS
Project 1
Build a University Search Engine.
Features:
- Search lecture notes
- Search PDFs
- Ranking system
- Autocomplete
Project 2
Build a Job Recommendation System.
Project 3
Build a News Classification System.
Project 4
Build a Movie Recommendation System.
SAMPLE EXAM QUESTIONS
Theory Questions
- Define Information Retrieval.
- Explain inverted indexes.
- Differentiate precision and recall.
- Explain tf-idf.
- Discuss PageRank.
- Explain text preprocessing.
- Explain Boolean retrieval.
- Describe BM25.
- Explain clustering.
- Explain Naive Bayes classification.
Practical Questions
- Build an inverted index.
- Compute tf-idf.
- Calculate cosine similarity.
- Build a simple crawler.
- Implement document ranking.
CAREER OPPORTUNITIES
Students can become:
- Search Engineer
- NLP Engineer
- Machine Learning Engineer
- Data Scientist
- AI Engineer
- Recommendation Systems Engineer
- Search Relevance Engineer
- Big Data Engineer
- Cybersecurity Analyst
- Research Scientist
RECOMMENDED TOOLS
Programming Languages
- Python
- Java
- Scala
Libraries
- NLTK
- Scikit-learn
- SpaCy
- Gensim
- TensorFlow
- PyTorch
Search Platforms
- Elasticsearch
- Apache Solr
- OpenSearch
Big Data Platforms
- Hadoop
- Spark
RECOMMENDED TEXTBOOKS
- Introduction to Information Retrieval — Manning, Raghavan & Schütze
- Modern Information Retrieval — Ricardo Baeza-Yates
- Speech and Language Processing — Jurafsky & Martin
- Mining Massive Datasets — Leskovec
FINAL ADVICE TO STUDENTS
To succeed in Information Retrieval Systems:
- Master data structures and algorithms.
- Learn linear algebra and probability.
- Practice Python programming.
- Build real projects.
- Understand search engine architecture.
- Learn machine learning fundamentals.
- Practice evaluating retrieval systems.
- Study modern AI retrieval systems.
Information Retrieval is one of the most important foundations of:
- Artificial Intelligence
- Search Engines
- Recommender Systems
- NLP
- Data Science
- Generative AI
Modern companies such as Google, Amazon, Netflix, TikTok, Meta, Microsoft, OpenAI, and Spotify depend heavily on Information Retrieval technologies.
Social Media