Semester Offering: August

With the growth of massive digital data archives, which are not necessarily organized in any order, the twin and complementary processes of information retrieval and data mining have emerged together as a particular important discipline within the information sciences. The object of information retrieval is to automatically search a data archive in order to respond to a user’s query. The object of data mining, on the other hand, is to automatically process a data archive in order to find patterns that represent knowledge or, equivalently, information interesting to the user (not necessarily in response to a targeted query). Information retrieval and data mining invoke multidisciplinary techniques, including those from artificial intelligence, statistics, machine learning, pattern analysis, and others.


The object of this course is to introduce information retrieval and data mining techniques with a view to practical application. Topics covered will include association and rule generation, classification and prediction (including Bayesian and rule-based), cluster analysis (including partitioning, hierarchical and grid-based methods, and outlier analysis), data stream mining, social network analysis, Boolean retrieval, index construction and compression, vector space model, relevance feedback and query expansion, probabilistic information retrieval. Practical case studies will use both commercial and non-commercial software packages.




I.             Boolean Retrieval
1.      Inverted index
2.      Processing Boolean queries
3.      Extended Boolean model
4.      Ranked retrieval

II.          Index Construction
1.      Blocked sort-based indexing
2.      Single-pass in-memory indexing
3.      Distributed indexing
4.      Dynamic indexing

III.       Index Compression
1.      Statistical properties of terms in information retrieval
2.      Dictionary compression
3.      Postings file compression

IV.       Scoring and the Vector Space Model
1.      Parametric and zone indexes
2.      Vector space model for scoring

V.          Mining Frequent Patterns, Associations, And Correlations
1.      Efficient and scalable frequent itemset mining methods
2.      Mining association rules
3.      Association mining to correlation analysis
4.      Constraint-based association mining

VI.       Classification And Prediction
1.      Classification and prediction methods
2.      Accuracy and error measures
3.      Evaluation techniques
4.      Model selection

VII.    Cluster Analysis
1.      Clustering methods
2.      High-dimensional data
3.      Constraint-based cluster analysis
4.      Outlier analysis

VIII.Special Applications
1.      Mining data streams
2.      Mining time series data
3.      Graph mining
4.      Social network analysis

IX.       Case studies


J. Han and M. Kamber (2006), Data Mining: Concepts and Techniques, 2nd edition, Morgan Kaufmann.
C. D. Manning, P. Raghavan, H. Schutze (2009), An Introduction to Information Retrieval, Cambridge University Press.


M. J. A. Berry and G. Linoff (1997), Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management, Wiley.
I. H. Witten and E. Frank (2001), Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann.
T. Soukup and I. Davidson (2002), Visual Data Mining: Techniques and Tools for Data Visualization and Mining, Wiley.
P. Tan, M. Steinbach and V. Kumar (2005), Introduction to Data Mining, Addison-Wesley.
D. T. Larose (2006), Data Mining Methods and Models, Wiley.


Assignment 30%,
Midterm exam 30%,
Final exam 40%