Notes

Notes - notes.io

MIS 331
2016/2017 Fall, Homework #1
Due: 31.10.2016 17:00

Submit as hard copy Question 3 can
1. Consider a real life situation that interest you for example: an university registration system, an hospital information system or any other business problem. Go over the steps of the KDD process as follows:
a) Briefly describe the environment.(in 60-100 words)
An online shopping system where customers can purchase products. For the registration users have to enter some information like name, age, occupation etc. and will determine a password and username. Customers will have a range of products of different brands and prices to purchase. The database of the system will store deeper information about the products like type, supplier, cost etc. The database will also hold information about the transactions about when the product was ordered, how many were ordered and which paid method was used. To keep track of the stock, items which were sold are recorded as well.
b) Describe very briefly the database where relevant attributes are stored.
An online shopping system can be described by the relational database which is a collection of tables consisting attributes and tuples. A relational database of an online shopping system contains the following relation tables: customer, item, purchases and items_sold.
customer (cust_ID, name, address, age, occupation, password, username, monthly_income, location, category)
item (item_ID, brand, category, type, price, place_made, supplier, cost)
purchases (trans_ID, date, time, method_paid, amount)
items_sold (trans_ID, item_ID, quantity)

c) Define three data mining problems from the environment: requiring three different functionalities such as association, classification and clustering, …. Clearly state the importance of each problem for the organization. (one problem for each functionality )

-With association we can find the market basket namely the frequent pattern. For example let us assume there is an over stock in software products.

buys(X,“computer”) ⇒ buys(X,“software”) [support = 1%,confidence = 50%]
According to 50% confidence that a customer buys a computer, there is a 50% chance that he or she will buy software also. A 1% support shows that computer and software are purchased together. In this case with applying a sale to software products we can increase the purchase rate of software products and decrease our stocks.
-With classification we can analyze class-labeled data sets. For example let us assume that we want to find out three kinds of responses(no response, mild response, good response) to a new product based on income level by assuming middle_aged and senior customers are in the same income level.
age(X, “youth”) AND income(X, “high”) class(X, “Mild Response”)
age(X, “youth”) AND income(X, “low”) class(X, “No Response”)
age(X, “middle_aged”) class(X, “Good Response”)
age(X, “senior”) class(X, “Good Response”)
According to the IF-THEN rules we can conclude that the product is viewed as “expensive” and suits for the “upper” class.
-Instead of class-labeled data sets, clustering labels groups of data. By doing so we are maximizing the intraclass similarity and minimizing the interclass similarity. For example let us assume sales are low in city B rather the city A. According to location clustering we can made special offers to the citizens of the corresponding city and increase sales.

d) Describe the variables in the database to be used in the solution of these problems

- In the association method we are using the purchases and item_sold tables and by equalizing trans_ID of these tables we find out which products are sold together. Assuming that there is an over stock in software products we try to find the market basket of software products and apply a discount.

-In the classification method we are using the customer table and can model IF-THEN rules by using the descriptive features such as age and monthly_income to find out the correlation between the responses.

-In the clustering method we are using the customer table and clustering them according to the variable location.

e) Are there any data problems such as missing data, outlayers, inconsistancies?

-Users may purposely submit incorrect data values or leave it blank when they do not wish to share personal information. In the clustering example location based information could be a missing data.

-Inconsistent data can also occur by human or computer errors occurring at data entry. In the classification example responses could be entered wrong and the outcome misled.
-Outliers can occur when the outcome does not comply with the general behavior. In the association example fraudulent credit card usage can be detected when we are examining the general purchase of a customer.

f) Do you define new variables that do not exist in the original database to solve these problems?

-To avoid blank data values mandatory fields can be established or we can create a new variable named default value. In the cluster example we can generate a default_location. In this case we can solve the problem and proceed with our calculations.
g) What are your input and output variables if you are to solve a classification or numerical prediction problem?
Predicting the amount of revenue that the product will generate(output) based on the previous sales data(input) is an example of regression analysis for the classification example. Instead of labeling no response, mild response and good response we can focus on the revenue.

h) Suppose the problem is solved successfully. Describe the implementation of the solution in the environment What are some possible impacts of the data mining solution?. Can you imagine any unanticipated events after implementation? (in 40-60 words)

2. From the Foodmart dataset perform the following oppertations:.
a) Calulate and show frequency distributions for income level, gender, number of children, education level, profetion using SPSS.
b) Show cross tabulation about how average sales are varied according to education and income level
c) Draw Box plots of different sales categories and interpret the plots
d) ın how many different components the three sales variables can be summarized? Apply a PCA for the three main
category of sales variables and interpret the output of the PCA

3.Show that correlation coeficient is not affcted from change of measurment units of variables
E.g., suppose x is temperature measured in oC y sales in any currency. Will the correlation between temperature and sales be affected if temperature is measured in Fahrenhite. Note that from Celcius to Fahrenhite linear transformation is applied
X’ = aX+b where X’ is new unit, X old unit, a and b are constants

Consider another example, correlation between height and weight of people. İll it change if we measure height in meter, centimeter or inch? Or Weight in kilogram or pound.

Here the unit change is
X’ = aX+b, y’= cy+d but b and d are zero

Formulate the unit changes as al general linear transformation of both X and y variables
Formulate under what condition correlation coefficient between x and y changes.

4 Opinions of voters are asked towards supporting a political leader in the elections. (Either support or not support) for males and females. Determine the critical region for both males and females when the null hypothesis is that there is no association between the voting behavior (support or not) and gender against the alternative that voting behavior depends on gender, when the p-value of the 2 test statistics is 0.05, for a sample size of 200 observations of 100 males and 100 females?
That is, what are the minimum and maximum values of males (females) supporting the candidate under the null hypothesis at a significance level of 5%: Note use the contingency analysis.
Hint: You need to solve a quadratic equation to determine the upper or lower limits of the critical region

Notes.io is a web-based application for taking notes. You can take your notes and share with others people. If you like taking long notes, notes.io is designed for you. To date, over 8,000,000,000 notes created and continuing...

With notes.io;

* You can take a note from anywhere and any device with internet connection.
* You can share the notes in social platforms (YouTube, Facebook, Twitter, instagram etc.).
* You can quickly share your contents without website, blog and e-mail.
* You don't need to create any Account to share a note. As you wish you can use quick, easy and best shortened notes with sms, websites, e-mail, or messaging services (WhatsApp, iMessage, Telegram, Signal).
* Notes.io has fabulous infrastructure design for a short link and allows you to share the note as an easy and understandable link.

Fast: Notes.io is built for speed and performance. You can take a notes quickly and browse your archive.

Easy: Notes.io doesn’t require installation. Just write and share note!

Short: Notes.io’s url just 8 character. You’ll get shorten link of your note when you want to share. (Ex: notes.io/q )

Free: Notes.io works for 12 years and has been free since the day it was started.

You immediately create your first note and start sharing with the ones you wish. If you want to contact us, you can use the following communication channels;

Email: [email protected]

Twitter: http://twitter.com/notesio

Instagram: http://instagram.com/notes.io

Facebook: http://facebook.com/notesio

Regards;
Notes.io Team

Notes

Notes - notes.io

Shortened Note Link

Long File

Notes