bookssland.com » Other » Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗

Book online «Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗». Author Mehmed Kantardzic



1 ... 171 172 173 174 175 176 177 178 179 ... 193
Go to page:
measurements. This data set was used by R. A. Fisher in 1936 as an example for discriminant analysis.

Adult Data Set.

http://archive.ics.uci.edu/ml/datasets/Adult

The Adult Data Set contains 48,842 samples extracted from the U.S. Census. The task is to classify individuals as having an income that does or does not exceed $50,000/year based on factors such as age, education, race, sex, and native country.

Breast Cancer Wisconsin (Diagnostic) Data Set.

http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

This data set consists of a number of measurements taken over a “digitized image of a fine needle aspirate (FNA) of a breast mass.” There are 569 samples. The task is to classify each data point as benign or malignant.

A.4.2 Clustering

Bag of Words Data Set.

http://archive.ics.uci.edu/ml/datasets/Bag+of+Words

Word counts have been extracted from five document sources: Enron Emails, NIPS full papers, KOS blog entries, NYTimes news articles and Pubmed abstracts. The task is to cluster the documents used in this data set based on the word counts found. One may compare the output clusters with the sources from which each document came.

US Census Data (1990) Data Set.

http://archive.ics.uci.edu/ml/datasets/US+Census+Data+%281990%29

This data set is a one percent sample from the 1990 Public Use Microdata Samples (PUMS). It contains 2,458,285 records and 68 attributes.

A.4.3 Regression

Auto MPG Data Set.

http://archive.ics.uci.edu/ml/datasets/Auto+MPG

This data set provides a number of attributes of cars that can be used to attempt to predict the “city-cycle fuel consumption in miles per gallon.” There are 398 data points and eight attributes.

Computer Hardware Data Set.

http://archive.ics.uci.edu/ml/datasets/Computer+Hardware

This data set provides a number of CPU attributes that can be used to predict relative CPU performance. It contains 209 data points and 10 attributes.

A.4.4 Web Mining

Anonymous Microsoft Web Data.

http://archive.ics.uci.edu/ml/datasets/Anonymous+Microsoft+Web+Data

This data set contains page visits for a number of anonymous users who visited www.microsoft.com. The task is to predict future categories of pages a user will visit based on the Web pages previously visited.

KDD Cup 2000.

http://www.sigkdd.org

This Web site contains five tasks used in a data-mining competition run yearly called KDD Cup. KDD Cup 2000 uses clickstream and purchase data obtained from Gazelle.com. Gazelle.com sold legwear and legcare products and closed their online store that same year. This Web site provides links to papers and posters of the winners of the various tasks and outlines their effective methods. Additionally, the description of the tasks provides great insight into original approaches to using data mining with clickstream data.

A.4.5 Text Mining

Reuters-21578 Text Categorization Collection.

http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html

This is a collection of news articles that appeared on Reuters newswire in 1987. All of the news articles have been categorized. The categorization provides opportunities to test text classification or clustering methodologies.

20 Newsgroups.

http://people.csail.mit.edu/jrennie/20Newsgroups/

The 20 Newsgroups data set contains 20,000 newsgroup documents. These documents are divided nearly evenly among 20 different newsgroups. Similar to the Reuters collection, this data set provides opportunities for text classification and clustering.

A.4.6 Time Series

Dodgers Loop Sensor Data Set.

http://archive.ics.uci.edu/ml/datasets/Dodgers+Loop+Sensor

This data set provides the number of cars counted by a sensor every 5 min over 25 weeks. The sensor was for the Glendale on ramp for the 101 North Freeway in Los Angeles. The goal of this data was to “predict the presence of a baseball game at Dodgers stadium.”

Australia Gun Deaths.

http://robjhyndman.com/TSDL/crime.html

These data give the yearly death rates in Australia for gun-related and non-gun-related homicides and suicides for the years 1915–2004.

A.4.7 Data for Association Rule Mining

BMS-POS.

http://www.sigkdd.org/kddcup

This data set gives the category for each product purchased from a large electronics retailer. It covers several years worth of point of sales data. This data set contains 515,597 transactions and 1,657 distinct items.

BMS-WebView1.

http://www.sigkdd.org/kddcup

This data set contains several months of clickstream sessions for Gazelle.com. A transaction is defined in this data set as the detail pages viewed per session. This data set contains 59,602 transactions and 497 distinct items.

A.5 COMERCIALLY AND PUBLICLY AVAILABLE TOOLS

This summary of some publicly available commercial data-mining products is being provided to help readers better understand what software tools can be found on the market and what their features are. It is not intended to endorse or critique any specific product. Potential users will need to decide for themselves the suitability of each product for their specific applications and data-mining environments. This is primarily intended as a starting point from which users can obtain more information. There is a constant stream of new products appearing in the market and hence this list is by no means comprehensive. Because these changes are very frequent, the author suggests two Web sites for information about the latest tools and their performances: http://www.kdnuggets.com and http://www.knowledgestorm.com.

A.5.1 Free Software

DataLab

Publisher: Epina Software Labs (www.lohninger.com/datalab/en_home.html)

DataLab, a complete and powerful data mining tool with a unique data exploration process, with a focus on marketing and interoperability with SAS. There is a public version for students.

DBMiner

Publisher: Simon Fraser University (http://ddm.cs.sfu.ca)

DBMiner is a publicly available tool for data mining. It is a multiple-strategy tool and it supports methodologies such as clustering, association rules, summarization, and visualization. DBMiner uses Microsoft SQL Server 7.0 Plato and runs on different Windows platforms.

GenIQ Model

Publisher: DM STAT-1 Consulting (www.geniqmodel.com)

GenIQ Model uses machine learning for regression tasks; automatically performs variable selection, and new variable construction, and then specifies the model equation to “optimize the decile table.”

NETMAP

Publisher: http://sourceforge.net/projects/netmap

NETMAP is a general-purpose, information-visualization tool. It is most effective for large, qualitative, text-based data sets. It runs on Unix workstations.

RapidMiner

Publisher: Rapid-I (http://rapid-i.com)

Rapid-I provides software, solutions, and services in the fields of predictive analytics, data mining, and text mining. The company concentrates on automatic intelligent analyses on a large-scale base, that is, for large amounts of structured data-like database systems and unstructured data-like texts. The open-source data-mining specialist Rapid-I enables other companies to use leading-edge technologies for data mining and business intelligence. The discovery and leverage of unused business intelligence from existing data enables better informed decisions and allows for process optimization.

SIPNA

Publisher: http://eric.univ-lyon2.fr/∼ricco/sipina.html

Sipina-W is publicly available software that includes different traditional data-mining techniques such as CART, Elisee, ID3, C4.5, and some new methods for generating decision trees.

SNNS

Publisher: University of Stuttart (http://www.nada.kth.se/∼orre/snns-manual/)

SNNS is a publicly available software. It is a simulation environment for research on and application

1 ... 171 172 173 174 175 176 177 178 179 ... 193
Go to page:

Free e-book «Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗» - read online now

Comments (0)

There are no comments yet. You can be the first!
Add a comment