Archive for February, 2010

Data manipulation

February 16, 2010

This is based on a book: Data Manipulation with R (Phil Spector).

page 136:
Datasets can be wide or long. When there are multiple occurences of values for a single observation:

  • a dataset is said to be long if each occurence is a separate row in the data frame (most IDR data, EAV design).
  • a  dataset is said to be wide if all of the occurences of values for a given observation are in the same row

R’s reshape function is very useful (http://stat.ethz.ch/R-manual/R-patched/library/stats/html/reshape.html )

Also a dataset can be “melted” and cast to a desired shape (using the reshape package; http://cran.r-project.org/web/packages/reshape/reshape.pdf

library(reshape)
melted_data= melt (data)
desired_shape_data = cast(PARAMS,data=melted_data)

Very useful.

Example: Representing logic for genetic counselling referral

February 12, 2010

Red node is subflow
major and minor are counts of major items or minor items.
e.g., history of breast cancer under age 50 is major risk item