resources

This is stuff we did.

Git Repository

posted Jul 8, 2015, 6:15 PM by Onno Benschop

Our source code is stored in the following repositories:

image source: data centre

posted Jul 4, 2015, 6:49 PM by Aisling Blackmore

http://codecondo.com/wp-content/uploads/2014/04/9-Free-Books-for-Learning-Data-Mining-Data-Analysis.jpg

wc

posted Jul 4, 2015, 1:27 AM by Onno Benschop   [ updated Jul 5, 2015, 12:05 AM ]

  • mkdir tmp ; for n in * ; do cat "$n" | tr -sc '[:alnum:]' '\n' | sort | uniq -ic | sort -rn > "./tmp/$n" ; done
  • for n in ./tmp/* ; do cat "$n" | tr -s ' ' , | sed "s/^/$n/" ; done | awk -F, '{print $1,$3,$2}' OFS=, > word_list.csv
  • tr -sc '[:alnum:]' '\n' | sort | uniq -ic | sort -rn | tr -s ' ' , | sed -e "s|^|${url}|"

aws credit

posted Jul 3, 2015, 8:45 PM by Onno Benschop

  • https://www.govhack.org/amazon-web-services/

moore's law of big data

posted Jul 3, 2015, 8:44 PM by Onno Benschop

  • http://www.ni.com/newsletter/51649/en/

5 tb download during lunch

posted Jul 3, 2015, 7:25 PM by Onno Benschop

  • http://dius.com.au/2014/01/07/eat-5-terabytes-lunch-hour-elastic-mapreduce/

elastic map reduce

posted Jul 3, 2015, 6:48 PM by Onno Benschop   [ updated Jul 3, 2015, 6:55 PM ]

To do a word count across a large data-set:
  • https://aws.amazon.com/articles/Elastic-MapReduce/2273
  • http://hci.stanford.edu/courses/cs448g/a2/
emr awscli rtfm:
  • http://docs.aws.amazon.com/cli/latest/reference/emr/index.html

mongodb in aws

posted Jul 3, 2015, 6:45 PM by Onno Benschop

To deploy mongodb within aws to get massive parallel performance and storage I read these documents:
  • https://d0.awsstatic.com/whitepapers/AWS_NoSQL_MongoDB.pdf
  • https://s3.amazonaws.com/quickstart-reference/mongodb/latest/doc/MongoDB_on_the_AWS_Cloud.pdf

awscli

posted Jul 3, 2015, 6:43 PM by Onno Benschop

To use elastic map reduce on debian,
  • aptitude install python-pip
  • pip install awscli

1-9 of 9