2014년 8월 29일 금요일

about google API

 * google drive API

open a service for development
https://console.developers.google.com/project

document for python
https://developers.google.com/drive/web/quickstart/quickstart-python

-- sample copy a document to gdrive
#!/usr/bin/python
#Copy a document to google drive.

import httplib2
import pprint

from apiclient.discovery import build
from apiclient.http import MediaFileUpload
from oauth2client.client import OAuth2WebServerFlow

from oauth2client.file import Storage
from oauth2client.util import logger


# Copy your credentials from the console
CLIENT_ID = '955147XXX1-o0jt3XXXXXXXXXXXXXXXXXXpfsnfe.apps.googleusercontent.com'
CLIENT_SECRET = 'UgXXXXXXXXXXXEoU7hG'

# Check https://developers.google.com/drive/scopes for all available scopes
OAUTH_SCOPE = 'https://www.googleapis.com/auth/drive'

# Redirect URI for installed apps
REDIRECT_URI = 'urn:ietf:wg:oauth:2.0:oob'

# Path to the file to upload
FILENAME = 'document.txt'

code = ''

# Run through the OAuth flow and retrieve credentials
flow = OAuth2WebServerFlow(CLIENT_ID, CLIENT_SECRET, OAUTH_SCOPE, REDIRECT_URI)

# if OAuth file are existed ?
storage = Storage('OAuthCredentials.txt')
credentials = storage.get()

if credentials is None:
    # Authorization Step 1
    authorize_url = flow.step1_get_authorize_url()
    print 'Go to the following link in your browser: ' + authorize_url
    code = raw_input('Enter verification code: ').strip()
    # Authorization Step 2
    credentials = flow.step2_exchange(code)
else:
    code = credentials

# store to local file
storage.put(credentials)

# Create an httplib2.Http object and authorize it with our credentials
http = httplib2.Http()
http = credentials.authorize(http)

drive_service = build('drive', 'v2', http=http)

# Insert a file
media_body = MediaFileUpload(FILENAME, mimetype='text/plain', resumable=True)
body = {
  'title': 'InsertTest1',
  'description': 'A test document',
  'mimeType': 'text/plain'
}

file = drive_service.files().insert(body=body, media_body=media_body).execute()
pprint.pprint(file)

2014년 8월 20일 수요일

Image analysis with Python

1. matplotlib ( is a python 2D plotting library )
 - site (api)
http://matplotlib.org/api/pyplot_api.html

2. opencv (is an open source computer vision and machine learning software library.)
 - site (api)
http://opencv.org/
 - template matching
http://docs.opencv.org/doc/tutorials/imgproc/histograms/template_matching/template_matching.html#which-are-the-matching-methods-available-in-opencv

3. NumPy (is the fundamental package for scientific computing with Python.)
http://www.numpy.org/

4. sample code
# personal issue
import matplotlib
matplotlib.use('TkAgg')
# begin
import cv2
import numpy as np
from matplotlib import pyplot as plt

pic = '/home/fox/park.JPG'
compare_pic = '/home/fox/car.JPG'

img_rgb = cv2.imread(pic)
img_gray = cv2.cvtColor(img_rgb, cv2.COLOR_BGR2GRAY)
template = cv2.imread(compare_pic,0)
w, h = template.shape[::-1]

res = cv2.matchTemplate(img_gray,template,cv2.TM_CCOEFF_NORMED)
threshold = 0.7   # wanna find a best matching image => np.amax(res)
loc = np.where( res >= threshold)
for pt in zip(*loc[::-1]):
   cv2.rectangle(img_rgb, pt, (pt[0] + w, pt[1] + h), (0,0,255), 2)

cv2.imwrite('/home/fox/res3.png',img_rgb)
plt.subplot(121),plt.imshow(res,cmap = 'gray')
plt.title('Matching Result'), plt.xticks([]), plt.yticks([])
plt.subplot(122),plt.imshow(img_rgb,cmap = 'gray')
plt.title('Detected Point'), plt.xticks([]), plt.yticks([])
plt.show()

up to bottom
 - car.JPG , park.JPG, best_one.JPG, density_0.6.JPG

















2014년 8월 18일 월요일

somethings about linux

Referred link:
http://www.tecmint.com/install-google-chrome-on-redhat-centos-fedora-linux/

Step 1: Enable Google YUM repository

Create a file called /etc/yum.repos.d/google-chrome.repo and add the following lines of code to it.
[google-chrome]
name=google-chrome
baseurl=http://dl.google.com/linux/chrome/rpm/stable/$basearch
enabled=1
gpgcheck=1
gpgkey=https://dl-ssl.google.com/linux/linux_signing_key.pub

Step 2: Installing Chrome Web Browser

Download and Install Chrome Web Browser with yum command. It will automatically install all dependencies.
# yum install google-chrome-stable
Update : Sadly, the Google Chrome browser no longer supports the most famous commercial distribution Red Hat and its free clones such as CentOS and Scientific Linux.
Yes, they’ve discontinued support for RHEL 6.X version as of Google Chrome and on other side, latest Firefox and Opera browsers run successfully on the same platforms.
Luckily, there is a script developed by Richard Lloyd, that automatically download and install latest Google Chrome browser by picking libraries from a more recent released distro and put those libraries in (/opt/google/chrome/lib) directory and then you can able to runGoogle Chrome on CentOS 6.X version.
# wget http://chrome.richardlloyd.org.uk/install_chrome.sh
# chmod u+x install_chrome.sh
# ./install_chrome.sh

This is trying upgrade OS.!!



2014년 8월 12일 화요일

for using Postgres-XL

INSTALL
-- init each instances
initgtm -Z gtm -D /var/lib/pgxl/9.2/data_gtm
initdb -D /var/lib/pgxl/9.2/coord01 --nodename coord01
initdb -D /var/lib/pgxl/9.2/data01 --nodename data01
initdb -D /var/lib/pgxl/9.2/data02 --nodename data02
-- start each instances
gtm_ctl -Z gtm start -D /var/lib/pgxl/9.2/data_gtm
pg_ctl start -D /var/lib/pgxl/9.2/data01 -Z datanode -l logfile
pg_ctl start -D /var/lib/pgxl/9.2/data02 -Z datanode -l logfile
pg_ctl start -D /var/lib/pgxl/9.2/coord01 -Z coordinator -l logfile
-- referred to http://files.postgres-xl.org/documentation/index.html

DEBUG
-- if there are running to single, modify this.
port = 5432 ~ X
pooler_port = 6668 ~ Y
-- define the relation of each instances ( There are missed on manual. )
psql -c "EXECUTE DIRECT ON (coord01) 'CREATE NODE data01 WITH (TYPE = ''datanode'', HOST = ''localhost'', PORT = 5433)'" postgres
psql -c "EXECUTE DIRECT ON (coord01) 'CREATE NODE data02 WITH (TYPE = ''datanode'', HOST = ''localhost'', PORT = 5434)'" postgres
psql -c "EXECUTE DIRECT ON (data01) 'ALTER NODE data01 WITH (TYPE = ''datanode'', HOST = ''localhost'', PORT = 5433)'" postgres
psql -c "EXECUTE DIRECT ON (data01) 'CREATE NODE data02 WITH (TYPE = ''datanode'', HOST = ''localhost'', PORT = 5434)'" postgres
psql -c "EXECUTE DIRECT ON (data01) 'SELECT pgxc_pool_reload()'" postgres
psql -c "EXECUTE DIRECT ON (data02) 'CREATE NODE data01 WITH (TYPE = ''datanode'', HOST = ''localhost'', PORT = 5433)'" postgres
psql -c "EXECUTE DIRECT ON (data02) 'ALTER NODE data02 WITH (TYPE = ''datanode'', HOST = ''localhost'', PORT = 5434)'" postgres
psql -c "EXECUTE DIRECT ON (data02) 'SELECT pgxc_pool_reload()'" postgres
-- referred to http://sourceforge.net/p/postgres-xl/tickets/18/

로고 이미지

2014년 8월 8일 금요일

sample code for collecting a data.


1. data collection using python

# python library for pulling data out of html or xml
# http://www.crummy.com/software/BeautifulSoup/bs4/doc/index.html
-- pulling_data.py
import codecs
import urllib2
from bs4 import BeautifulSoup

f = urllib2.urlopen('http://www.daum.net')
html_doc = f.read()

soup = BeautifulSoup(html_doc)
# for hangul
with codecs.open('result_daum.txt','w',encoding='utf8') as f:
        for str in soup.body.strings :
                f.write(str)

# soup.string get on all.
# soup.body.string get on a descendant of body.


2.ingest data and count each words and throw result to postgest

import codecs
import urllib2
from bs4 import BeautifulSoup

# get a site page
site = 'http://www.auction.co.kr'

f = urllib2.urlopen(site)
html_doc = f.read()

result = []

soup = BeautifulSoup(html_doc)

# get the level 2 deep's pages
for link in soup.find_all('a'):
        link_tmp = link.get('href')
        try:
                f = urllib2.urlopen(link_tmp)
                html_doc = f.read()
                soup = BeautifulSoup(html_doc)
                for str in soup.body.strings:
                        result.append(str)
        except:
                pass

# would be count as unique word
wordcount={}

for line in result:
        for word in line.split():
                if word not in wordcount:
                        wordcount[word] = 1
                else:
                        wordcount[word] += 1

f.close()

#with codecs.open('get.txt','w',encoding='utf8') as f:
#        for word,cnt in wordcount.items():
#                f.write("%s     %d\n" % (word,cnt))

# throwing the result to postgresql (ant)
import psycopg2

try:
    conn = psycopg2.connect("dbname='ant' user='ant' host='zoo' password='ant'")
except:
    print "I am unable to connect to the database"

cur = conn.cursor()

for word, cnt in wordcount.items():
        cur.execute("INSERT INTO commerce(tm,site,lev,word,cnt) VALUES (now(),%s,2,%s, %s)", (site,word,cnt,) )

conn.commit()
conn.close()


3. using Twitter API with python tweepy library
import tweepy

consumer_key = '0EBFhXXXXXXXgcG9ouIGZ6l'
consumer_secret = 'spmIzBLO24MqEXXXXXXXXXXXX35K4FyUlLoAw'

access_token = '151351809-REDL20AXXXXXXXXXXXXXXGhsDRCjd9Y0jtrDH'
access_token_secret = 'lawyXXXXXXXXXXXXXXXXXXXlon0YmtwZTd'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

public_tweets = api.home_timeline()
for tweet in public_tweets:
    print tweet.text

sample code to control the data in hadoop framework.

1. flume

-- fox.conf
# Name the components on this agent
# fox -> zoo -> koala
agent.sinks = koala
agent.sources = fox
agent.channels = zoo

# Describe/configure the source
agent.sources.fox.type = spooldir
agent.sources.fox.spoolDir = /home/flume/dump

# Describe the sink
agent.sinks.koala.type = hdfs
agent.sinks.koala.hdfs.path = /flume/events
agent.sinks.koala.hdfs.fileType = DataStream
agent.sinks.koala.hdfs.writeFormat = Text
agent.sinks.koala.hdfs.rollSize = 0
agent.sinks.koala.hdfs.rollCount = 10000

# Use a channel which buffers events in memory
agent.channels.zoo.type = file

# Bind the source and sink to the channel
agent.sources.fox.channels = zoo
agent.sinks.koala.channel = zoo

-- beginning with configuration (fox.conf)
shell$ flume-ng agent --conf conf --conf-file fox.conf --name agent


2. hcatalog

hcat -e "create table koala (cnt bigint, wd string)"


3. pig

a = load '/flume/events/*';
b = foreach a generate flatten(TOKENIZE((chararray)$0)) as word;
c = group b by word;
d = foreach c generate COUNT(b) as cnt, group as wd;
store d into 'koala' using org.apache.hcatalog.pig.HCatStorer();


4. hive

select wd, cnt from koala order by cnt desc limit 10;

Popular Baby Names Top 50 since 1980

I am studying about data analysis with R.
First I thought how many people used my name and else.

The word cloud showed me a best visualization.


The histogram plot about my name - Mark.


R code tested by R version 3.1.1, RStudio Version 0.98.978
# national poplular baby names
# url : http://www.ssa.gov/oact/babynames/limits.html
# national data .zip

# get a file list
setwd("C:/Users/Mark/Downloads/names")
files<-list.files()
files<-files[grepl(".txt",files)]
files<-files[files!="NationalReadMe.pdf"]

# import data to data frame
fox <- NULL
for (i in 1:length(files))
{
  data <- read.csv(files[i], header=F)
  data["year"] <- substr(files[i],4,7)
  fox <- rbind(fox,data)
}

# assign column name to data frame
colnames(fox) <- c('name','gender','cnt','year')

# word count
library(sqldf)
koala <- sqldf("select name,sum(cnt) as cnt from fox group by name")

# drawing a word cloud
library(wordcloud)
wordcloud( as.character(koala$name),as.integer(koala$cnt),
           scale=c(5,0.5), max.words=50, random.order=FALSE,
           rot.per=0.35, use.r.layout=FALSE,
           colors=brewer.pal(8, "Dark2"))

# drawing a histogram of some name as years
koala <- sqldf("select year,sum(cnt) as cnt from fox where name = 'Mark' group by year")
plot (koala, type = 'h', ylab = 'Baby Name (Mark)s Count', col = 'Purple')

2014년 7월 15일 화요일

mongoDB usable scripts

# get an average of collection with some conditions.
db.POINT_TOTAL_OBS_STATION_DATA.group(
   { cond: { obs_item_id : "OBSCD00074" }
   , initial: {count: 0, total:0}
   , reduce: function(doc, out) { out.count++ ; out.total += doc.v1 }
   , finalize: function(out) { out.avg = out.total / out.count }

} )

# improved a performance , however I don't know exactly why... may be hash !!
db.POINT_TOTAL_OBS_STATION_DATA.aggregate( [ 
     { $match: { obs_item_id : "OBSCD00074" } }, 
     { $group: { _id : 0 , v1_avg : { $avg: "$v1"} } } ] )


# group by each values
db.POINT_TOTAL_OBS_STATION_DATA.aggregate( [ 
     { $group: { _id : { key : "$obs_item_id" },  v1_avg : { $avg: "$v1"} } } ] )


# join query for a special case
db.POINT_TOTAL_OBS_STATION_DATA.aggregate( [{ $group: { _id : "$obs_item_id"} } , { $out : "fox_out" } ] )
fox = db.fox_out.find().toArray()
for ( var i = 0 ; i < fox.length ; i ++ ) {   db.fox_result.insert (db.OBS_ITEM_CODE.find( { obs_item_id : fox[i]._id }, { obs_item_id : 1, item_name_kor : 1 } ).toArray() ) }
db.fox_result.find().sort( { item_name_kor : 1 } )


# to find some for
db.POINT_TOTAL_OBS_STATION_DATA.aggregate( [
     { $match: { tm : { $gte : '2011-01-01 00:00:00', $lt : '2012-01-01 00:00:00' } }},
     { $group: { _id : 0 , v1_avg : { $avg: "$v1"} } } ] )


# Create index
db.POINT_TOTAL_OBS_STATION_DATA.ensureIndex( { obs_item_id : 1 } )
db.POINT_TOTAL_OBS_STATION_DATA.ensureIndex( { obs_time : 1 } )
.. more
db.system.indexes.find()

# how to check a elapsed time
db.setProfilingLevel(0) -- disable
db.setProfilingLevel(1) -- enable to 1 level
db.system.profile.find().limit(10).sort( { ts : -1 } ).pretty()


2014년 7월 7일 월요일

R sample code for

# R enviroments
-- to change java heap size
-- R_HOME/etc/Rprofile.site
options(java.parameters = c("-Xmx16g","-Dfile.encoding=UTF-8"))
-- to read UTF-8 type.
f <- file("d:/parser.txt", blocking=F,encoding="UTF-8")
txtLines <- readLines(f)


# 1. how to collect a stock info.

install.packages("fImport")
library(fImport)
s_e <- yahooSeries("005935.KS")
plot(s_e)


# 2. gathering a twitter posts.

install.packages("twitteR");install.packages("tm");
install.packages("wordcloud")
library(twitteR)
library(ROAuth)
library(RCurl)
library(tm)
library(wordcloud)
library(RColorBrewer)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
reqURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"
consumerKey <- "EOpIQtUgZmLGX04G1BYGIPDSE"
consumerSecret <- "N1OtR9YfulAL9OMjtN51XrAQ7DlnrjarfZo0Y6YJn11MOrfnXf"
twitCred <- OAuthFactory$new(consumerKey=consumerKey,consumerSecret = consumerSecret, requestURL = reqURL, accessURL = accessURL, authURL = authURL)
download.file(url = "http://curl.haxx.se/ca/cacert.pem", destfile = "cacert.pem")
twitCred$handshake(cainfo = "cacert.pem")
save(list='twitCred',file="twitteR_credentials")

registerTwitterOAuth(twitCred)
hilton.tweets <- searchTwitter('@ford',n=90,cainfo='cacert.pem')
hilton.tweets

install.packages("plyr")
library(plyr)
hilton.tweets[1]
hilton.txt <- laply(hilton.tweets,function(t) t$getText())
hilton<-as.data.frame(hilton.txt)


# 3. network analysis

install.packages('igraph')
library(igraph)
ga.data <- read.csv('http://www.babelgraph.org/data/ga_edgelist.csv', header=TRUE)
str(ga.data)
head(ga.data)
g1 <- graph.data.frame(ga.data, directed=FALSE)
summary(g1)
str(g1)
V(g1)
E(g1)

set.seed(2020)
plot(g1)


# 4. word cloud

library(NLP)
library(tm)
library(SnowballC)
# source directory
lords <- Corpus (DirSource("/Users/Mark/Desktop/AnalysisR/tmp01"))
inspect(lords)
lords <- tm_map(lords, stripWhitespace)
lords <- tm_map(lords, PlainTextDocument)
lords <- tm_map(lords, removeWords, stopwords("english"))
lords <- tm_map(lords, stemDocument)
wordcloud(lords, scale=c(5,0.5), max.words=100, random.order=FALSE,
          rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8, "Dark2"))
# control cloud graph
lords <- tm_map(lords, removeWords, c("textbook","book"))


# 5. Loading image and extract into RGB

# install.packages("jpeg")
library(jpeg)

# read jpg image to matrix
slide1 <- readJPEG("brazil.jpg")
# matrix scale (dimension)
dim(slide1)

library(raster)
library(grid)

# extract to each RGB
rst.blue <- raster(slide1[,,1])
rst.green <- raster(slide1[,,2])
rst.red <- raster(slide1[,,3])
# translation to matrix
fox.blue <- as.matrix(rst.blue)
fox.green <- as.matrix(rst.green)
fox.red <- as.matrix(rst.red)
# drawing
grid.raster(slide1)
grid.raster(fox.blue)


# 6. linear and integer Programming
# - we found a feasible integer point for.

install.packages("lpSolve",dependencies=T)
library(lpSolve)

# weight of elements (A,B,C)
f.obj <- c(1,2,1)
f.con <- matrix (c(1, 1, 1,
                   1, 0, 0,
                   1, 1, 0,
                   0, 1, 0,
                   0, 0, 1), nrow=5, byrow=T)

f.dir <- c( "<=", ">=", "<=", ">=",">=")
f.rhs <- c(90,30,75,10,20)

lp ("max", f.obj, f.con, f.dir, f.rhs)
lp ("max", f.obj, f.con, f.dir, f.rhs)$solution


# 7. get the google map

library(ggmap)
# central park
mapImageData1 <- get_map(location = c(lon = -73.960987, lat = 40.783598),
                         color = c("color"),source = "google",
                         maptype = c("satellite"),zoom = 14)

ggmap(mapImageData1,
      extent = "device",
      ylab = "Latitude",
      xlab = "Longitude")


# 8. using to mongoDB

library(rmongodb) - detail
...
library(RMongo) - easy
mongo <- mongoDbConnect("zoo", "192.168.50.81", 27017)
output <- dbInsertDocument(mongo, "fox_r", '{"foo": "fox","type":"anymal"}')
output <- dbGetQuery(mongo, "fox_r",'{"foo": "bar"}')
print (output)

# 9. connect to PostgreSQL

library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname="elephant",host="192.168.50.82",port=5432,user="fox",password="pw")
koala <- dbGetQuery(con,"select * from koala")
...
# converts to UTF-8 for hangul

Encoding(koala$contents) <- "UTF-8"


# 10. extract noun included hangul

library(KoNLP)
library(RColorBrewer)
library(wordcloud)
f <- file("d:/parser.txt", blocking=F)

txtLines <- readLines(f)
Encoding(txtLines) <- "UTF-8"

nouns <- sapply(txtLines, extractNoun, USE.NAMES=F)
close(f)
wordcount <- table(unlist(nouns))
pal <- brewer.pal(12,"Set3")
pal <- pal[-c(1:2)]
wordcloud(names(wordcount),freq=wordcount,scale=c(6,0.3),min.freq=40,
          random.order=T,rot.per=.1,colors=pal)

2014년 4월 28일 월요일

how to use bzip2 that supports a splittable compression with hadoop-streaming package.

sample codes for testing

1. compression

# To make a compress files per block by mapper.
hadoop jar hadoop-streaming-2.2.0.2.1.0.0-92.jar \
    -D mapreduce.output.fileoutputformat.compress=TRUE \
    -D mapreduce.output.fileoutputformat.compress.type=RECORD \
    -D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec \
    -D mapreduce.job.reduces=0 \
    -mapper /bin/cat \
    -input <input files> \
    -output <output directory> \
#    -outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat

# To make one compress files
hadoop jar hadoop-streaming-2.2.0.2.1.0.0-92.jar \
    -D mapreduce.output.fileoutputformat.compress=TRUE \
    -D mapreduce.output.fileoutputformat.compress.type=RECORD \
    -D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec \
    -D mapreduce.job.reduces=1 \
    -mapper /bin/cat \
    -reducer /bin/cat \ 
    -input <input files> \
    -output <output directory> \

# To make a compress files per block by mapper into a special block size.
- Pig
set output.compression.enabled true;
set output.compression.codec org.apache.hadoop.io.compress.BZip2Codec;
A = LOAD '/lab/efdc/EFDC_2010020100.nc' USING PigStorage();
SET dfs.block.size 36700160;
STORE A INTO '/lab/Dave/output/pig.nc' USING PigStorage();

# copy a file to new file to some block size
hadoop fs -D dfs.block.size=36700160 -cp <source directory> <target directory>


2. decompression & analysis

hadoop jar hadoop-streaming-2.2.0.2.1.0.0-92.jar \
        -D mapred.input.compress=true \
        -D mapred.input.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
        -files ./mapper.py,./reducer.py -mapper ./mapper.py -reducer ./reducer.py \
        -input <bzip2 compressed files> -output <output directory>

3. ETC
# below properties was deprecated to next below.
    -D mapred.input.compress=TRUE \
    -D mapred.input.compress.type=BLOCK \
    -D mapred.input.compress.codec=org.apache.hadoop.io.compress.BZip2Codec \
# next below
    -D mapreduce.output.fileoutputformat.compress=TRUE \
    -D mapreduce.output.fileoutputformat.compress.type=BLOCK \
    -D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec \

Deprecated List up ~
http://archive.cloudera.com/cdh4/cdh/4/hadoop-2.0.0-cdh4.1.2/hadoop-project-dist/hadoop-common/DeprecatedProperties.html

# block size as default
hdfs getconf -confKey dfs.blocksize

# block size as activity
hadoop fs -stat %o <hdfs drectory>