Thursday, November 23, 2017

JavaScript Closures: setTimeout and loop


for (var i = 0; i < 3; i++) {
    setTimeout(function() {
        console.log("My value: " + i);
    }, 1);

My value: 3
My value: 3
My value: 3



Variable Global scope, and JavaScript asynchronous event loop single thread execution.


// Block scope, ES2005
for (let i = 0; i < 3; i++) {
    setTimeout(function() {
        console.log("My value: " + i);
    }, 1000);

//Immediately Invoked creation, and function scope.
for (var i = 0; i < 3; i++) {
    setTimeout((function(pNum) {
        return function() {
            console.log('i: ', pNum)
    })(i), 1000);

//  IIFE (Immediately Invoked Function Expression)
for (var i = 0; i < 3; i++) {
    (function(pNum = i) {
            function() {
                console.log('i : ', pNum)
            }, 1000)

Enjoy the sunshine,

Tuesday, November 21, 2017

Running jupyter notebook with pyspark shell on Windows 10

# Running jupyter notebook with pyspark shell.

This notebook will not run in an ordinary jupyter notebook server.
Here is how we can load pyspark to use Jupyter notebooks.
On Mac, Linux, something like this will work:
On Windows 10, it works at 2017-Nov-21.

# Install Anaconda

## Run "Anaconda Prompt" to open command line:

%windir%\System32\cmd.exe "/K" E:\Users\datab_000\Anaconda3\Scripts\activate.bat E:\Users\datab_000\Anaconda3

Note: above commands are in one single line.

## Then set system environment variables:

rem set PYSPARK_PYTHON=python3

## start pyspark



Friday, November 17, 2017

Complete spark 2.2.0 installation guide on windows 10.

The step by step Apache Spark 2.2.0 installation guide on windows 10.

Download winutils.exe from git,

mkdir /tmp/hive

winutils.exe chmod -R 777 E:\tmp\hive
winutils.exe chmod -R 777 /tmp/hive

set HADOOP_HOME=E:\movie\spark\hadoop
mkdir %HADOOP_HOME%\bin
copy winutils.exe to %HADOOP_HOME%\bin

Download Spark: spark-2.2.0-bin-hadoop2.7.tgz from
cd E:\movie\spark\
# in MINGW64 (git / cywin)
tar -zxvf spark-2.2.0-bin-hadoop2.7.tgz
# or use 7-Zip

cd E:\movie\spark\spark-2.2.0-bin-hadoop2.7

%HADOOP_HOME%\bin\winutils.exe must be locatable.

Folder "E:\movie\spark\hadoop" is just an example, it can be any folder.

Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+.


Here is the example output when start pyspark successfully:


Python 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 18:41:36) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/11/17 19:07:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/11/17 19:07:36 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.0

Using Python version 3.6.1 (v3.6.1:69c0db5, Mar 21 2017 18:41:36)
SparkSession available as 'spark'.
>>> textFile ="")
17/11/17 19:08:03 WARN SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
>>> textFile.count()

>>>, "\s+")).name("word")).groupBy("word").count().show()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'explode' is not defined

>>> from pyspark.sql.functions import *
>>>, "\s+")).name("word")).groupBy("word").count().show()
|                word|count|
|              online|    1|
|              graphs|    1|
|          ["Parallel|    1|
|          ["Building|    1|
|              thread|    1|
|       documentation|    3|
|            command,|    2|
|         abbreviated|    1|
|            overview|    1|
|                rich|    1|
|                 set|    2|
|         -DskipTests|    1|
|                name|    1|
|page](http://spar...|    1|
|        ["Specifying|    1|
|              stream|    1|
|                run:|    1|
|                 not|    1|
|            programs|    2|
|               tests|    2|
only showing top 20 rows


from pyspark.sql.functions import *

# module for pyspark,
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import *
from pyspark.sql import *

Employee = Row("empno", "ename", "job", "mgr", "hiredate", "sal", "comm", "deptno")

emp1 = Employee(7369, "SMITH", "CLERK", 7902, "17-Dec-80", 800, 20, 10)
emp2 = Employee(7876, "ADAMS", "CLERK", 7788, "23-May-87", 1100, 0, 20)

df1 = sqlContext.createDataFrame([emp1, emp2])

sparkSession = SparkSession.builder.master("local").appName("Window Function").getOrCreate()

empDF = sparkSession.createDataFrame([
      Employee(7369, "SMITH", "CLERK", 7902, "17-Dec-80", 800, 20, 10),
      Employee(7499, "ALLEN", "SALESMAN", 7698, "20-Feb-81", 1600, 300, 30),
      Employee(7521, "WARD", "SALESMAN", 7698, "22-Feb-81", 1250, 500, 30),
      Employee(7566, "JONES", "MANAGER", 7839, "2-Apr-81", 2975, 0, 20),
      Employee(7654, "MARTIN", "SALESMAN", 7698, "28-Sep-81", 1250, 1400, 30),
      Employee(7698, "BLAKE", "MANAGER", 7839, "1-May-81", 2850, 0, 30),
      Employee(7782, "CLARK", "MANAGER", 7839, "9-Jun-81", 2450, 0, 10),
      Employee(7788, "SCOTT", "ANALYST", 7566, "19-Apr-87", 3000, 0, 20),
      Employee(7839, "KING", "PRESIDENT", 0, "17-Nov-81", 5000, 0, 10),
      Employee(7844, "TURNER", "SALESMAN", 7698, "8-Sep-81", 1500, 0, 30),
      Employee(7876, "ADAMS", "CLERK", 7788, "23-May-87", 1100, 0, 20)

partitionWindow = Window.partitionBy("deptno").orderBy(desc("empno"))
sumTest = sum("sal").over(partitionWindow)"*","PartSum")).show()

partitionWindowRow = Window.partitionBy("deptno").orderBy(desc("sal")).rowsBetween(-1, 1)
sumTest = sum("sal").over(partitionWindowRow)"*","PartSum")).orderBy("deptno").show()

Monday, March 13, 2017

code review

The biggest rule is that the point of code review is to find problems in code before it gets committed - what you're looking for is correctness.
The most common mistake in code review - the mistake that everyone makes when they're new to it - is judging code by whether it's what the reviewer would have written.
Remember that there are many ways to solve a problem.


  • What you're looking for is correctness.

Programming decisions are a matter of opinion. Reviewers and developers should seek to understand each other’s perspective but shouldn’t get into a philosophical debate.
  • Be humble.

Don’t be a prick.  :-)
“I didn’t see where these variables were initialized”.
“What do you think about Standard DRY and if it’s applies here?”, 
“I don’t understand why this is a global variable “
  • Make sure you have coding standards in place.

Coding standards are shared set of guidelines in an organization with buy-in from everyone. If you don’t have coding standards, then don’t let the discussion turn into a pissing contest over coding styles (opening braces ‘{‘ on the same line or the next!) If you run into a situation like that, take the discussion offline to your coding standards forum.
One of the goals of code reviews is to create ‘maintainable’ software.
  • Learn to communicate well.

You must be able to clearly express your ideas and reasons.
  • Authors should annotate source code before the review.

Authors should annotate code, what is the problem and goaldesign and solution implementation, before the review occurs, because annotations guide the reviewer through the changes, showing which files to look at first and defending the reason behind each code modification. Annotations should be directed at other reviewers to ease the process and provide more depth in context. As an added benefit, the author will often find additional errors before the peer review even begins. More bugs found prior to peer review will yield in lower defect density because fewer bugs exist overall.
Effective code reviews require a healthy culture that values quality and excellence. Code reviews will not give you the desired results if the team doesn’t believe in delivering high-quality products. You need a positive culture in which people are engaged – one that thrives on constructive criticism and allows the best ideas to win.
Peer code review can be efficient and effective while fostering open communication and knowledge-share between teammates.


Tuesday, December 13, 2016

INSERT ALL and sequence NextVal

drop table t1 purge;
drop table t2 purge;
drop table t3 purge;

create table t1(n1 number, n2 number);
create table t2(n1 number, n2 number);
create table t3(n1 number, n2 number);

create sequence s1;


    INTO t1(n1, n2) values(n1_seq, nn)
    INTO t2(n1, n2) values(n1_seq, nn) 
select s1.nextval n1_seq, rownum*10 nn from dual connect by level <= 2;

Error report -
SQL Error: ORA-02287: sequence number not allowed here
02287. 00000 -  "sequence number not allowed here"
*Cause:    The specified sequence number (CURRVAL or NEXTVAL) is inappropriate
           here in the statement.
*Action:   Remove the sequence number.


    INTO t1(n1, n2) values(s1.nextval, nn)
    INTO t2(n1, n2) values(s1.currval, nn) 
    INTO t3(n1, n2) values(s1.currval, nn) 
select rownum*10 nn from dual connect by level <= 2;

select * from t1;
select * from t2;
select * from t3;

Friday, October 14, 2016

When ANSI join comes to play?

ANSI join has been around for many years.
There are a couple of situations better off using ANSI join rather than Oracle native join.

1. Join many tables.


Use explicit JOINs rather than implicit (regardless whether they are outer joins or not) is that it's much easier to accidentally create a cartesian product with the implicit joins. With explicit JOINs you cannot "by accident" create one. The more tables are involved the higher the risk is that you miss one join condition.

2. Outer join.


Basically (+) is severely limited compared to ANSI joins. ANSI syntax is cleaner - you are not going to normal join if you forget (+) in some multi-column outer join.
In the past there were some bugs with ANSI syntax but if you go with latest 11.2 or 12.1 that should be fixed already.


  FROM table_a a 
  JOIN table_b b 
    ON a.col1 = b.col1
  JOIN table_c c
    ON b.col2 = c.col2;

Example 2:

SELECT d.department_id, e.last_name, j.job_name
  FROM departments d 
  LEFT OUTER JOIN employees e
    ON d.department_id = e.department_id
    ON e.job_id = j.job_id

Wednesday, July 13, 2016

Join 3 tables - Oracle native outer join v.s. ANSI outer join

select t1_id, t2_id, t3_id
from t1,t2,t3
where =
and =
--order by 1

     T1_ID      T2_ID      T3_ID
---------- ---------- ----------
         1          1          1
         2          2           

select t1_id, t2_id, t3_id
from t1
left join t2 on =
left join t3 on =
--order by 1

     T1_ID      T2_ID      T3_ID
---------- ---------- ----------
         1          1          1
         2          2           

Observation: They return same result.

Fyi,  init testing data.

drop table t1 purge;
drop table t2 purge;
drop table t3 purge;
create table t1 (id number);
create table t2 (id number);
create table t3 (id number);
insert into t1(id) select rownum from dual connect by level <= 3;
insert into t2(id) select rownum from dual connect by level <= 2;
insert into t3(id) select rownum from dual connect by level <= 1;