Thursday, November 23, 2017

JavaScript Closures: setTimeout and loop

Problem:

for (var i = 0; i < 3; i++) {
    setTimeout(function() {
        console.log("My value: " + i);
    }, 1);
}

My value: 3
My value: 3
My value: 3

}

Why:

Variable Global scope, and JavaScript asynchronous event loop single thread execution.

Solution:

// Block scope, ES2005
for (let i = 0; i < 3; i++) {
    setTimeout(function() {
        console.log("My value: " + i);
    }, 1000);
}

//Immediately Invoked creation, and function scope.
for (var i = 0; i < 3; i++) {
    setTimeout((function(pNum) {
        return function() {
            console.log('i: ', pNum)
        }
    })(i), 1000);
}

//  IIFE (Immediately Invoked Function Expression)
for (var i = 0; i < 3; i++) {
    (function(pNum = i) {
        setTimeout(
            function() {
                console.log('i : ', pNum)
            }, 1000)
    })(i);
}


Enjoy the sunshine,


Tuesday, November 21, 2017

Running jupyter notebook with pyspark shell on Windows 10

# Running jupyter notebook with pyspark shell.

This notebook will not run in an ordinary jupyter notebook server.
Here is how we can load pyspark to use Jupyter notebooks.
On Mac, Linux, something like this will work:
On Windows 10, it works at 2017-Nov-21.

# Install Anaconda

https://www.anaconda.com/download/

## Run "Anaconda Prompt" to open command line:

%windir%\System32\cmd.exe "/K" E:\Users\datab_000\Anaconda3\Scripts\activate.bat E:\Users\datab_000\Anaconda3

Note: above commands are in one single line.

## Then set system environment variables:

rem set PYSPARK_PYTHON=python3
set PYSPARK_DRIVER_PYTHON=jupyter
set PYSPARK_DRIVER_PYTHON_OPTS=notebook

## start pyspark

E:\movie\spark\spark-2.2.0-bin-hadoop2.7\bin\pyspark.cmd


Reference:

Friday, November 17, 2017

Complete spark 2.2.0 installation guide on windows 10.

Goal
----
The step by step Apache Spark 2.2.0 installation guide on windows 10.


Steps
----
Download winutils.exe from git, https://github.com/steveloughran/winutils
E.g.: https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe

mkdir /tmp/hive

winutils.exe chmod -R 777 E:\tmp\hive
or
winutils.exe chmod -R 777 /tmp/hive

set HADOOP_HOME=E:\movie\spark\hadoop
mkdir %HADOOP_HOME%\bin
copy winutils.exe to %HADOOP_HOME%\bin

Download Spark: spark-2.2.0-bin-hadoop2.7.tgz from http://spark.apache.org/downloads.html
cd E:\movie\spark\
# in MINGW64 (git / cywin)
tar -zxvf spark-2.2.0-bin-hadoop2.7.tgz
# or use 7-Zip

cd E:\movie\spark\spark-2.2.0-bin-hadoop2.7
bin\pyspark.cmd



Notes
----
%HADOOP_HOME%\bin\winutils.exe must be locatable.

Folder "E:\movie\spark\hadoop" is just an example, it can be any folder.

Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+.
http://spark.apache.org/docs/latest/

Reference
----
https://wiki.apache.org/hadoop/WindowsProblems


Here is the example output when start pyspark successfully:

E:\movie\spark\spark-2.2.0-bin-hadoop2.7>bin\pyspark.cmd

Python 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 18:41:36) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/11/17 19:07:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/11/17 19:07:36 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/

Using Python version 3.6.1 (v3.6.1:69c0db5, Mar 21 2017 18:41:36)
SparkSession available as 'spark'.
>>>
>>> textFile = spark.read.text("README.md")
17/11/17 19:08:03 WARN SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
>>> textFile.count()
103

>>> textFile.select(explode(split(textFile.value, "\s+")).name("word")).groupBy("word").count().show()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'explode' is not defined

>>> from pyspark.sql.functions import *
>>> textFile.select(explode(split(textFile.value, "\s+")).name("word")).groupBy("word").count().show()
+--------------------+-----+
|                word|count|
+--------------------+-----+
|              online|    1|
|              graphs|    1|
|          ["Parallel|    1|
|          ["Building|    1|
|              thread|    1|
|       documentation|    3|
|            command,|    2|
|         abbreviated|    1|
|            overview|    1|
|                rich|    1|
|                 set|    2|
|         -DskipTests|    1|
|                name|    1|
|page](http://spar...|    1|
|        ["Specifying|    1|
|              stream|    1|
|                run:|    1|
|                 not|    1|
|            programs|    2|
|               tests|    2|
+--------------------+-----+
only showing top 20 rows

>>>

from pyspark.sql.functions import *


# module for pyspark,
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import *
from pyspark.sql import *

Employee = Row("empno", "ename", "job", "mgr", "hiredate", "sal", "comm", "deptno")

emp1 = Employee(7369, "SMITH", "CLERK", 7902, "17-Dec-80", 800, 20, 10)
emp2 = Employee(7876, "ADAMS", "CLERK", 7788, "23-May-87", 1100, 0, 20)

df1 = sqlContext.createDataFrame([emp1, emp2])

sparkSession = SparkSession.builder.master("local").appName("Window Function").getOrCreate()

empDF = sparkSession.createDataFrame([
      Employee(7369, "SMITH", "CLERK", 7902, "17-Dec-80", 800, 20, 10),
      Employee(7499, "ALLEN", "SALESMAN", 7698, "20-Feb-81", 1600, 300, 30),
      Employee(7521, "WARD", "SALESMAN", 7698, "22-Feb-81", 1250, 500, 30),
      Employee(7566, "JONES", "MANAGER", 7839, "2-Apr-81", 2975, 0, 20),
      Employee(7654, "MARTIN", "SALESMAN", 7698, "28-Sep-81", 1250, 1400, 30),
      Employee(7698, "BLAKE", "MANAGER", 7839, "1-May-81", 2850, 0, 30),
      Employee(7782, "CLARK", "MANAGER", 7839, "9-Jun-81", 2450, 0, 10),
      Employee(7788, "SCOTT", "ANALYST", 7566, "19-Apr-87", 3000, 0, 20),
      Employee(7839, "KING", "PRESIDENT", 0, "17-Nov-81", 5000, 0, 10),
      Employee(7844, "TURNER", "SALESMAN", 7698, "8-Sep-81", 1500, 0, 30),
      Employee(7876, "ADAMS", "CLERK", 7788, "23-May-87", 1100, 0, 20)
])


partitionWindow = Window.partitionBy("deptno").orderBy(desc("empno"))
sumTest = sum("sal").over(partitionWindow)
empDF.select("*", sumTest.name("PartSum")).show()

partitionWindowRow = Window.partitionBy("deptno").orderBy(desc("sal")).rowsBetween(-1, 1)
sumTest = sum("sal").over(partitionWindowRow)
empDF.select("*", sumTest.name("PartSum")).orderBy("deptno").show()

Monday, March 13, 2017

code review

The biggest rule is that the point of code review is to find problems in code before it gets committed - what you're looking for is correctness.
The most common mistake in code review - the mistake that everyone makes when they're new to it - is judging code by whether it's what the reviewer would have written.
Remember that there are many ways to solve a problem.

Advice


  • What you're looking for is correctness.

Programming decisions are a matter of opinion. Reviewers and developers should seek to understand each other’s perspective but shouldn’t get into a philosophical debate.
  • Be humble.

Don’t be a prick.  :-)
“I didn’t see where these variables were initialized”.
“What do you think about Standard DRY and if it’s applies here?”, 
“I don’t understand why this is a global variable “
  • Make sure you have coding standards in place.

Coding standards are shared set of guidelines in an organization with buy-in from everyone. If you don’t have coding standards, then don’t let the discussion turn into a pissing contest over coding styles (opening braces ‘{‘ on the same line or the next!) If you run into a situation like that, take the discussion offline to your coding standards forum.
One of the goals of code reviews is to create ‘maintainable’ software.
  • Learn to communicate well.

You must be able to clearly express your ideas and reasons.
  • Authors should annotate source code before the review.

Authors should annotate code, what is the problem and goaldesign and solution implementation, before the review occurs, because annotations guide the reviewer through the changes, showing which files to look at first and defending the reason behind each code modification. Annotations should be directed at other reviewers to ease the process and provide more depth in context. As an added benefit, the author will often find additional errors before the peer review even begins. More bugs found prior to peer review will yield in lower defect density because fewer bugs exist overall.
.
.
Effective code reviews require a healthy culture that values quality and excellence. Code reviews will not give you the desired results if the team doesn’t believe in delivering high-quality products. You need a positive culture in which people are engaged – one that thrives on constructive criticism and allows the best ideas to win.
Peer code review can be efficient and effective while fostering open communication and knowledge-share between teammates.

Reference:


Tuesday, December 13, 2016

INSERT ALL and sequence NextVal

drop table t1 purge;
drop table t2 purge;
drop table t3 purge;

create table t1(n1 number, n2 number);
create table t2(n1 number, n2 number);
create table t3(n1 number, n2 number);

create sequence s1;

Problem:

INSERT ALL
    INTO t1(n1, n2) values(n1_seq, nn)
    INTO t2(n1, n2) values(n1_seq, nn) 
select s1.nextval n1_seq, rownum*10 nn from dual connect by level <= 2;

Error report -
SQL Error: ORA-02287: sequence number not allowed here
02287. 00000 -  "sequence number not allowed here"
*Cause:    The specified sequence number (CURRVAL or NEXTVAL) is inappropriate
           here in the statement.
*Action:   Remove the sequence number.

Fix:

INSERT ALL
    INTO t1(n1, n2) values(s1.nextval, nn)
    INTO t2(n1, n2) values(s1.currval, nn) 
    INTO t3(n1, n2) values(s1.currval, nn) 
select rownum*10 nn from dual connect by level <= 2;

select * from t1;
select * from t2;
select * from t3;

Friday, October 14, 2016

When ANSI join comes to play?

ANSI join has been around for many years.
There are a couple of situations better off using ANSI join rather than Oracle native join.

1. Join many tables.

Reason:

Use explicit JOINs rather than implicit (regardless whether they are outer joins or not) is that it's much easier to accidentally create a cartesian product with the implicit joins. With explicit JOINs you cannot "by accident" create one. The more tables are involved the higher the risk is that you miss one join condition.

2. Outer join.

Reason:

Basically (+) is severely limited compared to ANSI joins. ANSI syntax is cleaner - you are not going to normal join if you forget (+) in some multi-column outer join.
In the past there were some bugs with ANSI syntax but if you go with latest 11.2 or 12.1 that should be fixed already.


Example:

SELECT *
  FROM table_a a 
  JOIN table_b b 
    ON a.col1 = b.col1
  JOIN table_c c
    ON b.col2 = c.col2;

Example 2:

SELECT d.department_id, e.last_name, j.job_name
  FROM departments d 
  LEFT OUTER JOIN employees e
    ON d.department_id = e.department_id
  LEFT OUTER JOIN job j
    ON e.job_id = j.job_id
;


Wednesday, July 13, 2016

Join 3 tables - Oracle native outer join v.s. ANSI outer join

select t1.id t1_id, t2.id t2_id, t3.id t3_id
from t1,t2,t3
where t1.id = t2.id(+)
and t2.id = t3.id(+)
--order by 1
;

     T1_ID      T2_ID      T3_ID
---------- ---------- ----------
         1          1          1
         3                      
         2          2           


select t1.id t1_id, t2.id t2_id, t3.id t3_id
from t1
left join t2 on t1.id = t2.id
left join t3 on t3.id = t2.id
--order by 1
;

     T1_ID      T2_ID      T3_ID
---------- ---------- ----------
         1          1          1
         3                      
         2          2           

Observation: They return same result.


Fyi,  init testing data.

drop table t1 purge;
drop table t2 purge;
drop table t3 purge;
create table t1 (id number);
create table t2 (id number);
create table t3 (id number);
insert into t1(id) select rownum from dual connect by level <= 3;
insert into t2(id) select rownum from dual connect by level <= 2;
insert into t3(id) select rownum from dual connect by level <= 1;
commit;