中文排版风格指南

排版是一个需要精心设计,却又经常被忽略的事情。比如在微信公众号中,大多数的编辑都会无意间忽略了排版风格的重要性,时常会出现诸如「使用 github登录」等问题。这些问题不会影响文章的意思,但一定会影响读者的阅读体验。

本文将提供一个排版风格的参考,各位可以在这个的基础上做出一定的删改。部分内容来自 LeanCloud 的《文案风格指南》 [1] 及 GitHub 仓库 mzlogin/chinese-copywriting-guidelines [2]。

文案风格

在说排版之前,一定要先提及文案风格。一个好的排版,如果没有好的文案,那一切都是空无。

  • 文章发布之前,建议由多人检查,确保没有错别字。
  • 中文流行语中有很多的谐音、错别字,如果文案本身风格偏正式,强烈不推荐使用。比如「墙裂」、「童鞋」等。
  • 文案本身要简练。在不影响表达效果的前提下,精炼的语句能使读者更容易抓住文章的要点。

空格

「有研究显示,打字的时候不喜欢在中文和英文之间加空格的人,感情路都走得很辛苦,有七成的比例会在 34 岁的时候跟自己不爱的人结婚,而其余三成的人最后只能把遗产留给自己的猫。毕竟爱情跟书写都需要适时地留白。

与大家共勉之。」—— vinta/paranoid-auto-spacing https://github.com/vinta/pangu.js

中文、英文、数字混用时,空格大概是最容易被忽略的一件事了。我们建议,英文及数字和非标点的中文之间需要有一个空格,比如「使用 GitHub 登录」,而不是「使用GitHub登录」,更不是「使用 GitHub 登录」。对于数字也一样,比如「有超过 100,000 用户使用我们的产品」,而不是「有超过100,000用户使用我们的产品」。

在中文语境中,尽量使用中文数词。上面的例子可以写为「有超过十万用户使用我们的产品」。

当然,也要注意不要「矫枉过正」,在全角标点和其他字符之间不加空格。比如「我觉得最好的搜索引擎是 Google。」,而不是「我觉得最好的搜索引擎是 Google 。」(注意错误示范中句号前的空格)。

英文大小写

注意特殊名词的大小写。比如 Android、iOS、iPhone、macOS、Google、Apple 等。无论单词是否出现在句首,都应该采用官方的方式写。

关于缩写,产品的名词尽量不要使用非官方的缩写。比如 HTML 5 不要缩写为 H5。

以下提供一些常见的错误:

正确 错误示范
iOS IOS, ios, Ios
macOS Mac
Python py, python
PHP php
HTML 5 H5
Facebook FB
AMD Amd
Google gg, google

标点相关

如果你不知道什么是全角和半角符号,请参阅维基百科词条「全形和半形」。

只有中文或者中英文混排时,一律使用中文(全角)标点。如果出现整句英文或特殊名词,则在这句英文中使用英文(半角)标点。

中文标点与其他字符间一律不添加空格。

例子:

正确:

乔布斯那句话是怎么说的?「Stay hungry, stay foolish.」

嗨!你知道嘛?今天前台的小妹跟我说「喵」了哎!

错误:

乔布斯那句话是怎么说的?「Stay hungry,stay foolish。」

嗨!你知道嘛?今天前台的小妹跟我说「喵」了哎!

额外的小点

链接之间增加空格

由于微信公众号的限制,普通账户无法在文章中插入超链接。现在大部分公众号的做法是将链接直接粘贴到文章中,却忽略了在链接前后加上空格。此时用户比较难以复制链接。此处我们建议在链接前后加上空格。如果原文链接太长,可以使用短链接服务。比如:

本文部分转载自 LeanCloud: https://open.leancloud.cn/copywriting-style-guide.html

简体中文使用直角引号

这个是争议很大的一个点。但无论是否使用直角引号,从语法的角度上来说应当都是正确的。

用法:

「老师,『有条不紊』的『紊』是什么意思?」

对比用法:

“老师,‘有条不紊’的‘紊’是什么意思?”

引用来源

本文部分内容来自 LeanCloud 的《文案风格指南》 [1] 及 GitHub 仓库 mzlogin/chinese-copywriting-guidelines [2]。

[1] 文案风格指南, https://open.leancloud.cn/copywriting-style-guide.html
[2] GitHub 仓库, https://github.com/mzlogin/chinese-copywriting-guidelines

小工具 | Moodle快捷助手 & Essay句长检查器发布!

Moodle Helper and Essay Sentence Length Checker Published!
(For English version, please scroll down.)

你是否遇到……

  • 没人通知 Moodle 上传了新东西,每天只好一遍遍查看每个课程页?用
    Moodle Helper,在首页,一切动态尽收眼底!
  • Quiz 答题时,每一题都要“点选项 + 点 Check + 滑到下一题”,嫌操作太麻烦?用 Moodle Helper,只需轻按键盘上的 ABCD!
  • EBP 的 Essay 要求每句话不超过 9(或 8)个单词,不小心超了两句没发现?用 EssaySentenceLengthChecker,帮你标出不合要求的句子!

我最近为 Moodle 开发了一个小插件 MoodleHelper,另外又针对EBP作业开发了一个小工具 EssayChecker,希望能够让大家方便一些。

安装方式:

以后可能还将增加的功能有:

  • 智能打印:点一下moodle上的ppt,系统就会进行智能排版(如1面纸8页紧凑的ppt、双面打印……),并自动发送到打印机,省纸又方便。(另:貌似学校90%的人不知道ppt是可以紧凑地不留白地打印到一张纸上的哦)
  • 智能同步:自动将moodle上的课件同步到你的电脑文件夹,不必每次再下载新课件并整理到文件夹中。

常见问题

  • 为什么没有 Safari 版:注册费用要一年 99 刀,如果有好心人给一个账号我立马开发!
  • 只能使用 Chrome 浏览器吗:360 浏览器、QQ 浏览器可能也可以。
  • 程序是否安全:Moodle Helper完全开源,如觉得不安全,可查看源代码,也欢迎随时贡献新的代码和功能!
  • 开源地址:https://github.com/turtlegood/CUHKSZHelper

MoodleHelper 的一些应用场景

  • 想要提前预习,但是不知道教授什么时候才会更新 ppt,只好一遍又一遍地查看每个科目主页?
  • 教授喜欢把文件上传到文件夹里,因此得经常打开各个文件夹看看有没有新东西?
  • 新的 assignment 又神不知鬼不觉地出现了,根本没注意到课程页面下面多了一行?

这只是业余的一个小制作,难免存在不完美之处,请多多包涵。如果还有其他问题,欢迎随时联系我!(联系方式见下)

开发者:陈靖一(Tom)

联系方式:微信 fzyzcjy,邮箱 117010016

……

( Here is the English version. )

Have you ever…

  • Nobody informs you of the uploaded files on moodle, so everyday you have to check every webpage of courses and folders? Using MoodleHelper, you can see everything just on the frontpage!
  • When doing quizzes, you feel tired to “click the right answer + click check button + scroll to the next question” for every question? Using MoodleHelper, all you need to do is press the ABCD on your keyboard!
  • EBP Essay requires less than 9 (or 8) words in a sentence, but you sometimes didn’t follow it by accident, and didn’t notice that? Use EssaySentenceLengthChecker, and it will mark every sentence of that kind!

I have currently developed an extension called “Moodle Helper”. And I also developed a tool
called “EssaySentenceLengthChecker” about EBP courses. Both are in the hope that it will be more convenient for everyone.

Installation:

There will be a cooperation with ITSO department. So here are the functions that may be added in the future:

  • Intelligent Printing: You just need to click the ppt on Moodle. Then the system will compose and arrange automatically (e.g. print 8 slides tightly onto one piece of paper, print both sides of paper, …), and then send it to the printer automatically. So I think it will be convenient and will save paper. (P.S. It seems that 90% people here don’t know that ppt can be printed to a piece of paper very tightly without lots of blank)
  • Intelligent Synchronize: Automatically sync the files on Moodle to the folders in your computer. So you will never need to download new files and put them into corresponding folders.

Some common questions:

  • Why there is no Safari version: The registry costs 99 dollar per year, so if any warm-hearted man can give me an account, I will develop it immediately!
  • Is it only available on Chrome: 360 Browser and QQ Browser may be available too.
  • Is the software safe: Moodle Helper is a fully open-sourced software. So if you think it is not safe, you can just see the source code. And you are welcome to contribute new functions and codes to this project!
  • Open source: https://github.com/turtlegood/CUHKSZHelper

Use cases of MoodleHelper:

  • Want to preview ppts, but you don’t know when the professor will upload their ppts, so you have to refresh and see every course page on Moodle over and over again?
  • The professor likes to upload files into folders, so you have to open every one of them to see whether there are new things?
  • A new assignment appears “mysteriously” without notification, and you don’t notice the tiny new line appended on the course page?

This is just a small tool I made in my spare time, so there inevitably exists some insufficiency and please excuse that in my tool. If you have other questions, you are welcome to contact me at any time! (The contact information is listed below.)

Developer: Chen Jingyi (Tom)

Contact me: fzyzcjy (WeChat), 117010016 (Email)

“核战”看今朝——伪装机党观近日 AMD 大战英特尔

投稿:张润崧

高考结束以后,本人花了一定的时间来研究电脑硬件(主要是让自己买到合适的机子),对于民用电脑硬件的近期的发展也有了一定的了解。本文为科普性质,对于家用 CPU 近况非常了解或者完全不感兴趣的请略过此文章。文章中不可避免会出现一些错误,也请各位指正。

半年多以前,笔者刚入坑,遇到资深硬件党。问及电脑硬件发展,他们语重心长地说,当今桌面级电脑硬件中,发展最快的是显卡,发展最慢的无疑是CPU。笔者开始还不相信,后来才发现,这真的是事实。

img

近几年,桌面级 CPU 的发展相当缓慢,这是多方面的原因所造成的:桌面级 CPU 的市场一直为英特尔(Intel)和 AMD 所控制,而整整 10 年, AMD 一直被英特尔压一头,使得英特尔独步天下,几乎完全占有了桌面级电脑 CPU 市场;随着 CPU 制造工艺的进化,工艺改进越来越难(特别是最近各个代工厂商的 10nm 制程都出现了良品率不高的问题),使得直接细化电路来改进 CPU 受阻;改进架构、多核多线程化当然可以提升 CPU 性能,但整体极弱的市场竞争,也使得英特尔看不到拉升桌面级 CPU 性能的必要性。

img

img

所以,我们可以看到以下的现象:

英特尔玩起了大数据、人工智能、VR、物联网、云计算,CPU 改进的步伐越放越缓:7 代酷睿仅比 6 代酷睿提升了约 10%,而同期显卡市场上的英伟达 GTX1080Ti 则比上一代 GTX980Ti 提升了近 100%。英特尔因而被发烧友冠上了“牙膏厂”的名号。到了 2016 年,英特尔甚至公开宣布,不再把芯片业务作为重点。

与此同时,AMD 由于同时在桌面级 CPU 和 GPU 市场上表现乏力,因而被戏称为“农企”“按摩店”。在其他的一些方面,AMD 的表现还算可圈可点。同期家用游戏机市场的主要竞争对手,微软的 Xbox 系列和索尼的 PS 系列,不约而同地使用了 AMD 的 CPU 和 GPU,可见 AMD 的实力并没有发烧友口中那么不堪。但是其他方面的盈利没法挽救 AMD 最核心的桌面级 CPU 业务的亏损,导致 AMD 持续地处于低盈利状态,甚至连续3年亏损。

其他身处于 x86 体系外的厂商,如国产的龙芯和申威,受到现有的过于完善的市场生态的排斥,长期无法在桌面级市场上占有令人满意的份额。

img

img

但是在今年,英特尔可谓受到了当头一棒。在账目上还在不断亏损的情况下,AMD 却投入资金,潜心四年,研发全新的 Zen 架构,以取代不堪大任的 Bulldozer(推土机)。这可谓是一场豪赌,一旦失败,AMD 的财务状况无疑将雪上加霜。但 AMD 赌赢了:研制时预期架构性能提升 40%,最终却提升了足足 52%!

img

在今年2月21日,基于 Zen 架构,AMD 发布了受到圈内广泛赞誉的 Ryzen 7 处理器。AMD 的 CEO,Dr. Lisa Su 在发布会上开怀的笑容无疑令人印象深刻。一场如今依然在加剧的“热核战争”就此开启。

img

Ryzen 7 对标酷睿 i7,而其不对称优势大到令人瞠目结舌。持平的性能,仅有英特尔一半多的功耗,已经足以令发烧友心动。令人无比震惊的是 Ryzen 在价格的绝对优势:8 核 16 线程的 Ryzen 7 1700,国行开售价格仅 2400,甚至低于 4 核 8 线程、多核性能明显低于其的 i7-7700k;8 核 16 线程的旗舰级 CPU 1800x,国行开售时售价为 3999 元,而同为 8 核 16 线程,性能与之持平,i7-6900k 的价格却是其两倍有余! 在 AMD 的发布会上,当 1800x 的价格一出现在大屏幕是,现场的欢呼就已经难以抑制。

值得一提的是,Ryzen 7 几乎完全放开了超频的限制(和各种上锁的英特尔形成鲜明对比),甚至内置了官方超频工具,直接响应了发烧玩家的要求。

虽然,Ryzen 7 的整体性能仅仅与 i7 持平,一些方面还有不小的差距(主频,多核优化),在初期发售的时候还存在问题(对内存的兼容,稳定性,掉帧),但其无可忽视的价格优势已经让曾经的“农企”受到了广泛的赞誉。

P.S: Ryzen 7 的官方发布会:http://www.bilibili.com/video/av8777517

img

img

之后,AMD 布局了 Ryzen 5 作为新一代中端处理器(对位酷睿 i5),没有了价格优势,却有了一定的性能优势。低端处理器 Ryzen 3,将于下半年发售、对位酷睿 i3,情况还不是太明朗,不过据悉同时研发 GPU 的 AMD 据传会为其配上性能远超英特尔的核显,这样不仅会为有一定显卡性能要求的低端用户省下 800 左右的经费,更可能顺势剿灭整个低端独立显卡市场!

img

英特尔当然不会坐以待毙。X79、X99 系列面对 Ryzen 7 的冲击,已注定无力支撑台面。于是英特尔果断发布了搭配 X299 芯片组的 i9 系列。根据报道,i9 系列基于全新的 Skylake-X 架构,最高配的 i9-7980x 拥有多达 18 核 36 核心。英特尔的一大传统优势在于较高的主频(单核频率),在这一方面英特尔继续下了功夫,继续维持着优势。同时为了在对决 AMD 的时候不落下风,英特尔同样走了多核化路线,以核心数量的增加来拉动多核性能的提升。不惜血本的英特尔甚至为 i9 配上了 AVX-512,要知道这一超级指令集本来只用于英特尔中高端服务器 CPU!再加上其他各种的优化,想必足以使i9拥有极其引人注目的性能。

img

作为根本措施,英特尔宣布 8 代酷睿将比 7 代提升至少 30%,试图维持住主频的优势,甩开和 AMD 的性能差距。分析指出,在 10nm 工艺受阻(导致摩尔定律被彻底打破)、英特尔 Tick-Tock 模式难以为继的情况下,英特尔会加强架构的改进以提升总体效率,并将促进 CPU 的多核化来弥补主频的不足,这意味着以后中低端 i7 上 8 核之类的现象可能会成为常态。

img

AMD 的后劲比英特尔想的足很多。与 i9 对标的 Ryzen Threadripper(线程撕裂者)紧随 i9 传出消息(坊间看做 AMD 接下了英特尔的“战书”),也和 i9 同期在台北电脑展上发布。

根据报道,Threadripper 的顶配为16核32线程。为了找到优势,AMD 用上了传说中的“胶水”大法:将两颗 CPU 封装在一起。 “胶水”一词来源于网友的调侃,定义为:使用特殊手段将两颗或多颗 CPU 封装在一起。最先采用“胶水”技术的英特尔在当年被 AMD 抓住了把柄,搞了一波“真双核,假双核”的公关(详见:https://www.zhihu.com/question/28135719 )。所以, 这次 AMD 的做法不免有点打自己脸的感觉。

img

img

AMD 似乎一点都没有顾及当年的恩怨,在英特尔传出i9的消息后就果断地以 Threadripper 作出了回应。这样的结果可谓是令人瞠目:根据披露的图片,巨大的 Threadripper 大到可以直接覆盖一个一个成年男子手掌除了手指的部分,相应的主板上 CPU 接口加上 8 个内存接口轻易占据了整块主板一半的面积。

img

img

多颗 CPU 封装在一起本来是一种弊端很大的做法,容易导致功耗高,资源利用率低下。令人意想不到的是,AMD 采用了 Infinity Fabric 技术(见
http://www.mykancolle.com/?post=1630 ),大幅改进了效率:据称使得多一倍的线程可以达到多近 100% 性能的水平。反而由于把两个 CPU 封装在一起,CPU 比原来拥有了多一倍的线程数、多一倍的缓存和多一倍的的通道数(PCI-e 通道和内存通道)以至于一些纸面数据还好于英特尔的 i9,让 CPU 具有了优势。

img

值得玩味的是,搭配 Threadripper 的主板芯片组被命名为 X399,截了英特尔后路(英特尔上一代个人旗舰级主板芯片组为 X99,搭配 i9 的新一代为 X299)。这可谓是明显的挑衅举动。其实历史上,AMD 就曾经多次干这样截英特尔后路的事情。比如 APU 系列的 A4、A6、A8,用数字压制 i3、i5、i7(就算性能干不过你,名字上也一定要压制你!手动斜眼)。

和 Ryzen 7 类似,Threadripper 据悉将延续低价路线,价格可能会只有 i9 的一半;更令人不得不吐槽的是,根据英特尔的官方消息,旗舰级的 i9 到了 2018 年才能够出货(使得几款产品沦为被坊间吐槽的“PPT 处理器”)。在整个 2017 下半年中,笔者实在不知道英特尔能够拿什么在最高端桌面市场对抗 AMD……

Threadripper 的出现不仅意在对抗 i9,同时也意在填补 Ryzen 7 和 AMD 服务器 CPU 之间的空白。基于 Zen 架构,AMD 的 Naples(那不勒斯)系列(属于 AMD 新子品牌 EPYC)意在对抗英特尔的 XEON(至强)。和 Threadripper 类似,EPYC 是 4 颗 CPU 封装在一起的产物,于是其最多可以达到 32 核 64 线程(一如既往地靠多核弥补主频)。至此,除笔记本用 CPU,AMD 和英特尔的“核战”几乎由最低端的领域一路延伸到最高端的领域。

img

当 Ryzen 7 发布的时候,很多发烧友和门户网站的评测者都激动地感叹到:AMD,我等了你足足十年!QAQ

那些年,AMD 的 Athlon(速龙)处理器在桌面级市场一骑绝尘;那些年,英特尔步入歧途,产品四处碰壁;那些年,AMD 意气风发,收购 ATI 进入显卡业务,濒临破产的英伟达差点与 AMD 签下收购合约……

好景不长。不久之后,AMD 就因为负债收购 ATI,陷入了持续亏损;而另一边,英特尔放弃了错误的“唯主频论”,拿出了“Tick-Tock”策略,推出了酷睿系列,再次走向辉煌。缓过劲来的 AMD 多次发起冲击,反而英特尔的霸主地位越坐越稳,自己的账本上连年亏损……

风水轮流转,当英特尔将重点移出芯片业务,进步越发迟缓,而被冠上了“挤牙膏”的恶名之时,AMD 潜心 4 年铸一剑,靠 Zen 架构奋起直追,振奋人心的多核性能,较低的功耗,惊人的价格,对发烧玩家诉求的回应,看得周围叫好连连。人们这才想起,这个大家口中的“农企”,当年和当今,都拥有和英特尔大战三百回合的不俗实力!

img

展望未来,英特尔和 AMD 的核战远没有停止的迹象,双方可谓各有千秋。英特尔旗舰级 i9 的出货延期,显示其应对仍显仓促,2017 年下半年,AMD 在高端桌面级市场上优势明显。然而,英特尔的老本依旧不容忽视,其连续 10 年作为市场霸主,手握巨大的市场惯性,和连续多年数倍于 AMD 的研发资金流所积累的技术优势(尤其体现在主频上)。一旦英特尔在保持性能优势的情况下及时调整定位,放低身段,压低售价,那么 AMD 所建立起来的一时优势,将顷刻间灰飞烟灭。另外一边,AMD 同样清楚,给到对手的时间越多,自己被反杀的几率越大,所以也是马不停蹄,全力做好公关,解决兼容性问题,拓展市场,扩大产能,将更多的资金投入研发,在英特尔的反击(如 8 代酷睿)到来之前做好准备。时间决定成败,当今的核战,红蓝对决,两家超级大厂的生死时速,还在继续。

P.S: AMD和英特尔的恩怨情仇:http://tech.huanqiu.com/news/2017-03/10398639.html

作为消费者,笔者还是比较现实的。企业毕竟不是慈善家,怎么赚钱才是他们的终极目的。这回 AMD 奋起直追,对消费者来说是个好事:桌面级 CPU 市场几乎沦为一片死水,英特尔的“挤牙膏”早已让很多人不满。AMD 近期的行为应了情怀,应了民心,获得消费者的叫好,无疑是必然。不过从消费者的角度来看,红蓝对决,斗得两败俱伤,才能让我们的利益最大化。期待两厂可以拿出更多令人惊讶的产品。

文章作者:张润崧(学生,香港中文大学(深圳)理工学院)

图片来源网络

2017 香港中文大学(深圳)大数据分析挑战赛赛题

  1. 比赛背景

    在电子商务蓬勃发展的今天,工业界十分重视电商平台上用户的评论并以此为指导改善产品以达到更高的利润。自然语言处理 (NLP) 是现今人工智能领域一大重点研究方向。在本次比赛中,选手需要开发一整套针对特定品牌,特定产品评论的分析算法,并给出该评论在十个给出的品类下的情感分析。

  2. 数据介绍

    数据中包括 101480 条数据,其正文是电商平台中关于洗发水的评论,每条数据为12列,按顺序为

    1
    id, text, type1, type2, type3, type4, type5, type6, type7, type8, type9, type10

    其中对于type的值,“0” 表示没有出现这个type, “1” 表示出现这个type且为正面,“-1”表示出现了这个type且为负面。

    各个type的解释如下:

    • Price: 表示评论与价格相关,是否划算,是否便宜等。
    • Fakeconcern: 表示评论与商品是否为正品相关。
    • Promotion:表示评论中与商品促销相关,如出现“活动”字眼等。
    • Service: 表示评论与商家服务相关。
    • Leakage: 表示评论与洗发水是否泄露相关。
    • Package: 表示评论与商品包装相关。
    • Loyalty: 表示评论与用户忠诚度相关,其中若评论提及“多次购买”,“第二次购买”等等字眼则判为“1”,如“再也不买”的字眼则判定为“-1”。
    • Smell: 表示评论与气味相关。
    • Effect:表示评论与洗发水功效相关。
    • Logistics: 表示评论与物流相关。

    例子:

    如 Id 为11的评论:

    “跟超市搞活动时价格差不多,这次礼盒多了5小瓶赠品很合算。”

    “Price” 即为1, “Promotion”也为1。

  3. 评价方式

    对于每个type,是否猜中,正面,负面各有一个f1值,所以总共有30个f1值。

    其中对商家来说,负面的评论意义更大,故所有负面的f1值在计算总分时会乘以1.4,对于type来说,Smell, Effect, Package 和 Logistics 这四个type的意义对商家也更重要,所以其 f1 值在计算总分时将乘以1.25。

    其中每一项的 $F1$ Score 为:

    其中准确率 $Precision$:

    召回率 $Recall$:

2017 香港中文大学(深圳)大数据分析挑战赛

  1. 比赛主题
    本届大数据分析挑战赛的主题为:消费品行业中的数据分析

  2. 主办单位:香港中文大学(深圳)计算机协会(Computer @nd Comity)

  3. 协办单位:一面网络技术有限公司 (yimian.com.cn)

  4. 比赛目的及意义

    随着移动设备的完善和普及,移动互联网+各行各业进入了高速发展阶段,这其中以 O2O(Online to Offline)消费最为吸引眼球。据不完全统计,O2O行业估值上亿的创业公司至少有 10 家,也不乏百亿巨头的身影。O2O 行业天然关联数亿消费者,各类 APP 每天记录了超过百亿条用户行为和位置记录,因而成为大数据科研和商业化运营的最佳结合点之一。

    香港中文大学(深圳)大数据分析挑战赛由香港中文大学(深圳)计算机协会主办,一面网络技术有限公司协办,是面向全校学生的高端算法竞赛。通过开放由一面网络技术有限公司提供的海量电商评论和销量数据,大赛让所有参与者有机会运用自己设计的算法。

    本次大赛目的在于提升同学们对大数据的认识与理解,使得同学在比赛中学习、提高大数据分析能力,为今后的学习和工作提供宝贵的经验。

  5. 比赛赛题方向

    如今的电商平台中存在大量的商品评论,作为商家和数据分析者,希望提取其中的信息点来发现商业价值。你的任务是对商品评论进行类别和情感层面上的分类。

    训练集在赛事初发放。测试集发放分为两次,第一次发放约5w条数据,参赛者可以不限次数提交结果,但只在每日中午十二点返回最近一次的结果评测。第二次测试集发放在比赛截止前一晚,发放约5w条数据,参赛者可在最后一天无限次提交结果,最后以当夜 23:59 前最后一次提交的结果为准。

    为了方便对此有兴趣的同学参与比赛,计算机协会将会提供基础的数据分析指导,帮助大家完成自己第一次大数据分析。我们相信所有人都能从本次比赛获得宝贵的知识和经验。

    数据格式:

    1. 训练集(案例)包括如下字段:评论 ID、评论内容、类别1、类别2、类别3、情感
    2. 测试集(案例):评论 ID、评论内容
    3. 提交结果(案例):评论 ID、类别1、类别2、类别3、情感

    评测指标:

    对于每个类别和情感都可以得到一个f1-score, 最终总评为各个f1-score的加权

  6. 比赛评分

    比赛的成绩分为两个部分:

    • 对于销量的预测准确率 60%
    • 所用模型和方法答辩 40%

    线上测试开放后,每一位参赛队员每一日可提交一次,最终成绩取历次成绩中最好的一次。线上测试关闭后参赛选手需参与答辩,否则将没有第二部分成绩。

  7. 比赛赛程(可能会根据实际情况有微小变动)

    1. 报名:1月22号截止

      组队参赛,每个选手只能参与一支队伍,每组队员不多于3人(特殊情况可以提交申请,视情况放宽)。

      未找到队伍的同学可以个人报名,可以选择是否接受与其他单人参赛同学随机组队。

      选手需通过 Google 表单进行报名,报名成功后会有邮件提示报名成功。

    2. 数据发布:1月22号至1月24日

      数据发布后参赛选手就可以着手分析、编写脚本

    3. 线上测试开放:2月11日

    4. 线上测试关闭:3月9日

    5. 现场答辩:3月12日

      每队有五分钟时间陈述所用模型与处理方法,每队有五分钟的 Q&A 时间。现场颁奖。

  8. 其他信息请咨询计算机协会,email:[email protected]

参数范数正则化

机器学习中的一个核心的问题就是解决模型的泛化能力,很多模型被故意的设计为以增大训练误差为代价来获得泛化能力,这些方法统称为正则化。很多情况下,我们通过给目标函数 $J$ 添加一个参数范数的惩罚$\Omega(\theta)$,其主要目的是用于限制模型的学习能力。正则化后的目标函数可以表示为

其中$\alpha\in[0,\infty)$是调整惩罚项惩罚力度的一个超参数。$\alpha=0$就对应着没有正则惩罚,越大的$\alpha$对应着更加严重的惩罚。当我们训练最小正则化后的目标函数时,他会同时降低原始目标的训练误差并减小参数的规模。不同的正则化会导致不同的优先解。

$L^2$参数正则化

在$L^2$参数正则化(也被称为权重衰减 weight decay)通过向目标函数添加一个正则项$\Omega(\theta)=\frac{1}{2}|\omega|_2^2$使权重更加接近原点(一般认为接近空间中的某些特定点同样拥有正则效果,这个值在接近真实值得时候会获得更好的效果,一般来说零一个是有意义的默认值)。

对$L^2$正则化的研究可以从研究目标函数的梯度入手,梯度方法是现在最流行的复杂结构的训练方法。我们先假定模型中没有偏置参数,此目标函数可以写为

容易求得梯度

如果使用梯度下降更新参数的话

为了简化分析,设$\omega^\ast$为未经过正则化过的参数向量$\omega^\ast=\arg \min_\omega J(\omega)$ 并在$\omega^\ast$的邻域做二次近似。如果目标函数是二次的(最小二乘回归或者其他以均方误差来拟合模型的情况),这个近似是无偏差的。近似的目标函数

其中$H$是$J$在$\omega^\ast$处计算的海森矩阵,因为最小化点处梯度消失,这个二次近似中理论上没有一阶项。$H$是半正定的。为了研究正则化对权重衰减的影响,我们在梯度中加入正则项,设$\hat{J}$是正则化后的最小值,使用变量$\tilde{\omega}$来表示正则化后的最小值

在正则化系数趋向于零时,正则化结果趋向于非正则化的结果,但是当$\alpha$增加时,因为$H$的实对称性质,我们将其对角化分解为$H=Q\Lambda Q^T$可得

我们可以看到权重衰减的效果是沿着由$H$的特征向量所定义的轴对$\omega^\ast$做了缩放,具体来说,与$H$的第$i$个特征向量对其的$\omega^\ast$的分量根据$\frac{\lambda_i}{\lambda_i+\alpha}$因子做了缩放,沿着$H$特征值较大的方向正则化影响较小,而$\lambda_i\ll\alpha$的分量将会近似被缩小到零。只有显著减小目标函数方向的参数会保存完好,不会显著增加梯度的方向将会被正则化衰减掉。$L^2$正则化能让学习算法 ‘‘感知’’ 到具有较高方差的输入$x$,因此与输出目标的协方差较小(相对增加方差)的特征的权重将会被收缩。

将此正则化方法应用在线性回归模型中,使得原本的平方误差函数$(X\omega-y)^T(X\omega-y)$变化为

方程的解从简单的伪逆解$\omega=(X^TX)^{-1}X^Ty$变为

$L^1$参数正则化

当对模型参数取正则化$\Omega(\theta)=|\omega|_1​$时,对模型进行了$L^1​$参数正则化,与$L^2​$正则化类似,我们使用一个超参数$\alpha​$描述正则化对目标函数的贡献,此时的目标函数可以表示为

同$L^2$的分析,此目标函数的梯度

由此可见,$L^1$正则化并不是线性的缩放每个$\omega_i$,而是对相应的分量加上一个与$\mathrm{sign}(\omega_i)$相同符号的常数。我们一不定能得到这种正则化的直接算数解,我们可以通过泰勒级数表示目标函数是二次的线性模型,在这个设定下,梯度下降

$L^1$正则化在一般的情况下无法得到干净的解析表达式,我们进一步假设这个海森矩阵是对角的且每个对角元素大于零(在用于线性回归的数据已经经过类似 PCA 预处理,特征之间没有相关性得情况下,这一假设是成立的)。将$L^1$正则化的目标函数的二次近似分解为关于参数的求和

对这个近似的目标函数,我们可以得到每一维的解析解

现考虑所有$i$且$\omega_i^\ast>0$的情况:

  1. $\omega_i^\ast\leq\frac{\alpha}{H_{ii}}$ 的情况,正则化后给出的最优值是$\omega_i=0$。这是因为在方向$i$上$J(\omega;X,y)$对于 $\tilde{J}(\omega;X,y)$的贡献受到压制,$L^1$正这话将这个参数推向零
  2. $\omega_i^\ast>\frac{\alpha}{H_{ii}}$ 时,正则化仅仅在那个方向上移动$\frac{\alpha}{H_{ii}}$ 的距离

$\omega_i^\ast$的情况与之类似,但是正则化使$\omega_i^\ast$更接近0或者为0.

一般来说$L^1$正则化倾向产生更稀疏的解,$L^2$正则化虽然对一些输入进行了缩放,但是并不会使参数稀疏,$L^1$可以通过足够大的$\alpha$实现稀疏。也正是因为$L^1$的稀疏性质,这种正则化被广泛地用于特征选择。就如著名的 LASSO (Least Absolute Shrinkage and Selection Operator)将$L^1$正则化与线性模型结合,并使用最小二乘使部分子集的权重为零,表明某些特征可以被安全的忽略。

MAP 贝叶斯推断正则化

许多的正则化策略可以解释为 MAP 贝叶斯推断。先看原始的线性回归模型,我们的假设为$h_\omega(x)=\omega^Tx+\epsilon$,其中$\epsilon$是一个呈高斯分布的误差量$p(\epsilon)=N(0,\sigma^2)$我们可以得出对数似然函数

这是一般的最小二乘优化形式。这时我们对参数没有加入任何的先验分布,在模型参数很多的时候很容易发生过拟合,通过对$\omega$引入不同的先验分布,我们可以同样推导出我们熟悉的正则化形式。

  1. $L^2$正则化相当于权重是高斯先验的 MAP 贝叶斯推断。

    当我们对参数引入协方差为$\alpha$的零均值高斯先验 $p(\omega_j)=\frac{1}{\sqrt{2\pi\alpha}}\exp(-(\omega^{(j)})^2/2\alpha)$,其最大后验估计

    取其对数

    最大似然后 $\omega=\mathrm{argmin}_\omega(\frac{1}{n}|y-\omega^TX|_2+\alpha|\omega|_2)$ ,提取其目标函数,与$L^2$正则化形式类似。

  2. $L^1$正则化相当于是权重为拉普拉斯先验的 MAP 贝叶斯推断

Linear Model (1) - Linear Regression

Maximize Likelihood Linear Regression

Suppose we have data set $S={(x^{(i)},y^{(i)}),i=1,\dots,m}$ where $x^{(i)}\in\mathbb R^n$ such that x has $n$ features with $m$ training examples. Let us assume that the target variables and the inputs are related via a linear equation.

Where $\epsilon^{(i)}$ is an error term that captures either un-model effects or random noise. Let’s assume that the $\epsilon^{(i)}$’s are distribute i.i.d.(independently and identically distributed) according to Gaussian Distribution with mean zero and variance $\sigma^2$. Which can be written as $\epsilon^{(i)}\sim N(0, \sigma^2)$. And the pdf of $\epsilon^{(i)}$ is given by

Because of $\epsilon^{(i)}=y^{(i)}-\theta^Tx^{(i)}$, the pdf also can be given as

Notice that the notation ‘$p(y^{(i)}|x^{(i)};\theta)$’ indicates that this is the distribution of $y^{(i)}$ given $x^{(i)}$ is parameterized by $\theta$ and $\theta$ is not a random variable, the formula is not a probability consition on $\theta$. We can write the distribution as ‘$y^{(i)}|x^{(i)};\theta\sim N(\theta^Tx^{(i)},\sigma^2)$’. Given an input matrix $X=(x^{(1)},x^{(2)},\dots,x^{(m)})^T$ and $\theta$, what the distribution of $y^{(i)}$’s is given by $p(\overrightarrow{y}|X;\theta)$. When we wish to explicity view this as a function of $\theta$, we call it the likelihood function:

Note that by the independence assumption on the $\epsilon^{(i)}$’s, this can be written by

Now, given this probabilistic model relating the $y^{(i)}$’s and the $x^{(i)}$’s. The principal of maximum likelihood says that we should should choose $\theta$ so as to make the data as high probability as possible. So We are facing an optimization problem.

We define a new likelihood function called log likelihood:

And the maximization problem $\max_{\theta}\ell(\theta)$ become a minimization problem:

This is our original least-squares cost function. Under the previous probabilistic assumptions on the data, least-squares regression corresponds to finding the maximum likelihood estimate of $\theta$.

Back to over Linear Regression problem, assume we have data set $S={(x^{(i)},y^{(i)}),i=1,\dots,m}$ , and our hypothesis $h_{\theta}(x)=\theta^Tx$ (we set $x_0$ to be $1$ so that the constant $\theta_0$ could be include into $\theta$ and thus $x_{(i)}\in\mathbb R^{(n+1)}, \theta\in\mathbb R^{(n+1)}$). Notice that in our hypothesis, the $\theta$ is not the population parameter, this is the parameter we are going to estimate by maximize likelihood function. According to the probability analysis above, we define the cost function

And the linear regression problem can be express as this optimization problem

Let’s rewrite this problem with a simple form by matrix, denote $X=(x^{(1)},x^{(2)},\dots,x^{(m)})^T$, $y=(y^{(1)},y^{(2)},\dots,y^{(m)})$ where $x^{(i)}\in\mathbb R^{(n+1)},y\in\mathbb R^{(n+1)}$ and $\theta\in\mathbb R^{(n+1)}$. So the problem can be expressed by

And the linear model of inputs $X$ can be shown as $y=\theta^Tx$, where $\theta=\arg \min_\theta |X\theta-y|^2_2 $. We will solve this question latter.

Least Square Linear Regression

If we want to build a model which can fit the sample data with least error, a simple way is to make the different between the estimator and the samples to be smallest in some form. In the least square linear regression, we optimize the square of the errors. Suppose we have hypothesis $h_\theta=X\theta$ (in the form of matrix, where $X\in\mathbb R^{m*(n+1)}$ and $\theta\in\mathbb R^{(n+1)}$, $x^{(i)}_0=1$ for all the inputs to make $\theta_0$ to be constant), the problem can be expressed in the form

This is same with the maximize likelihood regression in expression but they have differences. In maximize likelihood estimation we have an important assumption that all the samples $x^{(1)}, x^{(2)},\dots,x^{(m)}$ are i.i.d. The $\theta$ is the parameter of the model, $h$ is our model or our hypothesis. With regard to maximizing likelihood regression, the most reasonable estimation should be the one which makes the probability of $n$ samples extracted from the model observe those $y$’s maximum. But for least square, the most reasonable estimation is the one which can fit the samples best(the minimum of the square of error). It is clear that those regression method are come from different idea.

When we employ maximize likelihood regression, we need to know the probability distribution of errors or the hypothesis. In general, we assume that the distribution is Gaussian. Under this assumption, maximize likelihood regression is equivalent to least square regression.

For least square, we can also try to understand it in the form of geometry. Suppose we have a vector space $N\in\mathbb R^{(n+1)}$ and all the input sample $x^{(i)}$ is in this space. In this regression problem, the real thing we want is to find one way that hold $x^{(i)}\theta=y^{(i)}$ for all $m$ samples, which can be written as $X\theta=y$. But in real world problems the most likely situation is the number of samples $m$ is larger than the features $n$ which indicates that the equation $X\theta=y$ will be overdetermined. Assume the solution of this least square problem is $\theta’$ and $X\theta’$ is the orthogonal projection $y$ project to the space spanned by $X$’s column vector. And in linear algebra the orthogonal projection from $y$ to $X\theta’$ is $X(X^T X)^{-1}X^Ty$. The solution can be expressed by $\theta’=(X^T X)^{-1}X^Ty$ easily. We will see that this is same with the solution we solve by taking the differential later.

Solve $\min_\theta |X\theta-y|^2_2$

Except the least square solution of the overdetermined system, we can solve this by other way. With respect to $|X\theta-y|^2_2$ we have $J(\theta)=(X\theta-y)^T(X\theta-y)$ and the optimization problem will be

We take the derivative

And set it to zero

We can get the solution

Week One - Python in CUHK(SZ)

Brief Introduction to Computer Programming

Why programming?

Computer is built to help people solve problems, but computer does not understand what we say.
So we need to communicate with computers using their languages (computer programming language)

Components in a computer

  1. Central processing unit (CPU): execute your program. Similar to human brain, very fast but not that smart
  2. Input device: take inputs from users or other devices
  3. Output device: output information to users or other devices
  4. Main memory: store data, fast and temporary storage
  5. Secondary memory: slower but large size, permanent storage

What can a computer actually understand

The computers used nowadays can understand only binary number (i.e. 0 and 1)。
Computers use voltage levels to represent 0 and 1
The instructions expressed in binary code is called machine language

Low level language – Assembly Language

An assembly language is a low-level programming language, in which there is a very strong (generally one-to-one) correspondence between the language and machine code instructions.
Each assembly language is specific to a particular computer architecture
Assembly language is converted into executable machine code by a utility program referred to as an assembler

High-level languages: C, C++, Java, Python…

High level languages cannot be executed directly
High level languages must be converted into low level languages first
Lower level languages have higher language efficiency (they are faster to run on a computer)
Higher level languages have higher development efficiency (it is easier to write programs in these languages)

Memory and addressing

A computer’s memory consists of an ordered sequence of bytes for storing dataEvery location in the memory has a unique address

The key difference between high and low level programming languages is whether programmer has to deal with memory addressing directly

Operating Systems

The operating system (OS) is a low level program, which provides all basic services for managing and controlling a computer’s activities

Applications are programs which are built based upon an OS

Main functions of an OS:

  • Controlling and monitoring system activities
  • Allocating and assigning system resources
  • Scheduling operations

Popular OS: Windows, macOS, Linux, iOS, Android (a kind of Linux)…

The units of information (data)

  • Bit (比特/位): a binary digit which takes either 0 or 1
    Bit is the smallest information unit in computer programming
  • Byte (字节): 1 byte = 8 bits, every English character is represented by 1 byte
  • KB (千字节):1 KB = 2^10 B = 1024 B
  • MB (兆字节):1MB = 2^20 B = 1024 KB
  • GB (千兆字节):1GB = 2^30 B = 1024 MB
  • TB (兆兆字节):1TB = 2^40 B = 1024 GB

Number Systems

A numeral system (or system of numeration) is a writing system for expressing numbers; that is, a mathematical notation for representing numbers of a given set, using digits or other symbols in a consistent manner.

Each positional number system contains two elements, a base and a set of symbols. Using the decimal system as an example, its base is 10 and the symbols are, of course, numbers.

Commonly, decimal number system, binary number system and hexadecimal number system are used in computer.

Demical Number System

In the decimal number system, the base is 10, the symbols include 0, 1, 2, 3, 4, 5, 6, 7, 8, 9

Every number can be decomposed into the sum of a series of numbers, each is represented by a positional value times a weight

𝑎𝑛 is the positional value (ranging from 0 to 9), while 10𝑛 represents the weight

Binary Number System

Base 2, symbols 0 and 1

𝑎𝑛 is the positional value (ranging from 0 to 1), while 2𝑛 represents the weight.

Hexadecimal number system

In the hexadecimal system, the base is 16, we use 16 symbols {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c,d, e, f}• “10” is used when we hit 16 (逢十六进一)•

𝑁=𝑎𝑛×16𝑛+𝑎𝑛−1×16𝑛−1+𝑎𝑛−2×16𝑛−2 ……+𝑎0×160+𝑎−1×16−1+𝑎−2× 16−2 …

𝑎𝑛 is the positional value (ranging from 0 to 15), while 16𝑛 represents the weight.

Number System Conversion

There are many methods or techniques which can be used to convert numbers from one base to another.

Decimal to Other Base System

  • Step 1 − Divide the decimal number to be converted by the value of the new base.
  • Step 2 − Get the remainder from Step 1 as the rightmost digit (least significant digit) of new base number.
  • Step 3 − Divide the quotient of the previous divide by the new base.
  • Step 4 − Record the remainder from Step 3 as the next digit (to the left) of the new base number.

Repeat Steps 3 and 4, getting remainders from right to left, until the quotient becomes zero in Step 3.

The remainders have to be arranged in the reverse order so that the first remainder becomes the Least Significant Digit (LSD) and the last remainder becomes the Most Significant Digit (MSD).

Example

Decimal Number: 29 -> Binary Equilvalent

Step Operation Result Remainder
Step 1 29 / 2 14 1
Step 2 14 / 2 7 0
Step 3 7/2 3 1
Step 4 3/2 1 1
Step 5 1/ 2 0 1

Binary Number: 11101

Other Base System to Decimal System

  • Step 1 − Determine the column (positional) value of each digit (this depends on the position ofthe digit and the base of the number system).
  • Step 2 − Multiply the obtained column values (in Step 1) by the digits in the corresponding columns.
  • Step 3 − Sum the products calculated in Step 2. The total is the equivalent value in decimal.
Example

Binary Number: 11101 -> Decimal Equivalent

Decimal Fraction to Binary

You can convert a decimal fraction to binary by repeatedly multiplying the fractional results of successive multiplications by 2. The carries form the binary number.

How a program runs?

A computer doesn’t actually understand the phrase ‘Hello, world!’, and it doesn’t know how to

display it on screen. It only understands on and off. So to actually run a command like print ‘Hello, world!’, it has to translate all the code in a program into a series of ons and offs that it can understand.

To do that, a number of things happen:

  • The source code is translated into assembly language.
  • The assembly code is translated into machine language.
  • The machine language is directly executed as binary code.

Confused? Let’s go into a bit more detail. The coding language first has to translate its source code into assembly language, a super low-level language that uses words and numbers to represent binary patterns. Depending on the language, this may be done with an interpreter (where the program is translated line-by-line), or with a compiler (where the program is translated as a whole).

The coding language then sends off the assembly code to the computer’s assembler, which converts it into the machine language that the computer can understand and execute directly as binary code.

Interpreter (解释器) is a computer program that directly executes, i.e. performs, instructions written in a programming or scripting language, without previously compiling them into a machine language program

A compiler (编译器) is a computer program (or a set of programs) that transforms source code written in a programming language (the source language) into another computer language (the target language), with the latter often having a binary form known as object code

Introduction to Python

What & Why is Python?

Python is a widely used high-level, general-purpose, interpreted, dynamic programming language. Its
design philosophy emphasizes code readability, and its syntax allows programmers to express
concepts in fewer lines of code than would be possible in languages such as C++ or Java. The
language provides constructs intended to enable clear programs on both a small and large scale.

Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles. It features a dynamic type system and automatic memory management and has a large and comprehensive standard library.

Python interpreters are available for many operating systems, allowing Python code to run on a wide variety of systems. Using third-party tools, such as Py2exe or Pyinstaller, Python code can be packaged into stand-alone executable programs for some of the most popular operating systems, so Python-based software can be distributed to, and used on, those environments with no need to install a Python interpreter.

CPython, the reference implementation of Python, is free and open-source software and has a community-based development model, as do nearly all of its variant implementations.

Install Python 3

In this course, we will use Python 3.5. Before we formally start the course, Python 3 must be installed in your computer first.

If you do not know how to use PowerShell on Windows, Terminal on OS X or bash on Linux then you need to go learn that first.

Why Python 3 not 2? Check the difference https://wiki.python.org/moin/Python2orPython3 ]

Windows Users:

  1. Open https://www.python.org/downloads/ in your browser
  2. Click Download Python 3.5.2, download Windows x86 executable installer
  3. Install

macOS Users (Recommended):

  1. Open your Terminal
  2. Type in the command shown here: http://brew.sh/
  3. Follow the instruction to install Homebrew, including xcode-command
  4. After you installed it, type in brew install python3 virtualenv
  5. Type in python3 -V, if it shows Python 3.5.2 then everything is done.

Linux Users:

  1. Use your package manager to install Python 3

Python Syntax

Python uses indentations to identify different program blocks. Here we show a simple example of Python script.

1
2
3
4
5
6
7
8
9
10
11
12
13
import os
from datetime import *

def helloworld(name):
if len(name) != 0:
return "Hello World " + name
else:
return "Oh my god"


if __name__ == "__main__":
print(helloworld("Computer @nd Comity"))
print(datetime.now())

Can you guess what will be on display?

Hello World Computer @nd Comity
2016-10-13 15:22:24.510837

Why? We will explain it in future courses.

IDE

What is IDE? An integrated development environment (IDE) is a software application that provides comprehensive facilities to computer programmers for software development. An IDE normally consists of a source code editor, build automation tools and a debugger. Most modern IDEs have intelligent code completion.

Here, we recommend you to use PyCharm when you believe that you master Python. It is a commercial software by JetBrains. Shall we pay for it? No. As a student, we can enjoy the educational promotion.

Detailed information will not be shown here.

  1. Obtain JetBrains Education promotion here: https://www.jetbrains.com/students
  2. Download PyCharm here: https://www.jetbrains.com/pycharm

Basic Python

Comments

Anything after a “#” is ignored by Python.

Why comment?

  •  Describe what is going to happen in a sequence of code
  •  Document who wrote the code and other important information
  •  Turn off a line of code – usually temporarily

Variable

A variable is something that holds a value that may change. In simplest terms, a variable is just a
box that you can put stuff in. You can use variables to store all kinds of stuff, but for now, we are just
going to look at storing numbers in variables.

Rules for defining variables in Python

Must start with a letter or underscore , Can only contain letters, numbers and underscore, Case sensitive

Reserved words

The following identifiers are used as reserved words, or keywords of the language, and cannot be used as ordinary identifiers. They must be spelled exactly as written here:

False class finally is return None continue
for lambda try True def from nonlocal
while and del global not with as
elif if or yield assert else Import
pass break except in raise

Assign a variable

1
a = "HELLO"
Multiple Assignment

Python allows you to assign a single value to several variables simultaneously. For example −

1
a = b = c = 1

Here, an integer object is created with the value 1, and all three variables are assigned to the samememory location. You can also assign multiple objects to multiple variables. For example −

1
a, b, c = 1, 2, "john"

Here, two integer objects with values 1 and 2 are assigned to variables a and b respectively, and onestring object with the value “john” is assigned to the variable c.

Extensive Knowledge

When you assign to a variable you are binding the name to an object. From that point onwards you can refer to the object by using the name, until that name is rebound.

In the first example the name i is bound to the value 5. Binding different values to the name jdoes not have any effect on i, so when you later print the value of i the value is still 5.

1
2
3
4
i = 5
j = i
j = 3
print(i)

In the second example you bind both i and j to the same list object. When you modify the contents of the list, you can see the change regardless of which name you use to refer to the list.

1
2
3
4
i = [1,2,3]
j = i
i[0] = 5
print(j)

Note that it would be incorrect if you said “both lists have changed”. There is only one list but it has two names (i and j) that refer to it.

Reference: http://stackoverflow.com/questions/13530998/python-variables-are-pointers

Operators

We can easily do numeric operations in Python — actually you can take it as a simple calculator!

Basic mathematic operators

Operator Description
+ add
- subtract
* multiply
/ divide
** Exponentiation
( ) parentheses
// floor division
% modulo, find the remainder

Operator precedence

Highest to lowest precedence rule

  • Parenthesis are always with highest priority
  • Power
  • Multiplication, division and remainder
  • Addition and subtraction
  • Left to right

Logical operators

Logical operators can be used to combine several logical expressions into a single expression

Python has three logical operators: not, and, or

Comparison operators

Boolean expressions ask a question and produce a Yes/No result, which we use to control program flow. Boolean expressions use comparison operators to evaluate Yes/No or True/False.

Comparison operators check variables but do not change the values of variables.

Operators Description
x < y Is x less than y?
x <= y Is x less than or equal to y?
x == y Is x equal to y?
x >= y Is x greater than or equal to y?
x > y Is x greater than y?
x != y Is x not equal to y?

Careful!! “=“ is used for assignment

Indentation

  • Increase indent: indent after an if or for statement (after :)
  • Maintain indent: to indicate the scope of the block (which lines are affected by the if/for)
  • Decrease indent: to back to the level of the if statement or for statement to indicate the end of the block
  • Blank lines are ignored – they do not affect indentation
  • Comments on a line by themselves are ignored w.r.t. indentation

Evaluate

The eval() function takes a string argument and evaluates that string as a Python expression, i.e., just as if the programmer had directly entered the expression as codeThe function returns the result of that expression.

eval() gives the programmers the flexibility to determine what to execute at run-time.

One should be cautious about using it in situations where users could potentially cause problems with “inappropriate” input.

Data Type

Numbers

Python supports four different numerical types −

  • int (signed integers)
  • float (floating point real values)
  • complex (complex numbers)

Please be noted that, if you do something like int(1.23), no exception will be raised. Instead, it will return an int object with value 1. The int class do the converion — assign the integer only.

Floating point real values

Floating-point numbers (float type) are numbers with a decimal point or an exponent (or both). Examples are 5.0, 10.24, 0.0, 12. and .3. We can use scientific notation to denote very large or very small floating-point numbers, e.g. 3.8 x 10^15. The first part of the number, 3.8, is the mantissa and 15 is the exponent. We can think of the exponent as the number of times we have to move the decimal point to the right to get to the actual value of the number.

In Python, we can write the number 3.8 x 10^15 as 3.8e15 or 3.8e+15. We can also write it as 38e14or .038e17. They are all the same value. A negative exponent indicates smaller numbers, e.g. 2.5e-3is the same as 0.0025. Negative exponents can be thought of as how many times we have to move the decimal point to the left. Negative mantissa indicates that the number itself is negative, e.g. -2.5e3 equals -2500 and -2.5e-3 equals -0.0025.

Here we should take care that, the implement of float in Python causes the incorrection. float(3.2)is not actually 3.2, but 3.200000000000000123 or something else. It does not matter in normal application, but when you are doing some scientific calculation, you may use some third-party packages to avoid it.

List

Lists are the most versatile of Python’s compound data types. A list contains items separated by commas and enclosed within square brackets ([]). To some extent, lists are similar to arrays in C. One difference between them is that all the items belonging to a list can be of different data type.

The values stored in a list can be accessed using the slice operator ([ ] and [:]) with indexes starting at 0 in the beginning of the list and working their way to end -1. The plus (+) sign is the list concatenation operator, and the asterisk (*) is the repetition operator.

For example,

1
2
3
4
5
6
7
8
9
>> List = [1, 2, "3", int(4.2)]  # This name is lawful. The list data type (which is actually a class) is named "list" not "List"
>> List[0] # Access the first element in the list. Note that in Python, almost everything starts at 0.
1
>> List[0:2] # Access what? Can you still remember?
[1, 2]
>> List[0:3:2] # Still, what is that?
[1, '3']
>> List[-1] # Access the last element.
4
Slice Operator

What is a slice operator? The slice operator ([ ] and [:]) is to slice a list (of course).

It’s pretty simple really:

a[start:end] # items start through end-1
a[start:]    # items start through the rest of the array
a[:end]      # items from the beginning through end-1
a[:]         # a copy of the whole array

There is also the step value, which can be used with any of the above:

a[start:end:step] # start through not past end, by step

The key point to remember is that the :end value represents the first value that is not in the selected slice. So, the difference beween end and start is the number of elements selected (if step is 1, the default).

The other feature is that start or end may be a negative number, which means it counts from the end of the array instead of the beginning. So:

a[-1]    # last item in the array
a[-2:]   # last two items in the array
a[:-2]   # everything except the last two items

Python is kind to the programmer if there are fewer items than you ask for. For example, if you ask for a[:-2] and a only contains one element, you get an empty list instead of an error. Sometimes you would prefer the error, so you have to be aware that this may happen.

Reference: http://stackoverflow.com/questions/509211/explain-pythons-slice-notation

Loop

We can loop a list using for element in <list>

Tuple

A tuple is another sequence data type that is similar to the list. A tuple consists of a number of values separated by commas. Unlike lists, however, tuples are enclosed within parentheses.

The main differences between lists and tuples are: Lists are enclosed in brackets ( [ ] ) and their elements and size can be changed, while tuples are enclosed in parentheses ( ( ) ) and cannot be updated. Tuples can be thought of as read-only lists.

String

Strings in Python are identified as a contiguous set of characters represented in the quotation marks. Python allows for either pairs of single or double quotes. Subsets of strings can be taken using the slice operator ([ ] and [:] ) with indexes starting at 0 in the beginning of the string and working their way from -1 at the end.

The plus (+) sign is the string concatenation operator and the asterisk (*) is the multiple concatenation operator.

For example,

s = 'Hello' + 'LGU'  # s = 'HelloLGU'
s = 'A'*10             # s = 'AAAAAAAAAA'

String is essentially a list in Python, or more accurately, is a tuple. If we want to update a string, what can we do?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
s = 'Hello LGU'

# If we want to make it into 'Hello.LGU'
# Can we do the following operation?
s[5] = '.'
# No, we cannot update a string. Instead, we can only assign a new string to it.
s = 'Hello.LGU'
# Interesting that we can do something like it...
s = s.replace(' ', '.')

# Since string is a list, we can do the date type convertion here.
>> print(list(s))
['H', 'e', 'l', 'l', 'o', '.', 'L', 'G', 'U']
# OR...
>> print(str(['H', 'e', 'l', 'l', 'o', '.', 'L', 'G', 'U']))
'Hello.LGU'

Dictionaries

Python’s dictionaries are kind of hash table type. They work like associative arrays or hashes found in Perl and consist of key-value pairs. A dictionary key can be almost any Python type, but are usually numbers or strings. Values, on the other hand, can be any arbitrary Python object. Amazing, the value can be almost any Python data type too!

Dictionaries are enclosed by curly braces ({ }) and values can be assigned and accessed using square braces ([]). The elements in a dict is unsorted.

Example:

d = {'lgu': 'cuhksz',
    'cuhk': 'shatin',
    631: ['Guangdong', 'Zhejiang']}
print(d)  # what will be printed?
d[0]  # No suce operator. What is the first element in dict? I don't know. God know.

We can also loop a dictionary.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
>> for element in d:  # If we do so, the element will be the KEY not a key-value pair.
>> print(element)
cuhk
lgu
631
>> for key, value in d.items(): # Now the key and value is the correct pair.
>> print(key, value)
cuhk shatin
lgu cuhksz
631 ['Guangdong', 'Zhejiang']
>> for value in d.values(): # Would like to loop the values?
>> print(value)
'shatin'
'cuhksz'
['Guangdong', 'Zhejiang']

Boolean (bool)

Python contains a built-in Boolean type, which takes two values True/False

Number 0 can also be used to represent False. All other numbers represent True.

Example:

1
2
3
4
5
6
7
8
>> bool(11)
True
>> bool(-1)
True
>> bool(0)
False
>> bool("abc")
True

Data Type Conversion

Sometimes, you may need to perform conversions between the built-in types. To convert between types, you simply use the type name as a function. Most conversion has been shown above.

Logic Statements

Reference: http://python-textbok.readthedocs.io/en/latest/Selection_Control_Statements.html

In procedurally written code, the computer usually executes instructions in the order that they appear. However, this is not always the case. One of the ways in which programmers can change the flow of control is the use of selection control statements.

Selection statements allows a program to choose when to execute certain instructions. For example, a program might choose how to proceed on the basis of the user’s input. As you will be able to see, such statements make a program more versatile.

Selection: if statement

People make decisions on a daily basis. What should I have for lunch? What should I do this weekend? Every time you make a decision you base it on some criterion. For example, you might decide what to have for lunch based on your mood at the time, or whether you are on some kind of diet. After making this decision, you act on it. Thus decision-making is a two step process – first deciding what to do based on a criterion, and secondly taking an action.

Decision-making by a computer is based on the same two-step process. In Python, decisions are made with the if statement, also known as the selection statement. When processing an ifstatement, the computer first evaluates some criterion or condition. If it is met, the specified action is performed. Here is the syntax for the if statement:

1
2
if condition:
if_body

When it reaches an if statement, the computer only executes the body of the statement only if the condition is true. Here is an example in Python, with a corresponding flowchart:

1
2
if age < 18:
print("Cannot vote")

None

As we can see from the flowchart, the instructions in the if body are only executed if the condition is met (i.e. if it is true). If the condition is not met (i.e. false), the instructions in the if body are skipped.

The else clause

An optional part of an if statement is the else clause. It allows us to specify an alternative instruction (or set of instructions) to be executed if the condition is not met:

1
2
3
4
if condition:
if_body
else:
else_body

To put it another way, the computer will execute the if body if the condition is true, otherwise it will execute the else body. In the example below, the computer will add 1 to x if it is zero, otherwise it will subtract 1 from x:

1
2
3
4
if x == 0:
x += 1
else:
x -= 1

This flowchart represents the same statement:

None

The computer will execute one of the branches before proceeding to the next instruction.

Value vs identity

So far, we have only compared integers in our examples. We can also use any of the above relational operators to compare floating-point numbers, strings and many other types:

1
2
3
4
5
6
7
# we can compare the values of strings
if name == "Jane":
print("Hello, Jane!")

# ... or floats
if size < 10.5:
print(size)

When comparing variables using ==, we are doing a value comparison: we are checking whether the two variables have the same value. In contrast to this, we might want to know if two objects such as lists, dictionaries or custom objects that we have created ourselves are the exact same object. This is a test of identity. Two objects might have identical contents, but be two different objects. We compare identity with the is operator:

1
2
3
4
5
6
7
8
a = [1,2,3]
b = [1,2,3]

if a == b:
print("These lists have the same value.")

if a is b:
print("These lists are the same list.")

It is generally the case (with some caveats) that if two variables are the same object, they are also equal. The reverse is not true – two variables could be equal in value, but not the same object.

To test whether two objects are not the same object, we can use the is not operator:

1
2
if a is not b:
print("a and b are not the same object.")

Note: In many cases, variables of built-in immutable types which have the same value will also be identical. In some cases this is because the Python interpreter saves memory (and comparison time) by representing multiple values which are equal by the same object. You shouldn’t rely on this behaviour and make value comparisons using is – if you want to compare values, always use ==.

Nested if statements

In some cases you may want one decision to depend on the result of an earlier decision. For example, you might only have to choose which shop to visit if you decide that you are going to do your shopping, or what to have for dinner after you have made a decision that you are hungry enough for dinner.

None

In Python this is equivalent to putting an if statement within the body of either the if or the else clause of another if statement. The following code fragment calculates the cost of sending a small parcel. The post office charges R5 for the first 300g, and R2 for every 100g thereafter (rounded up), up to a maximum weight of 1000g:

1
2
3
4
5
6
7
8
9
10
11
if weight <= 1000:
if weight <= 300:
cost = 5
else:
cost = 5 + 2 * round((weight - 300)/100)

print("Your parcel will cost R%d." % cost)

else:
print("Maximum weight for small parcel exceeded.")
print("Use large parcel service instead.")

Note that the bodies of the outer if and else clauses are indented, and the bodies of the inner ifand else clauses are indented one more time. It is important to keep track of indentation, so that each statement is in the correct block. It doesn’t matter that there’s an empty line between the last line of the inner if statement and the following print statement – they are still both part of the same block (the outer if body) because they are indented by the same amount. We can use empty lines (sparingly) to make our code more readable.

The elif clause and if ladders

The addition of the else keyword allows us to specify actions for the case in which the condition is false. However, there may be cases in which we would like to handle more than two alternatives. For example, here is a flowchart of a program which works out which grade should be assigned to a particular mark in a test:

None

We should be able to write a code fragment for this program using nested if statements. It might look something like this:

1
2
3
4
5
6
7
8
9
10
if mark >= 80:
grade = A
else:
if mark >= 65:
grade = B
else:
if mark >= 50:
grade = C
else:
grade = D

This code is a bit difficult to read. Every time we add a nested if, we have to increase the indentation, so all of our alternatives are indented differently. We can write this code more cleanly using elif clauses:

1
2
3
4
5
6
7
8
if mark >= 80:
grade = A
elif mark >= 65:
grade = B
elif mark >= 50:
grade = C
else:
grade = D

Now all the alternatives are clauses of one if statement, and are indented to the same level. This is called an if ladder. Here is a flowchart which more accurately represents this code:

None

The default (catch-all) condition is the else clause at the end of the statement. If none of the conditions specified earlier is matched, the actions in the else body will be executed. It is a good idea to include a final else clause in each ladder to make sure that we are covering all cases, especially if there’s a possibility that the options will change in the future. Consider the following code fragment:

1
2
3
4
5
6
7
8
9
10
11
12
if course_code == "CSC":
department_name = "Computer Science"
elif course_code == "MAM":
department_name = "Mathematics and Applied Mathematics"
elif course_code == "STA":
department_name = "Statistical Sciences"
else:
department_name = None
print("Unknown course code: %s" % course_code)

if department_name:
print("Department: %s" % department_name)

What if we unexpectedly encounter an informatics course, which has a course code of "INF"? The catch-all else clause will be executed, and we will immediately see a printed message that this course code is unsupported. If the else clause were omitted, we might not have noticed that anything was wrong until we tried to use department_name and discovered that it had never been assigned a value. Including the else clause helps us to pick up potential errors caused by missing options early.

If statement in one line?

Look back to the previoud example.

1
2
3
4
if x == 0:
x += 1
else:
x -= 1

It is clear, but it takes 4 lines. Some may say, can I turn it into one line?

Of course you can :)

1
x += 1 if x == 0 else -1

The syntax is like true-statement if expression else false-statement. It is easy to understand right? One reminder, make your code understandable.

Loop statement

A loop statement allows us to execute a statement or group of statements multiple times.

Python programming language provides following types of loops to handle looping requirements.

Loop Type Description
while loop Repeats a statement or group of statements while a given condition is TRUE. It tests the condition before executing the loop body.
for loop Executes a sequence of statements multiple times and abbreviates the code that manages the loop variable.
nested loops You can use one or more loop inside any another while, or for or loop.

Loop control statements change execution from its normal sequence. When execution leaves a scope, all automatic objects that were created in that scope are destroyed.
Python supports the following control statements.

Control Statement Description
break statement Terminates the loop statement and transfers execution to the statement immediately following the loop.
continue statement Causes the loop to skip the remainder of its body and immediately retest its condition prior to reiterating.
pass statement The pass statement in Python is used when a statement is required syntactically but you do not want any command or code to execute.

While loop statement

A while loop statement in Python programming language repeatedly executes a target statement as long as a given condition is true.

The syntax of a while loop in Python programming language is −

1
2
while expression:
statement(s)

Here, statement(s) may be a single statement or a block of statements with uniform indent. The condition may be any expression, and true is any non-zero value. The loop iterates while the condition is true.
When the condition becomes false, program control passes to the line immediately following the loop.
In Python, all the statements indented by the same number of character spaces after a programming construct are considered to be part of a single block of code. Python uses indentation as its method of grouping statements.

Key point of the while loop is that the loop might not ever run. When the condition is tested and the result is false, the loop body will be skipped and the first statement after the while loop will be executed.

The Infinite Loop

A loop becomes infinite loop if a condition never becomes FALSE. You must use caution when using while loops because of the possibility that this condition never resolves to a FALSE value. This results in a loop that never ends. Such a loop is called an infinite loop.

An infinite loop might be useful in client/server programming where the server needs to run continuously so that client programs can communicate with it as and when required.

For loop statements.

The for statement in Python has the ability to iterate over the items of any sequence, such as a list or a string.

1
2
for iterating_var in sequence:
statements(s)

If a sequence contains an expression list, it is evaluated first. Then, the first item in the sequence is assigned to the iterating variable iterating_var. Next, the statements block is executed. Each item in the list is assigned to iterating_var, and the statement(s) block is executed until the entire sequence is exhausted.

The range() function

The built-in function range() is the right function to iterate over a sequence of numbers. It generates an iterator of arithmetic progressions. For details please read the built-in help using help(range).

Using else Statement with Loops

Python supports to have an else statement associated with a loop statement.

If the else statement is used with a for loop, the else statement is executed when the loop has exhausted iterating the list.

If the else statement is used with a while loop, the else statement is executed when the condition becomes false.

Exception

An exception is an event, which occurs during the execution of a program that disrupts the normal flow of the program’s instructions. In general, when a Python script encounters a situation that it cannot cope with, it raises an exception. An exception is a Python object that represents an error.

When a Python script raises an exception, it must either handle the exception immediately otherwise it terminates and quits.

Handling an exception

If you have some suspicious code that may raise an exception, you can defend your program by placing the suspicious code in a try: block. After the try: block, include an except: statement, followed by a block of code which handles the problem as elegantly as possible.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
try:
You do your operations here
.....................
except:
if there is any exception, then execute this block.
except ExceptionI:
If there is ExceptionI, then execute this block.
except ExceptionII:
If there is ExceptionII, then execute this block.
......................
except (Exception1[, Exception2[,...ExceptionN]]]):
If there is exceptions, then …
else:
If there is no exception then execute this block.
finally:
This would always be executed.

Here are few important points about the above-mentioned syntax −

  • A single try statement can have multiple except statements. This is useful when the try block contains statements that may throw different types of exceptions.
  • You can also provide a generic except clause, which handles any exception.
  • After the except clause(s), you can include an else-clause. The code in the else-block executes if the code in the try: block does not raise an exception.
  • The else-block is a good place for code that does not need the try: block’s protection.

Raise an exception

You can raise exceptions in several ways by using the raise statement. The general syntax for the raise statement is as follows.

raise [Exception [, args [, traceback]]]

Here, Exception is the type of exception (for example, NameError) and argument is a value for the exception argument. The argument is optional; if not supplied, the exception argument is None.

The final argument, traceback, is also optional (and rarely used in practice), and if present, is the traceback object used for the exception.

Functions

A function is a block of organized, reusable code that is used to perform a single, related action. Functions provide better modularity for your application and a high degree of code reusing.

As you already know, Python gives you many built-in functions like print(), etc. but you can also create your own functions. These functions are called user-defined functions.

The names of built-in functions are usually considered as new reserved words, i.e. we do not use them as variable namesThe names of built-in functions are usually considered as new reserved words, i.e. we do not use them as variable names.

Defining a Function

You can define functions to provide the required functionality. Here are simple rules to define a function in Python.

  • Function blocks begin with the keyword def followed by the function name and parentheses ( ( ) ).
  • Any input parameters or arguments should be placed within these parentheses. You can also define parameters inside these parentheses.
  • The first statement of a function can be an optional statement - the documentation string of the function or docstring.
  • The code block within every function starts with a colon (:) and is indented.
  • The statement return [expression] exits a function, optionally passing back an expression to the caller. A return statement with no arguments is the same as return None.
  • If one function does not return a value, it is a void function. Return None by default.
1
2
3
4
5
def functionname( parameters ):
"function_docstring"
function_suite
return [expression]
print(abc) # no usage. the function exited by return statement.

Function Arguments

You can call a function by using the following types of formal arguments:

  • Required arguments
  • Keyword arguments
  • Default arguments
  • Variable-length arguments
Required arguments

Required arguments are the arguments passed to a function in correct positional order. Here, the number of arguments in the function call should match exactly with the function definition.

Keyword arguments

Keyword arguments are related to the function calls. When you use keyword arguments in a function call, the caller identifies the arguments by the parameter name.

This allows you to skip arguments or place them out of order because the Python interpreter is able to use the keywords provided to match the values with parameters.

The following example gives clear picture. Note that the order of parameters does not matter.

1
2
3
4
5
6
7
8
9
10
#!/usr/bin/python3
# Function definition is here
def printinfo( name, age ):
"This prints a passed info into this function"
print ("Name: ", name)
print ("Age ", age)
return

# Now you can call printinfo function
printinfo( age=50, name="miki" )
Default arguments

A default argument is an argument that assumes a default value if a value is not provided in the function call for that argument. The following example gives an idea on default arguments, it prints default age if it is not passed.

1
2
3
4
5
6
7
8
def printinfo( name, age = 35 ):
"This prints a passed info into this function"
print ("Name: ", name)
print ("Age ", age)
return
# Now you can call printinfo function
printinfo( age=50, name="miki" )
printinfo( name="miki" )
Variable-length arguments

You may need to process a function for more arguments than you specified while defining the function. These arguments are called variable-length arguments and are not named in the function definition, unlike required and default arguments.

Syntax for a function with non-keyword variable arguments is this −

1
2
3
4
def functionname([formal_args,] *var_args_tuple ):
"function_docstring"
function_suite
return [expression]

An asterisk (*) is placed before the variable name that holds the values of all non-keyword variable arguments. This tuple remains empty if no additional arguments are specified during the function call. Following is a simple example −

1
2
3
4
5
6
7
8
9
10
def printinfo( arg1, *vartuple ):
"This prints a variable passed arguments"
print ("Output is: ")
print (arg1)
for var in vartuple:
print (var)
return
# Now you can call printinfo function
printinfo( 10 )
printinfo( 70, 60, 50 )

The return Statement

The statement return [expression] exits a function, optionally passing back an expression to the caller. A return statement with no arguments is the same as return None.

Return multiple values

Python allows a function to return multiple values. The sort function returns two values; when it is invoked, you need to pass the returned values in a simultaneous assignment. Try to run it!

1
2
3
4
5
6
7
8
def sort(a, b):
if a > b:
return a, b
else:
return b, a

r1, r2 = sort(1, 2)
print(r1, r2)