[2011.04.28] 科学中的科学

2011-5-5 08:27| 发布者: Somers| 查看: 10290| 评论: 12|原作者: 胖白兔

摘要: 如何通过网络理解观点演化

网络组织化
科学中的科学
如何通过网络理解观点演化

Apr 28th 2011 | from the print edition

计算机科学家长期以来一直致力于为互联网上爆炸般的数据定制硬性规范。一种显而易见的方式是通过主题来组织信息，但通过人力为所有信息全面打上标签是不可能的。因此，普林斯顿大学的戴维·布莱试图通过机器来完成这项工作。

首先，他将在同一篇文档里经常出现的主题作为若干套词组定义下来。比如“大爆炸”和“黑洞”经常同时出现，但它们的每一个又往往不与“星系”同时出现，当然也别指望它们紧连着“基因组”出现。通过直觉去捕捉文档中首次出现的三个术语而非第四个的做法，是确定某单一主题的一部分。当然，这很大程度上取决于你期望的主题范围。但戴维·布莱和卡内基梅隆大学的约翰·拉弗蒂合作开发的模型中，考虑到了该因素。

用户通过自行挑选主题的数量来决定自己想要分析的主题的密度高低。计算机为每个主题创建一个虚拟的存储文件后，就开始读取待分析的文档。在移除一些均匀分布在原始文档中的普通词汇后，它随机为每一个剩下的词汇分配一个存储文件。计算机随后从存储文件中挑选若干对词汇，分析它们同时出现的次数是否高于它们偶然出现在原始文档中的次数。如果它们同时出现的次数多，这种关联就被保存下来。反之这些词汇（和其它已被关联的词汇）就被随机丢到另一个存储文件中。反复运行该过程，关联词汇间的一些网络系统就会逐渐形成。如果重复的次数足够多，每个网络系统就会与某单一的存储文件内容相一致。

这种方法果然奏效。戴维·布莱博士和约翰·拉弗蒂博士利用他们的软件从《科学》杂志1980年到2002年间发表的论文中找出了50个主题。按照词汇间的关联程度，这些挑出来的属于同类的词汇迅速被识别出来。比如某主题包含了“轨道”、“尘埃”、“木星”、“线条”、“系统”、“太阳的”、“气体”、“大气的”、“火星”和“领域”。另一主题包括“电脑”、“方法”、“数量”、“二”、“原理”、“设计”、“入口”和“进程”。

作为一种处理过量信息的方式，所有这些操作都十分有趣。通过对论文的标识，就能将论文用更有用的方式检索出来。但布莱博士一直都很想知道，他的方法是否能产生一些真正新颖的深刻见解并融入科学方法中。最后他认为这完全可行。他的合作者、普林斯顿的博士生肖恩·格里什开发出的某版本的软件，不仅可以通过精读文本获得主题，而且可以通过观察每个主题存储文件中模式年复一年的变化，追踪这些主题的演化过程。

新版本的软件可以长期追踪一个主题。比如，发表于1903年且拥有一个令人回味标题的论文《拉博德博士的大脑》，正好和1991年发表的论文《通过揭开潜皮层连接重塑皮层运动地图》都放置在相同主题的存储文件中。该软件允许进行术语学中的重要转换从而追踪主题的起源，它为鉴定真正富有开创性的工作提供了一种途径－－将材料分类：引入一些新概念，或是把旧概念综合成新颖有效的方式，以便在后来的文章中被挑拣和复制。通过观察某论文在相关主题中的结构转变程度，其影响力就能被确定下来。

实际上，布莱博士和格里什先生已经为科学论文作者所钟爱的论文索引设计了一种替代品。该替代品反映了某特定出版物或作者作为源头被其它人引用的频率。高的分数被认为是论文拥有重大影响力的代名词，分数就是代名词所拥有的全部。

布莱博士和格里什先生没有声称他们的方法就必然是个更好的替代品。但它可以通过在一开始就不断增添设置过的文档扩展到更广泛的领域。论文索引仅仅在出版物在明白无误地参考其源头时才有用，它们在数字世界中形成了一个微小星云。虽然新闻稿、博客和电子邮件缺少系统化的能被用于制作成索引的参考列表，然而它们也是促使某观点更有影响力的组成部分。

此外，尽管学术界的自命不凡客观存在着，在人类努力的任何领域中它仍然是政治考量的对象。许多作者引用同事、老板和导师的文章，仅仅是出于谦恭和感激，而不是这些引用就真得那么有用。更为罕见的是，某作者也许根本不引用。比如爱因斯坦关于相对论的原始论文就完全没有附上任何参考书目，尽管它也从先前的著作中摘录甚多。布莱格里什的方法可能更真实地揭示了科学观点的演变过程，该方法为科学研究提供了一种更科学的手段。

from the print edition | Science and Technology

http://www.economist.com/node/18618025?story_id=18618025

本文由译者 胖白兔 提供 点击此处阅读双语版

鲜花

握手

雷人

路过

鸡蛋

收藏分享邀请

发表评论

最新评论

引用 BTnuts 2011-5-5 16:47: 本帖最后由 BTnuts 于 2011-5-5 16:56 编辑

For example, “Big Bang” and “black hole” often will co-occur, but not as often as each does with “galaxy”. 比如“大爆炸”和“黑洞”常常一起出现，但是这两个词都更容易和“星系”这个词同时出现。

This captures the intuition that the first three terms, but not the fourth, are part of a single topic. 这样的情况使我们产生前三个词，不包括第四个，是属于某个共同主题的认知。

Of course, much depends on how narrow you want a topic to be 当然，这很大程度上取决于你想使你的主题精确到什么程度。

新人求指教~

引用天各一方 2011-5-5 21:48: He starts with defining topics as sets of words that tend to crop up in the same document.
首先，他将在同一篇文档里经常出现的主题作为若干套词组定义下来。
您认为，是将主题作为词组来定义。
偶以为，倒过来理解好像要顺一些。
He starts with defining topics as sets of words that tend to crop up in the same document.
起初，他用在同一篇文档中经常出现的若干词组来定义主题。
~ 妄加点评，旨在切磋，不当之处，望能海涵~

引用 aubreychen 2011-5-5 21:51: COMPUTER scientists have long tried to foist order on the explosion of data that is the internet.
计算机科学家长期以来一直致力于为互联网上爆炸般的数据定制硬性规范。
--------------------------------------------------------------------------------------------------------
这句话我觉得应该是
计算机科学家长期致力于将爆炸般的信息规划得更有条理性的这种努力，促成了internet
这句话的that指代的是前面整个的COMPUTER scientists have long tried to foist order on the explosion of data

引用 astrolanguage 2011-5-7 13:17: 回复 aubreychen 的帖子

我觉得楼主的理解是比较合理的。
如果按照你的理解的话，COMPUTER scientists have long tried to foist order on the explosion of data就是主语从句的主语，这时，那个that应该放在句首，成了That COMPUTER scientists have long tried to foist order on the explosion of data is the Internet.
根据一个句子的主干不能有两个谓语的原则，这里，data后面的that构成定于从句，用来说明“那个data就是所谓的Internet”。
所以我认为楼住的翻译比较合理。

引用 aubreychen 2011-5-7 13:45: 回复 astrolanguage 的帖子

恩。我原来以为是that指代前面整个句子。重新看了一下。指的是explosion of data，这个explosion of data is the internet。不是说internet 上的数据。而是，这些数据的explosion本身is the internet。是从句没错，但是不是data的定语从句，从主体来说是exposion的从句。

引用 astrolanguage 2011-5-7 14:09: 回复 aubreychen 的帖子

恩，是这样的~~

引用胖白兔 2011-5-7 15:16: 大家讨论的很激烈，相信更有助于英语学习。
真理因辩论而明

引用胖白兔 2011-5-7 15:32: 本帖最后由胖白兔于 2011-5-7 15:33 编辑

回复 BTnuts 的帖子

For example, “Big Bang” and “black hole” often will co-occur, but not as often as each does with “galaxy”. 比如“大爆炸”和“黑洞”常常一起出现，但是这两个词都更容易和“星系”这个词同时出现。
偶认为：not as often as是“不经常”的，是否定含义；而each指代的是“Big Bang” 或 “black hole”

This captures the intuition that the first three terms, but not the fourth, are part of a single topic. 这样的情况使我们产生前三个词，不包括第四个，是属于某个共同主题的认知。
偶认为：captures the intuition是一个整体：直觉认知。显然are part of a single topic是修饰captures的，are是复数形式。This做为单数指代这种捕捉行为。

Of course, much depends on how narrow you want a topic to be 当然，这很大程度上取决于你想使你的主题精确到什么程度。
这里how narrow偶漏了翻译，翻译成“精确”比较妥当。

引用 BTnuts 2011-5-7 16:29: For example, “Big Bang” and “black hole” often will co-occur, but not as often as each does with “galaxy”. 比如“大爆炸”和“黑洞”常常一起出现，但是这两个词都更容易和“星系”这个词同时出现。
偶认为：not as often as是“不经常”的，是否定含义；而each指代的是“Big Bang” 或 “black hole”
对啊，not as often as是“不如……更经常”，所以这两个词才更容易和galaxy这个词一起出现啊。
直译：比如“大爆炸”和“黑洞”常常一起出现，但是不如这两个词分别与“星系”这个词一起出现更为频繁。这样译有点绕。所以改成“这两个词都更容易和“星系”这个词同时出现。 ”

This captures the intuition that the first three terms, but not the fourth, are part of a single topic. 这样的情况使我们产生前三个词，不包括第四个，是属于某个共同主题的认知。
偶认为：captures the intuition是一个整体：直觉认知。显然are part of a single topic是修饰captures的，are是复数形式。This做为单数指代这种捕捉行为。
captures the intuition是一个整体，后面that引导同位语从句解释说明这个intuition是什么。are的主语是 first three terms.this指代前面 For example, “Big Bang” and “black hole” often will co-occur, but not as often as each does with “galaxy”. Neither, however, would be expected to pop up next to “genome”.这种现象。

引用 Jackyang 2011-5-11 21:54: He starts with defining topics as sets of words that tend to crop up in the same document. For example, “Big Bang” and “black hole” often will co-occur, but not as often as each does with “galaxy”. Neither, however, would be expected to pop up next to “genome”. This captures the intuition that the first three terms, but not the fourth, are part of a single topic. Of course, much depends on how narrow you want a topic to be. But Dr Blei’s model, which he developed with John Lafferty, of Carnegie Mellon University, allows for that.

戴维·布莱首先用一些词集定义某些主题，这些词汇可能突然出现在同一文档中。譬如“大爆炸”和“黑洞”这两个词经常一些出现，但单个词出现的频率不如“银河系”高。不过，你也不可能期望这两个词突然出现在“基因组”之后。这引起这样一种直觉，即前三个词属于同一主题，而第四个词“基因组”则不是。当然，这很大程度上取决于你所定义的主题范围的大小。布莱博士和卡内基梅隆大学的约翰·拉弗蒂建立的模型也考虑到了这一点。

引用飞龙在天 2011-5-14 19:55: black hole” often will co-occur, but not as often as each does with “galaxy”

大家讨论的这么激烈：其实是语法若得祸：not as often as 是as often as not的变通，意思是：常常

引用 join_soon 2011-5-17 00:07: 回复胖白兔的帖子

extremely interesting article, everybody should read it!!

--------------------------------------------
Organising the web网络组织化
The science of science科学中的科学
How to use the web to understand the way ideas evolve 如何通过网络理解观点演化

Apr 28th 2011 | from the print edition

COMPUTER scientists have long tried to foist order on the explosion of data that is the internet. One obvious way is to group information by topic, but tagging it all comprehensively by hand is impossible. David Blei, of Princeton University, has therefore been trying to teach machines to do the job.
计算机科学家长期以来一直致力于为互联网上爆炸般的数据定制硬性规范。一种显而易见的方式是通过主题来组织信息，但通过人力为所有信息全面打上标签是不可能的。因此，普林斯顿大学的戴维·布莱试图通过机器来完成这项工作。

He starts with defining topics as sets of words that tend to crop up in the same document. For example, “Big Bang” and “black hole” often will co-occur, but not as often as each does with “galaxy”. Neither, however, would be expected to pop up next to “genome”. This captures the intuition that the first three terms, but not the fourth, are part of a single topic. Of course, much depends on how narrow you want a topic to be. But Dr Blei’s model, which he developed with John Lafferty, of Carnegie Mellon University, allows for that.
首先，他将在同一篇文档里经常出现的主题作为若干套词组定义下来【把同一文档里经常出现的若干套词汇定义为话题】。比如“大爆炸”和“黑洞”经常同时出现，但它们的每一个又往往不与“星系”同时出现【但它们分别又和“星系”更经常同时出现】，当然也别指望它们紧连着“基因组”出现【但是，它们不会紧连着“基因组”出现】。通过直觉去捕捉文档中首次出现的三个术语而非第四个的做法，是确定某单一主题的一部分【这就合乎我们的直觉：前三个词属于一个话题，第四个不是】。当然，这很大程度上取决于你所期望的主题精确程度。但戴维·布莱和卡内基梅隆大学的约翰·拉弗蒂合作开发的模型中，考虑到了该因素。

The user decides how fine-grained he wants the analysis to be by picking the number of topics. The computer then creates a virtual bin for each topic and begins to read the documents to be analysed. After removing common words that it finds evenly spread through the original documents, it assigns each of the remaining ones, at random, to a bin. The computer then selects pairs of words in a bin to see if they co-occur more often than they would by chance in the original documents. If so, the association is preserved. If not, the words (together with others to which they have already been tied) are dropped at random into another bin. Repeat this process and networks of linked words will emerge. Repeat it enough and each network will correspond with a single bin.
用户通过自行挑选主题的数量来决定自己想要分析的主题的密度高低。计算机为每个主题创建一个虚拟的存储文件【盒子】后，就开始读取待分析的文档。在移除一些均匀分布在原始文档中的普通词汇后，它随机为每一个剩下的词汇分配一个存储文件。计算机随后从存储文件中挑选若干对词汇，分析它们同时出现的次数是否高于它们偶然出现在原始文档中的次数。如果它们同时出现的次数多，这种关联就被保存下来。反之这些词汇（和其它已被关联的词汇）就被随机丢到另一个存储文件中。反复运行该过程，关联词汇间的一些网络系统就会逐渐形成。如果重复的次数足够多，每个网络系统就会与某单一的存储文件内容相一致。

And it works. When Dr Blei and Dr Lafferty asked their software to find 50 topics in papers published in Science between 1980 and 2002, the words it threw up as belonging together were instantly recognisable as being related. One topic included “orbit”, “dust”, “Jupiter”, “line”, “system”, “solar”, “gas”, “atmospheric”, “Mars” and “field”. Another contained “computer”, “methods”, “number”, “two”, “principle”, “design”, “access” and “processing”.
这种方法果然奏效。戴维·布莱博士和约翰·拉弗蒂博士利用他们的软件从《科学》杂志1980年到2002年间发表的论文中找出了50个主题。按照词汇间的关联程度，这些挑出来的属于同类的词汇迅速被识别出来。比如某主题包含了“轨道”、“尘埃”、“木星”、“线条”、“系统”、“太阳的”、“气体”、“大气的”、“火星”和“领域”。另一主题包括“电脑”、“方法”、“数量”、“二”、“原理”、“设计”、“入口”和“进程”。

All of which is interesting as a way of dealing with information overload, and tagging papers so that they can be searched in a more useful way. But Dr Blei found himself wondering if his method could yield any truly novel insights into the scientific method. And he thinks it can. In tandem with Sean Gerrish, a doctoral student at Princeton, he has now produced a version that not only peruses text for topics, but also tracks how these topics evolve, by looking at how the patterns in each topic bin change from year to year.
作为一种处理过量信息的方式，所有这些操作都十分有趣。通过对论文的标识，就能将论文用更有用的方式检索出来。但布莱博士一直都很想知道，他的方法是否能产生一些【对于科学方法的】真正新颖的深刻见解 /并融入科学方法中/。最后他认为这完全可行。他的合作者、普林斯顿的博士生肖恩·格里什开发出的某版本的软件【了一个新版本】，不仅可以通过精读文本获得主题，而且可以通过观察每个主题存储文件中模式年复一年的变化，追踪这些主题的演化过程。

The new version is able to trace a topic over time. For example, a 1903 paper with the evocative title “The Brain of Professor Laborde” was correctly assigned to the same topic bin as “Reshaping the Cortical Motor Map by Unmasking Latent Intracortical Connections”, published in 1991. This allows important shifts in terminology to be tracked down to their origins, which offers a way to identify truly ground-breaking work—the sort of stuff that introduces new concepts, or mixes old ones in novel and useful ways that are picked up and replicated in subsequent texts. So a paper’s impact can be determined by looking at how big a shift it creates in the structure of the relevant topic.
新版本的软件可以长期追踪一个主题。比如，发表于1903年且拥有一个令人回味标题的论文《拉博德博士的大脑》，正好和1991年发表的论文《通过揭开潜皮层连接重塑皮层运动地图》都放置在相同主题的存储文件中。该软件允许进行术语学中的重要转换从而追踪主题的起源【可以让我们追踪重要的术语转换，找到转换的起源】，它为鉴定真正富有开创性的工作提供了一种途径－－将材料分类：引入一些新概念，或是把旧概念综合成新颖有效的方式，以便在后来的文章中被挑拣和复制【开创性的工作在术语上看，会引入一些新概念，或是把旧概念综合成新颖有效的方式；后来的文章挑拣和复制开创性的工作】。通过观察某论文在相关主题中的结构转变程度，其影响力就能被确定下来。

In effect, Dr Blei and Mr Gerrish have devised an alternative to the citation indices beloved of scientific publishers. These reflect how often a particular publication or author is cited as a source by others. High scores are treated as a proxy for high impact. But a proxy is all they are.
实际上，布莱博士和格里什先生已经为科学论文作者所钟爱的论文索引设计了一种替代品。该替代品【现有的论文索引】反映了某特定出版物或作者作为源头被其它人引用的频率。高的分数被认为是论文拥有重大影响力的代名词【指标】，分数就是代名词所拥有的全部【但是，指标就只是指标，不是影响力本身】。

Dr Blei and Mr Gerrish are not claiming their method is necessarily a better proxy. But it can cast its net more widely, depending on the set of documents fed into it at the beginning. Citation indices, which work only where publications refer to their sources explicitly, form a tiny nebula in the digital universe. News articles, blog posts and e-mails often lack a systematic reference list that could be used to make a citation index. Yet they, too, are part of what makes an idea influential.
布莱博士和格里什先生没有声称他们的方法就必然是个更好的替代品。但它可以通过在一开始就不断增添设置过的文档扩展到更广泛的领域【覆盖更广的材料，如果我们喂入广泛的文档的话】。论文索引仅仅在出版物在明白无误地参考其源头时才有用，它们在数字世界中形成了一个微小星云。虽然新闻稿、博客和电子邮件缺少系统化的能被用于制作成索引的参考列表，然而它们也是促使某观点更有影响力的组成部分。

Besides, despite academia’s pretensions to objectivity, it is as subject to political considerations as any area of human endeavour. Many authors cite colleagues, bosses and mentors out of courtesy or supplication rather than because such citations are strictly required. More rarely, an author may undercite. Albert Einstein’s original paper on special relativity, for example, had no references at all, even though it drew heavily on previous work. The upshot is that the Blei-Gerrish method may get closer to the real ebb and flow of scientific ideas and thus, in its way, offer a more scientific approach to science.
此外，尽管学术界的自命不凡客观存在着【客观】，但是，就像在人类努力的任何领域中它仍然是政治考量的对象。许多作者引用同事、老板和导师的文章，仅仅是出于谦恭和感激，而不是这些引用就真得那么有用。更为罕见的是，某作者也许根本不引用。比如爱因斯坦关于相对论的原始论文就完全没有附上任何参考书目，尽管它也从先前的著作中摘录甚多【使用了很多前人的工作】。布莱格里什的方法可能更真实地揭示了科学观点的演变过程，该方法为科学研究提供了一种更科学的手段。

查看全部评论(12)

账号		自动登录	找回密码
密码			立即注册

[2011.04.28] 科学中的科学

最新评论

相关分类