TensorFlow : a look back at a maturing ecosystem

TensorFlow was open sourced on Novembre 2015 as a second and public generation of DistBelief, Google internal machine learning system. In February 2016, TensorFlow Serving is released. In November 2016, TensorFlow is “the most popular machine learning project on Github” one year after being open sourced. In February 2017, the 1.0 version of TensorFlow is released with an experimental API for Java and the promise of Python API stability. In March 2017, Google ML engine reaches GA and in May 2017, GPU-enabled machines are available world-wide. At the same time, the second generation of TPUs (renamed Cloud TPUs) are available for researchers at no cost using the TensorFlow Research Cloud and an alpha program is started for everybody else.

The progress has been fast. And even if more can be expected (Org-scale TensorBoard, Java API stability, general availability of TPUs, etc), TensorFlow has become, step by step, an attractive solution to consider.

Here are a few notes and videos for those who want to catch up.

TensorFlow : an optimiser of numerical computation.

Machine learning, and even more so deep learning, require lots of numerical computation. As a consequence, practitioners know the importance of using efficient matrix computation means such as LAPACK, ATLAS or any other BLAS-related libraries. TensorFlow was historically an internal Google’s project focusing on this recurring problematic : how to provide the end users the means to consisly formulate their numerical computations while at the same time offering great performance?

The data flow graphs define the interface between the end users and the core of TensorFlow. In a sense, they are the equivalent of an abstract syntax tree between a developer and the compiler of the programming language.

Defining a good interface, or API, is always tricky. As much as possible, technical details regarding the implementation shouldn’t be exposed to the end user. But sometimes, those details do matter for the end user so it is not possible to hide them. It is especially true for performance. Leakiness of the API is one important aspect when selecting an interface. But the second, and more important, aspect is whether the created abstractions are at the right level.

TensorFlow is fundamentally a library for numerical computation. It is not a machine learning nor a deep learning library. The delimitation is however murky and will probably stay in that state. TensorFlow does contain support for machine learning and deep learning but it is possible to use TensorFlow for any kind of numerical computation without limitation of purpose. It is also possible to use deep learning with TensorFlow but without directly manipulating its API : here comes Keras.

Keras : simple but deep learning

Keras defines itself as the Python Deep Learning Library. One of its core guiding principle is user friendliness.

At high level, deep learning with Keras can be seen as building a lego house with lego bricks. The types of the bricks, their number and how they fit together is the responsibility of the builder but lots of details are abstracted away in order to create reusable bricks.

Creating a friendly deep learning library is the challenge addressed by Keras and rather well. So well, in fact, that its TensorFlow implementation has been selected to be directly embedded into the TensorFlow library. It doesn’t help with the definition of what is TensorFlow but it is a great news both for Keras and TensorFlow users.

Valerio Maggio presented Keras during PyData London 2017. Previous knowledge of Deep learning is assumed.

TensorBoard : Machine Learning with humans

With all the buzzword around deep learning, machine learning and AI, it’s easy to loose track. How far is the creation of a strong AI is an open question but it is certain that today humans still play a critical role in machine learning.

In almost every machine learning project, visualizing different metrics such as loss or fit of the model help to check the viability of a model and find the cause of errors. While building such graphical interface in an ad hoc fashion is not impossible, it is still a big time loss when the real goal is to understand the behavior of the learning with regard to a specific context. The good news? TensorFlow works can be analyzed thanks to TensorBoard.

Dandelion Mané presented TensorBoard during TensorFlow Dev Summit 2017.

TensorBoard is still a work in progress but the future looks promising. The roadmap includes an org-scale TensorBoard which would allow multiple users to share their results and keep an history.

Google ML Engine : Fast exploration of hyperparameters

Not all companies require distributed learning. That’s a truth. But on the technical side, there is one step in almost any machine learning project that can require massive computation power : the hyperparameter tuning. The good news are that it is not something that need to be performed regularly and that it is embarrassingly parallel and as a consequence, relatively trivial to distribute.

Google ML engine can be seen as a cloud version of TensorFlow. Arguably, the whole infrastructure of any application could be hosted by Google cloud and its ML engine. In practice, how much the infrascture should depend on an external service is something that need to be answered with the different constraints of the context.

That being said Google ML engine is very well positioned for hyperparameters tuning. Assume that 1 hour is required to train a model for a specific configuration of hyperparameters. Assume additionally that there exists two hyperparameters with 10 values each that should be explored. With a naive full exploration, it would require (10 x 10 x 1 =) 100 hours. But it could be performed by Google ML engine in 1 hour with 100 times the hardware used for the initial training. It is doubtful that many companies can handle such spike in hardware demand. Without Google ML Engine, a company would have a far longer feedback loop.

TensorFlow Serving : production is easy

Once upon a time, a businessman asked a datascientist colleague if it would possible to predict X. After a few exchanges on the definition of the subject, the datascientist was able to define the problem as a relatively classic machine learning problem, to retrieve the data and to select a model for which the conclusion was : yes, predictions could be made with enough quality such that it would help the business. The reaction was “Great! Now, it should be integrated with my application. I need to be able to ask for predictions in almost real time.” The datascientist said “No problem. I will send my results to another team. They will inform you about what can be done.”

This is indeed a stereotypical story. The critical point is that building a predictive model does not stop at the validation that the available data are sufficient to compute good enough predictions. At the end, the objective is often the integration of the model in production. A few use cases are easy in the sense that only cold predictions are required : they can be computed every night for example. However, more and more projects require to use hot data (such as a user browsing data) and as a consequence to generate hot predictions. In that case, the road to production, if not prepared beforehand, is not easy.

Can the technology used for the model be installed in production? Is there any equivalent tech? Is a complete rewrite required? How do other parts of the application ask for the prediction? Will it support the load? Should a new web service be implemented and deployed? How can high availability be guaranteed? What happens if the model needs to be fixed? Is it possible to perform a hot swap? If there is a model update, is it possible to A/B test it against the old one in production before a complete swap? And so on…

TensorFlow Serving is there to answer these questions. It is not necessary to reinvent the wheel in order to drive a TensorFlow project to production.

Noah Fiedel presented TensorFlow Serving during Google I/O 2017.

TensorFlow in production can be easy even on premise when Google ML Engine is not an option.

TensorFlow : on mobile

Machine learning implies data. As a consequence, that thought often leads to BigData and data center. But with the prevalence of smart phones and the internet of objects (raspberry pie!), isn’t there another way? If the mobile application needs to detect specific objects in its video, does the full video need to be streamed to the datacenter in order to have an answer? Wouldn’t that result in horrible latency and the necessity of having a fast and reliable internet connection?

Even though training on mobile is not a solved problem, doing prediction on a mobile is a reality and can lead to novel kinds of applications. A typical example is real time translation of text, sound or video even in airplane mode.

Yufeng Guo presented how machine learning on mobile is possible right now during Google Cloud Next 2017.

Deep learning provides one advantage for fast training : in order to understand a specific domain, a generic model can be fine tuned without a full training.

Unit testing should be part of any project. Hard to disagree. However, many complementary tools exist for various purposes. This is the case for assertions, which allow to verify the expected behavior of the system. Different tools have different strengths and weaknesses. Starting within the Java world with JUnit, Hamcrest, FEST-Assert and AssertJ, this high level review will end in the Scala world with ScalaTest and Specs2.

JUnit Assert, the first stop

Junit is the most widely used unit testing library for java. Of course, it comes with its own solution for assertions.

The assertions are static methods following the same signature pattern assertXXX([optionalMessage],expected,actual).

static void assertEquals(java.lang.Object expected,
                         java.lang.Object actual)

// or the same with a message
static void assertEquals(java.lang.String message,
                         java.lang.Object expected,
                         java.lang.Object actual)

A typical example would be :

assertEquals(3,4);
assertEquals("eggs", 3,4);

That’s a good start. Most people will stop here. Indeed, why bother searching for an alternative if the default solution is sufficient? But let’s continue.

Hamcrest, to the rescue?

One issue with JUnit standard assertions is their stiffness. They work perfectly for their exact use case but as soon as a more complex test needs to be written, it is the responsability of the developer to write the additional, boilerplate, code. Hamcrest is a set of matchers created to solve this issue. It is now possible to compose the assertions. JUnit is nowadays bundled with the core of Hamcrest.

The basics are aven simpler : there is a single assert,

static <T> void assertThat(T actual, org.hamcrest.Matcher<T> matcher)

but with powerfull flexibility.

assertThat("Hello", is(not(anyOf(
                                nullValue(),
                                instanceOf(Integer.class),
                                equalTo("Goodbye")))));

Of course, this example is contrieved. But what would be the same with JUnit Assert? More complicated, yes. For those interested, Ed Gibbs wrote a good walkthrough on Hamcrest.

FEST-Assert, the late solution

People may argue that Hamcrest matchers help readability. It may be true with regard to JUnit but not in general. One issue with Hamcrest is that efficient nesting of matchers requires a good knowledge of existing matchers. The Assertions from the FEST project are an alternative solution alleviating this issue by casting the Java type system as a tutor for the developer.

List newEmployees = employees.hired(TODAY);
assertThat(newEmployees).hasSize(6).contains(frodo, sam);

Thanks to the Java type system and IDE autocompletion, the relevant assertions will be shown during the chaining. Learning becomes easier and so does reading. Indeed, the hasSize will only be available for Collection and the chaining removes all the parentheses noise that would have been created by Hamcrest.

Sadly, the project died during its attempt to improve itself with a second version. The latest stable release is from 2011.

AssertJ, the new old

FEST-Assert is dead but… actually was forked and is still alive under a new name : AssertJ. The principle has not changed but additional features have appeared. Among those, the soft assertions should be checked out! In short, even if 10s assertions are written sequentially, on failure, the error message will contain all differences and not only the first one.

org.assertj.core.api.SoftAssertionError:
     The following 4 assertions failed:
     1) [Living Guests] expected:<[7]> but was:<[6]>
     2) [Library] expected:<'[clean]'> but was:<'[messy]'>
     3) [Candlestick] expected:<'[pristine]'> but was:<'[bent]'>
     4) [Professor] expected:<'[well kempt]'> but was:<'[bloodied and disheveled]'>

For new Java projects, I would definitely pick AssertJ over JUnit Assert, Hamcrest or FEST-Assert.

ScalaTest, for Scala

ScalaTest is one of the most widely used test framework for Scala. It “does not try to impose a testing philosophy on you” but supports indeed many ways of expressing a test. Of course, it also includes its own Matchers.

Here are a few examples.

result shouldBe 3  
result should have length 3
result should have size 10

Thanks to a different language, the readability is improved again, even though a few technical artefacts are still left (eg shouldBe). The syntactic sugar allows to remove parentheses, points and semi-colons. Here are the same examples, sugar free.

result.shouldBe(3);
result.should(have(length(3)));
result.should(have(size(10)));

The principle is quite close to how Hamcrest works. The main difference is the root assertion (should) being applied to the observed data instead of being a static method. That’s another Scala feature, ie implicit conversions. The result is transparently converted to another type owning the should or shouldBe methods. It’s the so-called ‘pimp my library’ pattern.

AssertJ fast learning curve is an asset. Should AssertJ used for a project moving from Java to Scala? Well, its lack of native support for Scala types is not in its favor. ScalaTest assertions do not seem to be the most attractive part of Scala for a well advised Java developer.

Specs2?

Specs2, an alternative solution to ScalaTest, has also its own Matchers. But the conclusion is similar. Easier to read? Yes. Easier to learn? Well, maybe not during the first day of Scala.

And you?

Are you a Java developer switching to Scala? Are you a Scala developer coaching Java developer? Have you been on a project migrating from Java to Scala? What are your assertions about assertions?

The context : a scala project has been started but for various reasons Maven was chosen instead of sbt. At first it does seem an odd choice. Why pick such a common Java tool and not the default Scala build tool? In practice, it can make sense for a Java-centric shop making a first step towards the Scala world. The next question is then : how can the quality of the Scala code be checked while still using Maven?

Scalastyle, the scala CheckStyle

Checkstyle is well known in the Java world. Scalastyle is similar, but for Scala. Both tools help teams converge on a shared coding standard with automated control.

Scalatest should be the first stop as it provides an integration with Maven, but also common IDEs : Intellij and Eclipse. Scala IDE, based on Eclipse, is also supported as a consequence.

The configuration for Maven is composed of 4 steps.

1) Update your pom.xml file.

<build>
  <plugins> 
    ...
    <plugin>
      <groupId>org.scalastyle</groupId>
      <artifactId>scalastyle-maven-plugin</artifactId>
      <version>0.8.0</version>
      <configuration>
        <failOnViolation>true</failOnViolation>
        <failOnWarning>false</failOnWarning>
        <verbose>false</verbose>
        <includeTestSourceDirectory>true</includeTestSourceDirectory>
        <sourceDirectory>${basedir}/src/main/scala</sourceDirectory>
        <testSourceDirectory>${basedir}/src/test/scala</testSourceDirectory>
        <configLocation>${basedir}/src/test/resources/scalastyle_config.xml</configLocation>
        <outputFile>${project.basedir}/target/scalastyle-output.xml</outputFile>
        <outputEncoding>UTF-8</outputEncoding>
      </configuration>
      <executions>
        <execution>
          <goals>
            <goal>check</goal>
          </goals>
        </execution>
      </executions>
    </plugin>
    ...
  </plugins>
</build>

2) Download scalastyle_config.xml to ${basedir}/src/test/resources/scalastyle_config.xml.

The rule org.scalastyle.file.HeaderMatchesChecker should probably be changed or disabled (enabled="false") for a non Open Source project. The others rules can be kept without change for a first run.

<check class="org.scalastyle.file.HeaderMatchesChecker" level="warning" enabled="false">
  <parameters>
    <parameter name="header"><![CDATA[// Copyright (C) 2011-2012 the original author or authors.
// See the LICENCE.txt file distributed with this work for additional
// information regarding copyright ownership.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.]]></parameter>
    </parameters>
</check>

3) Verify integration with Maven Build lifecyle.

mvn verify

At the end of the build, after the test, scalastyle output should be visible.

warning file=/xxx/xxx.scala message=ScalastyleWarningMessage
Saving to outputFile=/xxx/target/scalastyle-output.xml
Processed 1 file(s)
Found 0 errors
Found 1 warnings
Found 0 infos
Finished in xxxx ms
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: xx:xx min
[INFO] Finished at: xxxx-xx-xxTxx:xx:xx+xx:xx
[INFO] Final Memory: xxM/xxxM
[INFO] ------------------------------------------------------------------------

4) Run in isolation, without all the previous Maven build lifecycle phases.

mvn scalastyle:check

With a similar end results, but quicker.

WartRemover

WartRemover is “a flexible Scala code linting tool”. Its main usage is as a sbt plugin but luckily, it is still compatible with Maven.

1) Configure Maven to automatically copy the depency by using the maven-dependency-plugin during the validate phase.

<build>
  <plugins>
    ...
    <plugin>
      <groupId>org.apache.maven.plugins</groupId>
      <artifactId>maven-dependency-plugin</artifactId>
      <executions>
        <execution>
          <id>scala_plugins</id>
          <phase>validate</phase>
          <goals>
            <goal>copy</goal>
          </goals>
          <configuration>
            <outputDirectory>${project.basedir}/target/scala_plugins/</outputDirectory>
            <artifactItems>
              <artifactItem>
                <groupId>org.brianmckenna</groupId>
                <artifactId>wartremover_2.10</artifactId>
                <version>0.14</version>
              </artifactItem>
            </artifactItems>
          </configuration>
        </execution>
      </executions>
    </plugin>
    ...
  </plugins>
</build>

2) Verify the configuration by running Maven.

mvn validate

The downloaded plugin (jar) should be visible from the logs.

[INFO] --- maven-dependency-plugin:2.9:copy (scala_plugins) @ xxx ---
[INFO] Configured Artifact: org.brianmckenna:wartremover_2.10:0.14:jar
[INFO] Copying wartremover_2.10-0.14.jar to /xxx/target/scala_plugins/wartremover_2.10-0.14.jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] Total time: xx:xx min
[INFO] Finished at: xxxx-xx-xxTxx:xx:xx+xx:xx
[INFO] Final Memory: xxM/xxxM
[INFO] ------------------------------------------------------------------------

3) Configure the compilator with this plugin.

<build>
  <plugins>
    ...
    <plugin>
      <!-- see http://davidb.github.com/scala-maven-plugin -->
      <groupId>net.alchim31.maven</groupId>
      <artifactId>scala-maven-plugin</artifactId>
      <version>3.2.1</version>
      <executions>
        <execution>
          <goals>
            <goal>compile</goal>
            <goal>testCompile</goal>
          </goals>
          <configuration>
            <args>
              <arg>-Xplugin:${project.basedir}/target/scala_plugins/wartremover_2.10-0.14.jar</arg>
              <arg>-P:wartremover:only-warn-traverser:org.brianmckenna.wartremover.warts.Unsafe</arg>
            </args>
          </configuration>
        </execution>
      </executions>
    </plugin>
    ...
  </plugins>
</build>

For more information about how to configure which rules should be actived, in error or warn, the WartRemover github should be consulted. Here, all safe checks will be displayed as warns. The build will not fail if the plugin is added and the code is warty.

4) Compile the project.

mvn compile

And if the project is not pristine, a few warts should pop up.

[WARNING] /xxx/xxx.scala:xx: warning: Option#get is disabled - use Option#fold instead
[WARNING]     result.get
[WARNING]            ^
[WARNING] /xxx/xxx.scala:xx: warning: var is disabled
[WARNING]     var result: Option[T] = None
[WARNING]         ^
[WARNING] warning: there were x deprecation warning(s); re-run with -deprecation for details
[WARNING] x warnings found
[INFO] prepare-compile in xx s
[INFO] compile in xx s
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] Total time: xx:xx min
[INFO] Finished at: xxxx-xx-xxTxx:xx:xx+xx:xx
[INFO] Final Memory: xxM/xxxM
[INFO] ------------------------------------------------------------------------

Linter, Scapegoat, others?

Linter and Scapegoat are two others well known tools in the scala community. I am open to any feedback about their usage with scala 2.10 and Maven.

A year has passed. A new year is beginning. And a new resolution is born : starting a personnal blog in 2016.

It will start with a disclaimer, which should be familiar to many people, even though the content is clearly not under GPL.

    This blog is written in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Of course,

    The views expressed on this site are my own and
    do not reflect those of my employer or its clients.

If you revisit this blog later, you might see new posts around data science, computer science and software engineering.

Article Programmez « Développer un job Map/Reduce pour Hadoop »

Hadoop Map/Reduce est un framework de calcul distribué inspiré du paradigme fonctionnel. Dans cet article, nous allons voir dans un premier temps la théorie, ce qu’est ce paradigme, puis la pratique, en écrivant en job complet pour Hadoop…