Since I wrote extensively about the user journey and narrative last week, I wanted to review some of the technical work I’ve been doing this week as I’ve attempted to get my deep learning framework (a tool for generating stories from images) up and running.

I started by following the installation & compilation steps outlined here for neural storyteller. The process makes use of skip-thought vectors, word embeddings, conditional neural language models, and style transfer. First, I installed dependencies, including NumPy, SciPy, Lasagne, Theano, and all their dependencies. Once I finish setting up the framework, I’ll be able to do the following:

  • Train a recurrent neural network (RNN) decoder on a genre of text (in this case, mystery novels). Each passage from the novel is mapped to a skip-thought vector. The RNN is conditioned on the skip-thought vector and aims to generate the story that it has encoded.
  • While that’s happening, train a visual-semantic embedding between Microsoft’s COCO images and captions. In this model, captions and images are mapped to a common vector space. After training, I can embed new images and retrieve captions.
  • After getting the models & the vectors, I’ll create a decoder that maps the image caption vectors to a text vector that I would then feed to the encoder to get the story.
  • The three vectors would be as follows: an image caption x, a “caption style” vector c and a “book style” vector b. The encoder F would therefore look like this: F(x) = x – c + b. In short, it’s like saying “Let’s keep the idea of the caption, but replace the image caption with a story.” This is essentially the style transfer formula that I will be using in my project. In this scenario, c is obtained from the skip-thought vectors for Microsoft COCO training captions and b obtained from the skip-thought vectors for mystery novel passages.

So far I’ve successfully set up the frameworks for skip-thought vectors (pre-trained on romance novels) & Microsoft’s COCO vectors. Now, I’m in the middle of installing and compiling Caffe, a deep learning framework for captioning images. I feel like I’ve hit a bit of a wall in the compilation process. I’ve run these commands specified in the Makefile, which have succeeded:

    make clean
    make all
    make runtests
    make pycaffe

When I try to import caffe, however, I get an error stating that the module caffe doesn’t exist, which means that there was some error in the build/compilation process. I’ve been troubleshooting the build for over a week now. I’ve met with several teachers, adjuncts, and students to troubleshoot. Today, I finally decided to use an Ubuntu container for caffe called Docker (which came highly recommended from a number of other students). I’m optimistic that Docker will help control some of the Python dependency/version issues I keep running into.

When I haven’t been working on the machine learning component of my project, I’ve started working on the website (running server-side, using node + express + gulp) where the game will live. I’ll be using this JQuery plugin that mimics the look and feel of a Terminal window.

For my project, I plan to build an interactive, text-based narrative where the text and the plot is generated through machine learning methods. At each stage in the narrative,  the user will be prompted to choose the next step at various stages in the story.

The content of the game will be driven by a machine learning tool that takes image files and generates sequential stories from the images.

Here’s the storyboard / user flow for the game:

storyboard-gaze-01 storyboard-gaze-02 storyboard-gaze-03 storyboard-gaze-04 storyboard-gaze-05 storyboard-gaze-06 storyboard-gaze-07 storyboard-gaze-08 storyboard-gaze-09 storyboard-gaze-10

In terms of the technical details, I need to train my own data set on a specific genre of literature (horror? detective stories? thriller? choose your own adventure books) using the neural storyteller tool. Neural storyteller makes use of several different deep learning frameworks and tools, including skip thoughts, caffe, theanos, numpy, and scikit. Here’s an overflow of how the text in the game will be generated:


Here is the tentative schedule for the work:

Week 1: Nov. 2 – 8

  • Get the example encoder/trainer/models up and running (2-3 days).
  • Start training the same program on my own genre of literature (2-3 days).
  • Start building the website where the game will live (2 hrs).

Week 2: Nov. 9 – 15

  • After getting the machine learning framework working as I did with my solitaire app, start thinking about ways to structure the generative stories into the narrative arc (2-3 days).
  • Start building the front end of the game – upload buttons, submit forms. (1 day).

Week 3: Nov. 16 – 29

  • Start establishing the rules of game play & build the decision tree (2 days).
  • Continue building the website and tweaking the narrative (2 days).

Week 4: Nov. 30 – Dec. 7

  • User testing. Keep revising the game. Get feedback.


After spending many hours trying to articulate the perfect project concept that would appropriately communicate the research I’ve done thus far, I stumbled onto an idea that I think gets to the heart of what I’m trying to understand about computer vision. Namely, how might algorithms of the future use visual information to draw conclusions about you? And what are the consequences of ceding over our decision-making capabilities to a computer?

Here’s the quick and dirty elevator pitch for the game:

What happens when we let a computer make decisions on our behalf? ALGORITHMIC GAZE is an interactive web-based choose-your-own adventure game that makes personalized decisions for you based on a neural network trained on a collection of images. The project anticipates and satirizes a world in which we cede decision-making authority over to our computers.

I plan to build a low-fidelity game in three.js and WebGL. At the start of the game, the user will upload a handful of pictures and enter information about herself. Then, she will be guided through three different scenarios/scenes, in which there are objects with which she can interact. Each object will prompt a moment of decision: Let me decide or let the computer decide for me.

The program will use the images uploaded by the user to make decisions on behalf of the user. By tapping into a machine learning API, the program will use object recognition, sentiment analysis, facial recognition, and color analysis to make certain conclusions about the user’s preferences. The decisions made on behalf of the user may prompt illogical or surprising outcomes.

A storyboard of the experience:

1 2

Here’s what the basic decision tree will look like as you move through each scene.



Last week I presented a handful of different design concepts for my project. The feedback from my classmates was actually very positive – while I feel that the project still lacks focus at this stage, their comments reaffirmed that the different iterations of this projects are all connected by a conceptual thread. My task in the coming weeks is to continue following that thread and consider each iteration of the project a creative intervention into the same set of questions.

Theory & conceptual framework.

We know that systems that are trained on datasets that contain biases may exhibit those biases when they’re used, thus digitizing cultural prejudices like institutional racism and classism. Researchers working in the field of computer vision operate in a liminal space, one in which the consequences of their work remain undefined by public policy. Very little work has been done on “computer vision as a critical technical practice that entangles aesthetics with politics and big data with bodies,” argues Jentery Sayers.

I want to explore the ways in which algorithmic authority exercises disciplinary power on the bodies it “sees” vis-a-vis computer vision. Last week I wrote about Lacan’s concept of the gaze, a scenario in which the subject of a viewer’s gaze internalizes his or her own subjectivization. Michel Foucault wrote in Discipline and Punish about how the gaze is employed in systems of power. I’ve written extensively about biopower and surveillance in previous blog posts (here and here), but I want to continue exploring how people regulate their behavior when they know a computer is watching. Whether real or not, the computer’s gaze has a self-regulating effect on the person who knows they are being looked at.

It’s important to remember that the processes involved in training a data set to recognize patterns in images are so tedious that we tend to automate them. In his paper “Computer Vision as a Public Act: On Digital Humanities and Algocracy”, Jentery Sayers suggests that computer vision algorithms represent a new kind of power called algocracy – rule of the algorithm. He argues that the “programmatic treatment of the physical world in digital form” is so deeply embedded in our modern infrastructure that these algorithms have begun shaping our behavior and assert authority over us. An excerpt from the paper’s abstract:

Computer vision is generally associated with the programmatic description and reconstruction of the physical world in digital form (Szeliski 2010: 3-10). It helps people construct and express visual patterns in data, such as patterns in image, video, and text repositories. The processes involved in this recognition are incredibly tedious, hence tendencies to automate them with algorithms. They are also increasingly common in everyday life, expanding the role of algorithms in the reproduction of culture.

From the perspective of economic sociology, A. Aneesh links such expansion to “a new kind of power” and governance, which he refers to as “algocracy—rule of the algorithm, or rule of the code” (Aneesh 2006: 5). Here, the programmatic treatment of the physical world in digital form is so significantly embedded in infrastructures that algorithms tacitly shape behaviors and prosaically assert authority in tandem with existing bureaucracies.

Routine decisions are delegated (knowingly or not) to computational procedures that—echoing the work of Alexander Galloway (2001), Wendy Chun (2011), and many others in media studies—run in the background as protocols or default settings.

For the purposes of this MLA panel, I am specifically interested in how humanities researchers may not only interpret computer vision as a public act but also intervene in it through a sort of “critical technical practice” (Agre 1997: 155) advocated by digital humanities scholars such as Tara McPherson (2012) and Alan Liu (2012). 

I love these questions posed tacitly by pioneering CV researchers in the 1970s: How does computer vision differ from human vision? To what degree should computer vision be modeled on human phenomenology, and to what effects? Can computer or human vision even be modeled? That is, can either even be generalized? Where and when do issues of processing and memory matter most for recognition and description? And how should computer vision handle ambiguity? Now, the CV questions posed by Facebook and Apple are more along these lines: Is this you? Is this them?

The project.

So how will these new ideas help me shape my project? For one, I’ve become much more wary of using pre-trained data sets like the Clarifai API or Microsoft’s COCO for image recognition. This week I built a Twitter bot that uses the Clarifai API to generate pithy descriptions of images tweeted at it.


I honestly was disappointed by the lack of specificity the data set offered. However, I’m excited that Clarifai announced today a new tool for users to train their own models for image classification.

I want to probe the boundaries of these pre-trained data sets – where do these tools break and why? How can I distort images in a way that objects are recognized as something other than themselves? What would happen if I trained my own data set on a gallery of images that I have curated? Computer vision isn’t source code; it’s a system of power.


For my project, I want to have control over the content that the model is being trained on so that it outputs interesting or surprising results. In terms of the aesthetic, I want to try out different visual ways of organizing these images – clusters, tile patterns, etc. Since training one of these models can take as little as a month, the goal for this week is to start creating the data set and the model.

I’ve been reading Wendy Chun’s Programmed Visions and Alexander Galloway’s Protocol: How Control Exists After Decentralization for months, but I’m recommitting to finishing these books in order to develop my project’s concept more fully.

As mentioned last week, I’m exploring the idea of the algorithmic gaze vis-a-vis computer vision and machine learning tools. Specifically, I’m interested in how such algorithms describe and categorize images of people. I’d like to focus primarily as the human body as subject, starting with the traditional form of the portrait. What does a computer vision algorithm see what it looks at a human body? How does it categorize and identify parts of the body? When does the algorithm break? How are human assumptions baked into the way the computer sees us?

As mentioned last week, I’m interested in exploring the gaze as mediated through the computer. Lacan first introduced the concept of the gaze into Western philosophy, suggesting that a human’s subjectivity is determined by being observed, causing the person to experience themselves as an object that is seen. Lacan (and later Foucault) argues that we enjoy being subjectivized by the gaze of someone else: “Man, in effect, knows how to play with the mask as that beyond which there is the gaze. The screen is here the locus of mediation.”

The following ideas are variations on this theme, exploring the different capabilities of computer vision.

Project idea #1: Generative text based on image INPUT.

At its simplest, my project could be a poetic exploration of text produced by machine learning algorithms when it processes an image. This week I started working with several different tools for computer vision and image processing using machine learning. I’ve been checking out some Python tools, including SimpleCV and scikit. I also tested out the Clarifai API in JavaScript.

In the example below, I’ve taken the array of keywords generated by the Clarifai API and arranged them into sentences to give the description some rhythm.

Check out the live prototype here.

I used Clarifai’s image captioning endpoint in order to generate an array of keywords based on the images it’s seeing and then included the top 5 keywords in a simple description.

toni hijab nina

You can find my repo code over here on Github.


In the first project idea, I’m exploring which words an algorithm might use to describe a photo of a person. With this next idea, I’d be seeking to understand how a computer algorithm might categorize those images based on similarity. The user would input/access a large body of images and then the program would generate a cluster of related images or image pairs. Ideally the algorithm would take into account object recognition, facial recognition, composition, and context.

I was very much inspired by the work done in Tate’s most recent project Recognition, a machine learning AI that pairs photojournalism with British paintings from the Tate collection based on similarity and outputs something like this:

The result is a stunning side-by-side comparison of two images you might never have paired together. It’s the result of what happens when a neural net curates an art exhibition – not terribly far off from what a human curator might do. I’d love to riff on this idea, perhaps using the NYPL’s photo archive of portraits.

Another project that has been inspiring me lately was this clustering algorithm created by Mario Klingemann that groups together similar items:

I would love to come up with a way to categorize similar images according to content, style, and facial information – and then generate a beautiful cluster or grid of images grouped by those categories.


A variation on the first project idea, I’d like to explore the object recognition capabilities of popular computer vision libraries by taking a portrait of a person and slowly, frame by frame, incrementally distorting the image until it’s no longer recognized by the algorithm. The idea here is to test the limits of what computers can see and identify.

I’m taking my cues from the project Flower, in which the artist distorted stock images of flowers and ran them through Google’s Cloud Vision API to see how far they could morph a picture while still keeping it recognizable as a flower by computer vision algorithms. It’s essentially a way to determine the algorithm’s recognizable range of distortion (as well as human’s).

I’m interested testing the boundaries of such algorithms and seeing where their breakpoints are when it comes to the human face.*

*After writing this post, I found an art installation Unseen Portraits that did what I’m describing – distorted images of faces in order to challenge face recognition algorithms. I definitely want to continue investigating this idea.

PROJECT IDEA #4: interpreting BODY GESTURES IN paintings.

Finally, I want to return to my idea I started with last week, which was focused on the interpretation of individual human body parts. When a computer looks at an ear, a knee, a toenail, what does it see? How does it describe bodies?

Last week, I started researching hand gestures in Italian Renaissance paintings because I was interested in knowing whether a computer vision algorithm trained on hand gestures would be able to interpret hand signals from art. I thought that if traditional gestural machine learning tools proved unhelpful, it would be an amazing exercise to train a neural net on the hand signals found in religious paintings.


Terrapattern, Golan Levin

For this week’s assignment, we were to reframe or revisit our project idea through a scientific lens. Since computer vision — characterized by image analysis, recognition, and interpretation — is itself considered a scientific discipline, I struggled to find a new scientific framework through which to re-articulate my project.

Because my project is so deeply rooted in computer vision and optics, I’m interested in exploring the idea of “algorithmic gaze” as the means by which computers categorize and label bodies according to specific (and flawed) modalities of power.

Donna Haraway’s concept of the “scientific gaze” has very much influenced my research. In her paper “Situated Knowledges: The Science Question in Feminism and Privilege of Partial Perspective“, Haraway tears apart traditional ideas of scientific objectivity, including the idea of the subject as a passive, single point of empirical knowledge and the scientific gaze as objective observer. Instead, she advocates for situated knowledge, in which subjects are recognized as complex and the scientific gaze is dissolved into a network of imperfect/contested observations. In this new framework, objects and observers are far from passive, exercising control over the scientific process.

Haraway relies on the metaphor of vision, the all-seeing eye of Western science. She describes the scientific gaze as a kind of “god trick,” a move that positions science as the omniscient observer. The metaphor of optics, vision, and gaze will be central to the development of my project. I’m interested in exploring how the “algorithmic gaze” mediates and shapes the information we receive.

Sandro Botticelli (Florentine, 1446 - 1510 ), Portrait of a Youth, c. 1482/1485, tempera on poplar panel, Andrew W. Mellon Collection 1937.1.19
Sandro Botticelli (Florentine, 1446 – 1510 ), Portrait of a Youth, c. 1482/1485, tempera on poplar panel, Andrew W. Mellon Collection 1937.1.19

My first test was using ConvNetJS, a JS library built by Andrei Karpathy that uses neural networks to paint based on an image as input. I used a detail from the painting above and ran it through the neural network. Here’s an example of the process.
screen-shot-2016-09-29-at-2-53-35-pm screen-shot-2016-09-29-at-2-54-23-pmscreen-shot-2016-09-29-at-3-00-11-pm screen-shot-2016-09-29-at-2-54-29-pmscreen-shot-2016-09-29-at-3-03-10-pm

Project proposal.

I intend to use this class to explore generative text as a new poetic form, culminating in the production of some kind of physical or digital artifact.

Over the coming semester, I will conduct a series of text-based experiments using deep learning methods such as Recurrent Neural Networks (RNNs) for sequence learning and Convolution Neural Networks (CNNs) to classify images and text. I’ll also use Python (w/ Flask), Javascript (w/ Node), and Natural Language Processing (NLP) libraries in both of those programming languages. The goal behind these experiments is to teach myself different ways of training a computer program on text to generate something new.

I’m still not sure what form the final artifact will take, whether it’s a physical book, an installation, an interactive web-based tool, a chatbot, a mobile app, or otherwise. My hope is that the form will eventually emerge through my experimentation.

Some major questions I still have about this work deal with the audience response. What can I build that will elicit an emotional response? Will people understand the intent of this project? How will they connect with it if they aren’t writers/readers/theorists?

Here’s the project map I sketched out during our class activity:


Next steps.

Since I’m still unfamiliar with some of the tools I’d like to use, for the first few weeks I intend to teach myself the basics of deep learning. I plan on using resources from Gene Kogan’s course Machine Learning for Artists, Patrick Hebron’s course Learning Machines, and Andrei Karpathy’s amazing work on RNN. I’m going to build a week-to-week schedule to lend some structure to my experimentation. I’m taking another Javascript-based generative text class right now, so my experiments might align with that class as well.

Resources & inspiration (an ongoing list I will update).