6

Before GitHub CoPilot There Was Facebook Aroma

 3 years ago
source link: https://jrodthoughts.medium.com/before-github-copilot-there-was-facebook-aroma-a3c2751b91b6
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Before GitHub CoPilot There Was Facebook Aroma

The Facebook model improves developer productivity by filling the details in high level programming ideas.

Source: https://hub.packtpub.com/facebook-ai-introduces-aroma-a-new-code-recommendation-tool-for-developers/

I recently started an AI-focused educational newsletter, that already has over 90,000 subscribers. TheSequence is a no-BS (meaning no hype, no news etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. Please give it a try by subscribing below:

Last week, Microsoft and OpenAI astonished the developer community by unveiling the release of CoPilot, an machine learning model that can act like a developer assistant. After the release, a collegue was asking about some relevant research in this area and I remember a cool paper Facebook AI Research(FAIR) published a couple of years ago.

Code reusability is one of the biggest challenges of large development teams. Programmers regularly work on tasks that are similar to other code written somewhere else but the processes for finding and reusing code remain mostly manual. Developer documentation is always falling out of sync and the existing code search tools return very inconsistent results. However, with large knowledge bases such as StackOverflow, you would think that achieving code reusability across large number of developers is an obtainable goal. Could be model code reusability as a machine learning problem? In 2019, the Facebook engineering team published a research paper in which they unveiled Aroma, a method to achieve code reusability using machine learning and structural search.

The goal of Aroma is relatively simple: making programming a semi-automated tasks in which developers express higher level ideas and algorithms can complement the details. Every code snippet associated with a specific task contains many details that go beyond the task itself and include aspects such as error handling, callbacks and many others. Consider the scenario in which a developer would like to write Android code to decode a bitmap without incurring a high memory overhead. Our developer is familiar with the Android libraries intents to solve the task using the following code:

Bitmap bitmap = BitmapFactory.decodeStream(input);

However, a popular post on StackOverflow reveals that the previous code is vulnerable to memory errors and that there is a better option:

final BitmapFactory.Options options = new BitmapFactory.Options();
options.inSampleSize = 2;
// ...
Bitmap bmp = BitmapFactory.decodeStream(is, null, options);

The traditional approach to solve code reusability has been through code-to-code search tools. However, those tools tend to return all sorts of code snippets without deep levels of filtering. Alternatives such as pattern-based code completion tools are able mine common API usage patterns from a large corpus and use those patterns to recommend code completion for partially written programs. However, those tools tend to effective only for well-known patterns. Code clone detectors are another set of techniques that could potentially be used to retrieve recommended code snippets. However, code clone detection tools usually retrieve code snippets that are almost identical to a query snippet.

The ideal code recommendation tool should have several key characteristics:

· Fast

· Provide recommendations that are similar to a given code snippet.

· Provide code recommendations that are different from each other.

· Don’t require training on specific patterns or tasks.

Introducing Aroma

Facebook Aroma is a code search and recommendation tool. Given a code snippet as input query and a large corpus of code containing millions of methods, AROMA returns a set of recommended code snippets such that each recommended code snippet. Aroma improves over traditional code search tools on several key areas:

  • Aroma performs search on syntax trees. Rather than looking for string-level or token-level matches, Aroma can find instances that are syntactically similar to the query code and highlight the matching code by pruning unrelated syntax structures.
  • Aroma automatically clusters together similar search results to generate code recommendations. These recommendations represent idiomatic coding patterns and are easier to consume than unclustered search matches.
  • Aroma is fast enough to use in real time. In practice, it creates recommendations within seconds even for very large codebases and does not require pattern mining ahead of time.
  • Aroma’s core algorithm is language-agnostic. We have deployed Aroma across our internal codebases in Hack, JavaScript, Python, and Java.

The Aroma pipeline can be divided into two main stages. The first stage focus on indexing a large code corpus into a feature matrix that can be used to produce recommendations in a second stage.

1*AGOE1Duvpje6-3wIRdOe5Q.jpeg?q=20
before-github-copilot-there-was-facebook-aroma-a3c2751b91b6
Image Credit: Facebook AI Reserch

The first step in the Aroma workflow is to index the code corpus as a sparse matrix. Specifically, parses each method in the corpus and creates its parse tree. Then it extracts a set of structural features from the parse tree of each method.

1*w8oV_2cBB_h7yIuOLsJaxQ.jpeg?q=20
before-github-copilot-there-was-facebook-aroma-a3c2751b91b6
Image Credit: Facebook AI Reserch

After processing a new code snippet, Aroma creates a sparse vector in the manner described above and takes the dot product of this vector with the matrix containing the feature vectors of all existing methods. The top 1,000 method bodies whose dot products are highest are retrieved as the candidate set for recommendation.

1*NqsqUTK3ful7qYkRAYA5YA.jpeg?q=20
before-github-copilot-there-was-facebook-aroma-a3c2751b91b6
Image Credit: Facebook AI Reserch

From the candidate set, Aroma needs to discard similar code snippets. Aroma achieves that by first reranking the candidate methods by their similarity to the query code snippet. For this ranking, Aroma prunes each retrieved code snippet so that the resulting snippet becomes maximally similar to the query snippet. After obtaining a list of candidate code snippets in descending order of similarity to the query, Aroma runs an iterative clustering algorithm to find clusters of code snippets that are similar to each other and contain extra statements useful for creating code recommendations.

Image Credit: Facebook AI Research

The last section of the Aroma workflow uses an intersection algorithm to recommend succinct code snippets that have fewer irrelevant statements. The intersecting algorithm works by taking the first code snippet as the “base” code and then iteratively applying pruning on it with respect to every other method in the cluster. The remaining code after the pruning process will be the code that is common among all methods, and it becomes the code recommendation.

Let’s go back to our Android image processing example and assume that we have the following 3 code snippets.

Code Snippet 1:

InputStream is = ...;
final BitmapFactory.Options options = new BitmapFactory.Options();
options.inSampleSize = 2;
Bitmap bmp = BitmapFactory.decodeStream(is, null, options);
ImageView imageView = ...;
imageView.setImageBitmap(bmp);

Code Snippet 2:

BitmapFactory.Options options = new BitmapFactory.Options();
while (...) {
in = ...;
options.inSampleSize = 2;
options.inJustDecodeBounds = false;
bitmap = BitmapFactory.decodeStream(in, null, options);
}

Code Snippet 3:

BitmapFactory.Options bmpFactoryOptions = new BitmapFactory.Options();
// some setup code
try {
options.inSampleSize = 2;
loadedBitmap = BitmapFactory.decodeStream(inputStream, null, bmpFactoryOptions);
// some code...
} catch (OutOfMemoryError oom) {
}

In this scenario, Aroma finds the common code by first pruning the lines in the first code snippet that do not appear in the second snippet. The code in code snippet 1 about ImageView does not appear in code snippet 2 and is therefore removed. Now Aroma takes this intermediate snippet and prunes the lines that do not appear in code snippet 3, code snippet 4, and so on. The resulting code is returned as a code recommendation. The recommended result would look like the following:

InputStream is = ...;
final BitmapFactory.Options options = new BitmapFactory.Options();
options.inSampleSize = 2;
Bitmap bmp = BitmapFactory.decodeStream(is, null, options);

To test Aroma, Facebook trained the framework on an dataset of 5417 Android projects that haven’t been forked. For evaluation, Facebook picked 500 most popular questions on Stack Overflow with the android tag. From these questions, we only considered the top voted answers. From each answer, Facebook extracted all Java code snippets containing at least 3 tokens, a method call, and less than 20 lines, excluding comments. From this dataset, Facebook randomly picked 64 Java code snippets which were then used as queries to evaluate the quality of Aroma’s recommendations. Based on that dataset, Aroma identified five key groups of recommendations:

1) Configuring Objects: The recommended code suggests additional configurations on objects that are already appearing in the query code.

2) Error Checking and Handling: The recommended code adds null checks and other checks before using an object, or adds a try-catch block that guards the original code snippet.

3) Post-processing: The recommended code extends the query code to perform some common operations on the objects or values computed by the query code.

4) Correlated Statements: The recommended code adds statements that do not affect the original functionalities of the query code, but rather suggests related statements that commonly appear alongside the query code.

5) Unclustered Recommendations: In rare cases, the query code snippet could match method bodies that are mostly different from each other.

1*XeOVlJbWG7ZC47vifcCaTA.jpeg?q=20
before-github-copilot-there-was-facebook-aroma-a3c2751b91b6
Image Credit: Facebook AI Reserch

The variety shows that Aroma is able to recommend code snippets that are not only different but that complement a given code input in non-trivial ways.

Code recommendations can be definitely modeled as a machine learning problem with structured search capabilities. The idea that algorithms can improve programming code real time seems achievable given the rich programming datasets available in sites like Github or StackOverflow. Aroma is still highly experimental but its application within Facebook are likely to raise the bar for this type of stack.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK