These are deprecated small lists of papers that can be used as seeds for the seminar. Papers are loosely categorized for your convenience. You can come up with other categories that are not listed here and suggest papers if you believe that fit the domain of AI4CA such as
- “Alternative machine learning models for code-related tasks e.g. paper S4”,
- “code generation as planners (deliberately left vague) ViperGPT 🐍”,
- “Neuro-symbolic methodologies for code-related tasks”.
Use this to understand the domain we are working on and find more up-to-date papers through
- Look through authors from interesting papers to find their new papers from them Google Scholar, Semantic Scholar
- Online tools such as Connected Papers, Papers with code (useful for Project Path), etc.
- and other possible ways you can imagine…
Abbreviations
- Code Classification: cc
- Code Clone Detection: ccd
- Code Completion: cco
- Clone Detection: cd
- Code Documentation: cdo
- Code Summarization: cs
- Code Search: cse
- Code Retrieval: cr
- Code Translation: ct
- Program Similarity: ps
- Program Repair: pr
- Vulnerability Identification: vi
- Variable Misuse Prediction: vmp
Papers are loosely labeled with:
- related major subject: Section name, (e.g. Analysis)
- items have the structure of:
- extra related subject: (see Abbreviations above, e.g. cc=Code Classification) {cc}
- nickname for the model/dataset presented: e.g. (CodeBLEU)
Analysis
- What Do They Capture? – A Structural Analysis of Pre-Trained Language Models for Source Code
- What do pre-trained code models know about code?
- An Extensive Study on Pre-trained Models for Program Understanding and Generation
- (CodeBLEU) CodeBLEU: a Method for Automatic Evaluation of Code Synthesis
- {vi,ccd} Learning Program Semantics with Code Representations: An Empirical Study
- {cs} Semantic Similarity Metrics for Evaluating Source Code Summarization
Sample based Methods
- Fix-Filter-Fix: Intuitively Connect Any Models for Effective Bug Fixing
- Code Generation Tools (Almost) for Free? A Study of Few-Shot, Pre-Trained Language Models on Code
Pretrained Language Models (PLMs)
- (IntelliCode) IntelliCode Compose: Code Generation Using Transformer
- (CuBERT) Learning and Evaluating Contextual Embedding of Source Code
- (PLBART) Unified pre-training for program understanding and generation
- (Codex) Evaluating large language models trained on code
- (CodeBERT) CodeBERT: A Pre-Trained Model for Programming and Natural Languages
- (GraphCodeBERT) GraphCodeBERT: Pre-training Code Representations with Data Flow
- (UniXcoder) UniXcoder: Unified Cross-Modal Pre-training for Code Representation
- (CodeT5) CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
- (CoTexT) CoTexT: Multi-task Learning with Code-Text Transformer
- {cs,cdo} (TreeBERT) TreeBERT: A Tree-Based Pre-Trained Model for Programming Language
- (CodeTrans) CodeTrans: Towards Cracking the Language of Silicon’s Code Through Self-Supervised Deep Learning and High Performance Computing
- (SynCoBERT) SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation
Transformers-based
- {cs} Language-Agnostic Representation Learning of Source Code from Structure and Context
- {cs,cco} Long-Range Modeling of Source Code Files with eWASH: Extended Window Access by Syntax Hierarchy
- {cse,cco} [ReACC] ReACC: A Retrieval-Augmented Code Completion Framework
- {cse,cco} Retrieval Augmented Code Generation and Summarization
- A Structural Transformer with Relative Positions in Trees for Code-to-Sequence Tasks
- Code Prediction by Feeding Trees to Transformers
- {cs} Integrating Tree Path in Transformer for Code Representation
- {ct} (TransCoder) Unsupervised Translation of Programming Languages
Graph-based
- {cs} Learning to Represent Programs with Heterogeneous Graphs
- {cs} CoCoGUM: Contextual Code Summarization with Multi-Relational GNN on UMLs
- {cs} Improved Code Summarization via a Graph Neural Network
- Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks
- {pr} (GREAT) Global Relational Models of Source Code
Dataset
- (CodeXGLUE) CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
- (CodeSearchNet) CodeSearchNet Challenge: Evaluating the State of Semantic Code Search
- (Graph4Code) A Toolkit for Generating Code Knowledge Graphs
- (BigCloneBench) Detecting Code Clones with Graph Neural Networkand Flow-Augmented Abstract Syntax Tree
- (POJ-104) Convolutional Neural Networks over Tree Structures for Programming Language Processing
- CONTEST: A Unit Test Completion Benchmark featuring Context
- An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation
- Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow
- {cse} Are the Code Snippets What We Are Searching for? A Benchmark and an Empirical Study on Code Search with Natural-Language Queries
Representation learning (Embedding)
- code2vec: Learning Distributed Representations of Code
- Disentangled Code Representation Learning for Multiple Programming Languages
- Blended, Precise Semantic Program Embeddings
- {cco,vmp} On the Embeddings of Variables in Recurrent Neural Networks for Source Code
- {pr} (SCELMo) SCELMo: Source Code Embeddings from Language Models
- {ccd} Contrastive Code Representation Learning
- {vi,ccd} (DISCO) Towards Learning (Dis)-Similarity of Source Code from Program Contrasts
- Neural Code Comprehension: A Learnable Representation of Code Semantics
Code Generation
- (code2seq) code2seq: Generating Sequences from Structured Representations of Code
- (Grammformer) Learning to Complete Code with Sketches
- Using Deep Learning to Generate Complete Log Statements
- (CommitBERT) CommitBERT: Commit Message Generation Using Pre-Trained Programming Language Model
Program Repair
- Can we learn from developer mistakes? Learning to localize and repair real bugs from real bug fixes
- (TFix) TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer
- Self-Supervised Bug Detection and Repair
- {pr} Jointly Learning to Repair Code and Generate Commit Message