Seed Papers
These are dedicated small lists of papers that can be used as seeds for the seminar. Papers are loosely categorized for your convenience. You can come up with other categories that are not listed here and suggest papers if you believe they fit the domain of AI4CA such as
- “alternative machine learning models for code-related tasks”, e.g., S4,
- “code generation as planners” (deliberately left vague), e.g., ViperGPT 🐍,
- “neuro-symbolic methodologies for code-related tasks”.
Use this to understand the domain we are working on and find related and more up-to-date papers through
- looking up authors (or conferences) of interesting papers to find their new papers on Google Scholar, Semantic Scholar, DBLP, etc.,
- online tools such as Connected Papers, Papers With Code, etc.,
- and other possible ways you can imagine …
Abbreviations
- Code Classification: cc
- Code Clone Detection: ccd
- Code Completion: cco
- Code Documentation: cdo
- Code Summarization: cs
- Code Search: cse
- Code Retrieval: cr
- Code Translation: ct
- Program Similarity: ps
- Program Repair: pr
- Vulnerability Identification: vi
- Variable Misuse Prediction: vmp
- Constrained Decoding: cd
Papers are loosely labeled with:
- related major subject: Section name, (e.g. Analysis)
- items have the structure of:
- extra related subject: (see Abbreviations above, e.g. cc=Code Classification) {cc}
- nickname for the model/dataset presented: e.g. (CodeBLEU)
Analysis
- What Do They Capture? – A Structural Analysis of Pre-Trained Language Models for Source Code
- What do pre-trained code models know about code?
- An Extensive Study on Pre-trained Models for Program Understanding and Generation
- (CodeBLEU) CodeBLEU: a Method for Automatic Evaluation of Code Synthesis
- {vi,ccd} Learning Program Semantics with Code Representations: An Empirical Study
- {cs} Semantic Similarity Metrics for Evaluating Source Code Summarization
- Robustness, Security, Privacy, Explainability, Efficiency, and Usability of Large Language Models for Code
- {vi} A Survey of Trojans in Neural Models of Source Code: Taxonomy and Techniques
- {vi} Prompt Injection Attacks and Defenses in LLM-Integrated Applications
Sample-based Methods
- Fix-Filter-Fix: Intuitively Connect Any Models for Effective Bug Fixing
- Code Generation Tools (Almost) for Free? A Study of Few-Shot, Pre-Trained Language Models on Code
Pre-trained Language Models (PLMs)
- (IntelliCode) IntelliCode Compose: Code Generation Using Transformer
- (CuBERT) Learning and Evaluating Contextual Embedding of Source Code
- (PLBART) Unified pre-training for program understanding and generation
- (Codex) Evaluating large language models trained on code
- (CodeBERT) CodeBERT: A Pre-Trained Model for Programming and Natural Languages
- (GraphCodeBERT) GraphCodeBERT: Pre-training Code Representations with Data Flow
- (UniXcoder) UniXcoder: Unified Cross-Modal Pre-training for Code Representation
- (CodeT5) CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
- (CoTexT) CoTexT: Multi-task Learning with Code-Text Transformer
- {cs,cdo} (TreeBERT) TreeBERT: A Tree-Based Pre-Trained Model for Programming Language
- (CodeTrans) CodeTrans: Towards Cracking the Language of Silicon’s Code Through Self-Supervised Deep Learning and High Performance Computing
- (SynCoBERT) SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation
Transformer-based
- {cs} Language-Agnostic Representation Learning of Source Code from Structure and Context
- {cs,cco} Long-Range Modeling of Source Code Files with eWASH: Extended Window Access by Syntax Hierarchy
- {cse,cco} (ReACC) ReACC: A Retrieval-Augmented Code Completion Framework
- {cse,cco} Retrieval Augmented Code Generation and Summarization
- A Structural Transformer with Relative Positions in Trees for Code-to-Sequence Tasks
- Code Prediction by Feeding Trees to Transformers
- {cs} Integrating Tree Path in Transformer for Code Representation
- {ct} (TransCoder) Unsupervised Translation of Programming Languages
Graph-based
- {cs} Learning to Represent Programs with Heterogeneous Graphs
- {cs} CoCoGUM: Contextual Code Summarization with Multi-Relational GNN on UMLs
- {cs} Improved Code Summarization via a Graph Neural Network
- Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks
- {pr} (GREAT) Global Relational Models of Source Code
Datasets
- (CodeXGLUE) CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
- (CodeSearchNet) CodeSearchNet Challenge: Evaluating the State of Semantic Code Search
- (Graph4Code) A Toolkit for Generating Code Knowledge Graphs
- (BigCloneBench) Detecting Code Clones with Graph Neural Networkand Flow-Augmented Abstract Syntax Tree
- (POJ-104) Convolutional Neural Networks over Tree Structures for Programming Language Processing
- CONTEST: A Unit Test Completion Benchmark featuring Context
- An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation
- Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow
- {cse} Are the Code Snippets What We Are Searching for? A Benchmark and an Empirical Study on Code Search with Natural-Language Queries
- {cd,vi} (CodeGuard+) Constrained Decoding for Secure Code Generation
Representation Learning (Embeddings)
- code2vec: Learning Distributed Representations of Code
- Disentangled Code Representation Learning for Multiple Programming Languages
- Blended, Precise Semantic Program Embeddings
- {cco,vmp} On the Embeddings of Variables in Recurrent Neural Networks for Source Code
- {pr} (SCELMo) SCELMo: Source Code Embeddings from Language Models
- {ccd} Contrastive Code Representation Learning
- {vi,ccd} (DISCO) Towards Learning (Dis)-Similarity of Source Code from Program Contrasts
- Neural Code Comprehension: A Learnable Representation of Code Semantics
- TransformCode: A Contrastive Learning Framework for Code Embedding via Subtree Transformation
Code Generation
- (code2seq) code2seq: Generating Sequences from Structured Representations of Code
- (Grammformer) Learning to Complete Code with Sketches
- Using Deep Learning to Generate Complete Log Statements
- (CommitBERT) CommitBERT: Commit Message Generation Using Pre-Trained Programming Language Model
- {cd} Synchromesh: Reliable code generation from pre-trained language models
- {cd} (MGD) Guiding Language Models of Code with Global Context using Monitors
- {pr} CodeFusion: A Pre-trained Diffusion Model for Code Generation
Program Repair
- Can we learn from developer mistakes? Learning to localize and repair real bugs from real bug fixes
- (TFix) TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer
- Self-Supervised Bug Detection and Repair
- {pr} Jointly Learning to Repair Code and Generate Commit Message
- {cd,pr} (Repilot) Copiloting the Copilots: Fusing Large Language Models with Completion Engines for Automated Program Repair
Miscellaneous
- Neural Code Search Revisited: Enhancing Code Snippet Retrieval through Natural Language Intent
- Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations
- A Systematic Evaluation of Large Language Models of Code
- {cco} RepoHyper: Search-Expand-Refine on Semantic Graphs for Repository-Level Code Completion
- HiRoPE: Length Extrapolation for Code Models Using Hierarchical Position
- {cd} (LMQL) Prompting Is Programming: A Query Language For Large Language Models