Building Statistical Language Models of code

Peter Schulam; Roni Rosenfeld; Premkumar Devanbu

doi:10.1109/DAPSE.2013.6603797

2013 1st International Workshop on Data Analysis Patterns in Software Engineering (DAPSE)

Building Statistical Language Models of code

Year: 2013, Pages: 1-3

DOI Bookmark: 10.1109/DAPSE.2013.6603797

Authors

Peter Schulam, Language Technologies Institute, Carnegie Mellon University, USA
Roni Rosenfeld, Language Technologies Institute, Carnegie Mellon University, USA
Premkumar Devanbu, Dept. of Computer Science, University of California at Davis, USA

Abstract

We present the Source Code Statistical Language Model data analysis pattern. Statistical language models have been an enabling tool for a wide array of important language technologies. Speech recognition, machine translation, and document summarization (to name a few) all rely on statistical language models to assign probability estimates to natural language utterances or sentences. In this data analysis pattern, we describe the process of building n-gram language models over software source files. We hope that by introducing the empirical software engineering community to best practices that have been established over the years in research for natural languages, statistical language models can become a tool that SE researchers are able to use to explore new research directions.

Like what you’re reading?

Already a member?

Get this article FREE with a new membership!

Graphical Object Recognition using Statistical Language Models
Eighth International Conference on Document Analysis and Recognition (ICDAR'05)
Word-Based Statistical Compressors as Natural Language Compression Boosters
2008 Data Compression Conference
Statistical natural language understanding using hidden clumpings
Acoustics, Speech, and Signal Processing, IEEE International Conference on
Graph-Based Statistical Language Model for Code
2015 IEEE/ACM 37th IEEE International Conference on Software Engineering (ICSE)
Statistical Language Models for On-line Handwritten Sentence Recognition
Eighth International Conference on Document Analysis and Recognition (ICDAR'05)
A Possibilistic Approach for Building Statistical Language Models
Intelligent Systems Design and Applications, International Conference on
Neural Language Models in Natural Language Processing
2023 2nd International Conference on Data Analytics, Computing and Artificial Intelligence (ICDACAI)
Traces of Memorisation in Large Language Models for Code
2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE)
Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code
2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE)
Exploring Lexical Irregularities in Hypothesis-Only Models of Natural Language Inference
2020 IEEE 19th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC)

Building Statistical Language Models of code

Authors

Abstract

Related Articles