Hash function - Wikipedia. This article is about a programming concept. For other meanings of . There is a collision between keys . This site provides order information, updates, errata, supplementary information, chapter bibliographies, and other information for the Handbook. Foundstone Hash Calculator is a Fiddler Extension that allows you to calculate hashes for input strings. MultiHasher is a freeware file hash calculator. Here you find my free software, articles, open source applications and source code. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes. One use is a data structure called a hash table, widely used in computer software for rapid data lookup. Hash functions accelerate table or database lookup by detecting duplicated records in a large file. An example is finding similar stretches in DNA sequences. They are also useful in cryptography. A cryptographic hash function allows one to easily verify that some input data maps to a given hash value, but if the input data is unknown, it is deliberately difficult to reconstruct it (or equivalent alternatives) by knowing the stored hash value. This is used for assuring integrity of transmitted data, and is the building block for HMACs, which provide message authentication. Hash functions are related to (and often confused with) checksums, check digits, fingerprints, lossy compression, randomization functions, error- correcting codes, and ciphers. Although these concepts overlap to some extent, each has its own uses and requirements and is designed and optimized differently. Download the latest McDonald’s menu with prices or view our product descriptions and individual food nutritional value. USGS Earthquake Hazards Program, responsible for monitoring, reporting, and researching earthquakes and earthquake hazards. The Hash Keeper database maintained by the American National Drug Intelligence Center, for instance, is more aptly described as a catalogue of file fingerprints than of hash values. Hash tables. Specifically, the hash function is used to map the search key to an index; the index gives the place in the hash table where the corresponding record should be stored. Hash tables, in turn, are used to implement associative arrays and dynamic sets. Typically, the domain of a hash function (the set of possible keys) is larger than its range (the number of different table indices), and so it will map several different keys to the same index. Therefore, each slot of a hash table is associated with (implicitly or explicitly) a set of records, rather than a single record. For this reason, each slot of a hash table is often called a bucket, and hash values are also called bucket indices. Thus, the hash function only hints at the record's location . Still, in a half- full table, a good hash function will typically narrow the search down to only one or two entries. Hash functions are also used to build caches for large data sets stored in slow media. A cache is generally simpler than a hashed search table, since any collision can be resolved by discarding or writing back the older of the two colliding items. This is also used in file comparison. Bloom filters. Once the table is complete, any two duplicate records will end up in the same bucket. The duplicates can then be found by scanning every bucket T. With a table of appropriate size, this method is likely to be much faster than any alternative approach (such as sorting the file and comparing all consecutive pairs). Protecting data. This requires that the hash function is collision- resistant, which means that it is very hard to find data that will generate the same hash value. These functions are categorized into cryptographic hash functions and provably secure hash functions. Functions in the second category are the most secure but also too slow for most practical purposes. Collision resistance is accomplished in part by generating very large hash values. For example, SHA- 1, one of the most widely used cryptographic hash functions, generates 1. Finding similar records. For that purpose, one needs a hash function that maps similar keys to hash values that differ by at most m, where m is a small integer (say, 1 or 2). If one builds a table T of all record numbers, using such a hash function, then similar records will end up in the same bucket, or in nearby buckets. Then one need only check the records in each bucket T. For this application, the hash function must be as insensitive as possible to data capture or transmission errors, and to trivial changes such as timing and volume changes, compression, etc. In this case, the input strings are broken into many small pieces, and a hash function is used to detect potentially equal pieces, as above. The Rabin. It is based on the use of hashing to compare strings. Geometric hashing. In these applications, the set of all inputs is some sort of metric space, and the hashing function can be interpreted as a partition of that space into a grid of cells. The table is often an array with two or more indices (called a grid file, grid index, bucket grid, and similar names), and the hash function returns an index tuple. This special case of hashing is known as geometric hashing or the grid method. Geometric hashing is also used in telecommunications (usually under the name vector quantization) to encode and compressmulti- dimensional signals. Standard uses of hashing in cryptography. The exact requirements are dependent on the application, for example a hash function well suited to indexing data will probably be a poor choice for a cryptographic hash function. Determinism. In other words, it must be a function of the data to be hashed, in the mathematical sense of the term. This requirement excludes hash functions that depend on external variable parameters, such as pseudo- random number generators or the time of day. It also excludes functions that depend on the memory address of the object being hashed in cases that the address may change during execution (as may happen on systems that use certain methods of garbage collection), although sometimes rehashing of the item is possible. The determinism is in the context of the reuse of the function. For example, Python adds the feature that hash functions make use of a randomized seed that is generated once when the Python process starts in addition to the input to be hashed. The Python hash is still a valid hash function when used in within a single run. But if the values are persisted (for example, written to disk) they can no longer be treated as valid hash values, since in the next run the random value might differ. Uniformity. That is, every hash value in the output range should be generated with roughly the same probability. The reason for this last requirement is that the cost of hashing- based methods goes up sharply as the number of collisions. If some hash values are more likely to occur than others, a larger fraction of the lookup operations will have to search through a larger set of colliding table entries. Note that this criterion only requires the value to be uniformly distributed, not random in any sense. A good randomizing function is (barring computational efficiency concerns) generally a good choice as a hash function, but the converse need not be true. Hash tables often contain only a small subset of the valid inputs. For instance, a club membership list may contain only a hundred or so member names, out of the very large set of all possible names. In these cases, the uniformity criterion should hold for almost all typical subsets of entries that may be found in the table, not just for the global set of all possible entries. In other words, if a typical set of m records is hashed to n table slots, the probability of a bucket receiving many more than m/n records should be vanishingly small. In particular, if m is less than n, very few buckets should have more than one or two records. If, for example, the output is constrained to 3. Such hashing is commonly used to accelerate data searches. Hash functions used for data searches use some arithmetic expression which iteratively processes chunks of the input (such as the characters in a string) to produce the hash value. In this case, their size, which is called block size, is much bigger than the size of the hash value. In those situations, one needs a hash function which takes two parameters. If n is itself a power of 2, this can be done by bit masking and bit shifting. When this approach is used, the hash function must be chosen so that the result has fairly uniform distribution between 0 and n . Depending on the function, the remainder may be uniform only for certain values of n, e. For example, let n be significantly less than 2b. Consider a pseudorandom number generator (PRNG) function P(key) that is uniform on the interval . A hash function uniform on the interval . We can replace the division by a (possibly faster) right bit shift: n. P(key) > > b. Variable range with minimal movement (dynamic hash function). For example, when looking up a personal name, it may be desirable to ignore the distinction between upper and lower case letters. For such data, one must use a hash function that is compatible with the data equivalence criterion being used: that is, any two inputs that are considered equivalent must yield the same hash value. This can be accomplished by normalizing the input before hashing it, as by upper- casing all letters. Continuity. Continuity is desirable for hash functions only in some applications, such as hash tables used in Nearest neighbor search. Non- invertible. The cost of computing this . This hash function is perfect, as it maps each input to a distinct hash value. The meaning of . For example, in Java, the hash code is a 3. Thus the 3. 2- bit integer Integer and 3. Float objects can simply use the value directly; whereas the 6. Long and 6. 4- bit floating- point Double cannot use this method. Other types of data can also use this perfect hashing scheme. For example, when mapping character strings between upper and lower case, one can use the binary encoding of each character, interpreted as an integer, to index a table that gives the alternative form of that character (. If each character is stored in 8 bits (as in extended ASCII. Invalid data values (such as the country code . With such a function one can directly locate the desired entry in a hash table, without any additional searching. Minimal perfect hashing. Besides providing single- step lookup, a minimal perfect hash function also yields a compact hash table, without any vacant slots. Minimal perfect hash functions are much harder to find than perfect ones with a wider range. Hashing uniformly distributed data. For instance, suppose that each input is an integer z in the range 0 to N. Then the hash function could be h = zmodn (the remainder of z divided by n), or h = (z . For instance, most patrons of a supermarket will live in the same geographic area, so their telephone numbers are likely to begin with the same 3 to 4 digits. In that case, if m is 1. For example, text in any natural language has highly non- uniform distributions of characters, and character pairs, very characteristic of the language. For such data, it is prudent to use a hash function that depends on all characters of the string. In general, the scheme for hashing such data is to break the input into a sequence of small units (bits, bytes, words, etc.) and combine all the units b. The state variable S may be a 3. S0 can be 0, and G(S,n) can be just Smodn. The best choice of F is a complex issue and depends on the nature of the data. For example, suppose that the input data are file names such as FILE0. CHK, FILE0. 00. 1. CHK, FILE0. 00. 2. CHK, etc., with mostly sequential numbers. For such data, a function that extracts the numeric part k of the file name and returns kmodn would be nearly optimal. Needless to say, a function that is exceptionally good for a specific kind of data may have dismal performance on data with different distribution. Rolling hash. The straightforward solution, which is to extract every such substring s of t and compute h(s) separately, requires a number of operations proportional to k. However, with the proper choice of h, one can use the technique of rolling hash to compute all those hashes with an effort proportional to k + n. Universal hashing. Universal hashing ensures (in a probabilistic sense) that the hash function application will behave as well as if it were using a random function, for any distribution of the input data. It will, however, have more collisions than perfect hashing and may require more operations than a special- purpose hash function. See also unique permutation hashing. Some of those algorithms will map arbitrary long string data z, with any typical real- world distribution. However, some checksums fare poorly in the avalanche test, which may be a concern in some applications. In particular, the popular CRC3. Moreover, each bit of the input has a deterministic effect on each bit of the CRC3. In such systems, it is often better to use hash functions based on multiplication - - such as Murmur. Hash and the SBox. Hash - - or even simpler hash functions such as CRC3. This feature may help to protect services against denial of service attacks. Hashing by nonlinear table lookup. The key to be hashed is split into 8- bit (one- byte) parts, and each part is used as an index for the nonlinear table. The table values are then added by arithmetic or XOR addition to the hash output value. Because the table is just 1. As the table value is on average much longer than 8 bits, one bit of input affects nearly all output bits. This algorithm has proven to be very fast and of high quality for hashing purposes (especially hashing of integer- number keys). Efficient hashing of strings. The remaining characters of the string which are smaller than the word length of the CPU must be handled differently (e. One method that avoids the problem of strings having great similarity (. While it is possible that two different strings will have the same CRC, the likelihood is very small and only requires that one check the actual string found to determine whether one has an exact match. CRCs will be different for strings such as . Although, CRC codes can be used as hash values. The basic idea is to hash the input items so that similar items are mapped to the same buckets with high probability (the number of buckets being much smaller than the universe of possible input items). This is different from the conventional hash functions, such as those used in cryptography, as in this case the goal is to maximize the probability of . Then hmin(A) = hmin(B) exactly when the minimum hash value of the union A . The idea of the Min. Hash scheme is to reduce the variance by averaging together several variables constructed in the same way. Origins of the term. Indeed, typical hash functions, like the mod operation, . HASHING FOR STORAGE: DATA MANAGEMENT. Hashing in Computer Science: Fifty Years of Slicing and Dicing. Algorithms in Java (3 ed.). Handbook of Applied Cryptography. Retrieved May 1. 9, 2. Plain ASCII is a 7- bit character encoding, although it is often stored in 8- bit bytes with the highest- order bit always clear (zero). Therefore, for plain ASCII, the bytes have only 2. Broder, A. Sequences II: Methods in Communications, Security, and Computer Science. Accessed April 1. Knuth. Accessed April 1. Performance in Practice of String Hashing Functions^Peter Kankowski. Communications of the ACM. The Art of Computer Programming, volume 3, Sorting and Searching. Retrieved November 1, 2. Google Ad Blocker. Google Ad Blocker is the free software to quickly Block Google Ads on all Web Browsers such as Firefox, Chrome, IE, Safari..
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |