The DATAMP Patent XML Specification | |
Table Of Contents | |
XML Basics | |
DATAMP Patent XML |
|
Examples |
The following documents the XML format the DATAMP project uses for input of patent data. All XML input will be checked against this specification, and only uploaded if it adheres to it.
For those unfamiliar with XML, there are several very good tutorials avaialable on the web. My personal favorite is the one from W3Schools.org, available here.
XML is a text format very similar to HTML, but with stricter formatting rules. Like HTML, XML, is comprised of elements (sometimes with attributes), and text. Elements have a start and end tag, which consist of the element names enclosed in angle brackets, as such:
<element>(element contents)</element> |
Note that the end tag uses a '/' character to indicate that it is the end of the element. Between the start and end tags you an include other elements or text, depending on what data you are representing. Each XML format defines its own rules for what is valid content in each element.
Unlike HTML, XML does not define specific tags that can be used-- it allows different systems to define tags that make sense for their specific task. For example, in DATAMP, we define tags for things like "Patents" and "Patentees".
If an element has no sub-elements or text content, you can use a special shorthand notation to indicate this. Rather than using a separate start and end tag for these elements, you can use an empty tag, as such:
<element/> |
The trailing slash on the name indicates that there is no closing tag.
Certain elements may also support attributes, which give additional information about the element. Attributes are specified as a set of "name=value" fields inside the start tage, as such:
<element attr1='Hello' attr2="123">(element contents)</element> |
Note that the attribute values must be enclosed in quotes (single or double, it doesn't matter). Each element will define the set of attributes that it allows.
You will notice that there are certain characters that are "important" to XML: greater and less than symbols (<. >), and quotation marks( ', "). If these symbols show up in your data, you need to take some special stems To prevent them from being interpreted as XML markup (and causing errors).
XML handles this by defining entities for special characters. An entity is specified using the format "&name;", where "name" is a fixed value representing the character. The following entities are defined by default:
Entity | Encoding | Example |
Apostrophe (') | ' | Scarlett O'Hara => Scarlett O'Hara |
Quotes (") | " | This is "unusual" => This is "unusual" |
Less than (<) | < | 5 <10 => 5 < 10 |
Greater than (>) | > | 15 >10 => 15 > 10 |
Ampersand (&) | & | Rock & Roll => Rock & Roll |
These special encodings can be used in element attribute values, as well as in element text, as such:
<name first='Scarlett' last='O'Hara'> Proprieter of "Tara" </name> |
This example shoes one of the key reasons we need to worry about these characters. If we were to try to use an apostrophe in O'Hara, it would cause the value to look like 'O'Hara'-- the XML parser would assume the value stopped after the O.
All XML files start with a fixed preamble, indicating that this is an XML file, and what kind of characters it uses. For all DATAMP XML files, we use the following preamble:
<?xml version="1.0" encoding="UTF-8"?> |
This tells the system that this is an XML version 1.0 file, and that it uses UTF-8 characters (don't worry if you don't know what that means-- just include the line exactly as written and you'll be fine).
Every XML file has exactly one top-level element, which is called the document element. It must follow immediately after the preamble. All other elements in the file are children od this top-level element.
<?xml version="1.0" encoding="UTF-8"?> <documentElement> ...contents of the XML file... <<documentElement> |
The actual name of the document element is specified by the type of XML file you are using. For DATAMP, our document element is called "Patents".
As you can see from the preceding sections, XML puts very little restriction on what the individual elements contain, and what nesting of elements is permitted. This allows a great deal of flexibility, but it also means that there is no way for XML to validate that the contents of the file are actually meaningful.
To allow a system to define a structure for its particular "flavor" of XML, a special XML file called a XML Schema is used. This contains rules about what elements and attributes are valid, and how they are nested. The schema can be used to validate an XML file's contents to make sure it is correct and meaningful.
The schema definition is included in a file using a special attribute on the document element, called an XML namespace (xmlns for short). The namespace contains the URL of the namespace definition, as such:
<?xml version="1.0" encoding="UTF-8"?> <documentElement xmlns="http://..."> ...contents of the XML file... <<documentElement> |
For DATAMP, we use the URL "http://www.datamp.org/2003/xsd/patentInput.xsd" for our schema definition. This is a normal XML file, so if you are interested in exactly what the rules are you can refer to this file (if you know how to read a schema!).
The following sections describe the various elements which are used by the DATAMP XML schema, what attributes they support, and how they are nested.
The Patents element is the highest-level (document) element. There must be exactly one Patents element, and it must be the first element defined in the XML file.
The following attributes may be defined for the Patents element:
Attribute Name | Comments | ||
xmlns | This specifies the schema to use for validation of the XML file contents. It MUST contain the URL: http://www.datamp.org/2003/xsd/patentInput.xsd |
The following elements may be nested under the Patents element:
Element Name | Multiples? |
Comments | |
Patent | There will be one Patent element for each patent in the data set. |
This element may not define any non-element content (PCDATA).
The Patent element defines the information for a single patent. There will be one Patent element for each patent in the file.
The following attributes may be defined for the Patents element:
Attribute Name | Comments | ||
number | Specifies the number of the patent. This number should not include any type specification, so patent "X1,234" will have the number "1234". | ||
type | Specifies the type of patent. Valid values for this field are:
|
||
country | Specifies the country that granted the patent. If not explicitly specified otherwise, it is assumed to be a US patent. | ||
title | Specifies the title of the patent, as taken from the specification. | ||
alt | Specifies a more descriptive title for the patent. Many patents have very general titles (such as "Woodworking Machine"), so this field allows us to put something more descriptive (like "Surface Planer"). |
The following elements may be nested under the Patent element:
Element Name | Multiples? |
Comments | |
Description | Textual description of the tool. | ||
GrantDate | Specifies the grant date for the patent. | ||
ApplicationDate | Specifies the application date for the patent. | ||
EffectiveDate | For antedated patents, specifies the date when the patent became effective. If not specified, the effective date is the same as the grant date. | ||
Patentee | Specifies information about a person the patent was granted to. If the patent was granted to more than one person, each will have its own patentee element. | ||
Classification | Specifies information about the patent classification. If the patent is in multiple classes, there will be one element for each classification. | ||
Type | Specifies information about the type of tool covered by the patent, taken from a standardized list. A tool can be of more than one type (i.e. a combination tool). | ||
Witness | Specifies information about a person who witnessed the patent. Each witness (there are usually at least 2) will have its own element. | ||
Assignee | Specifies information about a person or company the patent was assigned to. | ||
Manufacturer | Specifies information about a company that manufactured the patented tool. In the case of long-lived or widely copied patents, only those who manufactured the tool while under patent will be listed. | ||
Link | Specifies the URL of a web page or other external source of information about the patented tool. | ||
ReissuedAs | Specifies the reissued patent for this patent, if any. | ||
ReissueOf | For reissued patents, specifies the patent which this patent is a reissue of. | ||
KnownExamples | This element will be present if there are known examples of tools using the patent extant. | ||
Gizmo | This element will be present if patent is unusual or amusing. | ||
Notable | This element will be present if patent is notable for some reason. | ||
Extended | This element will be present if patent was ever granted an extension. Details of the extension should be in the Description field. |
This element may not define any non-element content (PCDATA).
The Date element contains information about the date the patent was granted.
The following attributes may be defined for the GrantDate element:
Attribute Name | Comments | ||
month | Month component of date (1-12). | ||
day | Day component of date (1-31). | ||
year | Year component of date. |
The following elements may be nested under the GrantDate element:
Element Name | Multiples? |
Comments | |
This element may not define any non-element content (PCDATA).
The Name element contains information about a person. It is used for various data such as:
The following attributes may be defined for the Name element:
Attribute Name | Comments | ||
first | Person's first name. | ||
middle | Person's middle name or initial (if any). | ||
Last | Person's last name | ||
suffix | Suffix for the person's name ("Jr.", "Esq.", etc.) |
The following elements may be nested under the Name element:
Element Name | Multiples? |
Comments | |
This element may not define any non-element content (PCDATA).
The Company element contains information about a company. It is used for various data such as:
The following attributes may be defined for the Company element:
Attribute Name | Comments | ||
name | Name of the company. | ||
sortby |
Name to use for sorting the company. If this is not specified, the
company name will be used. This allows us to sort something like "E.A. Fay and Co." as "Fay and Co." |
The following elements may be nested under the Company element:
Element Name | Multiples? |
Comments | |
This element may not define any non-element content (PCDATA).
The Location element contains information about where a company or person is located.
The following attributes may be defined for the Location element:
Attribute Name | Comments | ||
city | Name of the city the person/company is located in. | ||
state | Name of the state the person/company is located in. | ||
country | Name of the country the person/company is located in. |
The following elements may be nested under the Location element:
Element Name | Multiples? |
Comments | |
This element may not define any non-element content (PCDATA).
The Classification element contains information about a patent classification.
The following attributes may be defined for the Classification element:
Attribute Name | Comments | ||
class | Main patent classification for the patent. | ||
subclass | Subclass for the patent classification. |
The following elements may be nested under the Classification element:
Element Name | Multiples? |
Comments | |
This element may not define any non-element content (PCDATA).
The Link element contains information about an external URL associated with the patent.
The following attributes may be defined for the Link element:
Attribute Name | Comments | ||
href | URL of the associated file. |
The following elements may be nested under the Link element:
Element Name | Multiples? |
Comments | |
The non-element content specifies the text of the hyperlink.
The Patentee element contains information about a single patentee. There can be any number of patentees defined for a patent.
The following attributes may be defined for the Patentee element:
Attribute Name | Comments | ||
The following elements may be nested under the Patentee element:
Element Name | Multiples? |
Comments | |
Name | Contains information about the patentee name. | ||
Location | Contains information about where the patentee lived when the patent was issued. |
This element may not define any non-element content (PCDATA).
The Witnesses element contains information about a persons who witnessed the patent. There will be one Witness element for each person (at least two witnesses are required for a patent).
The following attributes may be defined for the Witness element:
Attribute Name | Comments | ||
The following elements may be nested under the Witness element:
Element Name | Multiples? |
Comments | |
Name | Contains information about the witness name. |
This element may not define any non-element content (PCDATA).
The Assignee element contains information about a single assignee for the patent. There can be any number of assigness defined for a patent.
The following attributes may be defined for the Assignee element:
Attribute Name | Comments | ||
The following elements may be nested under the Assignee element:
Element Name | Multiples? |
Comments | |
Name | If the assignee is a person, this element will contain information about his name. There must be either a Name or Company element defined for each Assignee, but not both. | ||
Company | If the assignee is a company, this element will contain information about the company name. There must be either a Name or Company element defined for each Assignee, but not both. | ||
Location | Contains information about where the assignee (person or company) was located at the time the patent was issued. |
This element may not define any non-element content (PCDATA).
The Manufacturer element contains information about a single manufacturer of the tool covered by the patent. There can be any number of manufacturers defined for a patent.
The following attributes may be defined for the Manufacturer element:
Attribute Name | Comments | ||
The following elements may be nested under the Manufacturer element:
Element Name | Multiples? |
Comments | |
Company | This element will contain information about a company that manufactured the patented piece. There must be either a Name or Company element defined, but not both. | ||
Name | This element will contain information about a person who manufactured the patented piece. There must be either a Name or Company element defined, but not both. | ||
Location | Contains information about where the company/person was located at the time they produced the tool. |
This element may not define any non-element content (PCDATA).
The Type element contains information about the type of tool described by the patent. A patent will define at least one type, and if a tool embodies more than one type (combination tool), more than one.
Types will be fixed by the system, but the number will be fairly large.
The following attributes may be defined for the Type element:
Attribute Name | Comments | ||
The following elements may be nested under the Type element:
Element Name | Multiples? |
Comments | |
The non-element content will contain the name of the type. This type name consists of a colon-separated list in the format "class:category:type", and must be one defined in our category taxonomy.
The Description element contains a brief description of the tool covered by the patent, and any other pertinent information. A patent may only will define one Description element.
The Description text can contain basic XHTML markup if desired.
The following attributes may be defined for the Description element:
Attribute Name | Comments | ||
The following elements may be nested under the Description element:
Element Name | Multiples? |
Comments | |
The non-element content will contain the descriptive text.
Two special non-XML tags are allows in the description text: links to other patents, and line breaks.
To insert a link to another patent in DATAMP, use the syntax "{xxx}", where "xxx" is the number of the patent you are linking to. For example, insert a link to patent D1234, you could do something like:
<description>This is similar to patent {D1,234}</description> |
Note that the text inside the braces will become the link, so it is normally wise to include basic formatting as shown above.
To break a long description into paragraphs, you can use the "{br}" tag to insert a break. This will add a blank line to the description.
The Application element contains information about the date the patent was applied for. If the application date is not known, this element should not be present.
The following attributes may be defined for the ApplicationDate element:
Attribute Name | Comments | ||
month | Month component of date (1-12). | ||
day | Day component of date (1-31). | ||
year | Year component of date. |
The following elements may be nested under the ApplicationDate element:
Element Name | Multiples? |
Comments | |
This element may not define any non-element content (PCDATA).
The EffectiveDate element contains information about the date the patent went into effect. If the patent is not explicitly antedated this element is not necessary, since the patent became effective on the grant date.
The following attributes may be defined for the EffectiveDate element:
Attribute Name | Comments | ||
month | Month component of date (1-12). | ||
day | Day component of date (1-31). | ||
year | Year component of date. |
The following elements may be nested under the EffectiveDate element:
Element Name | Multiples? |
Comments | |
This element may not define any non-element content (PCDATA).
In a reissued patent, the ReissueOf element specifies the patent that is being reissued. A reissued patent will have exactly one ReissueOf element.
The following attributes may be defined for the ReissueOf element:
Attribute Name | Comments | ||
number | Specifies the number of the original patent. | ||
type | Specifies the type of the original patent. Valid values for this field are:
|
||
country | Specifies the country that granted the original patent. If not explicitly specified otherwise, it is assumed to be a US patent. |
The following elements may be nested under the ReissueOf element:
Element Name | Multiples? |
Comments | |
This element may not define any non-element content (PCDATA).
If a patent is reissued, the ReissuedAs element specifies the name and number of the reissued patent.
The following attributes may be defined for the ReissuedAs element:
Attribute Name | Comments | ||
number | Specifies the number of the reissued patent. | ||
type | Specifies the type of the reissued patent. In this case, the only valid type for the patent is "RE". | ||
country | Specifies the country that reissued the patent. If not explicitly specified otherwise, it is assumed to be a US patent. |
The following elements may be nested under the ReissuedAs element:
Element Name | Multiples? |
Comments | |
This element may not define any non-element content (PCDATA).
The KnownExamples element should be specified if there are known examples of the tool covered by the patent extant.
The following attributes may be defined for the KnownExamples element:
Attribute Name | Comments | ||
The following elements may be nested under the KnownExamples element:
Element Name | Multiples? |
Comments | |
This element may not define any non-element content (PCDATA).
The Notable element should be specified if patent is notable for some reason.
The following attributes may be defined for the Notable element:
Attribute Name | Comments | ||
The following elements may be nested under the Notable element:
Element Name | Multiples? |
Comments | |
This element may not define any non-element content (PCDATA).
The Gizmo element should be specified if patent is particularly amusing or unusual.
The following attributes may be defined for the Gizmo element:
Attribute Name | Comments | ||
The following elements may be nested under the Gizmo element:
Element Name | Multiples? |
Comments | |
This element may not define any non-element content (PCDATA).
The Extended element should be specified if patent was extended during its history.
The following attributes may be defined for the Extended element:
Attribute Name | Comments | ||
The following elements may be nested under the Extended element:
Element Name | Multiples? |
Comments | |
This element may not define any non-element content (PCDATA).
Here is a very basic example of the DATAMP XML input file format. It specifies the data for a pair of marking gage patents:
<?xml version="1.0" encoding="UTF-8"?> <Patents xmlns='http://www.datamp.org/2003/xsd/patentInput.xsd'> <Patent number="15556" title="Carpenter's Gage" alt="Replaceable points for marking gages"> <GrantDate month="8" day="19" year="1856"/> <Classification class="33" subclass="44"/> <Classification class="33" subclass="486"/> <Patentee> <Name first="Joel" last="Bryant"/> <Location city="Brooklyn" state="NY"/> </Patentee> <Type>layout tools:marking gauges</Type> <Description>Replacable points for marking gages.</Description> <Witness> <Name first="J." middle="L." last="Marcellus"/> </Witness> <Witness> <Name first="John" middle="C." last="Schencke"/> </Witness> <ReissuedAs number="448" type="RE"/> </Patent> <Patent number="17403" title="Compound Gage"> <GrantDate month="5" day="26" year="1857"/> <Classification class="33" subclass="44"/> <Patentee> <Name first="Albert" last="Williams"/> <Location city="Philadelphia" state="PA"/> </Patentee> <Manufacturer> <Company name="Stanley Rule & Level Co."/> <Location city="New Britain" state="CT"/> </Manufacturer> <Type>layout tools:marking gauges</Type> <Description>Combination Gage containing 3 marking points and a pair of mortise points. Produced by Stanley Rule & Level Co.</Description> <KnownExamples/> <Witness> <Name first="Enoch" last="Remick"/> </Witness> <Witness> <Name first="James" last="Nichol"/> </Witness> </Patent> </Patents> |